# Machine Learning - Predicting student scores - not checked/finished
---

In previous worksheets, we have done a simple linear regression using the stats.linregress function. Today we are going to take it one step further and train Python to predict future test scores. The way we will do this is by splitting our dataset into a training set (80%) and a test set (20%). We will also split the datasets into our x and y. Using the training set, the computer will run a linear regression (like what we did in Python and R) and then learn to predict the student scores. We will then test how well it can predict by giving the test dataset's X to it, and comparing it's predictions with the actual student scores (Y).  

To do this, we will be using a new library: **sklearn** as well as:
* pandas  
* numpy  
* seaborn  
* matplotlib.pyplot  
* scipy.stats  

`sklearn` has many associated packages related to machine learning but the ones we'll be using today are:  

`from sklearn.model_selection import train_test_split`   
`from sklearn.linear_model import LinearRegression`  
`from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error`  

Go ahead and load all the required packages below

### Today we will be using a simple dataset of student scores
---

url = "https://raw.githubusercontent.com/lilaceri/Working-with-data-/main/Data%20Sets%20for%20code%20divisio/student_scores.csv"

It contains 2 variables, `Hours` studied and test `Scores`. 

Read in the data and have a look it. 

### Exercise 1 - Clean the data
---

* check data for null values and remove if necessary
* check data for outliers using a boxplot
* remove outliers if necessary (you can use the function you created in the Correlation and Models worksheet)


### Exercise 2 - visualise linear relationship
---

Using seaborns regplot, visualise the relationship of the variables 

* use `sns.regplot()`
* split data into x (hours) and y (scores) variables

*(this is the same as using linregress and matplotlib to create a scatter with line of best fit)*

### Exercise 3 - check for normality 
---

* using describe() - check if the mean and median is similar 
* using seaborns distplot, check the shape of the data 

Does it look roughly gaussian in shape?

### Exercise 4 - Check the correlation
---

* create a correlation matrix using the corr() function 
* create a heatmap using `sns.heatmap()` 

Are they highly correlated (close to 1 or -1)

### Exercise 5 - reshape x and y 
---

Earlier we split our data into x and y. Currently if you look at their shape, they will be (25, ). In order to do the linear regression, they need to have the shape (25, 1). As our data is 1 dimensional we need to turn it into 2 dimensional data so that sklearn can understand it. We need to tell it to focus on using the data as a column, rather than a row: (-1,1) means not a row but a column. 


To transform our x and y datasets into a shape that sklearn can understand we will use numpy to reshape the data. 

This can be done using `np.array(variable).reshape(-1, 1)`  

* reshape the x and y variables using `np.array(variable).reshape(-1, 1)` 
* check the shape of each one 

### Exercise 6 - splitting the data 
---

In order to tell how accurately our model can predict we will need to test it. To do this we will split off 20% of the data to later use to test our models predictions. The remaining 80% will be used to train (teach) the computer how to predict. 

We will therefore turn our x and y variables into 4 datasets: x_train, y_train, x_test and y_test.

To do this use:

`x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)`

* the `test_size` parameter is how big you want your testing set to be, in this case 0.2 = 20% of the original data  
* the `random_state` parameter is used as a placeholder for that specific split. Without it, everytime the code was run, the test and train variables would be split differently. The specific number doesnt matter, but sklearn will save that specific split of the data into that number of random_state

**Split your data into training and testing sets**


### Exercise 7 - fit the data and perform linear regression
---

We are now going to perform a linear regression on our training data, so that we can fit our model to the training data. 

To do this:

1. set the `LinearRegression()` function into a variable called `lin_reg` 
2. use `lin_reg.fit(x_train, y_train)` to fit the model to the data 

Similar to `stats.linregress` we can use the intercept and coefficents from this function to plot a line of best fit..  
* the slope is accessed using: `lin_reg.coef_`
* the intercept is accessed using: `lin_reg.intercept_` 

3. plot a scatter plot using matplotlib
4. add a line of best fit using the slope and intercept from your regression
5. compare it to the regplot you made in Ex 2 

### Exercise 8 - make some predictions
---

Now that we've created a model using our training data, we can test it using the test data. We can attempt to predict student scores (y) from how many hours of study has been done. We can then compare these to the real actual student scores.

* create a new variable to store your predicted y values in 
* use `lin_reg.predict(dataset)` to predict using your x_test dataset 
* compare your predicted y values with your actual y values (y_test) - are they similar?

### Exercise 9 - How good is our model at predicting?
---

As mentioned in previous worksheets, r squared (r^2) measures how much of the variance of the y can be explained by x. A higher r^2 means that our model is predicting y well. 

* using `r2_score(y_predictions, y_test)` see how well your model is predicting y


### Exercise 10 - evaluating the model further
---

r^2 isnt the only way to evaluate our model..

* Mean Absolute Error (MAE) - tells us how big the error in our model is, so how far away the actual values are from the predicted values  
    `mean_absolute_error(test, predictions)`  
    
* Root Mean Squared Error (RMSE) - tells us the mean difference between the predicted values and the actual values  
    `metrics.mean_squared_error(test, predictions, squared = False)` 
Both these measures of error tell us a similar thing, 

The bigger the spread of the data and the smaller the dataset, the higher the error.


**Calculate the MAE and RMSE for your model using the y test and prediction values and evaluate how good your model is** 
