# Data Science Society Workshop 6: Scikit-Learn (Linear Regression)
In this workshop we begin our work on Scikit-Learn. This Python package is one of the most popular machine learning packages for Python. Our workshop today will focus on linear regression.
### Installation of Sci-Kit Learn
You may not have Sci-Kit Learn installed on your device. Run the following code cell to install Sci-Kit Learn.

In [None]:
!pip install sklearn

### Importing
There are a number of modules within sklearn so when importing functions often we'll need to import from the relevant module. Run the code cell below to import everything we'll need for the workshop.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pickle
%matplotlib inline

### The Problem
Below there are two code cells to extract a set of data and then plot it. The first code cell generates a data frame by reading the csv file "RegressionData.csv". Plotting the two arrays reveals a set of data that seems to show some kind of relationship between ````x```` and ````y````. The data here seems to change linearly as ````x```` changes. We can model this with a line of best fit for the data. The equation for a straight line is given by:
$$y = mx + c $$
Where $y$ is known as your label or dependant variable and $x$ is your feature or independant variable. The value $m$ is the gradient or the slope of the line, while $c$ is the $y$ intercept ie the point where the line crosses the $y$-axis. The algorithms we will use today from Sci-Kit Learn are designed to determine the values of $m$ and $c$ so that a line of best fit may be found.

In [None]:
# Generates data frame from csv file
df = pd.read_csv("RegressionData.csv")

# Turning the columns into arrays
x = df["x"].values
y = df["y"].values

In [None]:
# Plots the data from the above data
plt.figure()
plt.grid(True)
plt.plot(x,y,'r.')

### Preprocessing the data
#### Reshaping
The first step as is to prepare the data. This step is known as preprocessing. The first thing that's necessary in this case is to reshape the data. This is because our data is currently in the form of an array that is just one long row when the functions from Sci-Kit Learn deal with columns. To fix this we can use ````np.reshape(x,(-1,1))````. Here we will use ````x.reshape(-1,1)```` which for a Numpy array ````x```` is equivalent to using the ````np.reshape```` function as described earlier. Here the $1$ specifys that the data will have one column and $-1$ allocates a value for this dimension based on how many dimensions are left. Since here we have $1$ column the data can only be shaped as one column. It is also worth noting that we typically use a capital ````X```` to denote our features or independent variables and again we use ````y```` for our dependent variable or label. Depending on how your data is formatted to begin with this step may not be required. If your ````X```` is a data frame with multiple columns for instance you will not need to convert it to an array. Anything with one column likely needs to be converted to an array and reshaped as demonstrated below.

In [None]:
# Original shape of x and y
print(x)
print(y)

In [None]:
# Independant variable or features
X = x.reshape(-1,1)

# Dependant variable or labels
y = y.reshape(-1,1)

# Reshaped X and y
print(X)
print(y)

#### Seperating into Training and Test Sets
Here we split our data into training sets and test sets for each variable using the ````train_test_split```` function. The input is as you would expect your two variables and an additional argument the ratio at which the data is split. Here we passed ````test_size=0.2```` so $20\%$ of the data is allocated to the test set while $80\%$ is allocated to the training set.

In [None]:
# Seperates the data into
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Training the Regressor
We begin by defining a ````regressor```` by calling the ````LinearRegression```` function. The argument ````n_jobs```` specifys how much processing power we want to dedicate to the regressor. The higher the value given to this argument the faster the classifier will run. If you pass $-1$ as the value for the regressor will use all processers for the job. Then we pass ````regressor.fit(X_train, y_train)````. This passes our training data into the regressor in order to train it. For the linear regression algorithm training it means finding the values of $m$ and $c$ from the equation for the line of best fit.

In [None]:
# Defining our regressor
regressor = LinearRegression(n_jobs=-1)

# Train the regressor
fit = regressor.fit(X_train, y_train)

#### Returning the Gradient and the Intercept
Having trained the regressor we can now return the values of the gradient and intercept. These are simply the attributes ````.coef_```` and ````.intercept_```` of the variable we have used to define our trained regressor. In our case we defined it as ````fit````.

In [None]:
# Returns gradient and intercept
print("Gradient:",fit.coef_)
print("Intercept:",fit.intercept_)

### Line of Best Fit Values
Now that we've trained our regressor we can obtain the values for the line of best fit. We do this by passing through a set of points into ````regressor.predict````. For each point the regressor predicts a value based on the line of best fit that was obtained. Essentially, the input is passed through the equation: $ y = mx + c $, where the values of $m$ and $c$ are given by ````fit.coef_```` and ````fit.intercept_```` respectively. The code cell below uses the ````regressor.predict```` to obtain a predicted set of values. Then we obtain another set of values by explicitly plugging into the formula for our line of best fit. We subtract the two sets away from one another to demonstrate that the ````regressor.predict```` function did indeed do the same thing.

In [None]:
# Predicted values 
y_pred = regressor.predict(X_test)

# y = m*x + c
best_fit_line = fit.coef_*X_test + fit.intercept_

# Comparing the two
y_pred - best_fit_line

### Plotting the Line of Best Fit
Below we plot the line of best fit against the original data. Doing this enables us to make predictions about what ````y```` value we'd obtain for a ````x```` value within the interval covered but not actually recorded in the data set. This is known as interpolation. Our line in effect is a prediction of what the value of ````y```` is as ````x```` changes.

In [None]:
# Plot of the data with the line of best fit
plt.plot(X_test,y_pred)
plt.plot(x,y,'r.')
plt.grid(True)

### Scoring the Model
Let us compare the predicted results with the actual values by putting them into a data frame. Note the need for the slicing occurs because ````pd.DataFrame```` takes arrays as rows rather than columns. Since we converted the arrays to columns at the start we need to undo this to convert to a data frame.

In [None]:
# Converts predicted values and test values to a data frame
df = pd.DataFrame({"Predicted": y_pred[:,0], "Actual": y_test[:,0]})
df

We can score of our model by passing ````regressor.score(X_test,y_test)````. This is data that our regressor hasn't seen yet which means it can provide a basis with which to test the results. The best score you can obtain for the model is $1$. The lower the score the worse a fit the model is to the data.

In [None]:
# Determines a score for our model
score = regressor.score(X_test, y_test)
print(score)

## Pickling 
When dealing with large datasets it may take a while to train your regressor (or classifier for a classification algorithm). In such cases its often useful to use pickling; in effect storing the results of the trained model so that it can be loaded in later. To do this we first have to create a new file by using ````open("filename","wb")````. This command tries to open a file with the specified ````"filename"````and if no such file is available one will be created. The ````"wb"```` argument specifies what we will do with the file. Here the ````w```` indicated we would like to write into the file. We also define our opened file as ````f```` but any suitable variable name is fine. To store the model in the file we write ````pickle.dump(regressor,f)```` where the first argument is the regressor to be stored and the second is the variable name chosen for our file. Once we are done with the file we write ````f.close()```` to close it.

In [None]:
# Creates a .pickle file to store our trained regressor
f = open("linearregression.pickle","wb")

# Stores the regressor in the file f
pickle.dump(regressor,f)

# Closes the file
f.close()

To retrieve our trained regressor we first write ````open("linearregression.pickle","rb")````. Here we swapped ````w```` for ````r```` meaning instead of writing in our file we only want to read it. Then to load in our trained regressor we write ````pickle.load(pickle_in)````. Once again we conclude by closing the file with ````pickle_in.close()````.

In [None]:
# Opens the file to read
pickle_in = open("linearregression.pickle","rb")

# Loads the regressor from the file
regressor = pickle.load(pickle_in)

# Closes the file
pickle_in.close()

We can then score our pickled regressor and make predictions from it as we would if we had trained it again.

In [None]:
# Scoring the pickled regressor
regressor.score(X_test,y_test)

In [None]:
# Inserts values of X into line of best fit
y_pred = regressor.predict(X)

In [None]:
# Plots line of best fit along with the data
plt.plot(X,y_pred)
plt.plot(x,y,'r.')
plt.grid(True)

### Multiple Linear Regression
The example used above involved only two variables. However, in the real world it is highly unlikely that a label or dependant variable would depend on only one feature or independant variable. The equation for the value of $y$ in such an instance can be generalised as follows:
$$ y = c + m_1 x_1 + m_2 x_2 + ... m_n x_n + \epsilon $$
The equation is similar to the one before except we have multiple values for the gradient and we also have an additional $\epsilon$ term which is the error. In effect the $\epsilon$ accounts for any potential data points that might not be a fit for the linear model.

Below I have an example on how to a linear regression problem with multiple features. The dataset in question is about different types of advertising and the effect on sales. The independent variables are given by the "TV", "Radio" and "Newspaper" columns while "Sales" is our dependent variable.

In [None]:
# Converts advertising csv to a data frame
df = pd.read_csv("advertising.csv")
df

The process for applying the linear regression algorithm when you have multiple independent variables is very similar to that for our original case. The key difference being that ````X```` will consist of more than one column so converting to a Numpy array and reshaping it is not required. Once you have ````X```` and ````y```` the procedure for splitting the data into training and test sets then training the regressor and so forth is the same.

In [None]:
# Independent variables
X = df.drop("Sales",axis=1)

# Dependent variable
y = df["Sales"].values.reshape(-1,1)

# Splitting into test and training data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
# Defining regressor
regressor = LinearRegression(n_jobs=-1)

# Training our regressor
fit = regressor.fit(X_train,y_train)

# Predicting values
y_pred = fit.predict(X_test)

# Scoring our regressor
fit.score(X_test,y_test)

In [None]:
# Comparing predicted against actual values
df = pd.DataFrame({"Predicted": y_pred[:,0], "Actual": y_test[:,0]})
df

### References 
[1] advertising.csv : https://www.kaggle.com/ashydv/advertising-dataset