<a href="https://colab.research.google.com/github/ovieimara/ITNPBD6/blob/master/Newspaper_Multi_Linear_Reg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression with Numeric Values
## Newspaper sales prediction Example
This notebook uses multiple linear regression to predict newspaper sales from Advert Spend, Price, Offered Prize Value and whether or not it was Wet that day.
  
All of the variables are numeric and the data are clean, so only the basic steps of performing regresison are given. No hyper-parameter searching is included, so no validation data are used.


Note: at various points this week we're using numpy arrays rather than pandas dataframes. Scikit-learn can handle both; in many circumstances numpy arrays are faster, but the syntax is a bit different (often more complex). You'll see many examples of numpy and scikit-learn if you look around the web, so it's helpful to see this alternative approach.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np

## Load the Data
You should download the file called `Newspaper_Numeric.csv` from the course website and put it in a folder accessible to this notebook. In the code below, we assume it is in the same folder as the notebook. Change the code if it is somewhere else.

In [3]:
data = np.loadtxt('Newspaper_Numeric.csv', delimiter=",", skiprows=1)
data

array([[1.7570e+03, 6.0000e+01, 3.0000e+01, 1.0000e+00, 5.0611e+04],
       [1.6950e+03, 4.5000e+01, 3.0000e+01, 1.0000e+00, 4.5457e+04],
       [2.3590e+03, 4.5000e+01, 7.0000e+01, 0.0000e+00, 7.2836e+04],
       ...,
       [1.2350e+03, 6.5000e+01, 7.0000e+01, 0.0000e+00, 5.2755e+04],
       [1.5980e+03, 6.0000e+01, 4.0000e+01, 0.0000e+00, 5.1524e+04],
       [2.2290e+03, 5.0000e+01, 4.0000e+01, 0.0000e+00, 6.5331e+04]])

## Extract the Inputs and Outputs - single input
We'll begin with a simple linear regression: that is predicting a single output variable from a single input variable.

The target output variable, `sales` is the last column in the file, so we put that into a variable called `y`. `advert spend` is our input, in column 0, which we put into `X`. Then we split off 30% for testing.

In [4]:
cols = data.shape[1]
X = data[:,0:1]
y = data[:,cols-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

## Build the Regression Model
We fit the regression model next and print the R-squared value.

In [5]:
# make a new regression model object, and fit it to the data
reg = LinearRegression().fit(X_train, y_train)

# we can interrogate this object for the trained model coefficients and r^2 achieved on the training data
print("R Squared =",reg.score(X, y))
print("Coefficients and intercept =",reg.coef_, reg.intercept_)

R Squared = 0.7262845381804389
Coefficients and intercept = [14.71172193] 29713.039315480128


## Finally, Predict on the Test Data
We predict the values for the test data and calculate the mean absolute error for that data. Try other metrics in the second line.

In [6]:
preds = reg.predict(X_test)
test_MAE = metrics.mean_absolute_error(y_test, preds)
print("Mean Absolute Error on test =",test_MAE)

Mean Absolute Error on test = 3465.7932054853964


In [11]:
r_squared_error = metrics.r2_score(y_test, preds)
print(r_squared_error)
mean_squared_error = metrics.mean_squared_error(y_test, preds)
print(mean_squared_error)


0.7939208253497808
18093770.68386469


## Extract the Inputs and Outputs - multiple inputs
Now we'll try multiple linear regression.

The target output variable, `sales` is the last column in the file, so we put that into a variable called `y` and the other, input, columns into `X`. Then we split off 30% for testing.

In [13]:
# how many columns are in the data?
cols = data.shape[1]

# get the feature columns
X = data[:,0:cols-1]

# the target variable is in the last column
y = data[:,cols-1]

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

## Build the Regression Model
We fit the regression model next and print the R-squared value. It should be a bit better than the score for simple linear regression.

In [14]:
reg = LinearRegression().fit(X_train, y_train)
print("R Squared =",reg.score(X, y))
print("Coefficients and intercept =",reg.coef_, reg.intercept_)

R Squared = 0.8315795565670185
Coefficients and intercept = [   14.98317875  -126.40657446    88.25701868 -2804.81612855] 31957.993396006048


## Finally, Predict on the Test Data
We predict the values for the test data and calculate the mean absolute error for that data. Try other metrics in the second line.

In [15]:
preds = reg.predict(X_test)
test_MAE = metrics.mean_absolute_error(y_test, preds)
print("Mean Absolute Error on test =",test_MAE)

Mean Absolute Error on test = 3145.8335433344587
