Train/test split for regression
As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the R2 score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array X and target variable array y have been pre-loaded for you from the DataFrame df.

Instructions

* Import LinearRegression from sklearn.linear_model, mean_squared_error from sklearn.metrics, and train_test_split from sklearn.model_selection.
* Using X and y, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of 42.
* Create a linear regression regressor called reg_all, fit it to the training set, and evaluate it on the test set.
* Compute and print the R2 score using the .score() method on the test set.
* Compute and print the RMSE. To do this, first compute the Mean Squared Error using the mean_squared_error() function with the arguments y_test and y_pred, and then take its square root using np.sqrt().


In [1]:
# Import numpy and pandas
import numpy as np
import pandas as pd
import seaborn as sns


# Read the CSV file into a DataFrame: df
df = pd.read_csv('boston.csv')
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [3]:
y=df.TAX.values.reshape(-1,1)
X=df.AGE.values.reshape(-1,1)

In [4]:
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

In [5]:
# Compute and print R^2 and RMSE

print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

R^2: 0.3193812661749622
Root Mean Squared Error: 142.80189185421455


[sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html)

[R square and rmse](https://stats.stackexchange.com/questions/142248/difference-between-r-square-and-rmse-in-linear-regression)

RMSE is root mean squared error. It is based the assumption that data error follow normal distribution.

This is a measure of the average deviation of model predictions from the actual values in the dataset.

R2 is coefficient of determination, scaled between 0 and 1.

R^2 = 1-(SSE/SST)

SSE : sum of squared error, SST : total sum of squares

R-squared is simply the fraction of response variance that is captured by the model.

If R-squared = 1, means the model fits the data perfectly.

while both indicate the goodness of the fit, R-squared can be more easily interpreted.

It directly measures the goodness of fit in capturing the variance in training data.

For example : if R2=0.7, it says that with this model, we can explain 70% of what is going on in the real data,

rest 30% can’t be explained.

if your R2 is in the range 0.35, then model explains only 35% of the variance. This is not a good fit.

something in the range 0.7-0.8, is a good model.

I hope this helps.

In [7]:
import matplotlib.pyplot as plt
from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

reg = linear_model.LinearRegression()
ar = np.array([[[1],[2],[3]], [[2.01],[4.03],[6.04]]])
y = ar[1,:]
x = ar[0,:]
reg.fit(x,y)
print('Coefficients: \n', reg.coef_)
xTest = np.array([[4],[5],[6]])
ytest =  np.array([[9],[8.5],[14]])

preds = reg.predict(xTest)
print("R2 score : %.2f" % r2_score(ytest,preds))
print("Mean squared error: %.2f" % mean_squared_error(ytest,preds))

er = []
g = 0
for i in range(len(ytest)):
    print( "actual=", ytest[i], " observed=", preds[i])
    x = (ytest[i] - preds[i]) **2
    er.append(x)
    g = g + x
    
x = 0
for i in range(len(er)):
   x = x + er[i]

print ("MSE", x / len(er))

v = np.var(er)
print ("variance", v)

print ("average of errors ", np.mean(er))

m = np.mean(ytest)
print ("average of observed values", m)

y = 0
for i in range(len(ytest)):
    y = y + ((ytest[i] - m) ** 2)

print ("total sum of squares", y)
print ("ẗotal sum of residuals ", g)
print ("r2 calculated", 1 - (g / y))

Coefficients: 
 [[2.015]]
R2 score : 0.62
Mean squared error: 2.34
actual= [9.]  observed= [8.05666667]
actual= [8.5]  observed= [10.07166667]
actual= [14.]  observed= [12.08666667]
MSE [2.34028611]
variance 1.2881398892129619
average of errors  2.3402861111111117
average of observed values 10.5
total sum of squares [18.5]
ẗotal sum of residuals  [7.02085833]
r2 calculated [0.62049414]
