# Lab: Evaluating Model Predictions for Regression Models

How do you know if a regression model is a good estimator of what you are trying to predict?  
This lab will walk you through building several multivariate linear regression models using different prediction variables and then comparing the model predictions using evaluation tools such as R-squared and Mean Squared Error (MSE).

## Section 1: Prepare Model Data
The first step is to import the data you will be using to make your model predictions.  You will be using the [Wine Quality dataset](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).  This dataset was originally created for the ["Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems"](http://dx.doi.org/10.1016/j.dss.2009.05.016) paper by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.

### Import Data
Python has a package called [pandas](https://pandas.pydata.org/), which is great for reading and manipulating data.  You will be using this package to import the provided CSV into a *pandas* [dataframe](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe).

In [1]:
#Import pandas
import pandas as pd

#Load data
wine = pd.read_csv('../data/winequality.csv')

#Show first 10 rows of data frame
wine.head(10)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
5,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
6,white,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6
7,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
8,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
9,white,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6


Your model will be attempting to predict the **quality** value, given one or more of the other columns.  The variables you use to predict a modeled output are called *features* or *predictor variables*.  

Using the *groupby()* function in *pandas*, you can count the total number of records in the dataframe by white and red wine.

In [2]:
#Display count by wine type
pd.DataFrame(wine.groupby('type').size())

Unnamed: 0_level_0,0
type,Unnamed: 1_level_1
red,1599
white,4898


### Create Dummy Variable
Models are good with numbers but bad with categorical data.  The **type** field in the dataframe is a categorical variable, taking one of two category values (red or white).  

You can use the *get_dummies* function in *pandas* to convert this categorical variable into a dummy variable of 1 (true) for white wine and 0 (false) for red wine and add it as a column called **white** to the dataframe.

In [3]:
#Create dummy variable column using pandas get_dummies
wine['white'] = pd.get_dummies(wine['type'], drop_first=True)

wine.head()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,white
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,1
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,1
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,1
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,1
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,1


### Create training and test data sets
When evaluating how well your model does predicting wine quality, you want to use the model to predict quality for records it has never seen before.  This is accomplished by splitting your dataset into a training dataset, which will be used to train the model, and a test dataset, which will be held in reserve until it is time to test how well the model does in the "real world."  

This split will be done using the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from the [scikit-learn](https://scikit-learn.org/stable/index.html) package.

Note: You will want to set the *shuffle* parameter to **True** to ensure that both the training and test sets get a mix of red and white wines.

In [4]:
#Import train_test_split from the scikit-learn
from sklearn.model_selection import train_test_split

#First, split the dataset into features (predictors) and the output (target)
#Features are all columns except quality (the output) and the 'type' column (which was converted to a dummy variable)
features = wine[wine.columns.difference(['type','quality'])]
target = wine['quality']

#Split into training and test datasets using an 80%/20% split
features_train, \
features_test, \
target_train, \
target_test = train_test_split(features, target, test_size = 0.2, shuffle=True, random_state = 123)

Using the *size* function in *pandas*, you can see how many records went into each dataset.

In [5]:
#Count training dataframes
print("There are {} records in the training features dataframe.".format(features_train.shape[0]))
print("There are {} records in the training target dataframe.".format(target_train.shape[0]))

There are 5197 records in the training features dataframe.
There are 5197 records in the training target dataframe.


Now it is your turn.  Count the number of records in the test features and target dataframes.

In [7]:
#Count test dataframes
print("There are {} records in the testing features dataframe.".format(features_test.shape[0]))
print("There are {} records in the testing target dataframe.".format(target_test.shape[0]))

There are 1300 records in the testing features dataframe.
There are 1300 records in the testing target dataframe.


## Section 2: Build the Model
Now that your data has been split into training and test datasets, you can build your model.  The model you will be creating is a [multivariate linear regression](https://en.wikipedia.org/wiki/Linear_regression) model.  This means that you will be creating a function that uses a linear combination of one or more of the input or predictor variables to estimate the target variable.

But how do you know which features are useful in predicting the target variable of wine quality?  

The answer is [feature selection](https://en.wikipedia.org/wiki/Feature_selection).  There are many different methods of feature selection, but for this lab, you will employ forward and backward stepwise selection.
* Forward stepwise selection starts with no features and adds the single feature at each step that improves the fit the most
* Backward stepwise selection starts with all features and removes the single feature at each step which decreases the fit the least

So how do you measure model fit?  Again, there are many evaluation metrics, but for this lab, you will be exploring [$R^2$](https://en.wikipedia.org/wiki/Coefficient_of_determination) and [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error).

In [8]:
#Check scikit-learn version and update if lower than 0.24
! pip show scikit-learn

Name: scikit-learn
Version: 0.24.1
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: None
Author-email: None
License: new BSD
Location: /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages
Requires: threadpoolctl, numpy, scipy, joblib
Required-by: sklearn


In [9]:
#If version is less than 0.24, run this code
! pip install --upgrade scikit-learn



In [10]:
#Import SequentialFeatureSelector and LinearRegression from sklearn
#Note: SequentialFeatureSelector is a new feature in version 0.24
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

#Initialize your linear regression model
model = LinearRegression()

#Set up forward stepwise selection model using R-squared
forward_selection = SequentialFeatureSelector(model, n_features_to_select=5, direction='forward', scoring='r2')

#Fit the forward selection model
forward_selection.fit(features_train, target_train)

#Display the five best features using forward selection
forward_best = features_train.columns[forward_selection.get_support()]
forward_best

Index(['alcohol', 'residual sugar', 'sulphates', 'volatile acidity', 'white'], dtype='object')

In [11]:
#Train linear regression model with features from forward stepwise selection
best_forward_selection = model.fit(features_train[forward_best], target_train)

#Use the new model to predict wine quality on the training and test datasets
training_predict = best_forward_selection.predict(features_train[forward_best])
test_predict = best_forward_selection.predict(features_test[forward_best])

## Section 3: Evaluate the Model
$R^2$ is a measure of how much of the variance in the target variable is explained by the predictor variables.  It ranges from 0 to 1, with 0 being no variance explained by the model and 1 being the model giving a perfect explaination of the variance.  

The closer $R^2$ is to 1, the better the model fit.  

$R^2$ is calculated in *scikit-learn* using the [r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html) function.

In [12]:
#Import R-squared metric
from sklearn.metrics import r2_score

#Calculate R-squared for the training and test datasets
training_r2 = r2_score(target_train, training_predict)
test_r2 = r2_score(target_test, test_predict)

print('The R-squared value for the training dataset is {:.5}'.format(training_r2))
print('The R-squared value for the test dataset is {:.5}'.format(test_r2))

The R-squared value for the training dataset is 0.28689
The R-squared value for the test dataset is 0.26194


The $R^2$ value is higher for the training dataset than for the test set.  This is expected as generally the model does a better job predicting the wine quality on data it has already seen than on new, unseen data from the test dataset.

The mean squared error is computed as follows:
<center>
    $\text{MSE}=\frac{1}{n}\sum_{i=1}^{n}{(y_{i}-\hat{y_{i}})^2}$
</center>  

where 
* $n$ is the number of records
* $y_{i}$ is the actual target value (true wine quality)
* $\hat{y_{i}}$ is predicted target value (predicted wine quality)

The lower the MSE value, the better the model fit.

MSE is calculated in *scikit-learn* using the [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) function.

You can now calculate the MSE for the training and test set below.

In [14]:
#Import MSE metric
from sklearn.metrics import mean_squared_error

#Calculate mean squared error for the training and test datasets
training_mse = mean_squared_error(target_train, training_predict)
test_mse = mean_squared_error(target_test, test_predict)

print('The mean squared error for the training dataset is {:.5}'.format(training_mse))
print('The mean squared error for the test dataset is {:.5}'.format(test_mse))

The mean squared error for the training dataset is 0.5457
The mean squared error for the test dataset is 0.55357


## Section 4: Compare Models
Now you can create additional models and compare them against each other.  For example, you can use backward selection to create a model with 5 parameters, using MSE as the scoring metric.

In [15]:
#Set up backward stepwise selection model using MSE with 5 features
model = LinearRegression()
backward_selection = SequentialFeatureSelector(model, 
                                               n_features_to_select=5, 
                                               direction='backward', 
                                               scoring='neg_mean_squared_error')

#Fit the backward selection model
backward_selection.fit(features_train, target_train)

#Display the eight best features using backward selection
backward_best = features_train.columns[backward_selection.get_support()]
backward_best

Index(['alcohol', 'residual sugar', 'sulphates', 'total sulfur dioxide',
       'volatile acidity'],
      dtype='object')

Now fit a new model using the five best features from backward selection, and use it to predict the wine quality for the training and test datasets.

In [16]:
#Train linear regression model with features from backward stepwise selection
best_backward_selection = model.fit(features_train[backward_best], target_train)

Compare the $R^2$ values for the forward and backward models.  Which model does a better job predicting wine quality on the test dataset?

In [19]:
#Predict wine quality for both models on the test dataset
test_forward = best_forward_selection.predict(features_test[forward_best])
test_backward = best_backward_selection.predict(features_test[backward_best])

#Calculate R-squared
forward_r2 = r2_score(target_test, test_forward)
backward_r2 = r2_score(target_test, test_backward)

print('R-squared values for the test dataset.')
print('Forward selection model: {:.5f}'.format(forward_r2))
print('Backward selection model: {:.5f}'.format(backward_r2))

R-squared values for the test dataset.
Forward selection model: 0.26194
Backward selection model: 0.26001


Now compare the forward and backward models using mean squared error instead of $R^2$.

In [20]:
#Calculate MSE
forward_mse = mean_squared_error(target_test, test_forward)
backward_mse = mean_squared_error(target_test, test_backward)

print('MSE values for the test dataset.')
print('Forward selection model: {:.5}'.format(forward_mse))
print('Backward selection model: {:.5}'.format(backward_mse))

MSE values for the test dataset.
Forward selection model: 0.55357
Backward selection model: 0.55502


## Further Practice
If you want to practice building and comparing models to see if you can improve the model fit, you can try different combinations of the three parameters we have been adjusting for stepwise selection:
* <b>direction</b>: 'forward' or 'backward'
* <b>n_features_to_select</b>: Vary between 1 and 11 (since there are 12 total features)
* <b>scoring</b>: 'r2' or 'neg_mean_squared_error'

You could also use other [scoring metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) for regression models, such as explained variance or mean absolute error.