## Computing the Mean Absolute Error of a Second Model
In this exercise, we will be engineering new features and finding the score and loss of a new model.

In [1]:
#import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# pipeline
from sklearn.pipeline import Pipeline

# preprocessing
from sklearn.preprocessing import MinMaxScaler 
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import PolynomialFeatures

In [2]:
_headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc','MLOGP', 'response']
df = pd.read_csv('https://raw.githubusercontent.com/'\
                 'PacktWorkshops/The-Data-Science-Workshop'\
                 '/master/Chapter06/Dataset/qsar_fish_toxicity.csv', names=_headers, sep=';')

In [3]:
df.head()

Unnamed: 0,CIC0,SM1,GATS1i,NdsCH,Ndssc,MLOGP,response
0,3.26,0.829,1.676,0,1,1.453,3.77
1,2.189,0.58,0.863,0,0,1.348,3.115
2,2.125,0.638,0.831,0,0,1.348,3.531
3,3.027,0.331,1.472,1,0,1.807,3.51
4,2.094,0.827,0.86,0,0,1.886,5.39


In [4]:
# split data
features = df.drop('response', axis=1).values
labels = df[['response']].values
X_train, X_eval, y_train, y_eval = train_test_split(features, labels, test_size=0.2, random_state=0)
X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval, random_state=0)

In this step, you begin by splitting the DataFrame called df into two. The first DataFrame is called features and contains all of the independent variables that you will use to make your predictions. The second is called labels and contains the values that you are trying to predict.

In the third line, you split features and labels into four sets using train_test_split. X_train and y_train contain 80% of the data and are used for training your model. X_eval and y_eval contain the remaining 20%.

In the fourth line, you split X_eval and y_eval into two additional sets. X_val and y_val contain 75% of the data because you did not specify a ratio or size. X_test and y_test contain the remaining 25%.

In [5]:
# create a pipeline and engineer quadratic features
steps = [('scaler', MinMaxScaler()), 
         ('poly', PolynomialFeatures(2)), 
         ('model', LinearRegression())]

In this step, you begin by creating a Python list called steps. The list contains three tuples, each one representing a transformation of a model. The first tuple represents a scaling operation. The first item in the tuple is the name of the step, which you call scaler. This uses MinMaxScaler to transform the data. The second, called poly, creates additional features by crossing the columns of data up to the degree that you specify. In this case, you specify 2, so it crosses these columns up to a power of 2. Next comes your LinearRegression model.

In [6]:
# create a simple Linear Regression model with pipeline
model = Pipeline(steps)

In this step, you create an instance of Pipeline and store it in a variable called model. Pipeline performs a series of transformations, which are specified in the steps you defined in the previous step. This operation works because the transformers (MinMaxScaler and PolynomialFeatures) implement two methods called fit() and fit_transform(). You may recall from previous examples that models are trained using the fit() method that LinearRegression implements.

In [7]:
# train the model
model.fit(X_train, y_train)

Pipeline(steps=[('scaler', MinMaxScaler()), ('poly', PolynomialFeatures()),
                ('model', LinearRegression())])

First, X_train will be scaled. Next, additional features will be engineered. Finally, training will happen using the LinearRegression model.

In [8]:
# use model to predict on validation dataset
y_pred = model.predict(X_val)

In [9]:
# compute MAE
mae = mean_absolute_error(y_val, y_pred)
print('MAE: {}'.format(mae))

MAE: 0.660552610083608


The loss that you compute at this step is called a validation loss because you make use of the validation dataset. This is different from a training loss that is computed using the training dataset. This distinction is important to note as you study other documentation or books, which might refer to both.

In [10]:
# compute R2 score
r2 = model.score(X_val, y_val)
print('R^2: {}'.format(r2))

R^2: 0.6284921344153387


At this point, you should see a difference between the R2 score and the MAE of the first model and the second model (in the first model, the MAE and R2 scores were 0.781629 and 0.498688 respectively).

In this exercise, you engineered new features that give you a model with a hypothesis of a higher polynomial degree. This model should perform better than simpler models up to a certain point. After engineering and training the new model, you computed the R2 score and MAE, which you can use to compare this model with the model you trained previously. We can conclude that this model is better as it has a higher R2 score and a lower MAE.