In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [0]:
# column headers
_headers = ['CIC0', 'SM1', 'GATS1i', 'NdsCH', 'Ndssc', 'MLOGP', 'response']

In [0]:
# read in data
df = pd.read_csv('https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter06/Dataset/qsar_fish_toxicity.csv', names=_headers, sep=';')

In [7]:
df.head()

Unnamed: 0,CIC0,SM1,GATS1i,NdsCH,Ndssc,MLOGP,response
0,3.26,0.829,1.676,0,1,1.453,3.77
1,2.189,0.58,0.863,0,0,1.348,3.115
2,2.125,0.638,0.831,0,0,1.348,3.531
3,3.027,0.331,1.472,1,0,1.807,3.51
4,2.094,0.827,0.86,0,0,1.886,5.39


In [0]:
# Selecting features from dataset
features = df.drop('response' , axis = 1).values

In [0]:
# Selecting Labels from dataset
labels = df[['response']].values

In [0]:
# selection of Dataset for train test split
X_train, X_eval, y_train, y_eval = train_test_split(features, labels, test_size=0.2, random_state=0)
X_val, X_test, y_val, y_test = train_test_split(X_eval, y_eval, random_state=0)

In [0]:
model = LinearRegression()

In [13]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
y_pred = model.predict(X_val)

In [15]:
r2 = model.score(X_val, y_val)
print('R^2 score: {}'.format(r2))

R^2 score: 0.5623861754188693


In [19]:
_ys = pd.DataFrame(dict(actuals=y_val.reshape(-1), predicted=y_pred.reshape(-1)))
_ys.head()

Unnamed: 0,actuals,predicted
0,3.742,4.155885
1,6.143,6.398238
2,4.674,5.183181
3,4.865,3.771333
4,4.732,4.593059


**Computing the MAE of a Model**

In [21]:
# Let's compute our MEAN ABSOLUTE ERROR
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_val, y_pred)
print('MAE: {}'.format(mae))

MAE: 0.7243440846447939


In [22]:
# Let's get the R2 score
r2 = model.score(X_val, y_val)
print('R^2 score: {}'.format(r2))

R^2 score: 0.5623861754188693


A higher R2 score means a better model and uses an equation that computes the coefficient of determination.

In this exercise, we have calculated the MAE, which is a significant parameter when it comes to evaluating models.

You will now train a second model and compare its R2 score and MAE to the first model to evaluate which is a better performing model.

In [0]:
# pipeline
from sklearn.pipeline import Pipeline
# preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

In [0]:
# create a pipeline and engineer quadratic features
steps = [
    ('scaler', MinMaxScaler()),
    ('poly', PolynomialFeatures(2)),
    ('model', LinearRegression())
]

In above step, you begin by creating a Python list called steps. The list contains three tuples, each one representing a transformation of a model. The first tuple represents a scaling operation. 

The first item in the tuple is the name of the step, which you call scaler. This uses MinMaxScaler to transform the data. 

The second, called poly, creates additional features by crossing the columns of data up to the degree that you specify. In this case, you specify 2, so it crosses these columns up to a power of 2. 

Third comes your LinearRegression model.

In [0]:
# create a simple Linear Regression model with a pipeline
model = Pipeline(steps)

In this step, you create an instance of Pipeline and store it in a variable called model. Pipeline performs a series of transformations, which are specified in the steps you defined in the previous step. 

This operation works because the transformers (MinMaxScaler and PolynomialFeatures) implement two methods called fit() and fit_transform(). You may recall from previous examples that models are trained using the fit() method that LinearRegression implements.

In [26]:
# train the model
model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('poly',
                 PolynomialFeatures(degree=2, include_bias=True,
                                    interaction_only=False, order='C')),
                ('model',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

On the next line, you call the fit method and provide X_train and y_train as parameters. Because the model is a pipeline, three operations will happen. First, X_train will be scaled. Next, additional features will be engineered.

In [0]:
# let's use our model to predict on our validation dataset
y_pred = model.predict(X_val)

In [28]:
# Let's compute our MEAN ABSOLUTE ERROR
mae = mean_absolute_error(y_val, y_pred)
print('MAE: {}'.format(mae))

MAE: 0.6605526100836078


In [29]:
# Let's get the R2 score
r2 = model.score(X_val, y_val)
print('R^2 score: {}'.format(r2))

R^2 score: 0.6284921344153389


you engineered new features that give you a model with a hypothesis of a higher polynomial degree. This model should perform better than simpler models up to a certain point. After engineering and training the new model, you computed the R2 score and MAE, which you can use to compare this model with the model you trained previously. We can conclude that this model is better as it has a higher R2 score and a lower MAE.