## Extracting the Linear Regression Coefficient

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import altair as alt

In [2]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter09/Dataset/phpYYZ4Qc.csv'

In [3]:
df = pd.read_csv(file_url)
df.head()

Unnamed: 0,a1cx,a1cy,a1sx,a1sy,a1rho,a1pop,a2cx,a2cy,a2sx,a2sy,...,b2x,b2y,b2call,b2eff,b3x,b3y,b3call,b3eff,mxql,rej
0,0.41301,0.607442,0.332608,0.406812,-0.151224,1.525222,-0.144368,0.852368,0.412397,1.728169,...,-0.776759,-0.78377,8.0,0.603486,-0.997118,-0.502138,5.0,1.169388,9.0,0.049118
1,-0.602384,0.350618,0.429196,0.414476,-0.124489,4.597991,0.579458,0.651134,0.104394,0.636356,...,-0.00282,-0.080542,2.0,1.125542,-0.983397,-0.107632,5.0,1.186039,7.0,0.242579
2,-0.322881,-0.538491,1.60226,0.039605,0.196023,1.909005,-0.675672,0.963618,0.147458,1.414008,...,-0.952645,-0.5716,5.0,1.280392,0.771129,-0.665756,5.0,1.024203,6.0,0.0
3,-0.23357,-0.936451,1.710192,2.179527,0.438461,4.742055,-0.163625,-0.923273,0.597622,0.118409,...,-0.198235,-0.205276,2.0,0.509727,-0.579544,0.480094,6.0,1.568492,7.0,0.469045
4,0.403126,0.313367,0.822382,1.393975,0.253435,9.39863,0.312528,0.288321,0.431867,0.110369,...,0.573352,0.315217,2.0,0.622033,-0.134747,0.669948,3.0,1.295913,9.0,0.0


In [4]:
y = df.pop('rej')

In [5]:
df.describe()

Unnamed: 0,a1cx,a1cy,a1sx,a1sy,a1rho,a1pop,a2cx,a2cy,a2sx,a2sy,...,b1eff,b2x,b2y,b2call,b2eff,b3x,b3y,b3call,b3eff,mxql
count,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0,...,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0,8192.0
mean,-0.00323,-0.000939,1.011966,1.004934,-0.001786,3.06852,-0.00024,-0.0011,0.996411,0.993268,...,1.248389,-0.016438,0.001713,3.994629,1.247177,0.000752,0.001569,4.010864,1.242247,7.00293
std,0.579124,0.577033,1.027579,1.001369,0.288346,3.068512,0.575362,0.577482,0.978954,1.009709,...,0.43208,0.57901,0.5717,1.632885,0.430426,0.578329,0.579423,1.637225,0.431913,1.421186
min,-0.999409,-0.999864,0.000191,5.5e-05,-0.499944,0.000162,-0.999986,-0.999717,0.000114,0.00043,...,0.50011,-0.999677,-0.999622,2.0,0.500194,-0.999998,-0.999897,2.0,0.500334,5.0
25%,-0.499372,-0.500793,0.293154,0.28342,-0.253597,0.883641,-0.490687,-0.50707,0.286112,0.282905,...,0.870518,-0.516167,-0.484386,3.0,0.874773,-0.500204,-0.502221,3.0,0.869434,6.0
50%,-0.013944,0.00139,0.692698,0.69024,-0.001311,2.130011,0.000478,0.007756,0.701715,0.688298,...,1.251238,-0.01136,-0.001674,4.0,1.247008,0.00471,-0.004599,4.0,1.235297,7.0
75%,0.506926,0.494132,1.394602,1.398254,0.24683,4.235844,0.502243,0.490494,1.392521,1.363553,...,1.61904,0.483353,0.496305,5.0,1.615999,0.50349,0.510076,5.0,1.617383,8.0
max,0.999881,0.999828,10.30377,8.856675,0.499937,27.015529,0.999743,0.999511,7.927502,13.507064,...,1.999976,0.999404,0.999759,8.0,1.999761,0.999774,0.999771,8.0,1.999838,9.0


In [6]:
# split data into training and test sets with test_size=0.3
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=1)

In [7]:
# instantiate StandardScaler
scaler = StandardScaler()

# train and standardize training set using scaler using fit_tranform()
X_train = scaler.fit_transform(X_train) # standardizing our features to be between -1 and 1.

# standardize testing set
X_test = scaler.transform(X_test) # standardizing our targets to be between -1 and 1.

In [8]:
# LinearRegression
lr_model = LinearRegression()

# train model on training set
lr_model.fit(X_train, y_train)

LinearRegression()

In [9]:
# predict the outcomes of training and testing sets using .predict()
preds_train = lr_model.predict(X_train)
preds_test = lr_model.predict(X_test)

# calculate MSE on training set
train_mse = mean_squared_error(y_train, preds_train)
print(train_mse)

0.007062801218218886


In [10]:
# calculate MSE on testing set
test_mse = mean_squared_error(y_test, preds_test)
print(test_mse)

0.006571073840504757


We also have a low MSE score on the testing set that is very similar to the training one. So, our model is not overfitting.

In [11]:
# print coefficients of linear regression model
lr_model.coef_

array([-4.94784409e-04, -9.33729668e-04, -2.81877324e-03, -3.29306515e-03,
        5.93297658e-05,  4.02235021e-02,  1.98044098e-03, -3.39452179e-04,
       -5.44949336e-03, -4.82500415e-03,  8.22897305e-05,  6.24226133e-02,
       -2.66390121e-04,  5.69511279e-04, -2.30657474e-03, -2.51428876e-03,
       -1.31637908e-03,  3.96978126e-02, -1.05127088e-02,  3.02116023e-04,
       -3.89728413e-04,  3.57359816e-03, -1.43282556e-02, -1.13460794e-03,
       -1.02515456e-03,  5.50201521e-03, -4.02044201e-03, -1.15164578e-03,
        2.15908520e-04,  5.59779791e-03, -2.42464365e-03, -1.75240320e-02])

In [12]:
coef_df = pd.DataFrame()

In [13]:
coef_df['feature'] = df.columns
coef_df.head()

Unnamed: 0,feature
0,a1cx
1,a1cy
2,a1sx
3,a1sy
4,a1rho


In [14]:
coef_df['coefficient'] = lr_model.coef_
coef_df.head()

Unnamed: 0,feature,coefficient
0,a1cx,-0.000495
1,a1cy,-0.000934
2,a1sx,-0.002819
3,a1sy,-0.003293
4,a1rho,5.9e-05


From this output, we can see the variables a1sx and a1sy have the lowest value (the biggest negative value) so they are contributing more to the prediction than the three other variables shown here.

In [15]:
alt.Chart(coef_df).mark_bar().encode(x='coefficient', y='feature')

The first three variables impacted the outcome positively (increasing the target variable value). This means as the population grows in any of the three areas, the chance of customer churn increases. On the other hand, the last three features negatively impacted the target variable (decreasing the target variable value): if the maximum length, bank-1 efficiency level, or temperature increases, the risk of customers leaving decreases.

In this exercise, you learned how to extract the coefficients learned by a linear regression model and identified which variables make the biggest contribution to the prediction.