## Variable magnitude

It is important. Because, the **regression coefficient** is directly influenced by the scale of the variable. **Variables with bigger magnitude** dominate over the ones with smaller magnitude. **Gradient descent converges faster** when features are on similar scales. **Feature scaling** helps decrease the time to find **support vectors for SVMs**. **Euclidean distances** are sensitive to feature magnitude.

**The machine learning models sensitive magnitude of the feature:** Linear and Logistic Regression, Neural Networks, Support Vector Machines, KNN, K-means clustering, Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA).

**Machine learning models insensitive to feature magnitude:** The ones based on Trees; Classification and Regression Trees, Random Forests, Gradient Boosted Trees

In [42]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler  # to scale the features
from sklearn.metrics import roc_auc_score  # to evaluate performance
from sklearn.model_selection import train_test_split

### Load data with numerical variables only

**Load numerical variables of the Titanic Dataset!..**

In [43]:
data = pd.read_csv('titanic.csv',
                   usecols=['pclass', 'age', 'fare', 'survived'])
data.head()

Unnamed: 0,pclass,survived,age,fare
0,1,1,29.0,211.3375
1,1,1,0.9167,151.55
2,1,0,2.0,151.55
3,1,0,30.0,151.55
4,1,0,25.0,151.55


In [44]:
data.describe()

Unnamed: 0,pclass,survived,age,fare
count,1309.0,1309.0,1046.0,1308.0
mean,2.294882,0.381971,29.881135,33.295479
std,0.837836,0.486055,14.4135,51.758668
min,1.0,0.0,0.1667,0.0
25%,2.0,0.0,21.0,7.8958
50%,3.0,0.0,28.0,14.4542
75%,3.0,1.0,39.0,31.275
max,3.0,1.0,80.0,512.3292


We can see that **Fare** varies between **0 and 512**, **Age** between **0 and 80**, and **Class** between **0 and 3**.

In [45]:
for col in ['pclass', 'age', 'fare']:
    print(col, 'range: ', data[col].max() - data[col].min())

pclass range:  2
age range:  79.8333
fare range:  512.3292


These are **range of values**!

In [46]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['pclass', 'age', 'fare']].fillna(0),
    data.survived,
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((916, 3), (393, 3))

### Feature Scaling

Scale the features between 0 and 1, using the **MinMaxScaler**. Transform, by  --> **X_rescaled = X - X.min() / (X.max - X.min()**. Transform back, by --> **X = X_rescaled * (max - min) + min**

**Scale the features between 0 and 1. Re-scale the datasets!..**

In [47]:
scaler = MinMaxScaler()  # cal the scaler
scaler.fit(X_train)  # fit the scaler
X_train_scaled = scaler.transform(X_train)  # re scale the datasets!
X_test_scaled = scaler.transform(X_test)

**Look at the scaled training dataset!..**

In [48]:
print('Mean: ', X_train_scaled.mean(axis=0))
print('Standard Deviation: ', X_train_scaled.std(axis=0))
print('Minimum value: ', X_train_scaled.min(axis=0))
print('Maximum value: ', X_train_scaled.max(axis=0))

Mean:  [0.64628821 0.33048359 0.06349833]
Standard Deviation:  [0.42105785 0.23332045 0.09250036]
Minimum value:  [0. 0. 0.]
Maximum value:  [1. 1. 1.]


Now, the maximum values for all the features is 1, and the minimum value is zero, as expected. So they are in a more similar scale.

### Logistic Regression

Let's evaluate the effect of feature scaling in a Logistic Regression.

**Build model on unscaled variables! Call the model !**

In [49]:
logit = LogisticRegression(
    random_state=44,
    C=1000,  # c big to avoid regularization
    solver='lbfgs')
logit.fit(X_train, y_train)  # train the model
print('Train set')  # evaluate performance
pred = logit.predict_proba(X_train)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = logit.predict_proba(X_test)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_test, pred[:, 1])))

Train set
Logistic Regression roc-auc: 0.6793181006244372
Test set
Logistic Regression roc-auc: 0.7175488081411426


**Check the coefficients**

In [50]:
logit.coef_

array([[-0.71428242, -0.00923013,  0.00425235]])

**Build model on scaled variables! call the model**

In [51]:
logit = LogisticRegression(   # call the model
    random_state=44,
    C=1000,  # c big to avoid regularization
    solver='lbfgs')
logit.fit(X_train_scaled, y_train)  # train the model using the re-scaled data
print('Train set')  # evaluate performance
pred = logit.predict_proba(X_train_scaled)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = logit.predict_proba(X_test_scaled)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_test, pred[:, 1])))

Train set
Logistic Regression roc-auc: 0.6793281640744896
Test set
Logistic Regression roc-auc: 0.7175488081411426


In [52]:
logit.coef_

array([[-1.42875872, -0.68293349,  2.17646757]])

We observe that the **performance of logistic regression** did not change when using the datasets with the **features scaled** (compare roc-auc values for train and test set for models with and without feature scaling). But, the coefficients has a big difference in the values. The **magnitude of the variable** was affecting the coefficients. After scaling, **all 3 variables** have the relatively the same effect (coefficient) towards survival, whereas **before scaling**, we would be inclined to think that **PClass was driving the Survival outcome**.

### Support Vector Machines

**Build model on unscaled variables!**

In [53]:
SVM_model = SVC(random_state=44, probability=True, gamma='auto')  # call the model
SVM_model.fit(X_train, y_train)  #  train the model
print('Train set')  # evaluate performance
pred = SVM_model.predict_proba(X_train)
print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = SVM_model.predict_proba(X_test)
print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
SVM roc-auc: 0.882393490960506
Test set
SVM roc-auc: 0.6617581992146452


**Build model on scaled variables! Call model**

In [54]:
SVM_model = SVC(random_state=44, probability=True, gamma='auto')
SVM_model.fit(X_train_scaled, y_train)  # train the model
print('Train set')   # evaluate performance
pred = SVM_model.predict_proba(X_train_scaled)
print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = SVM_model.predict_proba(X_test_scaled)
print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
SVM roc-auc: 0.6780802962679695
Test set
SVM roc-auc: 0.6841435761296388


**Feature scaling improved the performance of the support vector machine.** After feature scaling the model is no longer over-fitting to the training set (compare the roc-auc of 0.881 for the model on unscaled features vs the roc-auc of 0.68). In addition, the roc-auc for the testing set increased as well (0.66 vs 0.68).

### K-Nearest Neighbours

**Build model on scaled variables! Call model**

In [55]:
KNN = KNeighborsClassifier(n_neighbors=5)
KNN.fit(X_train, y_train)  # train the model
print('Train set')  # evaluate performance
pred = KNN.predict_proba(X_train)
print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = KNN.predict_proba(X_test)
print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
KNN roc-auc: 0.8131141849360215
Test set
KNN roc-auc: 0.6947901111664178


**Build model on scaled variables! Call model**

In [56]:
KNN = KNeighborsClassifier(n_neighbors=5)
KNN.fit(X_train_scaled, y_train)  # train the model
print('Train set')  # evaluate performance
pred = KNN.predict_proba(X_train_scaled)
print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = KNN.predict_proba(X_test_scaled)
print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
KNN roc-auc: 0.826928785995703
Test set
KNN roc-auc: 0.7232453957192633


We observe for KNN as well that feature scaling improved the performance of the model. **The model built on unscaled features shows a better generalisation**, with a higher roc-auc for the testing set (0.72 vs 0.69 for model built on unscaled features). **Both KNN methods are over-fitting to the train set**. Thus, we would need to change the parameters of the model or use less features to try and decrease over-fitting, which exceeds the purpose of this demonstration.

### Random Forests

**Build model on non-scaled features! Call the model!**

In [57]:
rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train, y_train)  # train the model
print('Train set')  # evaluate performance
pred = rf.predict_proba(X_train)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = rf.predict_proba(X_test)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
Random Forests roc-auc: 0.9866810238554083
Test set
Random Forests roc-auc: 0.7326751838946961


**Build model on scaled features! Call the model!**

In [58]:
rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_scaled, y_train)  # train the model
print('Train set')  # evaluate performance
pred = rf.predict_proba(X_train_scaled)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = rf.predict_proba(X_test_scaled)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
Random Forests roc-auc: 0.9867917218059866
Test set
Random Forests roc-auc: 0.7312510370001659


As expected, **Random Forests** shows **no change** in performance regardless of whether it is trained on a dataset with **scaled or unscaled features**. This model in particular, is over-fitting to the training set. So we need to do some work to remove the over-fitting. That exceeds the scope of this demonstration.

**Train adaboost on non-scaled features! Call the model!**

In [59]:
ada = AdaBoostClassifier(n_estimators=200, random_state=44)
ada.fit(X_train, y_train)  # train the model
print('Train set')  # evaluate model performance
pred = ada.predict_proba(X_train)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = ada.predict_proba(X_test)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
AdaBoost roc-auc: 0.7970629821021541
Test set
AdaBoost roc-auc: 0.7473867595818815


**Train adaboost on scaled features! Call the model!**

In [60]:
ada = AdaBoostClassifier(n_estimators=200, random_state=44)
ada.fit(X_train_scaled, y_train)  # train the model
print('Train set')  # evaluate model performance
pred = ada.predict_proba(X_train_scaled)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = ada.predict_proba(X_test_scaled)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
AdaBoost roc-auc: 0.7970629821021541
Test set
AdaBoost roc-auc: 0.7475250262706707


As expected, **AdaBoost** shows **no change** in performance regardless of whether it is trained on a dataset with **scaled or unscaled features**.