#### Variable Magnitude

In Linear Regression models, the scale of variables used to estimate the output matters. Linear models are of the type y = w x + b, where the regression coefficient w represents the expected change in y for a one unit change in x (the predictor). Thus, the magnitude of w is partly determined by the magnitude of the units being used for x. If x is a distance variable, just changing the scale from kilometers to miles will cause a change in the magnitude of the coefficient.

In addition, in situations where we estimate the outcome y by contemplating multiple predictors x1, x2, ...xn, predictors with greater numeric ranges dominate over those with smaller numeric ranges.

Gradient descent converges faster when all the predictors (x1 to xn) are within a similar scale, therefore having features in a similar scale is useful for Neural Networks as well as.

In Support Vector Machines, feature scaling can decrease the time to find the support vectors.

Finally, methods using Euclidean distances or distances in general are also affected by the magnitude of the features, as Euclidean distance is sensitive to variations in the magnitude or scales of the predictors. Therefore feature scaling is required for methods that utilise distance calculations like k-nearest neighbours (KNN) and k-means clustering.

**Magnitude is important due to following reason**
- The regression coefficient is directly influenced by the scale of the variable
- Variables with bigger magnitude / value range dominate over the ones with smaller magnitude / value range
- Gradient descent converges faster when features are on similar scales
- Feature scaling helps decrease the time to find support vectors for SVMs
- Euclidean distances are sensitive to feature magnitude.

**Feature Magnitude sensitive machine learning models**
- Linear and Logistic Regression
- Neural Networks
- Support Vector Machines
- KNN
- K-means clustering
- Linear Discriminant Analysis (LDA)
- Principal Component Analysis (PCA)

**Feature Magnitude insensitive machine learning models**
- Classification and Regression Trees
- Random Forests
- Gradient Boosted Trees

In this notebook:
- Feature magnitude 

Dataset: Titanic 

In [2]:
import pandas as pd 

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split


In [3]:
# load titanic dataset with only few columns 

use_columns = ['age', 'pclass','fare', 'survived']

data = pd.read_csv('../datasets/titanic.csv', usecols=use_columns)
data.head()

Unnamed: 0,pclass,survived,age,fare
0,1,1,29.0,211.3375
1,1,1,0.9167,151.55
2,1,0,2.0,151.55
3,1,0,30.0,151.55
4,1,0,25.0,151.55


In [4]:
data.describe()

Unnamed: 0,pclass,survived,age,fare
count,1309.0,1309.0,1046.0,1308.0
mean,2.294882,0.381971,29.881135,33.295479
std,0.837836,0.486055,14.4135,51.758668
min,1.0,0.0,0.1667,0.0
25%,2.0,0.0,21.0,7.8958
50%,3.0,0.0,28.0,14.4542
75%,3.0,1.0,39.0,31.275
max,3.0,1.0,80.0,512.3292


We can see that Fare varies between 0 and 512, Age between 0 and 80, and Class between 0 and 3. So the variables have different magnitude.

In [7]:
# split the data for training and testing 

X_train, X_test, y_train, y_test = train_test_split(
    data[['pclass', 'age', 'fare']].fillna(0),
    data.survived,
    test_size=0.3,
    random_state=0)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")

Train shape: (916, 3), Test shape: (393, 3)


##### Feature Scaling

Here, we are going to use MinMax Scalar from Scikit library. It ranges all values in **0-1**

**Formula:**
Transformation:

$X_{rescaled} = X - X.min() / (X.max() - X.min()) $

Revert into original one:

$X = X_{rescaled} * (max - min) + min $



In [12]:
# lets scale the feature: MinMax

scaler = MinMaxScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [13]:
print('Mean: ', X_train_scaled.mean(axis=0))
print('Standard Deviation: ', X_train_scaled.std(axis=0))
print('Minimum value: ', X_train_scaled.min(axis=0))
print('Maximum value: ', X_train_scaled.max(axis=0))

Mean:  [0.64628821 0.33048359 0.06349833]
Standard Deviation:  [0.42105785 0.23332045 0.09250036]
Minimum value:  [0. 0. 0.]
Maximum value:  [1. 1. 1.]


Now, Max value is 1 and Min values is zero. 

##### Logistic Regression 


In [16]:
logitreg = LogisticRegression(
        C=1000, #avoid regularization
        solver='lbfgs',
        random_state=44
    )

logitreg.fit(X_train, y_train)


print('Train set')
pred = logitreg.predict_proba(X_train)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_train, pred[:, 1])))
print('Test set')
test_pred = logitreg.predict_proba(X_test)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_test, test_pred[:, 1])))

Train set
Logistic Regression roc-auc: 0.6793181006244372
Test set
Logistic Regression roc-auc: 0.7175488081411426


In [18]:
logitreg.coef_

array([[-0.71428242, -0.00923013,  0.00425235]])

In [20]:
# Lets used scaled data
logitreg = LogisticRegression(
        C=1000, #avoid regularization
        solver='lbfgs',
        random_state=44
    )

logitreg.fit(X_train_scaled, y_train)


print('Train set')
pred = logitreg.predict_proba(X_train_scaled)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_train, pred[:, 1])))
print('Test set')
test_pred = logitreg.predict_proba(X_test_scaled)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_test, test_pred[:, 1])))

Train set
Logistic Regression roc-auc: 0.6793281640744896
Test set
Logistic Regression roc-auc: 0.7175488081411426


In [21]:
logitreg.coef_

array([[-1.42875872, -0.68293349,  2.17646757]])

We observe that the performance of logistic regression did not change when using the datasets with the features scaled (compare roc-auc values for train and test set for models with and without feature scaling).

However, when looking at the coefficients we do see a big difference in the values. This is because the magnitude of the variable was affecting the coefficients. After scaling, all 3 variables have the relatively the same effect (coefficient) towards survival, whereas before scaling, we would be inclined to think that PClass was driving the Survival outcome.

##### Support Vector Machine


In [24]:
SVM_model = SVC(probability=True, gamma='auto',random_state=44)
SVM_model.fit(X_train, y_train)

print('Train set')
pred = SVM_model.predict_proba(X_train)
print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = SVM_model.predict_proba(X_test)
print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
SVM roc-auc: 0.882393490960506
Test set
SVM roc-auc: 0.6617581992146452


In [25]:
SVM_model.coef_

AttributeError: coef_ is only available when using a linear kernel

In [26]:
SVM_model = SVC(probability=True, gamma='auto',random_state=44)
SVM_model.fit(X_train_scaled, y_train)

print('Train set')
pred = SVM_model.predict_proba(X_train_scaled)
print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = SVM_model.predict_proba(X_test_scaled)
print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
SVM roc-auc: 0.6780802962679695
Test set
SVM roc-auc: 0.6841435761296388


Feature scaling improved the performance of the support vector machine. After feature scaling the model is no longer over-fitting to the training set (compare the roc-auc of 0.881 for the model on unscaled features vs the roc-auc of 0.68). In addition, the roc-auc for the testing set increased as well (0.66 vs 0.68).

##### K-Nearest Neighbours


In [27]:
KNN = KNeighborsClassifier(n_neighbors=5)

KNN.fit(X_train, y_train)

print('Train set')
pred = KNN.predict_proba(X_train)
print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = KNN.predict_proba(X_test)
print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
KNN roc-auc: 0.8131141849360215
Test set
KNN roc-auc: 0.6947901111664178


In [28]:
KNN = KNeighborsClassifier(n_neighbors=5)

KNN.fit(X_train_scaled, y_train)

print('Train set')
pred = KNN.predict_proba(X_train_scaled)
print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = KNN.predict_proba(X_test_scaled)
print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
KNN roc-auc: 0.826928785995703
Test set
KNN roc-auc: 0.7232453957192633


We observe for KNN as well that feature scaling improved the performance of the model. The model built on unscaled features shows a better generalisation, with a higher roc-auc for the testing set (0.72 vs 0.69 for model built on unscaled features).

Both KNN methods are over-fitting to the train set. Thus, we would need to change the parameters of the model or use less features to try and decrease over-fitting, which exceeds the purpose of this demonstration.

##### Random Forest 

In [29]:
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train, y_train)

# evaluate performance
print('Train set')
pred = rf.predict_proba(X_train)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = rf.predict_proba(X_test)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
Random Forests roc-auc: 0.9866810238554083
Test set
Random Forests roc-auc: 0.7326751838946961


In [30]:
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train_scaled, y_train)

# evaluate performance
print('Train set')
pred = rf.predict_proba(X_train_scaled)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = rf.predict_proba(X_test_scaled)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
Random Forests roc-auc: 0.9867917218059866
Test set
Random Forests roc-auc: 0.7312510370001659


As expected, Random Forests shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features. This model in particular, is over-fitting to the training set. So we need to do some work to remove the over-fitting. That exceeds the scope of this demonstration.

##### AdaBoost

In [31]:
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train, y_train)

# evaluate model performance
print('Train set')
pred = ada.predict_proba(X_train)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = ada.predict_proba(X_test)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
AdaBoost roc-auc: 0.7970629821021541
Test set
AdaBoost roc-auc: 0.7473867595818815


In [32]:
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train_scaled, y_train)

# evaluate model performance
print('Train set')
pred = ada.predict_proba(X_train_scaled)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = ada.predict_proba(X_test_scaled)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
AdaBoost roc-auc: 0.7970629821021541
Test set
AdaBoost roc-auc: 0.7475250262706707


As expected, AdaBoost shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features