## Scale of variables

Reference https://machinelearningmastery.com/

- In Linear Regression models models **y = w x + b**, the scale of the X variable matters 

- The value of **w** is partly affected by the magnitude of **x**

- Changing the scale from mm to km will cause a change in the magnitude of the **w**

- Unscaled input variables can result in a slow or unstable learning process

- Unscaled target variables on regression problems can result in exploding gradients

- Input variables with larger values may dominate the learning curves

- Gradient descent converges faster when the input features are scaled 

- SVMs perform better with scaled features 

- Methods that require distance metrics, e.g., KNN, KMeans, are usually affected by the scale of input features 

### Affected Models

- KNN
- K-means clustering
- Linear Discriminant Analysis 
- Principal Component Analysis 
- Linear and Logistic Regression
- Neural Networks
- Support Vector Machines


### Unaffected Models

- Trees
- Random Forests
- Gradient Boosted Trees


In [0]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [64]:
from google.colab import drive
drive.mount('/content/gdrive')
data = pd.read_csv("gdrive/My Drive/Colab Notebooks/FeatureEngineering/train.csv")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [65]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [0]:
num_cols = ['Survived', 'Pclass', 'Age', 'Fare']
data = data[num_cols]

In [67]:
data.describe()

Unnamed: 0,Survived,Pclass,Age,Fare
count,891.0,891.0,714.0,891.0
mean,0.383838,2.308642,29.699118,32.204208
std,0.486592,0.836071,14.526497,49.693429
min,0.0,1.0,0.42,0.0
25%,0.0,2.0,20.125,7.9104
50%,0.0,3.0,28.0,14.4542
75%,1.0,3.0,38.0,31.0
max,1.0,3.0,80.0,512.3292


In [68]:
for i in ['Pclass', 'Age', 'Fare']:
    print(i,': ', data[i].max()-data[i].min())

Pclass :  2
Age :  79.58
Fare :  512.3292


In [69]:
data.isnull().sum()

Survived      0
Pclass        0
Age         177
Fare          0
dtype: int64

In [70]:
X_train, X_test, y_train, y_test = train_test_split(
          data[['Age', 'Fare', 'Pclass',]].fillna(data.mean()),
          data['Survived'],
          test_size=0.2)

X_train.shape, X_test.shape

((712, 3), (179, 3))

### Feature Scaling

In [71]:
for i in ['Pclass', 'Age', 'Fare']:
    print(i,'Min: ', X_train[i].min())
    print(i,'Max: ', X_train[i].max())
    print(i,'Range: ', X_train[i].max()-X_train[i].min())
    print(i,'Mean: ', X_train[i].mean())
    print(i,'Std: ', X_train[i].std())

Pclass Min:  1
Pclass Max:  3
Pclass Range:  2
Pclass Mean:  2.2907303370786516
Pclass Std:  0.8392773834291926
Age Min:  0.42
Age Max:  80.0
Age Range:  79.58
Age Mean:  29.915777428949063
Age Std:  13.103765391938433
Fare Min:  0.0
Fare Max:  512.3292
Fare Range:  512.3292
Fare Mean:  32.83680323033706
Fare Std:  51.81885296097275


In [0]:
obj = StandardScaler()
X_train_scaled = obj.fit_transform(X_train)
X_test_scaled = obj.transform(X_test)

In [73]:
for i in range(3):
    print(i,'Min: ', X_train_scaled[i].min())
    print(i,'Max: ', X_train_scaled[i].max())
    print(i,'Range: ', X_train_scaled[i].max()-X_train_scaled[i].min())
    print(i,'Mean: ', X_train_scaled[i].mean())
    print(i,'Std: ', X_train_scaled[i].std())

0 Min:  -0.47867188547973005
0 Max:  0.8456897383917251
0 Range:  1.324361623871455
0 Mean:  0.09902737911531101
0 Std:  0.5537116115928274
1 Min:  -1.5389878605584064
1 Max:  4.444810827034529
1 Range:  5.983798687592936
1 Mean:  0.8180164330164081
1 Std:  2.602657213669343
2 Min:  -0.7572440289155674
2 Max:  0.8456897383917251
2 Range:  1.6029337673072925
2 Mean:  -0.13133579327609476
2 Std:  0.6999103191462845


In [0]:
obj = MinMaxScaler()
X_train_scaled = obj.fit_transform(X_train)
X_test_scaled = obj.transform(X_test)

In [75]:
for i in range(3):
    print(i,'Min: ', X_train_scaled[i].min())
    print(i,'Max: ', X_train_scaled[i].max())
    print(i,'Range: ', X_train_scaled[i].max()-X_train_scaled[i].min())
    print(i,'Mean: ', X_train_scaled[i].mean())
    print(i,'Std: ', X_train_scaled[i].std())

0 Min:  0.015712553569072387
0 Max:  1.0
0 Range:  0.9842874464309276
0 Mean:  0.45828267158007363
0 Std:  0.40790364164369575
1 Min:  0.0
1 Max:  0.5133418122566507
1 Range:  0.5133418122566507
1 Mean:  0.2698824722266242
1 Std:  0.2104021395471772
2 Min:  0.015330377421392339
2 Max:  1.0
2 Range:  0.9846696225786077
2 Mean:  0.42045736548209095
2 Std:  0.42048319800287653


In [76]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.7374301675977654
0.7318435754189944


In [77]:
from sklearn.linear_model import RidgeClassifierCV
classifier = RidgeClassifierCV()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.7374301675977654
0.7374301675977654


In [78]:
from sklearn.linear_model import RidgeClassifierCV
classifier = RidgeClassifierCV()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.7374301675977654
0.7374301675977654


In [79]:
from sklearn.svm import SVC
classifier = SVC()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.6927374301675978
0.7430167597765364


In [80]:
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

0.7262569832402235
0.7374301675977654


In [81]:
from sklearn.svm import LinearSVC
classifier = LinearSVC()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.3743016759776536
0.7374301675977654




In [82]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.6815642458100558
0.6871508379888268


In [83]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

0.6927374301675978
0.6983240223463687


In [84]:
from sklearn.ensemble import GradientBoostingClassifier
classifier = GradientBoostingClassifier()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.7039106145251397
0.7039106145251397


In [85]:
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.6759776536312849
0.7318435754189944


In [86]:
from sklearn.linear_model import Perceptron
classifier = Perceptron()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.7039106145251397
0.7206703910614525


In [87]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))


0.6927374301675978
0.6927374301675978


In [88]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

classifier.fit(X_train_scaled,y_train)
y_pred = classifier.predict(X_test_scaled)
y_pred = np.round(y_pred).flatten()
print(accuracy_score(y_test, y_pred))

0.6536312849162011
0.6927374301675978
