# Centering and Scaling

Many machine learning models use some form of distance to inform them, so if we have features on far larger scales, they can disproportionately influence our model. For example, KNN uses distance explicitly when making predictions. For this reason, we actually want features to be on a similar scale. To achieve this, we can normalize or standardize our data, often referred to as scaling and centering.

There are several ways to scale our data: given any column, we can subtract the mean and divide by the variance so that all features are centered around zero and have a variance of one. This is called standardization. We can also subtract the minimum and divide by the range of the data so the normalized dataset has minimum zero and maximum one. Or, we can center our data so that it ranges from -1 to 1

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Lasso, LogisticRegression

In [2]:
music_df = pd.read_csv('Data/music_clean.csv')
music_df = music_df.drop('Unnamed: 0', axis=1)
print(music_df.shape)
music_df.head()

(1000, 12)


Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre
0,60.0,0.896,0.726,214547.0,0.177,2e-06,0.116,-14.824,0.0353,92.934,0.618,1
1,63.0,0.00384,0.635,190448.0,0.908,0.0834,0.239,-4.795,0.0563,110.012,0.637,1
2,59.0,7.5e-05,0.352,456320.0,0.956,0.0203,0.125,-3.634,0.149,122.897,0.228,1
3,54.0,0.945,0.488,352280.0,0.326,0.0157,0.119,-12.02,0.0328,106.063,0.323,1
4,55.0,0.245,0.667,273693.0,0.647,0.000297,0.0633,-7.787,0.0487,143.995,0.3,1


In [3]:
X = music_df.drop('genre', axis=1).values
y = music_df['genre'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(np.mean(X), np.std(X))
print(np.mean(X_train_scaled), np.std(X_train_scaled))

19762.413275219726 71791.8429618064
4.037174635000569e-16 0.9999999999999993


## Scaling in a pipeline

In [4]:
X = music_df.drop('genre', axis=1).values
y = music_df['genre'].values

steps = [('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier(n_neighbors=6))]

pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=21)

knn_scaled = pipeline.fit(X_train, y_train)
y_pred = knn_scaled.predict(X_test)
print(knn_scaled.score(X_test, y_test))

0.805


Compare the 80% accuracy with unscaled data

In [5]:
X = music_df.drop('genre', axis=1).values
y = music_df['genre'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=21)
knn_unscaled = KNeighborsClassifier(n_neighbors=6).fit(X_train, y_train)
print(knn_unscaled.score(X_test, y_test))

0.515


## Using Cross Validation and Scaling in a Pipeline

In [6]:
X = music_df.drop('genre', axis=1).values
y = music_df['genre'].values

steps = [('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier())]

pipeline = Pipeline(steps)
parameters = {'knn__n_neighbors' : np.arange(1,50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=21)

cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

print(cv.best_score_)
print(cv.best_params_)

0.8137500000000001
{'knn__n_neighbors': 12}


### Centering & Scaling: Regression

use a pipeline to preprocess the music_df features and build a lasso regression model to predict a song's loudness.

In [7]:
X = music_df.drop('loudness', axis=1).values
y = music_df['loudness'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=21)
steps = [("scaler", StandardScaler()),
         ("lasso", Lasso(alpha=0.5))]

pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))

0.6976727596061001


In [8]:
X = music_df.drop('loudness', axis=1).values
y = music_df['loudness'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=21)
lasso_unscaled = Lasso(alpha=0.5).fit(X_train, y_train)
print(lasso_unscaled.score(X_test, y_test))

0.48710377916766456


### Centering and Scaling Classification

Bring together scaling and model building into a pipeline for cross-validation.

Build a pipeline to scale features in the music_df dataset and perform grid search cross-validation using a logistic regression model with different values for the hyperparameter C. The target variable here is "genre", which contains binary values for rock as 1 and any other genre as 0.

In [9]:
X = music_df.drop('genre', axis=1).values
y = music_df['genre'].values

steps = [("scaler", StandardScaler()),
         ("logreg", LogisticRegression())]

pipeline = Pipeline(steps)

parameters = {"logreg__C": np.linspace(0.001, 1.0, 20)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=21)

cv = GridSearchCV(pipeline, param_grid=parameters)

cv.fit(X_train, y_train)
print(cv.best_score_, "\n", cv.best_params_)

0.8625 
 {'logreg__C': 0.15873684210526315}
