# Centering And Scaling

We can see the ranges of the data frame by using

> df.describe()

## Why scale your data

* Many models use some form of distance to inform them
* Features on larger scales can unduly influence the model
    * Example: k-NN uses distance explicitly when making predictions
* We want features to be on a similar scale
* Normalizing (or scaling and centering)

## Ways to normalize your data

* Standardization: Substract the mean and divide by variance
* All features are centered around zero and have variance 1
* Can also substract the minumum and divide by the range
* Minimum zero and maximum one
* Can also normalize, so the data ranges from -1 to +1


## Scale Example

In [None]:
from sklearn.preprocessing import  scale
X_scaled = scale(X)
np.mean(X), np.std(X)
# (8.13421922452, 16.7265339794)
np.mean(X_scaled), np.std(X_scaled)
# (2.54662653149e-15, 1.0)

### Scaling with Pipeline

In [None]:
from sklearn.preprocessing import StandardScaler
steps = [("scaler", StandardScaler()),
         ("knn",KNeighborsClassifier())]

pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred)
# 0.956

# Knn without scaling
knn_unscaled = KNeighborClassifier().fit(X_train,y_train)
knn_unscaled.score(X_test, y_test)
# 0.928

> Scaling did improve the model performance


## CV and Scaling in a Pipeline

In [None]:
steps[("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier())]

pipeline = Pipeline(steps)
parameters = {knn__n_neighbors: np.arange(1,50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

print(cv.best_params_)
# {'knn_n_neighbors' : 41}

print(cv.score(X_test, y_test))
#0.956

print(classification_report(y_test,y_pred))

![](../images/centering_scaling_classification_report.png)

