# Scaling data 

## Effects of Scaling

### Scaling affects interpretability
Let's say we have following ordinary least squares regressions.

1. $y = \hat \beta_0 + \hat \beta_1x_1 + \hat \beta_2x_2$

Where $x_1$ and $x_2$ are in meters.

2. $y = \hat \beta_0 + \hat \beta_1x_1 + \hat \beta_2x_2$

Where $x_1$ is in meters and $x_2$ is in centimeters.

For 1., we can say that the larger $\beta$ has a greater effect on y, because $x_1$ and $x_2$ are on the same scale.

For 2., the $\hat \beta_2$ is 100 times smaller than the same coefficient in 1. So we can't conclude that $\beta_1$ is a more important variable than $\beta_2$, because the scales are different.

### When to scale or not scale?

**Scale** when your machine learning model includes distances (e.g. SVM, KNN, KMeans, Linear Regression) or uses gradient descent because it can help with convergence. We scale when distance calculations are involved because features with large scales can overpower the distance calculation.
![](knn_scaling.JPG)

Don't need to scale when distances aren't involved (e.g. trees). Trees are unaffected because they will split in the same relative position whether the data is scaled or not.

**In general, it does not hurt to scale the data.**

### Types of Scaling

#### Standardization vs Normalization

**Normalization (Min-max scaler)** gets our feature values between 0 and 1.

$X_{norm} = \frac{X-X_{min}}{X_{max} - X_{min}}$

We use normalization when we use CNNs.

**Standardization (Standard scaler)** gets our features into z-scores

$Z = \frac{X - \mu}{\sigma}$

Usually, standardization performs better than normalization.

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

Let's do an example to see how scaling affects our regression. In this example, we are trying to predict the mpg of a car based on its engine size, horsepower, and weight. The data was simulated.

In [2]:
# Set a seed for reproducibility
np.random.seed(42)

# Generate the toy dataset
n_samples = 100

# Features: engine_size (in liters), horsepower, weight (in kg)
engine_size = np.random.uniform(1.0, 4.0, n_samples)  # Engine size between 1.0 and 4.0 liters
horsepower = engine_size * np.random.uniform(60, 100, n_samples)  # Roughly proportional to engine size
weight = np.random.uniform(800, 2000, n_samples)  # Weight between 800 and 2000 kg

# Target variable: mpg (miles per gallon), inversely related to horsepower and weight
mpg = 50 - (horsepower / 50) - (weight / 1000) + np.random.normal(0, 2, n_samples)

# Create a DataFrame
df = pd.DataFrame({
    'engine_size': engine_size,
    'horsepower': horsepower,
    'weight': weight,
    'mpg': mpg
})

First, let's split the data into train and test.

In [15]:
X = df.drop('mpg', axis = 1)
y = df['mpg']

# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=np.random.seed(42))

# Scale features using Min-Max (Normalization)
mm_scaler = MinMaxScaler()
X_train_mm = mm_scaler.fit_transform(X_train)
X_test_mm = mm_scaler.transform(X_test)

# Scale features using Standard scaler (Standardization)
ss_scaler = StandardScaler()
X_train_ss = mm_scaler.fit_transform(X_train)
X_test_ss = mm_scaler.transform(X_test)

Now lets perform a linear regression unscaled, normalized, and standardized data.

In [18]:
# Unscaled
lm_unscaled = LinearRegression()
lm_unscaled.fit(X_train, y_train)
pred_unscaled = lm_unscaled.predict(X_test)
print(f"MSE of unscaled data: {mean_squared_error(y_test, pred_unscaled)}")

# Min-Max (Normalization)
lm_mm = LinearRegression()
lm_mm.fit(X_train_mm, y_train)
pred_mm = lm_mm.predict(X_test_mm)
print(f"MSE of Min-Max data: {mean_squared_error(y_test, pred_mm)}")


lm_ss = LinearRegression()
lm_ss.fit(X_train_ss, y_train)
pred_ss = lm_ss.predict(X_test_ss)
print(f"MSE of standardized data: {mean_squared_error(y_test, pred_ss)}")

MSE of unscaled data: 8.409350248370767
MSE of Min-Max data: 8.409350248370782
MSE of standardized data: 8.409350248370782


It look like scaling made a huge difference here in terms of performance. Let's see what happens when we use KNN.

In [20]:
# Unscaled
knn_unscaled = KNeighborsRegressor()
knn_unscaled.fit(X_train, y_train)
pred_unscaled = lm_unscaled.predict(X_test)
print(f"MSE of unscaled data: {mean_squared_error(y_test, pred_unscaled)}")

# Min-Max (Normalization)
knn_mm = KNeighborsRegressor()
knn_mm.fit(X_train_mm, y_train)
pred_mm = knn_mm.predict(X_test_mm)
print(f"MSE of normalized data: {mean_squared_error(y_test, pred_mm)}")

# Standard Scaler (standardization)
knn_ss = KNeighborsRegressor()
knn_ss.fit(X_train_ss, y_train)
pred_ss = knn_ss.predict(X_test_ss)
print(f"MSE of standardized data: {mean_squared_error(y_test, pred_ss)}")

MSE of unscaled data: 8.409350248370767
MSE of normalized data: 8.366496434906873
MSE of standardized data: 8.366496434906873


We see that there is a slight improvement in MSE when we scale using standardization or normalization.

We know that trees based models are not really affected by scaling, so we will show this property now.

In [23]:
tree_unscaled = DecisionTreeRegressor(random_state=np.random.seed(42))
tree_unscaled.fit(X_train, y_train)
pred_unscaled = tree_unscaled.predict(X_test)
print(f"MSE of unscaled data: {mean_squared_error(y_test, pred_unscaled)}")

tree_mm = DecisionTreeRegressor(random_state=np.random.seed(42))
tree_mm.fit(X_train_mm, y_train)
pred_mm = tree_mm.predict(X_test_mm)
print(f"MSE of unscaled data: {mean_squared_error(y_test, pred_mm)}")

tree_ss = DecisionTreeRegressor(random_state=np.random.seed(42))
tree_ss.fit(X_train_ss, y_train)
pred_ss = tree_ss.predict(X_test_ss)
print(f"MSE of unscaled data: {mean_squared_error(y_test, pred_ss)}")

MSE of unscaled data: 11.372798254414686
MSE of unscaled data: 11.372798254414686
MSE of unscaled data: 11.372798254414686


As we can see, the MSE of the decision tree models do not change even if we scale the features.