Basic imports

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import kagglehub, os

  from .autonotebook import tqdm as notebook_tqdm


Data

In [None]:
df = pd.DataFrame({
    'height': [150, 160, 170, 180, 190],
    'weight': [50, 60, 70, 80, 90],
    'age': [18, 25, 35, 45, 55]
})
df

Unnamed: 0,height,weight,age
0,150,50,18
1,160,60,25
2,170,70,35
3,180,80,45
4,190,90,55


<hr> 

# Feature scaling

technique used in machine learning and statistics to standardize or normalize the range of independent variables (features) of data. It ensures that the model treats all features equally, avoiding biased results due to varying scales of the features

## Types of models
### Important for algorithms that rely on:
* distance calculations (e.g., K-Nearest Neighbors, Support Vector Machines, K-Means Clustering, Principal Component Analysis)  
* gradient-based optimization (e.g., Linear Regression, Logistic Regression, Neural Networks)
and feature size sensitive algorithms like Ridge and Lasso Regression

### We not use scaling features when:
* using tree-based algorithms (e.g. Decision Trees, Random Forests, Gradient Boosting Machines)
* when features are already on the same scale  (e.g., if the data is already normalized or percentage-based)
* using models that are not sensitive to size feature (e.q. Naive Bayes or non-regularized logistic regression)

## Types of data
### Scale:
* numerical data - features such as age, income, temperature, etc, which have different ranges or units (e.g. age 18-70 vs. income 45 000 - 1 000 000), need scaling for algorithms that are sensitive to the scale, such as KNN or SVM. Otherwise,  model might pay more attention to the feature with the larger range 'cause it has bigger values, and this could lead to bias in the model. It might "ignore" the feature with smaller values
* ordinal data - data that has a natural order but can have large numerical differences (e.g. education level coded as 1, 2, 3, and 4) should often be scaled, especially when used with distance-based models 'cause the difference between categories is not the same. If we don’t scale this data, some algorithms might treat the differences between categories as being equal, which isn’t always true 
* continuous data - features like height, weight, or time, which have a wide range of values, same, if we don’t scale these values, the algorithm might focus more on the feature with the larger range, even though both are important for the prediction


### Do not scale:
* binary features - since these features are already within a specified range, scaling is unnecessary (e.g. 0 or 1, male or female)
* categorical data (one-hot encoded) - features that are categorical and have been one-hot encoded (e.g. color=red, green, blue) do not require scaling, these binary columns represent the presence or absence of categories

If you also scale dependent variable (y) use different scaler


## I. Min-max scaling (normalization)

transforms the data into a specific range, typically [0, 1]. This method is suitable when we want to ensure that the data fits within a specified range

Correspond with outliers: 
any outliers that are very far from the rest of the data will directly influence the scaling, and the rest of the data may be compressed into a very small range

Formula:
X_scaled = (x-x_min)/(x_max-x_min)

Value range:
[0, 1] (always non-negative)

Best use case:
Use when data must be scaled to a specific range, especially for algorithms sensitive to feature size (Neural Networks, K-Means Clustering)


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print("Scaled df")
scaled_df

Scaled df


Unnamed: 0,height,weight,age
0,0.0,0.0,0.0
1,0.25,0.25,0.189189
2,0.5,0.5,0.459459
3,0.75,0.75,0.72973
4,1.0,1.0,1.0


In [None]:
print("Original df")
df 

Original df


Unnamed: 0,height,weight,age
0,150,50,18
1,160,60,25
2,170,70,35
3,180,80,45
4,190,90,55


## II. standardization (z-score normalization)

transforms the data so that it has a mean of 0 and a standard deviation of 1. This is often used when the data has a Gaussian (normal) distribution or when we want to treat all features with equal importance

Correspond with outliers: 
outliers can heavily influence the mean and increase the standard deviation, leading to incorrect scaling

Formula:
x_scaled = (x-mean)/stdev

Value range:
(-∞, ∞) (can be negative or positive, depending on the mean and variance)

Best use case:
Use when the data follows a normal distribution or when the model assumes Gaussian-distributed data (e.g. Linear/Logistic Regression, SVM)


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

print("Scaled df")
scaled_df

Scaled df


Unnamed: 0,height,weight,age
0,-1.414214,-1.414214,-1.321256
1,-0.707107,-0.707107,-0.795756
2,0.0,0.0,-0.045043
3,0.707107,0.707107,0.705671
4,1.414214,1.414214,1.456384


In [None]:
print("Original df")
df 

Original df


Unnamed: 0,height,weight,age
0,150,50,18
1,160,60,25
2,170,70,35
3,180,80,45
4,190,90,55


## III. robust scaling

uses the median and interquartile range (IQR) for scaling the features, making it robust to outliers

Correspond with outliers: 
Since the median is not sensitive to extreme values and the IQR only considers the middle 50% of the data, this method is much less affected by outliers

Values range:
(-∞, ∞) (can be negative or positive, but outliers are less impactful)

Formula:
s_xcaled = (x-median)/IQR

Best use case:
Use when there are outliers in the data that need to be preserved but should not dominate the scaling (e.g.financial data or other datasets with extreme values)

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

print("Scaled df")
scaled_df

Scaled df


Unnamed: 0,height,weight,age
0,-1.0,-1.0,-0.85
1,-0.5,-0.5,-0.5
2,0.0,0.0,0.0
3,0.5,0.5,0.5
4,1.0,1.0,1.0


In [None]:
print("Original df")
df 

Original df


Unnamed: 0,height,weight,age
0,150,50,18
1,160,60,25
2,170,70,35
3,180,80,45
4,190,90,55


## IV. normalization (L2 norm scaling)

normalization scales the data such that the sum of squares of each row equals 1 (L2 normalization). This is often used when the data represents directions (e.g., text or image data)

Values range:
(0, 1) (values are non-negative, typically between 0 and 1)

Best use case:
Use when the magnitude of the data matters less than its direction (e.g., text data, image data)

In [None]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

print("Scaled df")
scaled_df

Scaled df


Unnamed: 0,height,weight,age
0,0.942595,0.314198,0.113111
1,0.926467,0.347425,0.14476
2,0.908364,0.374032,0.187016
3,0.89086,0.395938,0.222715
4,0.874314,0.414149,0.253091


In [None]:
print("Original df")
df 

Original df


Unnamed: 0,height,weight,age
0,150,50,18
1,160,60,25
2,170,70,35
3,180,80,45
4,190,90,55
