<a href="https://colab.research.google.com/github/peajangid/Feature-Engineering/blob/main/Standardization_and_Normalizaion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What is Standardization?
Standardization transforms data so that it has a mean of 0 and a standard deviation of 1 (i.e., it follows a standard normal distribution). It’s also called z-score normalization because it converts raw data into z-scores.

For a data point x:


 z = (x-μ)/σ


μ = mean of the dataset

σ = standard deviation of the dataset

z = standardized value
Key Characteristics:

1. The resulting distribution has a mean of 0 and a standard deviation of 1.

2. It preserves the shape of the original distribution but rescales it.

3. It doesn’t bound the values to a specific range (e.g., values can be negative or greater than 1).

When to Use Standardization?
1. Algorithms Assuming Normally Distributed Data.
2. Features with Different Units or Scales.
3. Outliers Are Important.
4. Gradient-Based Optimization.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler


In [None]:
df = pd.read_csv(r'/content/advertising.csv')
df.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0


In [None]:
# info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.3+ KB


In [None]:
# Checking for null values
df.isnull().sum()

Unnamed: 0,0
Daily Time Spent on Site,0
Age,0
Area Income,0
Daily Internet Usage,0
Ad Topic Line,0
City,0
Male,0
Country,0
Timestamp,0
Clicked on Ad,0


In [None]:
# selecting only the continuous cols (numnerical)
df_num = [column for column in df.columns if df[column].dtype != 'object']
df_num

['Daily Time Spent on Site',
 'Age',
 'Area Income',
 'Daily Internet Usage',
 'Male',
 'Clicked on Ad']

In [None]:
df_num = df[df_num]


In [None]:
df_num.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Clicked on Ad
0,68.95,35,61833.9,256.09,0,0
1,80.23,31,68441.85,193.77,1,0
2,69.47,26,59785.94,236.5,0,0
3,74.15,29,54806.18,245.89,1,0
4,68.37,35,73889.99,225.58,0,0


In [None]:
# initiating the scaler
scaler = StandardScaler()

# scaling
df_scaled = scaler.fit_transform(df_num)

In [None]:
df_scaled = pd.DataFrame(df_scaled)
df_scaled.columns = df_num.columns
df_scaled.head()


Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Clicked on Ad
0,0.249267,-0.114905,0.509691,1.73403,-0.962695,-1.0
1,0.961132,-0.570425,1.00253,0.313805,1.03875,-1.0
2,0.282083,-1.139826,0.356949,1.287589,-0.962695,-1.0
3,0.577432,-0.798185,-0.014456,1.50158,1.03875,-1.0
4,0.212664,-0.114905,1.408868,1.038731,-0.962695,-1.0


## What is Normalization
Normalization rescales the data to fit within a specific range, typically [0, 1] or [-1, 1]. Unlike standardization, it doesn’t assume anything about the distribution—it just adjusts the values proportionally.

Common Normalization methods:
1. Min-Max Scaler:
2. Robust Scaler
3. Mean Scaler

Key Characteristics:
1. Values are constrained to a specific range (e.g., [0, 1]).
2. The shape of the distribution is preserved, but the scale is compressed.
3. Can be sensitive to outliers.

When to Use Normalization?
1. Bounded Input Requirements
2. Distance-Based Algorithms
3. No Assumption of Distribution
4. Outliers Aren’t Critical

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
df_num

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Clicked on Ad
0,68.95,35,61833.90,256.09,0,0
1,80.23,31,68441.85,193.77,1,0
2,69.47,26,59785.94,236.50,0,0
3,74.15,29,54806.18,245.89,1,0
4,68.37,35,73889.99,225.58,0,0
...,...,...,...,...,...,...
995,72.97,30,71384.57,208.58,1,1
996,51.30,45,67782.17,134.42,1,1
997,51.63,51,42415.72,120.37,1,1
998,55.55,19,41920.79,187.95,0,0


In [None]:
scaler = MinMaxScaler()
df_scaled2 = scaler.fit_transform(df_num)

In [None]:
df_scaled2 = pd.DataFrame(df_scaled2)
df_scaled2.columns = df_num.columns
df_scaled2.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Clicked on Ad
0,0.617882,0.380952,0.730472,0.916031,0.0,0.0
1,0.809621,0.285714,0.831375,0.538746,1.0,0.0
2,0.626721,0.166667,0.6992,0.797433,0.0,0.0
3,0.706272,0.238095,0.62316,0.85428,1.0,0.0
4,0.608023,0.380952,0.914568,0.731323,0.0,0.0


In [None]:
df_scaled2.describe()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Clicked on Ad
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.550743,0.404976,0.626119,0.455383,0.481,0.5
std,0.269482,0.20918,0.20484,0.265785,0.499889,0.50025
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.318885,0.238095,0.504446,0.206139,0.0,0.0
50%,0.605388,0.380952,0.656847,0.474331,0.0,0.5
75%,0.781022,0.547619,0.786005,0.690232,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0


#### Which One to Choose?
Use Standardization:

When working with gradient-based models (e.g., neural networks, logistic regression).

When features have different units or large variances.

When outliers are meaningful and should influence the model.

Use Normalization:

When working with algorithms sensitive to magnitude (e.g., KNN, SVM).

When inputs need to be in a specific range (e.g., image data: 0-255 → 0-1).

When you don’t care about the distribution and just want a uniform scale.

Combine or Experiment:

In practice, try both and evaluate model performance (e.g., via cross-validation). Some datasets benefit from one over the other depending on the algorithm.