In [2]:
!pip install pandas
!pip install scikit-learn

Collecting pandas
  Downloading pandas-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Collecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.8/347.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m[31m3.3 MB/s[0m eta [36m0:00:01[0m
[?25hCollecting pytz>=2020.1
  Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.3.1 pytz-2025.2 tzdata-2025.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[

### 1. Absolute Maximum Scaling

This method of scaling requires two-step:

1. We should first select the maximum absolute value out of all the entries of a particular measure.
2. Then after this we divide each entry of the column by this maximum value.

$X_{\text{scaled}} = \frac{X_i - \max(|X|)}{\max(|X|)}$

After performing the above-mentioned two steps we will observe that each entry of the column lies in the range of -1 to 1. But this method is not used that often the reason behind this is that it is too sensitive to the outliers. And while dealing with the real-world data presence of outliers is a very common thing. 

In [3]:
import pandas as pd
df = pd.read_csv('SampleFile.csv')
print(df.head())

   LotArea  MSSubClass
0     8450          60
1     9600          20
2    11250          60
3     9550          70
4    14260          60


Now let's apply the first method which is of the absolute maximum scaling. For this first, we are supposed to evaluate the absolute maximum values of the columns.

In [5]:
import numpy as np
max_vals = np.max(np.abs(df))
max_vals

np.int64(215245)

Now we are supposed to subtract these values from the data and then divide the results from the maximum values as well. 

In [6]:
print((df - max_vals) / max_vals)

       LotArea  MSSubClass
0    -0.960742   -0.999721
1    -0.955400   -0.999907
2    -0.947734   -0.999721
3    -0.955632   -0.999675
4    -0.933750   -0.999721
...        ...         ...
1455 -0.963219   -0.999721
1456 -0.938791   -0.999907
1457 -0.957992   -0.999675
1458 -0.954856   -0.999907
1459 -0.953834   -0.999907

[1460 rows x 2 columns]


### 2. Min-Max Scaling

This method of scaling requires below two-step:

1. First we are supposed to find the minimum and the maximum value of the column.
2. Then we will subtract the minimum value from the entry and divide the result by the difference between the maximum and the minimum value.

$$
X_{\text{scaled}} = \frac{X_i - X_{\min}}{X_{\max} - X_{\min}}
$$

As we are using the maximum and the minimum value this method is also prone to outliers but the range in which the data will range after performing the above two steps is between 0 to 1.

In [7]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, 
                         columns=df.columns)
scaled_df.head()

Unnamed: 0,LotArea,MSSubClass
0,0.03342,0.235294
1,0.038795,0.0
2,0.046507,0.235294
3,0.038561,0.294118
4,0.060576,0.235294


### 3. Normalization

Normalization is the process of adjusting the values of data points so that they all have the same length or size, specifically a length of 1. This is done by dividing each data point by the "length" (called as Euclidean norm) of that data point. Think of it like adjusting the size of a vector so that it fits within a standard size of 1.

The formula for Normalization looks like this:

$$
X_{\text{scaled}} = \frac{X_i}{\|X\|}
$$

Where:

- $X_i$ is the $i^{\text{th}}$ element of the vector $X$
- $\|X\|$ is the Euclidean norm: $\|X\| = \sqrt{\sum_{i=1}^n X_i^2}$

In [8]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0  0.999975    0.007100
1  0.999998    0.002083
2  0.999986    0.005333
3  0.999973    0.007330
4  0.999991    0.004208


### 4. Standardization
This method of scaling is basically based on the central tendencies and variance of the data. 

1. First we should calculate the mean and standard deviation of the data we would like to normalize it.
2. Then we are supposed to subtract the mean value from each entry and then divide the result by the standard deviation.

This helps us achieve a normal distribution of the data with a mean equal to zero and a standard deviation equal to 1.

$$
X_{\text{scaled}} = \frac{X_i - X_{\text{mean}}}{\sigma}
$$


In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0 -0.207142    0.073375
1 -0.091886   -0.872563
2  0.073480    0.073375
3 -0.096897    0.309859
4  0.375148    0.073375


### 5. Robust Scaling

In this method of scaling, we use two main statistical measures of the data.

- Median
- Inter-Quartile Range

After calculating these two values we are supposed to subtract the median from each entry and then divide the result by the interquartile range.

$$
X_{\text{scaled}} = \frac{X_i - X_{\text{median}}}{\text{IQR}}
$$

**Why use Robust Scaling?**
Imagine a dataset of house prices where most houses are between $100k and $500k, but there's one mansion priced at $10 million.

If you used Min-Max or StandardScaler, that mansion would skew the scaling, making the rest of the data compressed toward zero.

RobustScaler, however, ignores those extreme values, focusing only on the middle 50% (the interquartile range), so your data ends up being more evenly scaled.

**Example** Let’s say we have the following values for a feature:
```
[1, 2, 2, 3, 4, 5, 100]
```
- Median = 3
- Q1 = 2, Q3 = 5 → IQR = 5 - 2 = 3

Now apply Robust Scaling:
```(1 - 3) / 3 = -0.67
(2 - 3) / 3 = -0.33
(2 - 3) / 3 = -0.33
(3 - 3) / 3 =  0.0
(4 - 3) / 3 =  0.33
(5 - 3) / 3 =  0.67
(100 - 3) / 3 = 32.33  ← still large, but doesn't affect the rest
```
The middle values are well-scaled, and the outlier (100) is still high but doesn't **squash** the rest of the data.

**When to Use RobustScaler?**

Use it when
- Your data has outliers.
- You want resilient and stable scaling for models sensitive to feature scales (like SVMs, k-NN, or logistic regression).

In [10]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df.head())

    LotArea  MSSubClass
0 -0.254076         0.2
1  0.030015        -0.6
2  0.437624         0.2
3  0.017663         0.4
4  1.181201         0.2


#### Scaler Comparison Table

| **Scaler**                       | **Best When...**                                                          | **Sensitive to Outliers?** | **Typical Output Range** | **Use Case Examples**                                |
| -------------------------------- | ------------------------------------------------------------------------- | -------------------------- | ------------------------ | ---------------------------------------------------- |
| **1. Absolute Max Scaling**      | Data needs to be scaled relative to the largest absolute value            | Yes                      | -1 to 1                  | Simple range compression, rarely used in practice    |
| **2. Min-Max Scaling**           | You want to bring data to a fixed range (e.g., 0–1) and no/few outliers   | Yes                      | 0 to 1 (or custom range) | Neural networks, image data, constrained algorithms  |
| **3. Normalization (L1/L2)**     | You want to scale **samples** to unit norm (rows sum to 1 or L2 norm = 1) | Yes                      | Vector norm = 1          | Text classification (TF-IDF), KNN, cosine similarity |
| **4. Standardization (Z-Score)** | Features are normally distributed or model assumes normal distribution    | Yes                      | Mean = 0, Std = 1        | Logistic regression, SVMs, PCA                       |
| **5. Robust Scaling**            | Your data has outliers and you don’t want them to skew your model         | No                       | Centered around 0        | Regression, SVMs, tree-based models with outliers    |
