# 1. Scaling Numerical Features (Robust Scaling)

Robust Scaling is a feature scaling technique that is less sensitive to outliers compared to Min-Max Scaling and Standardization. Instead of using the mean and standard deviation (like Standardization) or the minimum and maximum values (like Min-Max Scaling), Robust Scaling uses the median and the interquartile range (IQR) to scale the data.

Why Use Robust Scaling?

1. Handling Outliers: The primary advantage of Robust Scaling is its ability to handle outliers effectively. Outliers can significantly skew the mean and standard deviation, leading to issues with Standardization. Similarly, outliers can compress the majority of the data into a small range in Min-Max Scaling. Robust Scaling mitigates these effects by focusing on the central portion of the data.
2. Data with Non-Gaussian Distributions: If your data does not follow a normal (Gaussian) distribution and contains outliers, Robust Scaling can often provide a more stable and representative scaling than other methods.

How Robust Scaling Works (The Formula):

For each value x in a numerical feature, the Robust Scaled value x_scaled is calculated using the following formula:

x_scaled = (x - median) / IQR

Where:

x is the original value of the feature.

median is the middle value of the feature when the data is sorted (the 50th percentile).

IQR is the interquartile range, which is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the feature's values. It represents the spread of the middle 50% of the data.

Steps in Robust Scaling:

1. Calculate the Median: Compute the median value for the numerical feature across all data points in your training set.
2. Calculate the Interquartile Range (IQR): Compute the IQR for the same feature across the training set by finding the difference between the 75th and 25th percentiles.
3. Apply the Formula: For each value of the feature in both the training and testing (and future unseen) datasets, subtract the calculated median and then divide by the calculated IQR.

Important Considerations:

1. Apply Statistics from Training Data: As with other scaling methods, it's crucial to calculate the median and IQR only from the training data and then use these values to transform both the training and any subsequent testing or new data to prevent data leakage.
2. Preserves Relative Order: Robust Scaling preserves the relative order of the data points within the feature.
3. Range of Scaled Data: The scaled data from Robust Scaling does not have a fixed range like [0, 1] in Min-Max Scaling, nor does it guarantee a specific mean and standard deviation like Standardization. The range will depend on the distribution of the data and the presence of outliers.
4. When to Use: Robust Scaling is particularly useful when you suspect or know that your data contains outliers and you want a scaling method that is not heavily influenced by these extreme values.

In summary, Robust Scaling is a valuable feature scaling technique when dealing with data that contains outliers, as it uses statistics (median and IQR) that are less sensitive to these extreme values, providing a more stable scaling of the majority of the data.

# Import necessary dependencies

In [56]:
import pandas as pd
from sklearn.preprocessing import RobustScaler

# Create sample dataset

In [57]:
# Sample DataFrame representing features of Men's Sports Apparel
data = pd.DataFrame({
    'Price_INR': [320, 409, 650, 1200, 850, 500, 999, 250, 2500],  # Potential outlier
    'Rating_Out_Of_5': [4.2, 3.9, 4.5, 4.1, 3.7, 4.0, 4.3, 3.5, 2.5],  # Potential outlier
    'Number_of_Reviews': [55, 32, 120, 78, 25, 40, 95, 15, 300],  # Potential outlier
    'Discount_Percentage': [0.0, 0.15, 0.05, 0.20, 0.10, 0.0, 0.25, 0.30, 0.50]  # Potential outlier
})

print("Original Data:")
data

Original Data:


Unnamed: 0,Price_INR,Rating_Out_Of_5,Number_of_Reviews,Discount_Percentage
0,320,4.2,55,0.0
1,409,3.9,32,0.15
2,650,4.5,120,0.05
3,1200,4.1,78,0.2
4,850,3.7,25,0.1
5,500,4.0,40,0.0
6,999,4.3,95,0.25
7,250,3.5,15,0.3
8,2500,2.5,300,0.5


# Numerical Features (Robust Scaling) implementation

In [58]:
# 1. Initialize the RobustScaler
scaler = RobustScaler()

In [59]:
# 2. Fit the scaler to the data
# This calculates the median and interquartile range (IQR) for each numerical column.
# It's important to fit on the training data only in a real-world scenario.

scaler.fit(data)

In [60]:
# 3. Transform the data
# This applies the Robust Scaling formula to each value in each numerical column
# using the median and IQR calculated in the fit step.

scaled_data = scaler.transform(data)

In [61]:
# 4. Create a new DataFrame with the scaled values
scaled_df = pd.DataFrame(scaled_data, columns=data.columns)

print("\nRobust Scaled Data:")
scaled_df


Robust Scaled Data:


Unnamed: 0,Price_INR,Rating_Out_Of_5,Number_of_Reviews,Discount_Percentage
0,-0.559322,0.4,0.0,-0.75
1,-0.408475,-0.2,-0.365079,0.0
2,0.0,1.0,1.031746,-0.5
3,0.932203,0.2,0.365079,0.25
4,0.338983,-0.6,-0.47619,-0.25
5,-0.254237,0.0,-0.238095,-0.75
6,0.591525,0.6,0.634921,0.5
7,-0.677966,-1.0,-0.634921,0.75
8,3.135593,-3.0,3.888889,1.75


In [62]:
# get the median and IQR used for scaling
print("\nMedian of each feature (from the data used for fitting):")
print(scaler.center_)
print("\nInterquartile Range (IQR) of each feature (from the data used for fitting):")
print(scaler.scale_)


Median of each feature (from the data used for fitting):
[6.5e+02 4.0e+00 5.5e+01 1.5e-01]

Interquartile Range (IQR) of each feature (from the data used for fitting):
[5.9e+02 5.0e-01 6.3e+01 2.0e-01]


Important Note for Real-World Applications:

As with other scaling methods:

1. Fit the RobustScaler only on your training data.
2. Use the transform() method (with the fitted scaler) on both your training and testing data. This prevents data leakage and ensures consistent scaling.