# 1. Scaling Numerical Features (Standardization)

Standardization, also known as Z-score normalization, is a common technique used in feature scaling to transform numerical features so that they have a mean of 0 and a standard deviation of 1. This process centers the data around zero and scales it based on its variability.

Why Standardize Numerical Features?

1. Sensitivity of Algorithms to Feature Scales: Many machine learning algorithms are sensitive to the scale of input features. Algorithms like gradient descent-based optimization (used in neural networks, linear regression, logistic regression) converge faster when features are on similar scales. Distance-based algorithms (like k-nearest neighbors, k-means clustering, support vector machines with radial basis function kernels) are heavily influenced by feature scales, as they calculate distances between data points. Features with larger values can disproportionately affect these distance calculations.
2. Preventing Feature Dominance: If one numerical feature has a much larger range of values than others, it might dominate the learning process in some algorithms, even if it's not the most important feature. Standardization helps to give all features a more equal footing.
3. Improving Model Stability: Extreme values or different scales can sometimes lead to numerical instability during the training of certain models. Standardization can help mitigate this.
4. Assumption of Data Distribution: Some statistical techniques and models assume that the data is normally distributed or at least symmetric around the mean. While standardization doesn't make non-normal data normal, it centers it, which can be beneficial for some of these assumptions.

How Standardization Works (The Formula):

For each value x in a numerical feature, the standardized value x_scaled is calculated using the following formula:

x_scaled = (x - mean) / standard deviation

Where:

mean is the average value of the feature across the entire dataset.
standard deviation is the measure of the spread or dispersion of the feature's values around the mean.

Steps in Standardization:

1. Calculate the Mean: Compute the average value for the numerical feature across all data points in your training set.
2. Calculate the Standard Deviation: Compute the standard deviation for the same feature across the training set.
3. Apply the Formula: For each value of the feature in both the training and testing (and future unseen) datasets, subtract the calculated mean and then divide by the calculated standard deviation.

Important Considerations:

1. Apply Statistics from Training Data: It is crucial to calculate the mean and standard deviation only from the training data and then use these values to transform both the training and any subsequent testing or new data. This prevents data leakage from the test set into the training process.
2. Impact on Distribution: Standardization changes the scale and center of the data but does not change the underlying shape of the distribution. If your data is heavily skewed, it will remain skewed after standardization.
3. When to Use: Standardization is often preferred when the algorithm being used is sensitive to feature scales, and when the data roughly follows a Gaussian (normal) distribution.
4. Alternative: Min-Max Scaling (Normalization): Another common scaling technique is Min-Max scaling, which scales features to a specific range (typically [0, 1]). The choice between standardization and normalization often depends on the specific algorithm and the characteristics of the data. For algorithms sensitive to outliers, standardization might be less affected than min-max scaling.

In summary, standardization is a vital step in preparing numerical data for many machine learning algorithms by rescaling features to have a mean of 0 and a standard deviation of 1, helping to improve model performance and stability.

# Import necessary dependencies

In [41]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Create sample dataset

In [42]:
# Sample DataFrame representing features of Men's Sports Apparel
data = pd.DataFrame({
    'Price_INR': [320, 409, 650, 1200, 850, 500, 999, 250],  # Prices in Indian Rupees
    'Rating_Out_Of_5': [4.2, 3.9, 4.5, 4.1, 3.7, 4.0, 4.3, 3.5],
    'Number_of_Reviews': [55, 32, 120, 78, 25, 40, 95, 15],
    'Discount_Percentage': [0.0, 0.15, 0.05, 0.20, 0.10, 0.0, 0.25, 0.30]
})

print("Original Data:")
data

Original Data:


Unnamed: 0,Price_INR,Rating_Out_Of_5,Number_of_Reviews,Discount_Percentage
0,320,4.2,55,0.0
1,409,3.9,32,0.15
2,650,4.5,120,0.05
3,1200,4.1,78,0.2
4,850,3.7,25,0.1
5,500,4.0,40,0.0
6,999,4.3,95,0.25
7,250,3.5,15,0.3


# Numerical Features (Standardization) implementation

In [43]:
# 1. Initialize the StandardScaler

scaler = StandardScaler()

In [44]:
# 2. Fit the scaler to the data
# This calculates the mean and standard deviation for each numerical column.
# It's important to fit on the training data only in a real-world scenario.

scaler.fit(data)

In [45]:
# 3. Transform the data
# This applies the standardization formula to each value in each numerical column
# using the mean and standard deviation calculated in the fit step.

scaled_data = scaler.transform(data)

In [46]:
# 4. Create a new DataFrame with the scaled values

scaled_df = pd.DataFrame(scaled_data, columns=data.columns)

In [47]:
scaled_df

Unnamed: 0,Price_INR,Rating_Out_Of_5,Number_of_Reviews,Discount_Percentage
0,-1.026093,0.57735,-0.072327,-1.239591
1,-0.747033,-0.412393,-0.737737,0.177084
2,0.008623,1.567094,1.808179,-0.767366
3,1.733149,0.247436,0.593083,0.649309
4,0.635723,-1.072222,-0.940253,-0.295141
5,-0.461703,-0.082479,-0.50629,-1.239591
6,1.102913,0.907265,1.084908,1.121535
7,-1.245578,-1.732051,-1.229562,1.59376


In [48]:
# get the mean and standard deviation used for scaling
print("\nMean of each feature (from the data used for fitting):")
print(scaler.mean_)
print("\nStandard deviation of each feature (from the data used for fitting):")
print(scaler.scale_)


Mean of each feature (from the data used for fitting):
[6.4725e+02 4.0250e+00 5.7500e+01 1.3125e-01]

Standard deviation of each feature (from the data used for fitting):
[3.18928185e+02 3.03108891e-01 3.45651559e+01 1.05881715e-01]


Important Note for Real-World Applications:

In a real machine learning workflow, it's crucial to:

1. Fit the StandardScaler only on your training data.

2. Use the transform() method (with the fitted scaler) on both your training and testing data to avoid data leakage and ensure that the scaling is consistent across all datasets. You should never fit the scaler on the entire dataset (including the test set) before training your model.