**What is Feature Scaling?**

Feature scaling is a technique to standardize the range of independent variables (features) in your data. It's important because many machine learning algorithms work better when all features are on a similar scale.

Imagine this: You're comparing people's heights (in centimeters, ranging from 150-200) with their salaries (in dollars, ranging from 30,000-100,000). The salary numbers are much larger, so they might dominate the analysis even if height is equally important. Scaling puts both on a similar range.

**🎯 Feature Scaling (Normalization/Standardization)**


Feature scaling adjusts numerical values so that they have a similar scale. This helps machine learning models perform better.

In [27]:
import pandas as pd

In [28]:
df = pd.read_csv("500hits.csv", encoding = 'latin-1')
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [29]:
df = df.drop(columns = ['PLAYER', 'CS'])

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   YRS     465 non-null    int64  
 1   G       465 non-null    int64  
 2   AB      465 non-null    int64  
 3   R       465 non-null    int64  
 4   H       465 non-null    int64  
 5   2B      465 non-null    int64  
 6   3B      465 non-null    int64  
 7   HR      465 non-null    int64  
 8   RBI     465 non-null    int64  
 9   BB      465 non-null    int64  
 10  SO      465 non-null    int64  
 11  SB      465 non-null    int64  
 12  BA      465 non-null    float64
 13  HOF     465 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 51.0 KB


In [31]:
df.describe().round(3)

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA,HOF
count,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0
mean,17.049,2048.699,7511.456,1150.314,2170.247,380.953,78.555,201.049,894.26,783.561,847.471,195.905,0.289,0.329
std,2.765,354.392,1294.066,289.635,424.191,96.483,49.363,143.623,486.193,327.432,489.224,181.846,0.021,0.475
min,11.0,1331.0,4981.0,601.0,1660.0,177.0,3.0,9.0,0.0,239.0,0.0,7.0,0.246,0.0
25%,15.0,1802.0,6523.0,936.0,1838.0,312.0,41.0,79.0,640.0,535.0,436.0,63.0,0.273,0.0
50%,17.0,1993.0,7241.0,1104.0,2076.0,366.0,67.0,178.0,968.0,736.0,825.0,137.0,0.287,0.0
75%,19.0,2247.0,8180.0,1296.0,2375.0,436.0,107.0,292.0,1206.0,955.0,1226.0,285.0,0.3,1.0
max,26.0,3308.0,12364.0,2295.0,4189.0,792.0,309.0,755.0,2297.0,2190.0,2597.0,1406.0,0.366,2.0


In [32]:
# Selecting the first 13 columns from the DataFrame to create feature set X1
X1 = df.iloc[:, 0:13]

In [33]:
# Selecting the same first 13 columns for another feature set X2
X2 = df.iloc[:, 0:13]

In [34]:
# Importing StandardScaler for standardizing features (mean=0, std=1)
from sklearn.preprocessing import StandardScaler

In [35]:
scaleStandard = StandardScaler()
scaleStandard

In [36]:
# Applying standardization on X1
X1 = scaleStandard.fit_transform(X1)

In [37]:
# Converting the scaled array back to a DataFrame with column names
X1 = pd.DataFrame(X1, columns = ['YRS',	'G', 'AB',	'R',	'H',	'2B',	'3B',	'HR',	'RBI',	'BB',	'SO',	'SB',	'BA'])

In [38]:
X1.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
0,2.516295,2.786078,3.034442,3.787062,4.764193,3.559333,4.389485,-0.585841,-0.346449,1.423013,-1.003628,3.832067,3.64829
1,1.792237,2.760655,2.677044,2.76053,3.444971,3.569709,1.996457,1.909487,2.175837,2.493089,-0.309948,-0.64908,1.996159
2,1.792237,2.091184,2.075964,2.528955,3.171214,4.264876,2.909053,-0.585841,-0.350567,1.826585,-1.283965,1.299723,2.657012
3,1.06818,1.972543,2.849554,2.670665,3.055576,1.691719,-0.254611,0.410896,0.858071,0.912434,2.030966,0.892346,1.004881
4,1.430208,2.099658,2.257758,2.024329,2.972977,2.68778,3.517449,-0.697364,-1.84129,0.548609,-1.065016,2.896201,1.901752


In [39]:
X1.describe().round()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
count,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0
mean,-0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-2.0,-2.0,-2.0,-2.0,-1.0,-2.0,-2.0,-1.0,-2.0,-2.0,-2.0,-1.0,-2.0
25%,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
50%,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0
75%,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
max,3.0,4.0,4.0,4.0,5.0,4.0,5.0,4.0,3.0,4.0,4.0,7.0,4.0


In [40]:
# Importing MinMaxScaler for normalizing features to a range (default 0 to 1)
from sklearn.preprocessing import MinMaxScaler

In [41]:
scaleMinMax = MinMaxScaler(feature_range = (0, 1))
scaleMinMax

In [42]:
X2 = scaleMinMax.fit_transform(X2)

In [43]:
# Converting the scaled array back to a DataFrame with column names
X2 = pd.DataFrame(X2, columns = ['YRS',	'G', 'AB',	'R',	'H',	'2B',	'3B',	'HR',	'RBI',	'BB',	'SO',	'SB',	'BA'])

In [44]:
X2.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
0,0.866667,0.861912,0.874035,0.971074,1.0,0.889431,0.954248,0.144772,0.316064,0.517683,0.137466,0.632595,1.0
1,0.733333,0.85736,0.811459,0.79575,0.778964,0.891057,0.568627,0.624665,0.849369,0.697078,0.268002,0.050751,0.708333
2,0.733333,0.737481,0.706217,0.756198,0.733096,1.0,0.715686,0.144772,0.315194,0.585341,0.084713,0.303788,0.825
3,0.6,0.716237,0.841663,0.780401,0.713721,0.596748,0.205882,0.336461,0.570744,0.432086,0.70851,0.250893,0.533333
4,0.666667,0.738998,0.738047,0.670012,0.699881,0.752846,0.813725,0.123324,0.0,0.371092,0.125915,0.511079,0.691667


In [45]:
X2.describe().round(3)

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,BA
count,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0
mean,0.403,0.363,0.343,0.324,0.202,0.332,0.247,0.257,0.389,0.279,0.326,0.135,0.356
std,0.184,0.179,0.175,0.171,0.168,0.157,0.161,0.193,0.212,0.168,0.188,0.13,0.177
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.267,0.238,0.209,0.198,0.07,0.22,0.124,0.094,0.279,0.152,0.168,0.04,0.225
50%,0.4,0.335,0.306,0.297,0.164,0.307,0.209,0.227,0.421,0.255,0.318,0.093,0.342
75%,0.533,0.463,0.433,0.41,0.283,0.421,0.34,0.379,0.525,0.367,0.472,0.199,0.45
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Example of Feature Scaling


In [45]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Load the diabetes dataset
data = load_diabetes()

# Create a DataFrame with features and target
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Display the first 5 rows of the dataset
print("Original Data:\n", df.head())

# Split data into features and target
X = df.drop(columns=['target'])  # Feature variables
y = df['target']                  # Target variable

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ===========================
# Standardization (StandardScaler)
# ===========================
scaler_std = StandardScaler()

# Fit and transform on training data, transform test data
X_train_std = scaler_std.fit_transform(X_train)
X_test_std = scaler_std.transform(X_test)

# Convert scaled arrays to DataFrame
X_train_std_df = pd.DataFrame(X_train_std, columns=X.columns)
X_test_std_df = pd.DataFrame(X_test_std, columns=X.columns)

print("\nStandardized Training Data (First 5 Rows):\n", X_train_std_df.head())

# ===========================
# Normalization (MinMaxScaler)
# ===========================
scaler_mm = MinMaxScaler()

# Fit and transform on training data, transform test data
X_train_mm = scaler_mm.fit_transform(X_train)
X_test_mm = scaler_mm.transform(X_test)

# Convert scaled arrays to DataFrame
X_train_mm_df = pd.DataFrame(X_train_mm, columns=X.columns)
X_test_mm_df = pd.DataFrame(X_test_mm, columns=X.columns)

print("\nNormalized Training Data (First 5 Rows):\n", X_train_mm_df.head())