# Day 12: Feature Scaling & Encoding

Before training ML models, we often need to preprocess data so that all features are on 
a similar scale and categorical values are handled properly.

---

## 1. Feature Scaling  
- Different features can have different units (e.g., Age in years, Salary in ₹).  
- Models like KNN, Logistic Regression, Gradient Descent–based algorithms are **sensitive to scale**.  

### Common Scaling Methods:
- **Min-Max Scaling (Normalization):** Scales values between 0 and 1.  
- **Standardization (Z-score):** Transforms data to have mean = 0 and std = 1.  

---

## 2. Encoding Categorical Data
Machine learning models need **numerical input**.  
We convert text labels into numbers.

### Methods:
- **Label Encoding:** Converts categories into numbers (e.g., Male=0, Female=1).  
- **One-Hot Encoding:** Creates dummy variables (e.g., Red → [1,0,0], Blue → [0,1,0], Green → [0,0,1]).  

In [4]:
# Day 12: Feature Scaling & Encoding
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder

# Example dataset
data = {
    "Age": [18, 25, 32, 40, 50],
    "Salary": [20000, 35000, 50000, 70000, 100000],
    "Gender": ["Male", "Female", "Female", "Male", "Female"],
    "City": ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai"]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Age,Salary,Gender,City
0,18,20000,Male,Delhi
1,25,35000,Female,Mumbai
2,32,50000,Female,Delhi
3,40,70000,Male,Bangalore
4,50,100000,Female,Mumbai


In [5]:
# 1. Feature Scaling
scaler = StandardScaler()
df_scaled = df.copy()
df_scaled[["Age", "Salary"]] = scaler.fit_transform(df[["Age", "Salary"]])

print("Standardized Data:")
print(df_scaled)

Standardized Data:
        Age    Salary  Gender       City
0 -1.338432 -1.253201    Male      Delhi
1 -0.713831 -0.716115  Female     Mumbai
2 -0.089229 -0.179029  Female      Delhi
3  0.624602  0.537086    Male  Bangalore
4  1.516890  1.611258  Female     Mumbai


In [6]:
# Min-Max Scaling
minmax = MinMaxScaler()
df_minmax = df.copy()
df_minmax[["Age", "Salary"]] = minmax.fit_transform(df[["Age", "Salary"]])

print("Min-Max Scaled Data:")
print(df_minmax)

Min-Max Scaled Data:
       Age  Salary  Gender       City
0  0.00000  0.0000    Male      Delhi
1  0.21875  0.1875  Female     Mumbai
2  0.43750  0.3750  Female      Delhi
3  0.68750  0.6250    Male  Bangalore
4  1.00000  1.0000  Female     Mumbai


In [7]:
# 2. Encoding - Label Encoding
le = LabelEncoder()
df_le = df.copy()
df_le["Gender"] = le.fit_transform(df["Gender"])

print("Label Encoded Gender:")
print(df_le)

Label Encoded Gender:
   Age  Salary  Gender       City
0   18   20000       1      Delhi
1   25   35000       0     Mumbai
2   32   50000       0      Delhi
3   40   70000       1  Bangalore
4   50  100000       0     Mumbai


In [8]:
# One-Hot Encoding for City
df_ohe = pd.get_dummies(df, columns=["City"], drop_first=True)  # drop_first avoids dummy variable trap
print("One-Hot Encoded Data:")
print(df_ohe)

One-Hot Encoded Data:
   Age  Salary  Gender  City_Delhi  City_Mumbai
0   18   20000    Male        True        False
1   25   35000  Female       False         True
2   32   50000  Female        True        False
3   40   70000    Male       False        False
4   50  100000  Female       False         True


### ✅ Key Points:
- Always **scale numerical features** when required (esp. for distance-based models).  
- Use **Label Encoding** for binary categories (Male/Female).  
- Use **One-Hot Encoding** for multi-category variables (City names).  