ENCODING TECHNIQUES
# 🔥 Day-5: Encoding
---

## 🌟 What is Encoding?

In machine learning models, we can only work with **numerical data**.

**Encoding** is the process of converting **categorical variables (Text Labels)** into **numerical values** so that algorithms can process them.
---

## 🚀 Why Encoding is Needed?

Machine Learning models **cannot understand text or categories**.  
Example:
- Color: `Red`, `Blue`, `Green`
- Gender: `Male`, `Female`

We need to convert them into numerical format like:
- Red → 0, Blue → 1, Green → 2  
or
- Male → 1, Female → 0

---

## 🎯 Types of Encoding Techniques

| Encoding Method       | When to Use                                       |
|----------------------|---------------------------------------------------|
| **Label Encoding**   | When categories have an **order/rank**            |
| **One-Hot Encoding** | When categories are **independent (no order)**    |
---

In [3]:
#one-hot encoding(nominal data)
import pandas as pd
import seaborn as sns

data  = pd.read_csv('homeprices.csv')
df = pd.DataFrame(data)
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [4]:
dummies = pd.get_dummies(df, columns=['town'])
dummies

Unnamed: 0,area,price,town_monroe township,town_robinsville,town_west windsor
0,2600,550000,True,False,False
1,3000,565000,True,False,False
2,3200,610000,True,False,False
3,3600,680000,True,False,False
4,4000,725000,True,False,False
5,2600,585000,False,False,True
6,2800,615000,False,False,True
7,3300,650000,False,False,True
8,3600,710000,False,False,True
9,2600,575000,False,True,False


In [5]:
encoded_df = pd.get_dummies(df, columns=['town'],drop_first=True)
encoded_df

Unnamed: 0,area,price,town_robinsville,town_west windsor
0,2600,550000,False,False
1,3000,565000,False,False
2,3200,610000,False,False
3,3600,680000,False,False
4,4000,725000,False,False
5,2600,585000,False,True
6,2800,615000,False,True
7,3300,650000,False,True
8,3600,710000,False,True
9,2600,575000,True,False


label encoding(ordinal data)

In [6]:
#label encoding(ordinal data)
from sklearn.preprocessing import LabelEncoder

# Sample Data
df = pd.DataFrame({'Education': ['High School', 'Bachelor', 'Master', 'PhD']})

# Apply Label Encoding
encoder = LabelEncoder()
df['Education_encoded'] = encoder.fit_transform(df['Education'])
df



Unnamed: 0,Education,Education_encoded
0,High School,1
1,Bachelor,0
2,Master,2
3,PhD,3


#ordinal encoding(for ordered categories)


In [7]:
#ordinal encoding(for ordered categories)
from sklearn.preprocessing import OrdinalEncoder

#sample data
df = pd.DataFrame({'Satisfaction':['Low', 'Medium', "High"]})

order = [['Low', 'Medium', 'High']]
#apply ordinal encoding
encoder = OrdinalEncoder(categories=order)

df['Satisfaction_encoded'] = encoder.fit_transform(df[['Satisfaction']])
df


Unnamed: 0,Satisfaction,Satisfaction_encoded
0,Low,0.0
1,Medium,1.0
2,High,2.0


Target encoding (mean encoding)

this encoding is not available in sklearn

for this we need  to import 
from category_encoders import TaargetEncoder

or

using groupby()[]

df['encoded']=df.groupby('City')['Purchased'].transform('mean')


 


In [8]:
df = pd.DataFrame({
    'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai'],
    'Purchased': [1, 0, 1, 0]
})


# encoder = TargetEncoder()
df['City_encoded'] = df.groupby('City')['Purchased'].transform('mean')
df

Unnamed: 0,City,Purchased,City_encoded
0,Delhi,1,1.0
1,Mumbai,0,0.0
2,Delhi,1,1.0
3,Chennai,0,0.0
