# What is Data Encoding?
It is the process of converting categorical data into numerical values so that ML algorithms can understand and process them.<br>
Because ML models works with numbers not text.<br>
Eg: models like regression, SVM, neural networks require numerical inputs.



# 1.Label Encoding
Converts each category into a unique numeric value.<br>
Eg: Red   → 0  
    Blue  → 1  
    Green → 2  

**Advantages:<br>**
i)Simple and memory-efficient.<br>
ii)Good for ordinal data (like "Low", "Medium", "High").

**Limitations: <br>**
i)Label Encoding, by assigning unique integers (e.g., 0, 1, 2), creates an arbitrary order that is appropriate for ordinal data (like small < medium < large) but it's misleading for nominal data (like colors or cities).<br>

**Example with Linear Regression or KNN:<br>**

If "Green"=2 and "Red"=0, the algorithm might wrongly think Green is "twice" Red.<br>

Distance-based models (KNN, SVM, clustering) may compute:<br><br>
Distance(Red=0, Blue=1) = 1<br>
Distance(Red=0, Green=2) = 2<br>
→ implying Blue is more similar to Red than Green is, which is not logically true for colors.<br>


In [16]:
from sklearn.preprocessing import LabelEncoder
data = ["Red", "Blue", "Green", "Blue"]
encoder = LabelEncoder() #creating object for LabelEncoder
encoded_data=encoder.fit_transform(data)
print(encoded_data)


[2 0 1 0]


**Why Red is assigned with 2,blue with 0 etc?**<br>
A:  LabelEncoder() takes all unique values from your data.<br>
    ["Red", "Blue", "Green", "Blue"]  <br>
    Unique = {"Red", "Blue", "Green"}<br>
    
It sorts them alphabetically (lexicographically).<br>
    Sorted = ["Blue", "Green", "Red"]<br>
    Assigns 0, 1, 2 … based on this sorted list.<br>
    
So the rule is: LabelEncoder always assigns integers in alphabetical order of categories.

# 2.Ordinal Encoding

Ordinal Encoding is a technique where categorical values are converted into integers based on some natural order or ranking that exists in the data.<br>
Eg: We have education_level category then<br>
    High School-->1<br>
    College-->2<br>
    Graduation-->3<br>
    Post-graduation-->4

In [35]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
df=pd.DataFrame({
    "Size":["Small","Large","Medium","Large","Small","Medium"],
    "Quality": ["Low", "Medium", "High", "Low", "High","Medium"]   #Every column has the same number of rows. Otherwise, it’s not a valid DataFrame,
                                                                    #and operations like fit_transform() won't work.
})
#Define the order (You explicitly tell the encoder that the sizes follow this order)
size_order = ["Small", "Medium", "Large"]
quality_order = ["Low", "Medium", "High"]

#create object for OrdinalEncoder
encoder = OrdinalEncoder(categories=[size_order,quality_order]) #here categories are passed as list of lists bcz we might encode multiple columns at once
                                                                #Even if you only encode one column, you must pass it as:
encoded_data= encoder.fit_transform(df)
print(encoded_data)

[[0. 0.]
 [2. 1.]
 [1. 2.]
 [2. 0.]
 [0. 2.]
 [1. 1.]]


**Advantages**<br>
i)Memory Efficient : It uses a single integer per category, unlike one-hot encoding which creates multiple columns.<br>
ii)Works Well with Some ML Algorithms : Tree-based models (like Decision Trees, Random Forests, XGBoost) can handle ordinal encoded data well because they split based on thresholds.<br>
iii)Preserves information about relative ordering<br>

**Limitations**<br>
i)Not suitable for nomial data<br>
ii)Requires Domain Knowledge : You must explicitly define the order for categories based on context.<br>

# 3.One-hot Encoding or Nominal Endcoding

It is a technique to convert categorical data into numerical data in which each category is represented as a binary vector.<br>
It creates new columns for each category where 1 means the category is present and 0 means it is not.<br>

It is mostly used for nominal categorical data where no order exists.

For example: You have a color column ["Red", "Blue", "Green"] <br>
                we represent one-hot coding as:<br>
                Red-->[0,0,1]<br>
                Blue-->[1,0,0]<br>
                Green-->[0,1,0]<br>

Why is "Red" encoded as [0, 0, 1] and not something else like [0, 1, 0]? What decides this?<br>
A: How One-Hot Encoding Assigns Vectors:<br>
     **1.List all unique categories from the column<br>**
     eg: ["Red", "Blue", "Green"]<br><br>
**2.Sort or Order Them<br>**
     Depending on the encoder, it might sort them alphabetically or keep the order of appearance.<br>
     ["Blue", "Green", "Red"]<br><br>
**3.Assign Columns Based on Order<br>**
Each unique category is assigned one column in this order.<br>
        Column 0 → Blue  <br>
        Column 1 → Green  <br>
        Column 2 → Red<br><br>
**4.Create Binary Vectors<br>**
For each row, the encoder places 1 in the column corresponding to that category, and 0 elsewhere.

| Category | Blue | Green | Red |
| -------- | ---- | ----- | --- |
| Blue     | 1    | 0     | 0   |
| Green    | 0    | 1     | 0   |
| Red      | 0    | 0     | 1   |






In [4]:
import pandas as pd
data=pd.DataFrame({
    "Color":["Red","Green","Red","Blue","Blue","Green"]
})

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_data=encoder.fit_transform(data)
print(encoded_data,"\n")
print(encoded_data.toarray())   # we converted encoded_data to  array because the result we get is in sparse matrix form 
                                # A sparse matrix only stores the positions where there's a 1
                                # sparse matrix is memory efficient : Imagine you have 1 million rows and 1000 categories.
                                # A dense matrix would store 1 million × 1000 = 1 billion elements, most of which are zeros → memory-heavy!
                                # where sparse only store pos of 1s...But for our understanding we converted to dense array

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6 stored elements and shape (6, 3)>
  Coords	Values
  (0, 2)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 0)	1.0
  (4, 0)	1.0
  (5, 1)	1.0 

[[0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]


In [16]:
print(pd.get_dummies(data))#is a Pandas function used to convert categorical variables into dummy/indicator variables, which is essentially one-hot encoding.
pd.get_dummies(data).shape #to know how many new rows and columns are created

   Color_Blue  Color_Green  Color_Red
0       False        False       True
1       False         True      False
2       False        False       True
3        True        False      False
4        True        False      False
5       False         True      False


(6, 3)

**Advantages of One-Hot Encoding**<br>
i)Widely used for nominal data<br>
ii)Better for algorithms sensitive to distance<br><br>
**Limitations**<br>
**i)Increased Dimensionality<br>**
    One-hot encoding creates one new column per category.<br>
    For features with many unique categories (high cardinality), this leads to a huge number of columns.<br>
    Example: A "Country" column with 200 countries will create 200 new columns → huge memory and computational cost!<br><br>
**ii)Sparsity<br>**
Since each row only has one 1 and all others are 0, the resulting matrix is mostly zeros.<br>
This sparsity makes the data inefficient to store and process in some cases.<br>
**iii)Overfitting**<br>
When there are many categories, this results in a large number of features, but most of them are zeros.<br>
so the model memorizes specific patterns instead of learning general trends.Hence it leads to  poor performace for unseen data.

# 4.Target Guided Ordinal Encoding

It is a technique used to encode categorial data based on relationship with the target data.<br>
This technique is useful when we have a categorical column with more unique categories.<br>

In this we replace every category in categorical variable with a numerical value based on mean or median of target variable of that category.<br>
Captures relationship with the target : It means that the encoding assigns higher values to categories that are more likely to have a positive outcome (or whatever the target represents).

In [117]:
#Encoding City Based on House Prices
import pandas as pd
df = pd.DataFrame({
    "City": ["Mumbai", "Delhi", "Bangalore", "Delhi", "Mumbai", "Bangalore", "Mumbai", "Delhi"],
    "Price": [100, 80, 120, 85, 110, 130, 90, 95]
})

grouping=df.groupby("City") #Grouping data based on City

mean_prices=grouping["Price"].mean() #In each group calculating mean of Price
print(mean_prices)


City
Bangalore    125.000000
Delhi         86.666667
Mumbai       100.000000
Name: Price, dtype: float64


In [123]:
df["City_encoded"] = df["City"].map(mean_prices)  # Corresponding mean price is mapped to its City
print(df)

        City  Price  City_encoded
0     Mumbai    100    100.000000
1      Delhi     80     86.666667
2  Bangalore    120    125.000000
3      Delhi     85     86.666667
4     Mumbai    110    100.000000
5  Bangalore    130    125.000000
6     Mumbai     90    100.000000
7      Delhi     95     86.666667


**Advantages**<br>
i)Works well for high-cardinality features :  Instead of creating many columns (like in one-hot encoding), it keeps things simple.<br>
ii)Captures relationship with the target <br>
iii)Improves model performance in supervised learning :Makes it easier for the model to learn patterns related to the target.<br><br>
**Limitations**<br>
i)Only works in supervised problems : It relies on the target variable, so you can’t use it in unsupervised learning.><br>
ii)Sensitive Outliers: If a city only has one very expensive house, its average price will be high → the model thinks all houses there are expensive.<br>

For example:<br>

| City     | Price|
| -------- | ---- |
| Mumbai   | 100  |
| Delhi    | 80   |
| Banglore | 120  |
| Smalltown|1000  |
| Mumbai   | 110  |
| Delhi    | 85   |
| Banglore | 130  |


Mean of prices after grouping: <br>
Bangalore --> 125.0 <br>
Delhi      --> 82.5 <br>
Mumbai      -->105.0<br>
SmallTown   -->1000.0<br>


SmallTown only has 1 house, but the model thinks all SmallTown houses are extremely expensive.<br>
When a new SmallTown house comes, the model predicts ~1000 → may be wrong if it’s cheaper.<br>

So the problem is: <br>
SmallTown's encoded value = 1000 → much larger than others
Even though it’s only one data point, the model might treat it as extremely expensive and learn misleading patterns!

# 5.Frequency Encoding
Frequency Encoding replaces each category in a column with the proportion or count of how often that category occurs in the dataset.<br>
For Example: Purchase Prediction <br>


In [21]:
import pandas as pd

df = pd.DataFrame({
    "Brand": ["Apple", "Samsung", "OnePlus", "Samsung", "Apple", "Xiaomi", "OnePlus", "Apple", "Realme", "Xiaomi", "Samsung", "Realme"],
    "Purchased": [1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0]  # 1 = purchased, 0 = not purchased
})
print(df)


      Brand  Purchased
0     Apple          1
1   Samsung          1
2   OnePlus          0
3   Samsung          1
4     Apple          1
5    Xiaomi          0
6   OnePlus          0
7     Apple          1
8    Realme          0
9    Xiaomi          0
10  Samsung          1
11   Realme          0


In [25]:
# Count how many times each brand appears
brand_count = df['Brand'].value_counts()
print(brand_count)

Brand
Apple      3
Samsung    3
OnePlus    2
Xiaomi     2
Realme     2
Name: count, dtype: int64


In [29]:
# Map counts to the 'Brand' column
df['Brand_encoded_count'] = df['Brand'].map(brand_count)
print(df)

      Brand  Purchased  Brand_encoded_count
0     Apple          1                    3
1   Samsung          1                    3
2   OnePlus          0                    2
3   Samsung          1                    3
4     Apple          1                    3
5    Xiaomi          0                    2
6   OnePlus          0                    2
7     Apple          1                    3
8    Realme          0                    2
9    Xiaomi          0                    2
10  Samsung          1                    3
11   Realme          0                    2


In [33]:
brand_freq = df['Brand'].value_counts(normalize=True) #normalize=True gives proportion values like Apple-->3/12=0.25
print(brand_freq)

Brand
Apple      0.250000
Samsung    0.250000
OnePlus    0.166667
Xiaomi     0.166667
Realme     0.166667
Name: proportion, dtype: float64


**Why Frequency Encoding Helps Here**

✔ Popular brands like Apple and Samsung → higher frequency → more likely to be purchased<br>
✔ Less popular brands → lower frequency → less likely to be purchased<br>
✔ The model can now interpret brand popularity as a numeric feature<br>

**Advantages**<br>
i)Simple to implement bcz it doesn’t create many extra columns like one-hot encoding does.<br>
ii)Automatically identifies "popular" vs "rare" categories - Models can learn that popular = important<br>
iii)Can be useful for high cardinality data like having many unique categories.<br>

**Limitations**<br>
i)Doesn’t capture relationships between categories : “Apple” and “Samsung” might be treated equally just because they both appear 25% of the time, even if they have very different customer behaviors.<br>
ii)Can be misleading with rare categories : A luxury brand bought only once might be treated as irrelevant because its frequency is 1/1000 = 0.001.