#### Data Encoding Techniques

    Machine learning algorithms typically require numerical input, so categorical data must be encoded. 

### LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder

data = ['red', 'blue', 'green', 'blue', 'red']

encoder = LabelEncoder()

encoded = encoder.fit_transform(data)

print(encoded)

Best for: Ordinal data with inherent order

### OneHotEncoder

One-Hot Encoding is a technique for converting categorical variables into a binary (0/1) numerical format, 

    where each category becomes a new binary feature. 

It's ideal for nominal data (categories without inherent order), such as:

    Colors (Red, Green, Blue)

    Countries (India, USA, UK )

    Product categories (Electronics, Clothing, Books)

Each category is transformed into a new binary column:

    1 indicates the presence of the category

    0 indicates its absence

Example: Encoding Colors

Original Data	Red	Green	Blue

        Red	       1	0	0
        
        Green	   0	1	0
        
        Blue	   0	0	1
        
        Red	       1	0	0

❌ Can cause high dimensionality (many new columns if a feature has many categories)

❌ May lead to sparse data (many zeros)

In [3]:
import pandas as pd

data = pd.DataFrame({
    'color': ['Red', 'Green', 'Blue', 'Red','Yello']
})

encoded_data = pd.get_dummies(data['color'], prefix='color')
print(encoded_data)

   color_Blue  color_Green  color_Red  color_Yello
0       False        False       True        False
1       False         True      False        False
2        True        False      False        False
3       False        False       True        False
4       False        False      False         True


##### drop='first' avoids multicollinearity

In [6]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({'color': ['Red', 'Green', 'Blue', 'Red', 'Yello']})

# Sklearn OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse_output=False)

sk_encoded = encoder.fit_transform(data[['color']])

print(pd.DataFrame(sk_encoded, columns=encoder.get_feature_names_out()))


   color_Green  color_Red  color_Yello
0          0.0        1.0          0.0
1          1.0        0.0          0.0
2          0.0        0.0          0.0
3          0.0        1.0          0.0
4          0.0        0.0          1.0


Best for: Nominal data without inherent order

### Ordinal Encoding

Ordinal encoding is a technique for converting categorical variables with inherent order into numerical values while preserving their ordinal relationship. 

It's ideal for features like:

    Size (XS, S, M, L, XL)

    Education Level (High School, Bachelor's, Master's, PhD)

    Ratings (Poor, Fair, Good, Excellent)

    Income Level (Low, Medium, High)

In [7]:
import pandas as pd
data = pd.DataFrame({'size': ['XS', 'S', 'M', 'L', 'XL', 'S', 'M']})
data

Unnamed: 0,size
0,XS
1,S
2,M
3,L
4,XL
5,S
6,M


In [8]:
size_order = {'XS': 0, 'S': 1, 'M': 2, 'L': 3, 'XL': 4}

size_order

{'XS': 0, 'S': 1, 'M': 2, 'L': 3, 'XL': 4}

In [9]:
data['size_encoded'] = data['size'].map(size_order)
print(data)

  size  size_encoded
0   XS             0
1    S             1
2    M             2
3    L             3
4   XL             4
5    S             1
6    M             2


In [10]:
from sklearn.preprocessing import OrdinalEncoder

data = pd.DataFrame({
    'size': ['XS', 'S', 'M', 'L', 'XL'],
    'rating': ['Poor', 'Fair', 'Good', 'Excellent', 'Good']
})

data

Unnamed: 0,size,rating
0,XS,Poor
1,S,Fair
2,M,Good
3,L,Excellent
4,XL,Good


In [11]:
categories = {
    'size': ['XS', 'S', 'M', 'L', 'XL'],
    'rating': ['Poor', 'Fair', 'Good', 'Excellent']
}

categories

{'size': ['XS', 'S', 'M', 'L', 'XL'],
 'rating': ['Poor', 'Fair', 'Good', 'Excellent']}

In [12]:
encoder = OrdinalEncoder(categories=[categories['size'], categories['rating']])

encoded_data = encoder.fit_transform(data[['size', 'rating']])

data[['size_encoded', 'rating_encoded']] = encoded_data
print(data)

  size     rating  size_encoded  rating_encoded
0   XS       Poor           0.0             0.0
1    S       Fair           1.0             1.0
2    M       Good           2.0             2.0
3    L  Excellent           3.0             3.0
4   XL       Good           4.0             2.0


Appropriate for:

    Features with a clear hierarchy (e.g., education levels, income brackets).

    Tree-based models (Decision Trees, Random Forest, XGBoost).

Avoid for:

    Nominal data (no order, e.g., colors, countries). Use One-Hot Encoding instead.

    Linear models (Logistic Regression, SVM) if the numerical gaps imply incorrect relationships.

Ordinal encoding is a powerful technique when dealing with ordered categorical data. 

Always define the order manually if the natural order isn't alphabetical. 

For nominal data (no order), prefer One-Hot Encoding.