#### Data Encoding Techniques

    Machine learning algorithms typically require numerical input, so categorical data must be encoded. 

### LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder

data = ['red', 'blue', 'green', 'blue', 'red']

encoder = LabelEncoder()
encoded = encoder.fit_transform(data)

print(encoded)

Best for: Ordinal data with inherent order

### OneHotEncoder

One-Hot Encoding is a technique for converting categorical variables into a binary (0/1) numerical format, 

    where each category becomes a new binary feature. 

It's ideal for nominal data (categories without inherent order), such as:

    Colors (Red, Green, Blue)

    Countries (USA, UK, India)

    Product categories (Electronics, Clothing, Books)

Each category is transformed into a new binary column:

    1 indicates the presence of the category

    0 indicates its absence

Example: Encoding Colors

Original Data	Red	Green	Blue

        Red	       1	0	0
        
        Green	   0	1	0
        
        Blue	   0	0	1
        
        Red	       1	0	0

❌ Can cause high dimensionality (many new columns if a feature has many categories)
❌ May lead to sparse data (many zeros)

In [None]:
import pandas as pd

data = pd.DataFrame({
    'color': ['Red', 'Green', 'Blue', 'Red']
})

encoded_data = pd.get_dummies(data['color'], prefix='color')
print(encoded_data)

##### drop='first' avoids multicollinearity

In [None]:
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({
    'color': ['Red', 'Green', 'Blue', 'Red','Yello']
})

encoder = OneHotEncoder(sparse_output=False, drop='first') 
encoded_data = encoder.fit_transform(data[['color']])

encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['color']))
print(encoded_df)

Best for: Nominal data without inherent order

### Ordinal Encoding

Ordinal encoding is a technique for converting categorical variables with inherent order into numerical values while preserving their ordinal relationship. 
It's ideal for features like:

    Size (XS, S, M, L, XL)

    Education Level (High School, Bachelor's, Master's, PhD)

    Ratings (Poor, Fair, Good, Excellent)

    Income Level (Low, Medium, High)

In [None]:
import pandas as pd
data = pd.DataFrame({'size': ['XS', 'S', 'M', 'L', 'XL', 'S', 'M']})

size_order = {'XS': 0, 'S': 1, 'M': 2, 'L': 3, 'XL': 4}

data['size_encoded'] = data['size'].map(size_order)
print(data)

In [None]:
from sklearn.preprocessing import OrdinalEncoder

data = pd.DataFrame({
    'size': ['XS', 'S', 'M', 'L', 'XL'],
    'rating': ['Poor', 'Fair', 'Good', 'Excellent', 'Good']
})

data

In [None]:
categories = {
    'size': ['XS', 'S', 'M', 'L', 'XL'],
    'rating': ['Poor', 'Fair', 'Good', 'Excellent']
}

categories

In [None]:
encoder = OrdinalEncoder(categories=[categories['size'], categories['rating']])
encoded_data = encoder.fit_transform(data[['size', 'rating']])

data[['size_encoded', 'rating_encoded']] = encoded_data
print(data)

Appropriate for:

    Features with a clear hierarchy (e.g., education levels, income brackets).

    Tree-based models (Decision Trees, Random Forest, XGBoost).

Avoid for:

    Nominal data (no order, e.g., colors, countries). Use One-Hot Encoding instead.

    Linear models (Logistic Regression, SVM) if the numerical gaps imply incorrect relationships.

Ordinal encoding is a powerful technique when dealing with ordered categorical data. 

Always define the order manually if the natural order isn't alphabetical. 

For nominal data (no order), prefer One-Hot Encoding.