# Feature Encoding

Feature encoding is the process of transforming `categorical features` into `numeric features`. This is necessary because machine learning algorithms can only handle numeric features. There are many different ways to encode categorical features, and each method has its own advantages and disadvantages. In this notebook, we will explore some of the most popular methods for encoding categorical features, such as:

- Label encoding
- Ordinal encoding
- One-hot encoding
- Binary encoding

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# data load
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [9]:
# Count the number of times each value in a column occurs
print(df['time'].value_counts())

print('_________________________________________________________')

# Count the days of the week in df
print(df['day'].value_counts())

print('_________________________________________________________')

# Count the days of the week in df
print(df['smoker'].value_counts())

print('_________________________________________________________')

# Count the days of the week in df
print(df['sex'].value_counts())

time
Dinner    176
Lunch      68
Name: count, dtype: int64
_________________________________________________________
day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64
_________________________________________________________
smoker
No     151
Yes     93
Name: count, dtype: int64
_________________________________________________________
sex
Male      157
Female     87
Name: count, dtype: int64


In [13]:
df['encoded_time'].value_counts()

encoded_time
0    176
1     68
Name: count, dtype: int64

In [12]:
# let's encode the time in labelencoder with sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
le = LabelEncoder()
df['encoded_time'] = le.fit_transform(df['time'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


In [16]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [15]:
# ordinal encoding the day column using specific order
oe = OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])
df['encoded_day'] = oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


In [17]:
df['encoded_day'].value_counts()

encoded_day
2.0    87
3.0    76
0.0    62
1.0    19
Name: count, dtype: int64

In [32]:
onehot_encoder = OneHotEncoder(sparse=False)
sex_onehot = onehot_encoder.fit_transform(df[['sex']])
sex_onehot_df = pd.DataFrame(sex_onehot, columns=onehot_encoder.get_feature_names_out(['sex']))
df = pd.concat([df.reset_index(drop=True), sex_onehot_df.reset_index(drop=True)], axis=1)
df.head()



Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day,sex_Female,sex_Male
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0,0.0,1.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0,0.0,1.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0,0.0,1.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0,1.0,0.0


In [33]:
# let's encode the sex in labelencoder with sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
le = LabelEncoder()
df['encoded_sex'] = le.fit_transform(df['sex'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day,sex_Female,sex_Male,encoded_sex
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0,1.0,0.0,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0,0.0,1.0,1
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0,0.0,1.0,1
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0,0.0,1.0,1
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0,1.0,0.0,0


In [34]:
df['encoded_sex'].value_counts()

encoded_sex
1    157
0     87
Name: count, dtype: int64

In [35]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day,sex_Female,sex_Male,encoded_sex
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0,1.0,0.0,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0,0.0,1.0,1
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0,0.0,1.0,1
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0,0.0,1.0,1
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0,1.0,0.0,0


In [27]:
# example of one hot encoding
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [26]:
# example of one hot encoding
titanic = sns.load_dataset('titanic')

onehot_encoder = OneHotEncoder(sparse=False)
embarked_onehot = onehot_encoder.fit_transform(titanic[['embarked']])
embarked_onehot_df = pd.DataFrame(embarked_onehot, columns=onehot_encoder.get_feature_names_out(['embarked']))
titanic = pd.concat([titanic.reset_index(drop=True), embarked_onehot_df.reset_index(drop=True)], axis=1)
titanic.head()



Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,embarked_C,embarked_Q,embarked_S,embarked_nan
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,0.0,0.0,1.0,0.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1.0,0.0,0.0,0.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0.0,0.0,1.0,0.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0.0,0.0,1.0,0.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,0.0,0.0,1.0,0.0


# Binary Encoding

In [37]:
# !pip install category_encoders

In [38]:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [42]:
from category_encoders import BinaryEncoder

binary_encoder = BinaryEncoder()
df_binary = binary_encoder.fit_transform(df['day'])
df_binary.head()

Unnamed: 0,day_0,day_1,day_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1


## Assignment: how many types of feature encoding are there?
Certainly! Here's the list of types of feature encoding in markdown:

1. **Ordinal Encoding:**
   - Assigns unique integers to different categories based on their order or rank. Useful when the categories have a meaningful order.

2. **One-Hot Encoding:**
   - Represents each category as a binary vector. It creates a new binary column for each category and marks the presence of that category with a 1.

3. **Label Encoding:**
   - Assigns a unique integer to each category. It is simple but should be used carefully as the order of integers may be misinterpreted by the algorithm.

4. **Binary Encoding:**
   - Represents the categories with binary code. It involves two steps: first, convert categories to numeric using ordinal encoding, and then convert those integers to binary code.

5. **Count Encoding:**
   - Replaces each category with the count of occurrences in the dataset. Useful when the frequency of a category is informative.

6. **Frequency Encoding:**
   - Similar to count encoding but represents the categories with the frequency of each category instead.

7. **Target Encoding (Mean Encoding):**
   - Involves replacing a categorical value with the mean of the target variable for that value. This is often used in classification problems.

8. **Feature Hashing:**
   - Involves transforming categorical features into a fixed-length vector of integers using a hash function. It's useful when dealing with high cardinality categorical features.

9. **Entity Embeddings of Categorical Variables:**
   - Involves representing categorical variables as vectors of continuous numbers, similar to word embeddings in natural language processing.

These methods have their advantages and are suitable for different scenarios. The choice of encoding method often depends on the nature of the data, the machine learning algorithm being used, and the specific requirements of the problem at hand.

## When to use which type of feature encoding?


Choosing the appropriate feature encoding method depends on the nature of the categorical data and the requirements of the machine learning task. Here are some guidelines along with examples:

1. **Ordinal Encoding:**
   - **When to Use:**
     - Categories have a clear order or rank.
   - **Example:**
     - Education levels (e.g., Low, Medium, High).

2. **One-Hot Encoding:**
   - **When to Use:**
     - Categories are not ordinal, and there is no inherent order.
   - **Example:**
     - Colors (e.g., Red, Blue, Green).

3. **Label Encoding:**
   - **When to Use:**
     - There is an ordinal relationship, and the algorithm can interpret the order.
   - **Example:**
     - Grades (e.g., A, B, C).

4. **Binary Encoding:**
   - **When to Use:**
     - High cardinality categorical features.
   - **Example:**
     - ZIP codes or Product IDs.

5. **Count Encoding:**
   - **When to Use:**
     - Capture information about the frequency of categories.
   - **Example:**
     - Word frequencies in a document.

6. **Frequency Encoding:**
   - **When to Use:**
     - Similar to count encoding, but emphasizes the relative frequency.
   - **Example:**
     - Percentage of transactions by category.

7. **Target Encoding (Mean Encoding):**
   - **When to Use:**
     - Utilized in classification tasks where the target variable provides valuable information.
   - **Example:**
     - Mean income for different occupation categories in a loan default prediction.

8. **Feature Hashing:**
   - **When to Use:**
     - Dealing with high cardinality and memory constraints.
   - **Example:**
     - User IDs in a large-scale recommendation system.

9. **Entity Embeddings of Categorical Variables:**
   - **When to Use:**
     - Embedding categorical variables into continuous vectors for neural network-based models.
   - **Example:**
     - Embedding user or product categories in collaborative filtering.

Choosing the right encoding method is crucial for the performance of machine learning models. It's essential to consider the characteristics of the data and the assumptions of the algorithms being used. Experimentation and validation on the specific task are often necessary to determine the most effective encoding strategy.

In [5]:
# use pandas for feature encoding
import seaborn as sns
import pandas as pd
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [6]:
# use pandas get dummies
get_dummies = pd.get_dummies(df, columns=['day']) 
# this the type of onehot encoding, where we have True or False values for each category instead of 0 and 1.
get_dummies.head()

Unnamed: 0,total_bill,tip,sex,smoker,time,size,day_Thur,day_Fri,day_Sat,day_Sun
0,16.99,1.01,Female,No,Dinner,2,False,False,False,True
1,10.34,1.66,Male,No,Dinner,3,False,False,False,True
2,21.01,3.5,Male,No,Dinner,3,False,False,False,True
3,23.68,3.31,Male,No,Dinner,2,False,False,False,True
4,24.59,3.61,Female,No,Dinner,4,False,False,False,True
