# **Feature Encoding**
> Feature encoding is the process of transforming `categorical features` into `numeric features`. This is necessary because machine learning algorithms can only handle numeric features. There are many different ways to encode categorical features, and each method has its own advantages and disadvantages. In this notebook, we will explore some of the most popular methods for encoding categorical features, such as:

1. `Label encoding`
2. `Ordinal encoding`
3. `One-hot encoding`
4. `Binary encoding`
5. Frequency / Count Encoding
6. Mean Encoding
7. Hashing Encoding
8. Helmert Encoding
9. Sum Encoding
10. Polynomial Encoding
11. Backward Difference Encoding

#### **What are the benefits of encoding?**
Encoding categorical data has several benefits:

1. **Improved Machine Learning Model Performance**: Many machine learning algorithms perform better with numerical input. Encoding transforms categorical data into a format that the algorithms can work with more effectively.

2. **Efficient Representation**: Some forms of encoding can represent the data more efficiently, especially when dealing with a large number of categories. For example, binary encoding can represent the same information as one-hot encoding with fewer features.

3. **Preserving Information**: Certain types of encoding can preserve information about the category order. For example, ordinal encoding can preserve information about the order of categories, which can be useful for ordinal data.

4. **Better Understanding of the Data**: Encoding can sometimes help to better understand the data. For example, label encoding can make it easier to see relationships between categories.

5. **Handling of Complex Structures**: Certain types of encoding can handle complex structures in the data. For example, one-hot encoding can handle nominal data where no order is present.

6. **Reduced Memory Usage**: Some encoding techniques can reduce memory usage by creating fewer new features compared to other methods. For example, binary encoding creates fewer new features than one-hot encoding for the same number of categories.

Remember, the choice of encoding method depends on the specific dataset and the machine learning algorithm you are using. Different encoding methods may be appropriate for different types of categorical data (ordinal vs nominal) and different types of models.

Feature encoding is 

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# data load
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

#### **1. Label Encoding:**

> Label encoding is a technique for transforming categorical data into numerical data. Each unique category value is assigned an integer value. For example, "red" is 1, "green" is 2, "blue" is 3.

After label encoding, your data would look like this:

| Fruit  | Fruit_Label |
|--------|-------------|
| Apple  | 1           |
| Banana | 2           |
| Cherry | 3           |
| Apple  | 1           |
| Banana | 2           |
| Cherry | 3           |
| Apple  | 1           |

Now, the "Fruit_Label" column can be used in a machine learning algorithm. The algorithm will understand that "Apple", "Banana", and "Cherry" are different types of fruits, represented by the numbers 1, 2, and 3.

**When to use**: Label encoding is best used for ordinal variables where the order matters (e.g., "low", "medium", "high").

In [5]:
# let's encode the time in labelencoder with sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
le = LabelEncoder()
df['encoded_time'] = le.fit_transform(df['time'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


The above code is using the LabelEncoder class from the sklearn.preprocessing module to encode the '`time`' column of the DataFrame `df`.

Here's a step-by-step explanation:

1. `from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder`: This line is importing the necessary classes for encoding from sklearn's preprocessing module.

2. `le = LabelEncoder()`: This line is creating an instance of the LabelEncoder class.

3. `df['encoded_time'] = le.fit_transform(df['time'])`: This line is fitting the label encoder and then transforming the 'time' column of the DataFrame. The transformed data (i.e., the encoded labels) are then stored in a new column in the DataFrame called 'encoded_time'.

4. `df.head()`: This line is displaying the first five rows of the DataFrame.

**Observations from the Output:**
- The output is the first five rows of the DataFrame after label encoding. 
- The '`encoded_time`' column contains the encoded values of the 'time' column. 
- In this case, it seems like the 'time' column contained two unique values (probably 'Lunch' and 'Dinner'), which have been encoded as 0 and 1 in the 'encoded_time' column. 
- The exact mapping depends on the order in which the LabelEncoder encountered the unique values in the 'time' column.

In [6]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

#### **2. Ordinal Encoding:**
> Ordinal encoding is a type of encoding for ordinal variables. Ordinal variables are categorical variables in which the categories can be meaningfully ordered. The integer assigned to each category `respects the order` of the category.

After ordinal encoding, your data would look like this:

| T-shirt Size | T-shirt Size Ordinal |
|--------------|----------------------|
| Small        | 1                    |
| Medium       | 2                    |
| Large        | 3                    |
| Small        | 1                    |
| Medium       | 2                    |
| Large        | 3                    |
| Small        | 1                    |

Now, the "`T-shirt Size Ordinal`" column can be used in a machine learning algorithm. The algorithm will understand that "Small", "Medium", and "Large" are different sizes of T-shirts, represented by the numbers 1, 2, and 3. Furthermore, it will understand that "Small" is smaller than "Medium", and "Medium" is smaller than "Large", because of the order of the numbers.

**When to use**: Ordinal encoding is best used for ordinal variables where the order of categories is important.

In [11]:
# ordinal encoding the day column using specific order
oe = OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])
df['encoded_day'] = oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


The above code is using the OrdinalEncoder class from the sklearn.preprocessing module to encode the 'day' column of the DataFrame `df`.

Here's a step-by-step explanation:

1. `from sklearn.preprocessing import OrdinalEncoder`: This line is importing the necessary class for ordinal encoding from sklearn's preprocessing module.

2. `oe = OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])`: This line is creating an instance of the OrdinalEncoder class. The 'categories' parameter is set to a list of lists, where each list contains the unique categories in the 'day' column in the desired order.

3. `df['encoded_day'] = oe.fit_transform(df[['day']])`: This line is fitting the ordinal encoder and then transforming the 'day' column of the DataFrame. The transformed data (i.e., the encoded labels) are then stored in a new column in the DataFrame called 'encoded_day'.

4. `df.head()`: This line is displaying the first five rows of the DataFrame.

**Observations from the Output:**
- The output is the first five rows of the DataFrame after ordinal encoding. 
- The '`encoded_day`' column contains the encoded values of the 'day' column. 
- In this case, 'Thur' has been encoded as 0, 'Fri' as 1, 'Sat' as 2, and 'Sun' as 3, as per the order specified in the 'categories' parameter. 
- The machine learning algorithm will now understand that 'Sun' comes after 'Sat', 'Fri', and 'Thur', because of the order of the numbers.

#### **3. One-Hot Encoding:**
> One-hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0. Each integer value is represented as a binary vector.

After one-hot encoding, your data would look like this:

| Fruit  | Is_Apple | Is_Banana | Is_Cherry |
|--------|----------|-----------|-----------|
| Apple  | 1        | 0         | 0         |
| Banana | 0        | 1         | 0         |
| Cherry | 0        | 0         | 1         |
| Apple  | 1        | 0         | 0         |
| Banana | 0        | 1         | 0         |
| Cherry | 0        | 0         | 1         |
| Apple  | 1        | 0         | 0         |

Now, the "Is_Apple", "Is_Banana", and "Is_Cherry" columns can be used in a machine learning algorithm. The algorithm will understand that "Apple", "Banana", and "Cherry" are different types of fruits, represented by the binary vectors (1, 0, 0), (0, 1, 0), and (0, 0, 1) respectively.

**When to use**: One-hot encoding is best used for nominal variables where the order does not matter (e.g., "cat", "dog", "mouse"). It's also useful when the categorical variable is not ordinal (i.e., the categories do not have a meaningful order) and when the number of categorical variables is less, so one-hot encoding can be effectively applied.

In [15]:
# one hot encoding on day column
ohe = OneHotEncoder()
ohe.fit_transform(df[['sex']]).toarray()

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.

#### **One-Hot Encoder Example from Titanic dataset:**

In [13]:
# example of one hot encoding
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [22]:
from sklearn.preprocessing import OneHotEncoder
import seaborn as sns
import pandas as pd

# load the dataset
titanic = sns.load_dataset('titanic')

# create an instance of OneHotEncoder
onehot_encoder = OneHotEncoder()

# fit and transform the 'embarked' column of the DataFrame
embarked_onehot = onehot_encoder.fit_transform(titanic[['embarked']]).toarray()

# get the unique values in the 'embarked' column
embarked_unique = titanic['embarked'].unique()

# create the column names for the one-hot encoded DataFrame
embarked_columns = ['embarked_' + str(val) for val in embarked_unique]

# create a DataFrame from the one-hot encoded data
embarked_onehot_df = pd.DataFrame(embarked_onehot, columns=embarked_columns)

# concatenate the original DataFrame with the one-hot encoded DataFrame
titanic = pd.concat([titanic.reset_index(drop=True), embarked_onehot_df.reset_index(drop=True)], axis=1)

titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,embarked_S,embarked_C,embarked_Q,embarked_nan
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,0.0,0.0,1.0,0.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1.0,0.0,0.0,0.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0.0,0.0,1.0,0.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0.0,0.0,1.0,0.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,0.0,0.0,1.0,0.0


#### **4. Binary Encoding:**
> Binary encoding converts a category into binary digits. Each binary digit creates one feature column. If there are n unique categories, then binary encoding results in the only log(base 2)ⁿ features. In this case, first, the categories are encoded as ordinal, then those integers are converted into binary code, so for k distinct values, binary encoding uses only log2(k) features.

After binary encoding, your data would look like this:

| Animal | Animal_Binary |
|--------|---------------|
| Cat    | 01            |
| Dog    | 10            |
| Bird   | 11            |
| Cat    | 01            |
| Dog    | 10            |
| Bird   | 11            |
| Cat    | 01            |

In binary encoding, each category value is first converted to numerical form using ordinal encoding, then those integers are converted into binary code. So, if we have 3 categories like 'Cat', 'Dog', and 'Bird', they could first be encoded into ordinal values like 1, 2, and 3.

Then, these ordinal values are converted into binary: 1 is 01, 2 is 10, and 3 is 11 in binary.

`Now, the "Animal_Binary" column can be used in a machine learning algorithm. The algorithm will understand that "Cat", "Dog", and "Bird" are different types of animals, represented by the binary codes 01, 10, and 11 respectively.`

**When to use**: Binary encoding is best used when dealing with a `large number of categories` in a categorical feature. It helps to keep the dimensionality low.

#### **You must install this library for binary encoding:**

In [None]:
# !pip install category_encoders

In [23]:
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [None]:
from category_encoders import BinaryEncoder

binary_encoder = BinaryEncoder()
df_binary = binary_encoder.fit_transform(df['day'])

Unnamed: 0,day_0,day_1,day_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
239,0,1,0
240,0,1,0
241,0,1,0
242,0,1,0


---
5. **Frequency Encoding**: It is a way to utilize the frequency of the categories as labels.

6. **Mean Encoding**: Also known as target encoding. It involves calculating the mean of the target variable for each category and replacing the category with that mean.

7. **Hashing Encoding**: The categorical feature is first converted into a string and then hashed. The resulting hash value is then used as the encoded value.

8. **Helmert Encoding**: The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.

9. **Sum Encoding**: Also known as Deviation Encoding or Effect Encoding, is very similar to one-hot encoding but the overall mean of the dependent variable is taken into consideration.

10. **Polynomial Encoding**: Polynomial coding is a way of selecting contrast coefficients for coding a nominal or ordinal variable.

11. **Backward Difference Encoding**: The mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level.
---

### **Feature Encoding with Pandas Library:**

In [24]:
# use pandas for feature encoding

df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [25]:
# use pandas get dummies
get_dummies = pd.get_dummies(df, columns=['day'])
get_dummies.head()

Unnamed: 0,total_bill,tip,sex,smoker,time,size,day_Thur,day_Fri,day_Sat,day_Sun
0,16.99,1.01,Female,No,Dinner,2,False,False,False,True
1,10.34,1.66,Male,No,Dinner,3,False,False,False,True
2,21.01,3.5,Male,No,Dinner,3,False,False,False,True
3,23.68,3.31,Male,No,Dinner,2,False,False,False,True
4,24.59,3.61,Female,No,Dinner,4,False,False,False,True


This Python code is using the pandas `get_dummies` function to perform one-hot encoding on the 'day' column of the 'tips' dataset. 

The 'tips' dataset is a built-in dataset in the seaborn library that contains information about the bills paid by different customers in a restaurant, including the total bill amount, tip amount, sex of the person paying the bill, whether they are a smoker or not, the day of the week, time of the day, and the size of the party.

The `get_dummies` function is used to convert the categorical variable 'day' into a format that can be provided to a machine learning algorithm to improve its prediction performance. For each unique value in the 'day' column, `get_dummies` creates a new column in the DataFrame, and uses a binary value to indicate the presence of that value in the original row.

**Observations from the Output:**
- From the output, you can see that the '`day`' column has been replaced by four new columns: '`day_Thur`', '`day_Fri`', '`day_Sat`', and '`day_Sun`'. 
- Each of these columns has a binary value: a 1 indicates that the original 'day' value for that row was the same as the column name, and a 0 indicates that it was not.

For example, in the first row of the output, the 'day_Sun' column has a value of 1, which means that the original 'day' value for that row was 'Sun'. All the other 'day' columns for that row have a value of 0, indicating that the 'day' value was not 'Thur', 'Fri', or 'Sat'.

In machine learning, we often use this kind of one-hot encoding to convert categorical data into a numerical format that can be used by a machine learning algorithm. This is because many machine learning algorithms require their input data to be numerical. By using one-hot encoding, we can convert categorical data into a numerical format without introducing arbitrary numerical values for different categories.