<div style="text-align: center; color: blue; font-size: 24px; font-weight: bold;">
    ONE HOT ENCODING
</div>

Feature encoding is the process of transforming categorical features into numeric features. This is necessary because machine learning algorithms can only handle numeric features. There are many different ways to encode categorical features, and each method has its own advantages and disadvantages. In this notebook, we will explore some of the most popular methods for encoding categorical features, such as:

- Label encoding
- Ordinal encoding
- One-hot encoding
- Binary encoding

- There are two important points to note about Numeric Data:
    * Numeric Data(or Values) can be compared.For example age 20 represents a young person then a person with age 40.
    * Numeric Data(or Values) can be used in calculation.For example,we can calculate average height of people.

- Sometimes, we will have textual data that will also contribute to the final output.
- Example, take the dataset that contains 'homeprice','town',and 'area'.homeprices are dependent on town column where town column is textual data. 

<span style="color:yellow;font-size: 24px">CONVERTING TEXTUAL DATA INTO NUMERIC DATA</span>  🚀.

- There are 3 ways of converting or encoding textual data into numeric type of data.
- The first way is to simply assign a numeric value to the textual data.
- For example,we can replace "moreno township" with 0,'robinsvile' with 1 and "west windsor" with 2.After conversion the 'town' column contains 0,1 and 2 .This type of conversation is called <span style="color:blue">Label Encoding </span>.

<span style="color:white;font-size:20px">Label Encoding </span>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

In [20]:
df = pd.read_csv('/Users/hackthebox/Downloads/Machine-Learning-Self-Study/ONE_HOT_ENCODING/Data/homeprices.csv')

In [3]:
df.head()

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


In [4]:
df['town'].value_counts()

town
monroe township    5
west windsor       4
robinsville        4
Name: count, dtype: int64

- Label Encoding can be done using LabelEncoder class of preprocessing moule in sklearn.
- https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

- To apply label encoding on the 'town' column using fit_transform() method 

In [6]:
df.town = le.fit_transform(df.town)

- fit_transform() method first sorts the values in the 'town' column into alphabetical order so that they appear in the following order:
  
  * moreno township
  * robinsville
  * west windsor 
- Then it will replace them with numeric values 0,1 and 2
  * moreno township: 0
  * robinsville: 1
  * west windsor: 2 
- 'fit' means training and 'transform' means convert.

In [8]:
df['town'].value_counts()

town
0    5
2    4
1    4
Name: count, dtype: int64

- The problem with Label Encoding is that the numerical values 0,1 and 2 can be misunderstood by the Machine Learning Model.
- Label Encoding is not recommended for categorical data.

<span style="color:white;font-size:20px">DUMMY VARIABLES(BINARY ENCODING) </span>

- The second way of converting the data is by using <span style="color:yellow">dummy variables </span>.A dummy variable as its name indicates is an intermediate varibale by which the categorical(or textual) data can be represented.
- Dummy variables contain binary either 1 or 0.To convert a column into dummy variables.
- we can use get_dummies() function of pandas package.
- https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

In [21]:
df1 = df.copy()

In [22]:
df1.head()

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


In [25]:
# get_dummies = pd.get_dummies(df1, columns=['town'])
# get_dummies.head()
dummies = pd.get_dummies(df1.town,dtype='int')
dummies.head()

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


- The get_dummies() function sorts the textual data of 'town' column into alphabetical order.
- Then it converts the data into binary digits in the sequence: 100,010,001.
- Hence we get 
  * moreno township: 100
  * robinsville: 010
  * west windsor: 001

- Dummy Variable Trap: 
    
    * The dummy variable trap is a scenario where there exists relationship between the independent variables and any two values can predict third value.
    * For example, moreno township - 0 and west windsor - 1 ,then predict third robinsvile is 0.
- To overcome the dummy variable trap ,simply we should drop a value from the three values and take only two .For Example,let us take the town values as: 
    * moreno township: 10
    * robinsville: 01
    * west windsor: 00

In [28]:
dummmies = dummies.drop('west windsor',axis=1)
# dummmies = dummies.drop('west windsor',axis='column')


In [29]:
dummmies

Unnamed: 0,monroe township,robinsville
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
5,0,0
6,0,0
7,0,0
8,0,0
9,0,1


- Now Machine Learnig model does not known the third value.hence it can not establish any link between all the three values.So,the values behave like independent of each other.Suppose we want to drop thrid value.i.e.'west windsor' we can drop() method as.
- axis = 1 means column 

In [30]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.4


In [31]:
from category_encoders import BinaryEncoder

In [39]:
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [42]:
tips['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [44]:
binary_encoder = BinaryEncoder()
df_binary = binary_encoder.fit_transform(tips['day'])

In [45]:
df_binary

Unnamed: 0,day_0,day_1,day_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
239,0,1,0
240,0,1,0
241,0,1,0
242,0,1,0


<span style="color:white;font-size:20px">ONE HOT ENCODING </span>

- The Third way of converting the categorical data into numerical values is by using a technique called <span style="color:yellow">One Hot Encoding </span> which gives the output as given by the 'dummy variables' method.  
- There are 2 Stages in this One Hot Encoding technique.
    * In the first stage,we have to convert the categorical variable values 0,1,2 using LabelEncoder class.

In [47]:
df.head()

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000


In [48]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# fit and transform the data of town column
df.town = le.fit_transform(df.town)

In [49]:
df.head()

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000


- In the Second Stage,convert these values into binary digit using OneHotEncoder class of sklearn.preprocessing module.
- https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [50]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore') # handle_unknown='ignore' means to ignore the categories that cannot be converted. 

In [52]:
x1 = ohe.fit_transform(df[['town']]) 

 # this will give us sparse matrix means many zeros with integer values.
 # This matrix can be converted into an array using .toarray() method.

In [53]:
# The following statement converts the sparse matrix x1 into an array and then into the data frame.
x1 = pd.DataFrame(x1.toarray())

In [54]:
x1

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
5,0.0,0.0,1.0
6,0.0,0.0,1.0
7,0.0,0.0,1.0
8,0.0,0.0,1.0
9,0.0,1.0,0.0


- Now that the encoding(or conversation) is complete,we have to drop a column from the above 3 columns in order to avoid dummy variable trap.
- Any column can be dropped from the 3 columns.
- Suppose we want to drop 0th column and take only the 1st and 2nd columns.

In [55]:
x1 = x1.iloc[:,1:]
x1

Unnamed: 0,1,2
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0
5,0.0,1.0
6,0.0,1.0
7,0.0,1.0
8,0.0,1.0
9,1.0,0.0
