## Categorical Data

Reference : https://analyticsindiamag.com/a-complete-guide-to-categorical-data-encoding/

For any machine learning model to get trained, the data has to be in integer format. There are different methods to convert categorical data into integer format. It depends on the type of categorical data we encode. Encoding categorical data is a process of converting categorical data into integer format so that the data with converted categorical values can be provided to the models to give and improve the predictions.

There are two types of categorical data:

- Nominal data
- Ordinal data

Types of encoding categorical data :

* Label Encoding or Ordinal Encoding
* One-Hot Encoding
* Effect Encoding
* Binary Encoding
* Base-N Encoding
* Hash Encoding
* Target Encoding

### Ordinal Encoding (also known as Label Encoding)

It is used when the categorical variable is ordinal in nature and the sequence of the category has an importance. Eg: Good, better, best..
It converts each category into an integer in a sequence based on the information.

In [4]:
import category_encoders as ce
import pandas as pd
from sklearn.datasets import species_distributions 

df=pd.DataFrame({'grade':['very poor','poor','average','above average','good','excellent']})

encoder= ce.OrdinalEncoder(cols=['grade'], return_df=True, mapping=[{'col':'grade', 'mapping':{'very poor':0,'poor':1,'average':2,'above average':3,'good':4,'excellent':5}}])

df['transformed'] = encoder.fit_transform(df)
df.head(6)

Unnamed: 0,grade,transformed
0,very poor,0
1,poor,1
2,average,2
3,above average,3
4,good,4
5,excellent,5


### One-Hot Encoding 

In this type of encoding we create a new binary variable/column for each category of the categorical variable. Next we represent the category of the data point with a binary 1 or 0 to that newly created variables/columns. This is used for nominal type of categorical variable

In [6]:
df=pd.DataFrame({'Country':['India','Sri Lanka','China','Pakistan','Afganistan']})

encoder=ce.OneHotEncoder(cols='Country', handle_unknown='return_nan', return_df=True, use_cat_names=True)

df_enc = encoder.fit_transform(df)
df_enc.head()

Unnamed: 0,Country_India,Country_Sri Lanka,Country_China,Country_Pakistan,Country_Afganistan
0,1.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0


### Effect Encoding

This encoding is also known as Deviation Encoding or Sum Encoding. It is similar to One-Hot encoding, the only difference is we use -1, 0, 1 in Effect Encoding instead of 0, 1 in One-Hot-Encoding.

Reference : https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-effect-coding/

#### Why use effect coding?
Here’s a good question, why use effect coding instead of dummy coding? If you have several categorical variables in a model it often doesn’t make much difference whether you use effect coding or dummy coding. However, if you have an interaction of two categorical variables then effect coding may provide some benefits. The primary benefit is that you get reasonable estimates of both the main effects and interaction using effect coding. With dummy coding the estimate of the interaction is fine but main effects are not “true” main effects but rather what are called simple effects, i.e., the effect of one variable at one level of the other variable. This is why most analysis of variance programs use some type of effect coding when estimating the various effects in an ANOVA model.

In [7]:
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']}) 

encoder=ce.sum_coding.SumEncoder(cols='City', verbose=False,)
df_enc=encoder.fit_transform(data)
df_enc.head(7)

Unnamed: 0,intercept,City_0,City_1,City_2,City_3
0,1,1.0,0.0,0.0,0.0
1,1,0.0,1.0,0.0,0.0
2,1,0.0,0.0,1.0,0.0
3,1,0.0,0.0,0.0,1.0
4,1,-1.0,-1.0,-1.0,-1.0
5,1,1.0,0.0,0.0,0.0
6,1,0.0,0.0,1.0,0.0


### Hash Encoder

This technique encodes the categorical variable similar to one hot encoding, but the number of variables/columns created can be fixed in this case.
Hashing is a one-way technique of encoding which is unlike other encoders. The Hash encoder’s output can not be converted again into the input. That is why we can say it may cause loss of information from the data. It should be applied with high dimension data in terms of categorical values.

In [8]:
data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})

#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month',n_components=6)#Fit and Transform Data
encoder.fit_transform(data)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5
0,0,0,0,0,1,0
1,0,0,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0
5,0,1,0,0,0,0
6,1,0,0,0,0,0
7,0,1,0,0,0,0
8,0,0,0,0,1,0


### Binary Encoding

In the hash encoding, we have seen that using hashing can cause the loss of data and on the other hand we have seen in one hot encoding dimensionality of the data is increasing. The binary encoding is a process where we can perform hash encoding look like encoding without losing the information just like one hot encoding.

Basically, we can say that binary encoding is a combination process of hash and one hot encoding.

After implementation, we can see the basic difference between binary and hash and one hot encoding.This encoding is very helpful in the case of data with a huge amount of categories

In [12]:
data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})
encoder= ce.BinaryEncoder(cols=['Month'],return_df=True)
data_month=encoder.fit_transform(data) 
data_month

Unnamed: 0,Month_0,Month_1,Month_2
0,0,0,1
1,0,1,0
2,0,1,1
3,0,1,0
4,1,0,0
5,1,0,1
6,1,1,0
7,1,0,1
8,1,1,1


### Base-N Encoding

In a positional number system, base or radix is the number of unique digits including zero used to represent numbers. In base n encoding if the base is two then the encoder will convert categories into the numerical form using their respective binary form which is formally one-hot encoding. But if we change the base to 10 which means the categories will get converted into numeric form between 0-9. By implementation, we can understand it more.

In [13]:
#Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['Month'], return_df=True, base=5)

#Fit and Transform Data
data_encoded=encoder.fit_transform(data)

data_encoded

Unnamed: 0,Month_0,Month_1
0,0,1
1,0,2
2,0,3
3,0,2
4,0,4
5,1,0
6,1,1
7,1,0
8,1,2


In the above output, we can see that we have used base 5. Somewhere it is pretty simple to the binary encoding but where in binary we have got 4 dimensions after conversion here we have 3 dimensions only and also the numbers are varying between 0-4. 

If we do not define the base by default it is set to 2 which basically performs the binary encoding.

### Target Encoding

Target encoding is the method of converting a categorical value into the mean of the target variable. This type of encoding is a type of bayesian encoding method where bayesian encoders use target variables to encode the categorical value.

The target encoding encoder calculates the mean of the target variable for each category and by the mean, the categories get replaced.

In [15]:
df = pd.DataFrame({'name':[ 'rahul','ashok','ankit','rahul','ashok','ankit' ],'marks' : [10,20,30,60,70,80]})

encoder = ce.TargetEncoder(cols='name')
encoder.fit_transform(df['name'],df['marks'])



Unnamed: 0,name
0,37.689414
1,45.0
2,52.310586
3,37.689414
4,45.0
5,52.310586
