# **Data Cleaning & Preparation**

**CATEGORICAL**
- Nominal: Laki-laki (3), Perempuan (2)
- Ordinal: SD (1), SMP (4), SMA (9), PT (10), skala likert

**CONTINOUS**
- Interval: data pengukuran (suhu) - tidak ada nol mutlak, suhu 0 C
- Rasio: data hitungan/pengukuran yang memiliki nol mutlak - jarak, kedalaman

### Encode Categorical Data

### **Target Pertemuan**

<hr>

**Tujuan Instruksional Umum:** Peserta mampu mempersiapkan data untuk pembuatan model machine learning.

**Pertemuan** Peserta mampu mengelola data kategorial (Encode Categorical Data).

<hr>

## **Encode Categorical Data**

### **Intro**

In most data science problems, our datasets will contain categorical features. Categorical features contain a finite number of discrete values. How we represent these features will have an impact on the performance of our model. Like in other aspects of machine learning, there are no silver bullets. Determining the correct approach, specific to our model and data is part of the challenge.

Most of the Machine learning algorithms can not handle categorical variables unless we convert them to numerical values. Many algorithm’s performances vary based on how Categorical variables are encoded.

Categorical variables can be divided into two categories: Nominal (No particular order) and Ordinal (some ordered).

<img src="x_img.png" style="width:550px;height:400px"/>

Few examples as below for Nominal variable:
* Red, Yellow, Pink, Blue
* Singapore, Japan, USA, India, Korea
* Cow, Dog, Cat, Snake

Example of Ordinal variables:
* High, Medium, Low
* “Strongly agree,” Agree, Neutral, Disagree, and “Strongly Disagree.”
* Excellent, Okay, Bad

## **Quick Summary**
Here’s the list of Category Encoders functions with their descriptions and the type of data they would be most appropriate to encode.

### **Classic Encoders**
The first group of five classic encoders can be seen on a continuum of embedding information in one column (Ordinal) up to k columns (OneHot). These are very useful encodings for machine learning practitioners to understand.

* Ordinal — convert string labels to integer values 1 through k. Ordinal.
* OneHot — one column for each value to compare vs. all other values. Nominal, ordinal.
* Binary — convert each integer to binary digits. Each binary digit gets one column. Some info loss but fewer dimensions. Ordinal.
* BaseN — Ordinal, Binary, or higher encoding. Nominal, ordinal. Doesn’t add much functionality. Probably avoid.
* Hashing — Like OneHot but fewer dimensions, some info loss due to collisions. Nominal, ordinal.
* Sum — Just like OneHot except one value is held constant and encoded as -1 across all columns.

### **Contrast Encoders**
The five contrast encoders all have multiple issues that I argue make them unlikely to be useful for machine learning. They all output one column for each value found in a column. Their stated intents are below.

* Helmert (reverse) — The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.
* Backward Difference — the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level.
* Polynomial — orthogonal polynomial contrasts. The coefficients taken on by polynomial coding for k=4 levels are the linear, quadratic, and cubic trends in the categorical variable.

### **Bayesian Encoders**
The Bayesian encoders use information from the dependent variable in their encodings. They output one column and can work well with high cardinality data.

* Target — use the mean of the DV, must take steps to avoid overfitting/ response leakage. Nominal, ordinal. For classification tasks.
* LeaveOneOut — similar to target but avoids contamination. Nominal, ordinal. For classification tasks.
* WeightOfEvidence — added in v1.3. Not documented in the docs as of April 11, 2019. The method is explained in this post.
* James-Stein — forthcoming in v1.4. Described in the code here.
* M-estimator — forthcoming in v1.4. Described in the code here. Simplified target encoder.

In [2]:
import pandas as pd
import numpy as np
data = {"Temperature": ['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
       'Color': ['Red','Yellow', 'Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Yellow'],
        'Target': [1,1,1,0,1,0,1,0,1,1]
       }
df = pd.DataFrame(data)
df

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Yellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


## **How to Encode Categorical Data**

### 1. **One-hot Encoding**

The first method we will be covering is one that no doubt will be familiar to you. One-hot encoding expands a categorical feature made up of m categories into m* distinct features with values of either 0 or 1.

There are two ways of implementing one-hot encoding, either with pandas or scikit-learn. In this tutorial we have chosen to use the latter.

Actually, it is seen as more correct to expand m categories into (m - 1) distinct features. The reason for this is twofold. Firstly, if the values of (m - 1) features are known, the m-th feature can be inferred and secondly because including the m-th feature can cause certain linear models to become unstable. More on that can be found here. In practice I think this depends on your model. Some non-linear models actually do better with m features.

In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features. This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature. Pandas has get_dummies function, which is quite easy to use. For the sample data-frame code would be as below:

In [2]:
df = pd.get_dummies(df, prefix=['Temp'], columns=['Temperature'])
df

Unnamed: 0,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,0,1,0,0
1,Yellow,1,1,0,0,0
2,Blue,1,0,0,1,0
3,Blue,0,0,0,0,1
4,Red,1,0,1,0,0
5,Yellow,0,0,0,0,1
6,Red,1,0,0,0,1
7,Yellow,0,0,1,0,0
8,Yellow,1,0,1,0,0
9,Yellow,1,1,0,0,0


### 2. **Label Encoding**

In this encoding, each category is assigned a value from 1 through N (here N is the number of categories for the feature. One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship. In below example it may look like (Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .Scikit-learn code for the data-frame as follows:

In [3]:
from sklearn.preprocessing import LabelEncoder
df['Temp_label_encoded'] = LabelEncoder().fit_transform(df['Temperature'])
df

Unnamed: 0,Temperature,Color,Target,Temp_label_encoded
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,0


Pandas **factorize** also perform the same function.

In [8]:
df.loc[:, 'Temp_factorize_encode'] = pd.factorize(df['Temperature'])[0].reshape(-1,1)
df

Unnamed: 0,Temperature,Color,Target,Temp_label_encoded,Temp_factorize_encode
0,Hot,Red,1,1,0
1,Cold,Yellow,1,0,1
2,Very Hot,Blue,1,2,2
3,Warm,Blue,0,3,3
4,Hot,Red,1,1,0
5,Warm,Yellow,0,3,3
6,Warm,Red,1,3,3
7,Hot,Yellow,0,1,0
8,Hot,Yellow,1,1,0
9,Cold,Yellow,1,0,1


### 3. **Ordinal Encoding**

We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature of the variable. This is reasonable only for ordinal variables, as I mentioned at the beginning of this article. This encoding looks almost similar to Label Encoding but slightly different as Label coding would not consider whether variable is ordinal or not and it will assign sequence of integers
* as per the order of data (Pandas assigned Hot (0), Cold (1), “Very Hot” (2) and Warm (3)) or
* as per alphabetical sorted order (scikit-learn assigned Cold(0), Hot(1), “Very Hot” (2) and Warm (3)).

If we consider in the temperature scale as the order, then the ordinal value should from cold to “Very Hot. “ Ordinal encoding will assign values as ( Cold(1) <Warm(2)<Hot(3)<”Very Hot(4)). Usually, we Ordinal Encoding is done starting from 1.

In [9]:
Temp_dict = {'Cold': 1, 'Warm': 2, 'Hot': 3, 'Very Hot': 4}
df['Temp_Ordinal'] = df.Temperature.map(Temp_dict)
df

Unnamed: 0,Temperature,Color,Target,Temp_label_encoded,Temp_factorize_encode,Temp_Ordinal
0,Hot,Red,1,1,0,3
1,Cold,Yellow,1,0,1,1
2,Very Hot,Blue,1,2,2,4
3,Warm,Blue,0,3,3,2
4,Hot,Red,1,1,0,3
5,Warm,Yellow,0,3,3,2
6,Warm,Red,1,3,3,2
7,Hot,Yellow,0,1,0,3
8,Hot,Yellow,1,1,0,3
9,Cold,Yellow,1,0,1,1


In [10]:
Temp_dict = {'Cold': 1, 'Warm': 2, 'Hot': 3, 'Very Hot': 4}
df['Temp_Ordinalxxx'] = df['Temperature'].map(Temp_dict)
df

Unnamed: 0,Temperature,Color,Target,Temp_label_encoded,Temp_factorize_encode,Temp_Ordinal,Temp_Ordinalxxx
0,Hot,Red,1,1,0,3,3
1,Cold,Yellow,1,0,1,1,1
2,Very Hot,Blue,1,2,2,4,4
3,Warm,Blue,0,3,3,2,2
4,Hot,Red,1,1,0,3,3
5,Warm,Yellow,0,3,3,2,2
6,Warm,Red,1,3,3,2,2
7,Hot,Yellow,0,1,0,3,3
8,Hot,Yellow,1,1,0,3,3
9,Cold,Yellow,1,0,1,1,1


### 4. **Helmert Encoding**

In this encoding, the mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.

The version in category_encoders is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name ‘reverse’ is used to differentiate from forward Helmert coding.

In [15]:
import category_encoders as ce
encoder = ce.HelmertEncoder(cols=['Temperature'], drop_invariant=True)
dfh = encoder.fit_transform(df['Temperature'])
df = pd.concat([df, dfh], axis=1)
df

Unnamed: 0,Temperature,Color,Target,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,-1.0,-1.0,-1.0
1,Cold,Yellow,1,1.0,-1.0,-1.0
2,Very Hot,Blue,1,0.0,2.0,-1.0
3,Warm,Blue,0,0.0,0.0,3.0
4,Hot,Red,1,-1.0,-1.0,-1.0
5,Warm,Yellow,0,0.0,0.0,3.0
6,Warm,Red,1,0.0,0.0,3.0
7,Hot,Yellow,0,-1.0,-1.0,-1.0
8,Hot,Yellow,1,-1.0,-1.0,-1.0
9,Cold,Yellow,1,1.0,-1.0,-1.0


### 5. **Frequency  Encoding**

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data. Three-step for this :
* Select a categorical variable you would like to transform
* Group by the categorical variable and obtain counts of each category
* Join it back with the training dataset

Pandas code can be constructed as below:

In [19]:
fe = df.groupby('Temperature').size()/len(df)
df.loc[:, 'Temp_freq_encode'] = df['Temperature'].map(fe)
df

Unnamed: 0,Temperature,Color,Target,Temp_freq_encode
0,Hot,Red,1,0.4
1,Cold,Yellow,1,0.2
2,Very Hot,Blue,1,0.1
3,Warm,Blue,0,0.3
4,Hot,Red,1,0.4
5,Warm,Yellow,0,0.3
6,Warm,Red,1,0.3
7,Hot,Yellow,0,0.4
8,Hot,Yellow,1,0.4
9,Cold,Yellow,1,0.2


#### **Reference**:


* Jeff Hale, "Smarter Ways to Encode Categorical Data for Machine Learning", https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159


* Baijayanta Roy, "All about Categorical Variable Encoding", https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02


* Wayde Herman, "Categorical Feature Encoding", https://www.kaggle.com/waydeherman/tutorial-categorical-encoding