# Types of encoding for categorical variables
#### Feb 5, 2022 
##### Jey Kim (jeonghyeop.kim@gmail.com)

See the YouTube video: [HERE](https://youtu.be/OTPz5plKb40)
> **`Motivation:`** \
> ML algorithms are mathematical processures; \
> we need to convert categorical variables into a form that computers can understand.


## A. Two types of categorical variables
1. **`Nominal Categorical Variables`**
- There is no intrinsic ordering to the categories
- e.g. Genders (Male, Female, and others), States (NY, NJ, CA, and others), etc
2. **`Ordinal Categorical Variables`**
- There is intrinsic ordering, or ranking, to the categories
- e.g. Grades (A,B,C,D,F), education levels (PhD, MS, Bachelor's, High School, ...), etc
- If you want to predict a person's income, the ranking of educations may be a useful information
- Note that the states can be a ordinal category in the context of this prediction.

## B. Types of encodings
B-1. **`Nominal encoding:`** 
> a. One hot encoding; \
> b. One hot encoding with many categorical variables; \
> c. Mean encoding; \
> d. Count or frequency encoding; \
> etc 

B-2. **`Ordinal encoding:`** 
> a. Label (numbering) encoding; \
> b. Target guided ordinal encoding (similar to mean encoding); \
> etc 

-----------------------------------------------------------------

#### <font color=red>B-1-a. One Hot Encoding </font>

For instance, Suppose there exist 5 unique variables in a column 'Country'. 
  
    'Country'   -->   'S. Korea', 'US', 'Canada', 'Germany'

     S. Korea             1        0       0          0  
     US                   0        1       0          0   
     Canada               0        0       1          0  
     Germany              0        0       0          1       
     Japan                0        0       0          0 

Q. Where is the new 'Japan' column?? \
A. When one sees all of the values in the four new columns are 0, the corresponding country must be 'Japan'. We don't need the 5th column

#### <font color=red>B-1-b. One hot encoding with many variables </font>

For instance, Suppose there exist 160 unique variables in a column 'Country'. 
  
    'Country'   -->   'S. Korea', 'US', 'Canada', 'Germany', 'Japan', 'Spain', 'France', ... 

     S. Korea             1        0       0          0         0         0        0    ...
     US                   0        1       0          0         0         0        0    ...
     Canada               0        0       1          0         0         0        0    ...   
     Germany              0        0       0          1         0         0        0    ...     
     Japan                0        0       0          0         1         0        0    ...   
     Spain                0        0       0          0         0         1        0    ...   
     France               0        0       0          0         0         0        1    ...     
      ...                  ...

The number of the new columns is 159 (160 - 1), which is too many.  
    
Apply `KDD Orange algorithm:`

(a) Count frequencies of each of the unique variables \
(b) Choose 'k' top frequently repeated variables \
(c) Make 'k'-1 new columns only ignoring the rest

#### <font color=red>B-1-c. Mean Encoding</font>

(a) Find a output values associated with 'feature 1'. 
  
    'feature 1'    -->   'Output (e.g., classified binary labels)' 
       A                    1
       B                    1
       C                    0
       D                    1
       D                    1
       A                    0
       C                    1 
      ...                  ...

(b) Compute average output for each categorical var (e.g.). 
    
    A average output : 0.87 
    B average output : 0.13 
    C average output : 0.41 
    D average output : 0.69 
    
(c) Replace each var. in 'feature 1' with the means from the step (b). 
  
    'feature 1'    -->   'new feature 1'
       A                       0.87
       B                       0.13
       C                       0.41
       D                       0.69
       D                       0.69
       A                       0.87
       C                       0.41
      ...                     ...

#### <font color=red>B-1-d. Count (frequency) encoding</font>

(a) Count how many times each of categorical variables appears

(b) Replace categorical vars with the corresponding counts

#### <font color=orange>B-2-a. Label Encoding</font>

Assign labels to each of the variables in a 'feature'. 

EXAMPLE 1:

    'feature 1 (degrees)'    -->   'lables' 
       BS                             2
       MS                             3
       PhD                            4
       HS                             1
       HS                             1
       MS                             3
       BS                             2 
      ...                            ...

EXAMPLE 2:

    'feature 1 (Grades)'    -->   'lables' 
       A                             4
       B                             3
       A                             4
       A                             4
       C                             2
       F                             0
       D+                            1.5 
      ...                            ...

#### <font color=orange>B-2-b.Target Guided Ordinal Encoding</font>

(a) Find a output values associated with 'feature 1'. 
  
    'feature 1'    -->   'Output (e.g., classified binary labels)' 
       A                    1
       B                    1
       C                    0
       D                    1
       D                    1
       A                    0
       C                    1 
      ...                  ...

(b) Compute average output for each categorical var (e.g.). 
    
    a average output : 0.87 
    b average output : 0.13 
    c average output : 0.41 
    d average output : 0.69 
    
(c) Assign labels according to the step (b). 
    
    a label : 4 (highest)
    b label : 1 (lowest)
    c label : 2 (lower)
    d label : 3 (highter)
    
(d) Replace each var in 'feature 1' with the labels from the step (c). 
  
    'feature 1'    -->   'new feature 1'
       A                       4
       B                       1
       C                       2
       D                       3
       D                       3
       A                       4
       C                       2
      ...                     ...