# Convert Categorical values in dataframe to Integers (Categorical or Binary)

**Content:**

1. https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
2. https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/

## Methodology

1. Dataset must contain **categorical** values.

2. First convert categorical values of dataframe to **categorical integers** using sklearn's *LabelEncoder*
3. Then, Using categorical integers generated in **step 2** to create Binary One-Hot Encoding using sklean's *OneHotEncoding*

<img src='Data/0.PNG'>

## Imports

In [13]:
import pandas as pd
import numpy as np

import sklearn
import sklearn.preprocessing
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

## Dataset

In [2]:
df = pd.DataFrame(
{
    "Company": ['A', 'B', 'C', 'A', 'D', 'A', 'B'],
    "Rank": [1,4,2,1,3,1,4],
    "Origin": ['India', 'UK', 'USA', 'India', 'India', 'India', 'UK'],
    "Name of CEO": ['Raghav H.', 'S.M. Kishore', 'Shakti Prasad', 'Raghav H.', 'Oshiro Mata', 'Raghav H.', 'S.M. Kishore'],
    "Sex": ['male', 'female', 'male', 'male', 'female', 'male', 'female'],
})

df

Unnamed: 0,Company,Name of CEO,Origin,Rank,Sex
0,A,Raghav H.,India,1,male
1,B,S.M. Kishore,UK,4,female
2,C,Shakti Prasad,USA,2,male
3,A,Raghav H.,India,1,male
4,D,Oshiro Mata,India,3,female
5,A,Raghav H.,India,1,male
6,B,S.M. Kishore,UK,4,female


----

## Step 1: Converting to Categorical Integers

In [3]:
# Label Encoder...
le = LabelEncoder()

In [4]:
labeled_df = df.apply(le.fit_transform)
labeled_df

# OR each col wise...
# labeled_df = pd.DataFrame()
# for col in df.columns:    
#     labeled_df[col] = le.fit_transform(df[col])

Unnamed: 0,Company,Name of CEO,Origin,Rank,Sex
0,0,1,0,0,1
1,1,2,1,3,0
2,2,3,2,1,1
3,0,1,0,0,1
4,3,0,0,2,0
5,0,1,0,0,1
6,1,2,1,3,0


#### KEY

In [6]:
# Contains Key for the integers assigned...
Library = pd.DataFrame()

for col in df.columns:
    
    Lib = pd.DataFrame()
    Lib['Parameter'] = [col for i in range(len(df[col].value_counts().index.sort_values()))]
    Lib['Label'] = [i for i in range(len(df[col].value_counts().index.sort_values()))]
    Lib['Value'] = df[col].value_counts().index.sort_values()
    
    Library = pd.concat([Library, Lib], axis=0)

Library

Unnamed: 0,Parameter,Label,Value
0,Company,0,A
1,Company,1,B
2,Company,2,C
3,Company,3,D
0,Name of CEO,0,Oshiro Mata
1,Name of CEO,1,Raghav H.
2,Name of CEO,2,S.M. Kishore
3,Name of CEO,3,Shakti Prasad
0,Origin,0,India
1,Origin,1,UK


Remember: LabelEncoder and OneHotEncoding always work in Ascending Order Value. So all values will be sorted first and then integers will be applied to them. 

For Example, Company - Sorted[Accenture, IBM, Infosys, Tata] - [0,1,2,3]

---

## STEP 2: Converting Categorical Integers to Binary OneHotEncoding

#### INPUT 

    *NOTE*:  Input will be matrix of Categorical Integers, i.e. labeled_df

## Method 1: All at once

In [7]:
OHE = OneHotEncoder(categorical_features='all')

X = OHE.fit_transform(labeled_df)
X = X.toarray()

In [8]:
final_df = pd.DataFrame(X, columns = sum( [ sorted(df[col].unique()) for col in df.columns ] , [] ))
final_df

Unnamed: 0,A,B,C,D,Oshiro Mata,Raghav H.,S.M. Kishore,Shakti Prasad,India,UK,USA,1,2,3,4,female,male
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
5,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
6,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


## Method 2: Using each col wise

In [15]:
OHE = sklearn.preprocessing.OneHotEncoder(sparse=False)

In [16]:
OneHotEncoded_df = pd.DataFrame()

for col in labeled_df.columns:
    
    # Fitting the data
    OHE.fit(labeled_df[[col]])
    
    # Transforming into a proper matrix representation
    transformed = OHE.transform(labeled_df[[col]])
    
    # Storing one by one in a temp dataframe
    temp_df = pd.DataFrame(transformed, columns= list(Library[Library.Parameter == col].Value.values))
    
    OneHotEncoded_df = pd.concat([OneHotEncoded_df, temp_df], axis=1)
    
OneHotEncoded_df

Unnamed: 0,A,B,C,D,Oshiro Mata,Raghav H.,S.M. Kishore,Shakti Prasad,India,UK,USA,1,2,3,4,female,male
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
5,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
6,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


----