<a href="https://colab.research.google.com/github/nurfnick/Data_Viz/blob/main/18_One_Hot_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Factors and One Hot Encoding

Often we want to use strings in a numerical way besides just counting.  In data science we often want to include a category in our models.  To deal with this we can do a wide variety of transformations on the categorical variables.

## One Hot Encoding

Perhaps the easiest to understand is the one **one hot encoder** essentially we give a new column for every category in the categorical variable.  In pandas, this action is preformed by the `get_dummies` command.  Let's see it in action.

In [1]:
import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/iris.csv')

df.head()

Unnamed: 0,SepalLength,SepalWidth,PedalLength,PedalWidth,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [8]:
pa.get_dummies(df.Class).head()

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


We see that each column was given the name of the category and a 1 indeicating membership and a 0 if not a member.

We ran across some data recently that had a list as the input (actually we had to do a bit of cleanning to get to it but that is included below!)  Perhaps we want to get indicators for that.

In [10]:
from bs4 import BeautifulSoup
import requests
import re

r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')
html_contents = r.text
html_soup = BeautifulSoup(html_contents,"lxml")
tables = html_soup.find_all('table',class_="wikitable")

df1 = pa.read_html(str(tables))[0]
df1.columns = df1.columns.droplevel(0).droplevel(0)

newcol = df1.iloc[:,-1]
newcol = newcol.apply(lambda x: re.sub(r"\[(.+?)\]","",x))
newcol = newcol.apply(lambda x: re.sub(r"[^A-z]","",x))
newcol = newcol.apply(lambda x: re.findall(r"[A-Z][a-z]*",x))

newcol

0         [Nepal, China]
1      [Pakistan, China]
2         [Nepal, India]
3         [Nepal, China]
4         [Nepal, China]
             ...        
115              [China]
116       [Nepal, China]
117      [Bhutan, China]
118       [India, China]
119           [Pakistan]
Name: Country (disputed claims in italics), Length: 120, dtype: object

First I'll convert the data Series into a dataframe with multiple columns.

In [25]:
newcol.apply(pa.Series)

Unnamed: 0,0,1,2
0,Nepal,China,
1,Pakistan,China,
2,Nepal,India,
3,Nepal,China,
4,Nepal,China,
...,...,...,...
115,China,,
116,Nepal,China,
117,Bhutan,China,
118,India,China,


Next I'll use the `stack` command to break each row apart into its individual peices.

In [27]:
newcol.apply(pa.Series).stack()

0    0       Nepal
     1       China
1    0    Pakistan
     1       China
2    0       Nepal
            ...   
117  0      Bhutan
     1       China
118  0       India
     1       China
119  0    Pakistan
Length: 161, dtype: object

Now I can get the dummies!

In [28]:
pa.get_dummies(newcol.apply(pa.Series).stack())

Unnamed: 0,Unnamed: 1,Afghanistan,Bhutan,China,India,Kyrgyzstan,Nepal,Pakistan,Tajikistan
0,0,0,0,0,0,0,1,0,0
0,1,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0
1,1,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...
117,0,0,1,0,0,0,0,0,0
117,1,0,0,1,0,0,0,0,0
118,0,0,0,0,1,0,0,0,0
118,1,0,0,1,0,0,0,0,0


Lastly we bring it all back together using the groupby command on the indexes using `level = 0`.  We sum the totals as well.

In [22]:
pa.get_dummies(newcol.apply(pa.Series).stack()).groupby(level = 0).sum()

Unnamed: 0,Afghanistan,Bhutan,China,India,Kyrgyzstan,Nepal,Pakistan,Tajikistan
0,0,0,1,0,0,1,0,0
1,0,0,1,0,0,0,1,0
2,0,0,0,1,0,1,0,0
3,0,0,1,0,0,1,0,0
4,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...
115,0,0,1,0,0,0,0,0
116,0,0,1,0,0,1,0,0
117,0,1,1,0,0,0,0,0
118,0,0,1,1,0,0,0,0
