# One Hot Encoding

In a multi-classification problem, you need one-hot encoding of the y-variable to feed the neural net.  There are many functions available in sklearn to do this, described below.  

**X Variable**  
However the easiest to use is Pandas `pd.get_dummies(dataframe columns to encode)`.  The reason is that Pandas will give you column headings, whereas sklearn uses arrays that have no column headings and in the first two functions it is not possible to figure out which column is which.  
  
**Y Variable**  
Easiest to use `LabelBinarizer` and `MultiLabelBinarizer` that both give column headings too.  See usage below.
  
Once you have the pandas one-hot dataframe, you can convert it to an array using `np.asarray(df)` and feed it to the neural net.  
  
See example below:  
  

In [119]:
mylist = ['apple', 'banana', 'pear', 'pear', 'apple', 'apple']

     

In [120]:
pd.get_dummies(pd.DataFrame(mylist))

Unnamed: 0,0_apple,0_banana,0_pear
0,1,0,0
1,0,1,0
2,0,0,1
3,0,0,1
4,1,0,0
5,1,0,0


## One hot using sklearn
  
sklearn has many functions for one hot encoding.  All have examples below.  All of these have an interesting way to implement:  
  
1. Each needs to be assigned to a string first, eg `le=LabelEncoder()`,  
2. Then use the `string.fit_transform(values)`, where values is the array containing the text label categories.  
3. Each returns an array.  
  

***
**Prep the data**

In [129]:
# Convert list of labels to array
values = np.asarray(mylist)
print('values: \n',values, '\n\n')
print('values shape: \n',values.shape, '\n\n')

# Now reshape it to be a 2D array
values= values.reshape(values.shape[0],1) 
print('values after reshape: \n',values, '\n\n')
print('values new shape: \n',values.shape, '\n\n')


values: 
 ['apple' 'banana' 'pear' 'pear' 'apple' 'apple'] 


values shape: 
 (6,) 


values after reshape: 
 [['apple']
 ['banana']
 ['pear']
 ['pear']
 ['apple']
 ['apple']] 


values new shape: 
 (6, 1) 




In [122]:
import numpy as np
import pandas as pd
from numpy import array
from numpy import argmax

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer

***
**OneHotEncoder**  
Used for X variables.  Can convert multiple columns to one hot format directly from categorical text. Directly takes an array as an input. 

In [123]:
oh = OneHotEncoder(sparse=False)
myonehot = oh.fit_transform(values)
myonehot


array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]])

***
**LabelEncoder**  
Used for Y variables - this doesn't give you one-hot encoding, but gives you integer encoding.

In [124]:
le = LabelEncoder()
int = le.fit_transform(values.ravel()) # This needs a 1D arrary
print("Now int has integers, type is ", type(int))
print('int shape: ', int.shape)
int

Now int has integers, type is  <class 'numpy.ndarray'>
int shape:  (6,)


array([0, 1, 2, 2, 0, 0], dtype=int64)

***  
**LabelBinarizer**  
Used for Y variables - produces one-hot encoding for Y variables.  Each observation belongs to one and only one class.

In [125]:
lb = LabelBinarizer()
myonehot = lb.fit_transform(values) 
my1hot_df = pd.DataFrame(lb.fit_transform(values), columns=lb.classes_)
print(my1hot_df)
print('\n \n')
print(myonehot)

   apple  banana  pear
0      1       0     0
1      0       1     0
2      0       0     1
3      0       0     1
4      1       0     0
5      1       0     0

 

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 0 1]
 [1 0 0]
 [1 0 0]]


***
**MultiLabelBinarizer**: This is used when an observation can belong to multiple labels

In [126]:
df = pd.DataFrame({"genre": [["action", "drama","fantasy"], \
                             ["fantasy","action"], ["drama"], 
                             ["sci-fi", "drama"]]})

In [127]:
df

Unnamed: 0,genre
0,"[action, drama, fantasy]"
1,"[fantasy, action]"
2,[drama]
3,"[sci-fi, drama]"


In [128]:
mlb = MultiLabelBinarizer()
myonehot = mlb.fit_transform(df['genre'])
my1hot_df = pd.DataFrame(mlb.fit_transform(df['genre']), columns=mlb.classes_)
print('mlb.classes \n',mlb.classes_, '\n\n')
print('my1hot_df \n', my1hot_df, '\n\n')
print('myonehot \n', myonehot, '\n\n')

mlb.classes 
 ['action' 'drama' 'fantasy' 'sci-fi'] 


my1hot_df 
    action  drama  fantasy  sci-fi
0       1      1        1       0
1       1      0        1       0
2       0      1        0       0
3       0      1        0       1 


myonehot 
 [[1 1 1 0]
 [1 0 1 0]
 [0 1 0 0]
 [0 1 0 1]] 




***
  
  
**useless stuff**  
I created a function to get the equivalent of LabelEncoder before I knew about it, but useless now.

In [33]:
def to_cat(l):
    uniques = list(set(l))
    l1=[]
    word_to_index = {}
    index_to_word = {}
    x=0
    for i in uniques:
        x = x + 1
        index_to_word.update({x: i})
        word_to_index.update({i: x})
        print(i)
        print(word_to_index[i])

    for t in l:
        print(l1)
        print(word_to_index[t])
        l1.append(word_to_index[t])

    return l1, index_to_word, word_to_index