
One-Hot Encoding in Scikit-learn

Intuition

    You will prepare your categorical data using LabelEncoder()
    You will apply OneHotEncoder() on your new DataFrame in step 1



In [1]:
!pip install scikit-learn

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/5e/82/c0de5839d613b82bddd088599ac0bbfbbbcbd8ca470680658352d2c435bd/scikit_learn-0.20.3-cp36-cp36m-manylinux1_x86_64.whl (5.4MB)
[K    100% |████████████████████████████████| 5.4MB 6.0MB/s eta 0:00:01
Installing collected packages: scikit-learn
Successfully installed scikit-learn-0.20.3


In [2]:
import numpy as np
import pandas as pd

In [12]:
X=pd.read_csv("Titanic.csv")

In [13]:
X.head(3)

Unnamed: 0.1,Unnamed: 0,Class,Sex,Age,Survived,Freq
0,1,1st,Male,Child,No,0
1,2,2nd,Male,Child,No,0
2,3,3rd,Male,Child,No,35


In [17]:
# limit to categorical data using df.select_dtypes()
X=X.select_dtypes(include=[object])

In [18]:
X.head(3)

Unnamed: 0,Class,Sex,Age,Survived
0,1st,Male,Child,No
1,2nd,Male,Child,No
2,3rd,Male,Child,No


In [20]:
# check original shape
X.shape

(32, 4)

In [21]:
# import preprocessing from sklearn
from sklearn import preprocessing

In [22]:
# view columns using df.columns
X.columns

Index(['Class', 'Sex', 'Age', 'Survived'], dtype='object')

In [23]:
# TODO: create a LabelEncoder object and fit it to each feature in X


# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()

In [25]:
# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
X_2 = X.apply(le.fit_transform)
X_2.head()

Unnamed: 0,Class,Sex,Age,Survived
0,0,1,1,0
1,1,1,1,0
2,2,1,1,0
3,3,1,1,0
4,0,0,1,0




OneHotEncoder

    - Encode categorical integer features using a one-hot aka one-of-K scheme.
    - The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.
    - The output will be a sparse matrix where each column corresponds to one possible value of one feature.
    - It is assumed that input features take on values in the range [0, n_values).
    - This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.



In [26]:
# TODO: create a OneHotEncoder object, and fit it to all of X

# 1. INSTANTIATE
enc = preprocessing.OneHotEncoder()

# 2. FIT
enc.fit(X_2)

# 3. Transform
onehotlabels = enc.transform(X_2).toarray()
onehotlabels.shape

# as you can see, you've the same number of rows 891
# but now you've so many more columns due to how we changed all the categorical data into numerical data

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


(32, 10)

In [27]:
onehotlabels

array([[1., 0., 0., 0., 0., 1., 0., 1., 1., 0.],
       [0., 1., 0., 0., 0., 1., 0., 1., 1., 0.],
       [0., 0., 1., 0., 0., 1., 0., 1., 1., 0.],
       [0., 0., 0., 1., 0., 1., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1., 0., 0., 1., 1., 0.],
       [0., 1., 0., 0., 1., 0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1., 0., 0., 1., 1., 0.],
       [0., 0., 0., 1., 1., 0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 0., 1., 1., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1., 1., 0., 1., 0.],
       [0., 0., 1., 0., 0., 1., 1., 0., 1., 0.],
       [0., 0., 0., 1., 0., 1., 1., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 1., 0., 1., 0.],
       [0., 1., 0., 0., 1., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0., 1., 0., 1., 0.],
       [0., 0., 0., 1., 1., 0., 1., 0., 1., 0.],
       [1., 0., 0., 0., 0., 1., 0., 1., 0., 1.],
       [0., 1., 0., 0., 0., 1., 0., 1., 0., 1.],
       [0., 0., 1., 0., 0., 1., 0., 1., 0., 1.],
       [0., 0., 0., 1., 0., 1., 0., 1., 0., 1.],
       [1., 0., 0., 