Skip to content

split a column into multiple columns (One Hot Encode)

Khelil Sator edited this page Jun 30, 2019 · 5 revisions

OneHotEncoder (sklearn) vs LabelEncoder (sklearn)

Suppose you have a country feature which can takes the values Germany, France, and Spain.

Let's consider the following data:

    country   
0   Germany		
1   Germany		
2   Spain    	
3   France   	 
4   Germany  	

LabelEncoder converts non-numerical labels to numerical labels.
The 3 countries are replaced by the numbers 0, 1, and 2.

	country
0	1
1       1
2	2
3       0 
4       1

The problem with LabelEncoder is that there are different numbers in the same column.
So the algorithm might conclude Spain (2) > Germany (1) > France (0) ...
Also if your algorithm internally calculates the average, let's say for Spain (2) and France (0) it will get (2 + 0)/2=1 this would mean the average of Spain (2) and France (0) is Germany (1) ...

To overcome this problem, we use OneHotEncoder instead of LabelEncoder.

OneHotEncoder takes a column and split it into multiple columns
OneHotEncoder splits the feature country into 3 features (Germany, France, and Spain) which are all binary (0 or 1)

We have 3 columns (one per country) with 1s and 0s

	Germany	        Spain	        France
0	1 	  	0		0
1	1 	  	0		0
2       0 	  	1		0
3       0 	  	0		1
4       1 	  	0		0

OneHotEncoder (sklearn)

import the class OneHotEncoder

>>> from sklearn.preprocessing import OneHotEncoder

instanciate the class OneHotEncoder

>>> OHE = OneHotEncoder()
>>> X = [['Male', 'Paris', 1], ['Female', 'Paris', 4], ['Female', 'Amsterdam', 4], ['Female', 'Paris', 2], ['Female', 'London', 2]]
>>> X[0]
['Male', 'Paris', 1]

apply the fit method to the OneHotEncoder instance

>>> OHE.fit(X)

The used categories can be found in the categories_ attribute

>>> OHE.categories_
[array(['Female', 'Male'], dtype=object), array(['Amsterdam', 'London', 'Paris'], dtype=object), array([1, 2, 4], dtype=object)]

return features name

>>> OHE.get_feature_names()
array(['x0_Female', 'x0_Male', 'x1_Amsterdam', 'x1_London', 'x1_Paris', 'x2_1', 'x2_2', 'x2_4'], dtype=object)

apply the method transform to the OneHotEncoder instance

>>> OHE.transform([['Female', 'Paris', 2], ['Male', 'Amsterdam', 1]]).toarray()
array([[1., 0., 0., 0., 1., 0., 1., 0.], [0., 1., 1., 0., 0., 1., 0., 0.]])

apply the method inverse_transform to the OneHotEncoder instance

>>> OHE.inverse_transform([[1., 0., 0., 0., 1., 0., 1., 0.]])
array([['Female', 'Paris', 2]], dtype=object)
>>> OHE.inverse_transform([[1., 0., 0., 0., 1., 0., 1., 0.], [0., 1., 1., 0., 0., 1., 0., 0.]])
array([['Female', 'Paris', 2], ['Male', 'Amsterdam', 1]], dtype=object)

One-Hot Encoding a Feature on a Pandas Dataframe

create a dataset

>>> import pandas as pd
>>> df = pd.DataFrame({'country': ['russia', 'germany', 'australia','korea','germany']})
>>> df
     country
0     russia
1    germany
2  australia
3      korea
4    germany

One-Hot Encoding a Feature

>>> pd.get_dummies(df,prefix=['country'])
   country_australia  country_germany  country_korea  country_russia
0                  0                0              0               1
1                  0                1              0               0
2                  1                0              0               0
3                  0                0              1               0
4                  0                1              0               0

if you have other columns (in addition to the column you want to one-hot encode) in your dataset

>>> import pandas as pd
>>> df = pd.DataFrame({ 'name': ['josef','michael','john','bawool','klaus'], 'country': ['russia', 'germany', 'australia','korea','germany'] })
>>> df
      name    country
0    josef     russia
1  michael    germany
2     john  australia
3   bawool      korea
4    klaus    germany
>>> df['country']
0       russia
1      germany
2    australia
3        korea
4      germany
Name: country, dtype: object
>>> df.columns
Index(['name', 'country'], dtype='object')

One-Hot Encoding a Feature

>>> pd.get_dummies(df['country'], prefix='country')
   country_australia  country_germany  country_korea  country_russia
0                  0                0              0               1
1                  0                1              0               0
2                  1                0              0               0
3                  0                0              1               0
4                  0                1              0               0
>>> df = pd.concat([df,pd.get_dummies(df['country'], prefix='country')],axis=1)
>>> df
      name    country  country_australia  country_germany  country_korea  country_russia
0    josef     russia                  0                0              0               1
1  michael    germany                  0                1              0               0
2     john  australia                  1                0              0               0
3   bawool      korea                  0                0              1               0
4    klaus    germany                  0                1              0               0

drop the original feature column (you don't need this feature anymore)

>>> df = df.drop(columns=['country'])
>>> df
      name  country_australia  country_germany  country_korea  country_russia
0    josef                  0                0              0               1
1  michael                  0                1              0               0
2     john                  1                0              0               0
3   bawool                  0                0              1               0
4    klaus                  0                1              0               0
>>> df.columns
Index(['name', 'country_australia', 'country_germany', 'country_korea',
       'country_russia'],
      dtype='object')