split a column into multiple columns (One Hot Encode)

OneHotEncoder (sklearn) vs LabelEncoder (sklearn)

Suppose you have a country feature which can takes the values Germany, France, and Spain.

Let's consider the following data:

    country   
0   Germany		
1   Germany		
2   Spain    	
3   France   	 
4   Germany

LabelEncoder converts non-numerical labels to numerical labels.
The 3 countries are replaced by the numbers 0, 1, and 2.

The problem with LabelEncoder is that there are different numbers in the same column.
So the algorithm might conclude Spain (2) > Germany (1) > France (0) ...
Also if your algorithm internally calculates the average, let's say for Spain (2) and France (0) it will get (2 + 0)/2=1 this would mean the average of Spain (2) and France (0) is Germany (1) ...

To overcome this problem, we use OneHotEncoder instead of LabelEncoder.

OneHotEncoder takes a column and split it into multiple columns
OneHotEncoder splits the feature country into 3 features (Germany, France, and Spain) which are all binary (0 or 1)

We have 3 columns (one per country) with 1s and 0s

	Germany	        Spain	        France
0	1 	  	0		0
1	1 	  	0		0
2       0 	  	1		0
3       0 	  	0		1
4       1 	  	0		0

OneHotEncoder (sklearn)

import the class OneHotEncoder

>>> from sklearn.preprocessing import OneHotEncoder

instanciate the class OneHotEncoder

>>> OHE = OneHotEncoder()

>>> X = [['Male', 'Paris', 1], ['Female', 'Paris', 4], ['Female', 'Amsterdam', 4], ['Female', 'Paris', 2], ['Female', 'London', 2]]
>>> X[0]
['Male', 'Paris', 1]

apply the fit method to the OneHotEncoder instance

>>> OHE.fit(X)

The used categories can be found in the categories_ attribute

>>> OHE.categories_
[array(['Female', 'Male'], dtype=object), array(['Amsterdam', 'London', 'Paris'], dtype=object), array([1, 2, 4], dtype=object)]

return features name

>>> OHE.get_feature_names()
array(['x0_Female', 'x0_Male', 'x1_Amsterdam', 'x1_London', 'x1_Paris', 'x2_1', 'x2_2', 'x2_4'], dtype=object)

apply the method transform to the OneHotEncoder instance

>>> OHE.transform([['Female', 'Paris', 2], ['Male', 'Amsterdam', 1]]).toarray()
array([[1., 0., 0., 0., 1., 0., 1., 0.], [0., 1., 1., 0., 0., 1., 0., 0.]])

apply the method inverse_transform to the OneHotEncoder instance

>>> OHE.inverse_transform([[1., 0., 0., 0., 1., 0., 1., 0.]])
array([['Female', 'Paris', 2]], dtype=object)
>>> OHE.inverse_transform([[1., 0., 0., 0., 1., 0., 1., 0.], [0., 1., 1., 0., 0., 1., 0., 0.]])
array([['Female', 'Paris', 2], ['Male', 'Amsterdam', 1]], dtype=object)

One-Hot Encoding a Feature on a Pandas Dataframe

create a dataset

>>> import pandas as pd
>>> df = pd.DataFrame({'country': ['russia', 'germany', 'australia','korea','germany']})
>>> df
     country
0     russia
1    germany
2  australia
3      korea
4    germany

One-Hot Encoding a Feature

>>> pd.get_dummies(df,prefix=['country'])
   country_australia  country_germany  country_korea  country_russia
0                  0                0              0               1
1                  0                1              0               0
2                  1                0              0               0
3                  0                0              1               0
4                  0                1              0               0

if you have other columns (in addition to the column you want to one-hot encode) in your dataset

>>> import pandas as pd
>>> df = pd.DataFrame({ 'name': ['josef','michael','john','bawool','klaus'], 'country': ['russia', 'germany', 'australia','korea','germany'] })
>>> df
      name    country
0    josef     russia
1  michael    germany
2     john  australia
3   bawool      korea
4    klaus    germany
>>> df['country']
0       russia
1      germany
2    australia
3        korea
4      germany
Name: country, dtype: object
>>> df.columns
Index(['name', 'country'], dtype='object')

One-Hot Encoding a Feature

>>> pd.get_dummies(df['country'], prefix='country')
   country_australia  country_germany  country_korea  country_russia
0                  0                0              0               1
1                  0                1              0               0
2                  1                0              0               0
3                  0                0              1               0
4                  0                1              0               0
>>> df = pd.concat([df,pd.get_dummies(df['country'], prefix='country')],axis=1)
>>> df
      name    country  country_australia  country_germany  country_korea  country_russia
0    josef     russia                  0                0              0               1
1  michael    germany                  0                1              0               0
2     john  australia                  1                0              0               0
3   bawool      korea                  0                0              1               0
4    klaus    germany                  0                1              0               0

drop the original feature column (you don't need this feature anymore)

>>> df = df.drop(columns=['country'])
>>> df
      name  country_australia  country_germany  country_korea  country_russia
0    josef                  0                0              0               1
1  michael                  0                1              0               0
2     john                  1                0              0               0
3   bawool                  0                0              1               0
4    klaus                  0                1              0               0
>>> df.columns
Index(['name', 'country_australia', 'country_germany', 'country_korea',
       'country_russia'],
      dtype='object')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split a column into multiple columns (One Hot Encode)

OneHotEncoder (sklearn) vs LabelEncoder (sklearn)

OneHotEncoder (sklearn)

One-Hot Encoding a Feature on a Pandas Dataframe

Clone this wiki locally