-
Notifications
You must be signed in to change notification settings - Fork 2
split a column into multiple columns (One Hot Encode)
- OneHotEncoder (sklearn) vs LabelEncoder (sklearn)
- OneHotEncoder (sklearn)
- One-Hot Encoding a Feature on a Pandas Dataframe
Suppose you have a country
feature which can takes the values Germany, France, and Spain.
Let's consider the following data:
country
0 Germany
1 Germany
2 Spain
3 France
4 Germany
LabelEncoder converts non-numerical labels to numerical labels.
The 3 countries are replaced by the numbers 0, 1, and 2.
country
0 1
1 1
2 2
3 0
4 1
The problem with LabelEncoder is that there are different numbers in the same column.
So the algorithm might conclude Spain (2) > Germany (1) > France (0) ...
Also if your algorithm internally calculates the average, let's say for Spain (2) and France (0) it will get (2 + 0)/2=1 this would mean the average of Spain (2) and France (0) is Germany (1) ...
To overcome this problem, we use OneHotEncoder instead of LabelEncoder.
OneHotEncoder takes a column and split it into multiple columns
OneHotEncoder splits the feature country
into 3 features (Germany, France, and Spain) which are all binary (0 or 1)
We have 3 columns (one per country) with 1s and 0s
Germany Spain France
0 1 0 0
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
import the class OneHotEncoder
>>> from sklearn.preprocessing import OneHotEncoder
instanciate the class OneHotEncoder
>>> OHE = OneHotEncoder()
>>> X = [['Male', 'Paris', 1], ['Female', 'Paris', 4], ['Female', 'Amsterdam', 4], ['Female', 'Paris', 2], ['Female', 'London', 2]]
>>> X[0]
['Male', 'Paris', 1]
apply the fit method to the OneHotEncoder instance
>>> OHE.fit(X)
The used categories can be found in the categories_ attribute
>>> OHE.categories_
[array(['Female', 'Male'], dtype=object), array(['Amsterdam', 'London', 'Paris'], dtype=object), array([1, 2, 4], dtype=object)]
return features name
>>> OHE.get_feature_names()
array(['x0_Female', 'x0_Male', 'x1_Amsterdam', 'x1_London', 'x1_Paris', 'x2_1', 'x2_2', 'x2_4'], dtype=object)
apply the method transform to the OneHotEncoder instance
>>> OHE.transform([['Female', 'Paris', 2], ['Male', 'Amsterdam', 1]]).toarray()
array([[1., 0., 0., 0., 1., 0., 1., 0.], [0., 1., 1., 0., 0., 1., 0., 0.]])
apply the method inverse_transform to the OneHotEncoder instance
>>> OHE.inverse_transform([[1., 0., 0., 0., 1., 0., 1., 0.]])
array([['Female', 'Paris', 2]], dtype=object)
>>> OHE.inverse_transform([[1., 0., 0., 0., 1., 0., 1., 0.], [0., 1., 1., 0., 0., 1., 0., 0.]])
array([['Female', 'Paris', 2], ['Male', 'Amsterdam', 1]], dtype=object)
create a dataset
>>> import pandas as pd
>>> df = pd.DataFrame({'country': ['russia', 'germany', 'australia','korea','germany']})
>>> df
country
0 russia
1 germany
2 australia
3 korea
4 germany
One-Hot Encoding a Feature
>>> pd.get_dummies(df,prefix=['country'])
country_australia country_germany country_korea country_russia
0 0 0 0 1
1 0 1 0 0
2 1 0 0 0
3 0 0 1 0
4 0 1 0 0
if you have other columns (in addition to the column you want to one-hot encode) in your dataset
>>> import pandas as pd
>>> df = pd.DataFrame({ 'name': ['josef','michael','john','bawool','klaus'], 'country': ['russia', 'germany', 'australia','korea','germany'] })
>>> df
name country
0 josef russia
1 michael germany
2 john australia
3 bawool korea
4 klaus germany
>>> df['country']
0 russia
1 germany
2 australia
3 korea
4 germany
Name: country, dtype: object
>>> df.columns
Index(['name', 'country'], dtype='object')
One-Hot Encoding a Feature
>>> pd.get_dummies(df['country'], prefix='country')
country_australia country_germany country_korea country_russia
0 0 0 0 1
1 0 1 0 0
2 1 0 0 0
3 0 0 1 0
4 0 1 0 0
>>> df = pd.concat([df,pd.get_dummies(df['country'], prefix='country')],axis=1)
>>> df
name country country_australia country_germany country_korea country_russia
0 josef russia 0 0 0 1
1 michael germany 0 1 0 0
2 john australia 1 0 0 0
3 bawool korea 0 0 1 0
4 klaus germany 0 1 0 0
drop the original feature column (you don't need this feature anymore)
>>> df = df.drop(columns=['country'])
>>> df
name country_australia country_germany country_korea country_russia
0 josef 0 0 0 1
1 michael 0 1 0 0
2 john 1 0 0 0
3 bawool 0 0 1 0
4 klaus 0 1 0 0
>>> df.columns
Index(['name', 'country_australia', 'country_germany', 'country_korea',
'country_russia'],
dtype='object')