-
Notifications
You must be signed in to change notification settings - Fork 2
transform non numerical labels to numerical labels with LabelEncoder
- examples of features using non numerical labels
- converts non-numerical labels to numerical labels with LabelEncoder
read the titanic dataset
>>> import pandas as pd
>>> df = pd.read_csv("titanic.csv")
>>> df.shape
(891, 15)
>>> df.columns
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare','embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone'], dtype='object')
see all the data types
sex, embarked, class, who, deck , embark_town and alive use strings (object datatype)
>>> df.dtypes
survived int64
pclass int64
sex object
age float64
sibsp int64
parch int64
fare float64
embarked object
class object
who object
adult_male bool
deck object
embark_town object
alive object
alone bool
dtype: object
select all columns except the ones having string datatype
>>> df.select_dtypes(exclude ='object').head(3)
survived pclass age sibsp parch fare adult_male alone
0 0 3 22.0 1 0 7.2500 True False
1 1 1 38.0 1 0 71.2833 False False
2 1 3 26.0 0 0 7.9250 False True
select all columns having string datatype
>>> df.select_dtypes(include ='object').head(3)
sex embarked class who deck embark_town alive
0 male S Third man NaN Southampton no
1 female C First woman C Cherbourg yes
2 female S Third woman NaN Southampton yes
select all columns having bool datatype
>>> df.select_dtypes(include ='bool').head(3)
adult_male alone
0 True False
1 False False
2 False True
LabelEncoder is an encoder available in the SciKit python library.
It converts non-numerical labels (text) to numerical labels (integers), which machine learning algorithms can better understand.
It encodes labels with value between 0 and N-1.
let's consider the following data:
country
0 Germany
1 Germany
2 Spain
3 France
4 Germany
LabelEncoder converts non-numerical labels to numerical labels.
The 3 countries are replaced by the numbers 0, 1, and 2.
country
0 1
1 1
2 2
3 0
4 1
Example
import the preprocessing package from sklearn
>>> from sklearn import preprocessing
instanciate the class LabelEncoder
>>> le = preprocessing.LabelEncoder()
apply the method fit to fit the instance of labelencoder
>>> le.fit(["France", "Spain", "France", "Germany", "Germany", "Germany"])
LabelEncoder()
there are 3 classes
>>> le.classes_
array(['France', 'Germany', 'Spain'], dtype='<U7')
>>> list(le.classes_)
['France', 'Germany', 'Spain']
apply the method transform to the instance of labelencoder class. this will encode labels
>>> le.transform(['Germany', 'Germany', 'Spain', 'France', 'Germany'])
array([1, 1, 2, 0, 1])
>>> list(le.transform(['Germany', 'Germany', 'Spain', 'France', 'Germany']))
[1, 1, 2, 0, 1]
apply the method inverse_transform to the instance of labelencoder class to get the original encoding
>>> le.inverse_transform([2, 2, 1])
array(['Spain', 'Spain', 'Germany'], dtype='<U7')
>>> list(le.inverse_transform([2, 2, 1]))
['Spain', 'Spain', 'Germany']
The problem with LabelEncoder is that there are different numbers in the same column.
So the algorithm might conclude Spain (2) > Germany (1) > France (0) ...
Also if your algorithm internally calculates the average, let's say for Spain (2) and France (0) it will get (2 + 0)/2=1 this would mean the average of Spain (2) and France (0) is Germany (1) ...
To overcome this problem, we use OneHotEncoder instead of LabelEncoder.