Skip to content

transform non numerical labels to numerical labels with LabelEncoder

Khelil Sator edited this page Jun 30, 2019 · 2 revisions

examples of features using non numerical labels

read the titanic dataset

>>> import pandas as pd 
>>> df = pd.read_csv("titanic.csv") 
>>> df.shape
(891, 15)
>>> df.columns
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare','embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone'], dtype='object')

see all the data types
sex, embarked, class, who, deck , embark_town and alive use strings (object datatype)

>>> df.dtypes
survived         int64
pclass           int64
sex             object
age            float64
sibsp            int64
parch            int64
fare           float64
embarked        object
class           object
who             object
adult_male        bool
deck            object
embark_town     object
alive           object
alone             bool
dtype: object

select all columns except the ones having string datatype

>>> df.select_dtypes(exclude ='object').head(3) 
   survived  pclass   age  sibsp  parch     fare  adult_male  alone
0         0       3  22.0      1      0   7.2500        True  False
1         1       1  38.0      1      0  71.2833       False  False
2         1       3  26.0      0      0   7.9250       False   True

select all columns having string datatype

>>> df.select_dtypes(include ='object').head(3) 
      sex embarked  class    who deck  embark_town alive
0    male        S  Third    man  NaN  Southampton    no
1  female        C  First  woman    C    Cherbourg   yes
2  female        S  Third  woman  NaN  Southampton   yes

select all columns having bool datatype

>>> df.select_dtypes(include ='bool').head(3) 
   adult_male  alone
0        True  False
1       False  False
2       False   True

converts non-numerical labels to numerical labels with LabelEncoder

LabelEncoder is an encoder available in the SciKit python library.
It converts non-numerical labels (text) to numerical labels (integers), which machine learning algorithms can better understand.
It encodes labels with value between 0 and N-1.

let's consider the following data:

    country   
0   Germany		
1   Germany		
2   Spain    	
3   France   	 
4   Germany  	

LabelEncoder converts non-numerical labels to numerical labels.
The 3 countries are replaced by the numbers 0, 1, and 2.

	country
0	1
1       1
2	2
3       0 
4       1

Example

import the preprocessing package from sklearn

>>> from sklearn import preprocessing

instanciate the class LabelEncoder

>>> le = preprocessing.LabelEncoder()

apply the method fit to fit the instance of labelencoder

>>> le.fit(["France", "Spain", "France", "Germany", "Germany", "Germany"])
LabelEncoder()

there are 3 classes

>>> le.classes_
array(['France', 'Germany', 'Spain'], dtype='<U7')
>>> list(le.classes_)
['France', 'Germany', 'Spain']

apply the method transform to the instance of labelencoder class. this will encode labels

>>> le.transform(['Germany', 'Germany', 'Spain', 'France', 'Germany']) 
array([1, 1, 2, 0, 1])
>>> list(le.transform(['Germany', 'Germany', 'Spain', 'France', 'Germany']))
[1, 1, 2, 0, 1]

apply the method inverse_transform to the instance of labelencoder class to get the original encoding

>>> le.inverse_transform([2, 2, 1])
array(['Spain', 'Spain', 'Germany'], dtype='<U7')
>>> list(le.inverse_transform([2, 2, 1]))
['Spain', 'Spain', 'Germany']

The problem with LabelEncoder is that there are different numbers in the same column.
So the algorithm might conclude Spain (2) > Germany (1) > France (0) ...
Also if your algorithm internally calculates the average, let's say for Spain (2) and France (0) it will get (2 + 0)/2=1 this would mean the average of Spain (2) and France (0) is Germany (1) ...

To overcome this problem, we use OneHotEncoder instead of LabelEncoder.