transform non numerical labels to numerical labels with LabelEncoder

examples of features using non numerical labels

read the titanic dataset

>>> import pandas as pd 
>>> df = pd.read_csv("titanic.csv") 
>>> df.shape
(891, 15)
>>> df.columns
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare','embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone'], dtype='object')

see all the data types
sex, embarked, class, who, deck , embark_town and alive use strings (object datatype)

>>> df.dtypes
survived         int64
pclass           int64
sex             object
age            float64
sibsp            int64
parch            int64
fare           float64
embarked        object
class           object
who             object
adult_male        bool
deck            object
embark_town     object
alive           object
alone             bool
dtype: object

select all columns except the ones having string datatype

>>> df.select_dtypes(exclude ='object').head(3) 
   survived  pclass   age  sibsp  parch     fare  adult_male  alone
0         0       3  22.0      1      0   7.2500        True  False
1         1       1  38.0      1      0  71.2833       False  False
2         1       3  26.0      0      0   7.9250       False   True

select all columns having string datatype

>>> df.select_dtypes(include ='object').head(3) 
      sex embarked  class    who deck  embark_town alive
0    male        S  Third    man  NaN  Southampton    no
1  female        C  First  woman    C    Cherbourg   yes
2  female        S  Third  woman  NaN  Southampton   yes

select all columns having bool datatype

>>> df.select_dtypes(include ='bool').head(3) 
   adult_male  alone
0        True  False
1       False  False
2       False   True

converts non-numerical labels to numerical labels with LabelEncoder

LabelEncoder is an encoder available in the SciKit python library.
It converts non-numerical labels (text) to numerical labels (integers), which machine learning algorithms can better understand.
It encodes labels with value between 0 and N-1.

let's consider the following data:

    country   
0   Germany		
1   Germany		
2   Spain    	
3   France   	 
4   Germany

LabelEncoder converts non-numerical labels to numerical labels.
The 3 countries are replaced by the numbers 0, 1, and 2.

Example

import the preprocessing package from sklearn

>>> from sklearn import preprocessing

instanciate the class LabelEncoder

>>> le = preprocessing.LabelEncoder()

apply the method fit to fit the instance of labelencoder

>>> le.fit(["France", "Spain", "France", "Germany", "Germany", "Germany"])
LabelEncoder()

there are 3 classes

>>> le.classes_
array(['France', 'Germany', 'Spain'], dtype='<U7')
>>> list(le.classes_)
['France', 'Germany', 'Spain']

apply the method transform to the instance of labelencoder class. this will encode labels

>>> le.transform(['Germany', 'Germany', 'Spain', 'France', 'Germany']) 
array([1, 1, 2, 0, 1])
>>> list(le.transform(['Germany', 'Germany', 'Spain', 'France', 'Germany']))
[1, 1, 2, 0, 1]

apply the method inverse_transform to the instance of labelencoder class to get the original encoding

>>> le.inverse_transform([2, 2, 1])
array(['Spain', 'Spain', 'Germany'], dtype='<U7')
>>> list(le.inverse_transform([2, 2, 1]))
['Spain', 'Spain', 'Germany']

The problem with LabelEncoder is that there are different numbers in the same column.
So the algorithm might conclude Spain (2) > Germany (1) > France (0) ...
Also if your algorithm internally calculates the average, let's say for Spain (2) and France (0) it will get (2 + 0)/2=1 this would mean the average of Spain (2) and France (0) is Germany (1) ...

To overcome this problem, we use OneHotEncoder instead of LabelEncoder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transform non numerical labels to numerical labels with LabelEncoder

examples of features using non numerical labels

converts non-numerical labels to numerical labels with LabelEncoder

Clone this wiki locally