### Feature provided by Scikit-Learn

# Label Encoding
Label encoding is a technique of converting categorical values inside columns into numerical ones. This method works best on a dataset with hierarchical or ordinal data.
It still works if you use label encoding in non-hierarchical data. Still, the accuracy will drop very low because it’s not good to use.

In [1]:
import pandas as pd
df = pd.read_csv('penguins.csv')

In [2]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


There are three categorical column types on the penguin’s dataset: species, Island, and sex

In [4]:
df["island"].unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(df["island"])

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [6]:
df["island_label"] = le.fit_transform(df["island"])

In [7]:
df["island_label"].value_counts()

0    168
1    124
2     52
Name: island_label, dtype: int64

In [8]:
df.tail()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,island_label
339,Gentoo,Biscoe,,,,,,0
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE,0
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE,0
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE,0
343,Gentoo,Biscoe,49.9,16.1,213.0,5400.0,MALE,0


# One Hot Encoding
One hot encoding is a technique of creating a dummy dataset based on the number of categorical variables. As said before, label encoding works best with hierarchical data (ordinal), so one-hot encoding will work best with non-hierarchical data. 

In [10]:
add_columns = pd.get_dummies(df["island"])
add_columns

Unnamed: 0,Biscoe,Dream,Torgersen
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
339,1,0,0
340,1,0,0
341,1,0,0
342,1,0,0


In [11]:
df = df.join(add_columns)
df.drop(["island"], axis=1, inplace=True)
df.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,island_label,Biscoe,Dream,Torgersen
0,Adelie,39.1,18.7,181.0,3750.0,MALE,2,0,0,1
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE,2,0,0,1
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE,2,0,0,1
3,Adelie,,,,,,2,0,0,1
4,Adelie,36.7,19.3,193.0,3450.0,FEMALE,2,0,0,1


Use Label Encoding when you have ordinal features in your data to get higher accuracy and when there are too many categorical features present in your data because, in such scenarios, One Hot Encoding may perform poorly due to high memory consumption while creating the dummy variables.

If you have a big dataset with multiple categorical labels, you should prefer label encoding over one hot encoding. Because one hot encoding will double your dataset size, and the label encoding will rename the row.

If your dataset is hierarchical, you should use the label. If non-hierarchical, use one-hot encoding instead.

In [12]:
df.shape

(344, 10)