# Categorical Variables
- Categorical variables are essential part of any dataset. These variables present group information of observations containing similar information.


## Why encoding is important

- In real life, categorical or qualitative data may come as string. We have to convert these categories into numbers, so machine learning algorithms can work.  
- However, we must have to be extremely careful with nominal level data as these categories are unordered. If we convert them in number, such as 0, 1, 2, will greatly penalize the effectiveness of our model.
- Values can be encoded by creating additional binary features corresponding each value is picked or not. 

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import sklearn
import category_encoders as ce

In [None]:
penguin_raw = pd.read_csv('penguins_lter_manipulated.csv') # reading the data set

In [None]:
penguin = penguin_raw.copy() # creating a copy to do modification

In [None]:
penguin.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11-11-2007,,,181.0,3750,MALE,,,Not enough blood for isotopes.
1,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11-11-2007,39.5,17.4,186.0,3800,FEMALE,8.94956,-24.69454,
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,.,,,,Adult not sampled.
4,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450,FEMALE,8.76651,-25.32426,


In [None]:
# Dropping certain columns as they are not important
penguin = penguin.drop(['studyName', 'Island', 'Sample Number', 'Stage', 'Region', 'Date Egg', 'Individual ID', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'], axis = 1)

In [None]:
penguin.head()

Unnamed: 0,Species,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie Penguin (Pygoscelis adeliae),Yes,,,181.0,3750,MALE
1,Adelie Penguin (Pygoscelis adeliae),Yes,39.5,17.4,186.0,3800,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Yes,40.3,18.0,195.0,3250,FEMALE
3,Adelie Penguin (Pygoscelis adeliae),Yes,,,,.,
4,Adelie Penguin (Pygoscelis adeliae),Yes,36.7,19.3,193.0,3450,FEMALE


In [None]:
penguin.nunique() # nuniuqe function to count unique values in each column

Species                  3
Clutch Completion        2
Culmen Length (mm)     144
Culmen Depth (mm)       74
Flipper Length (mm)     55
Body Mass (g)           96
Sex                      3
dtype: int64

- There are many approaches, here we will discuss few commonly used approaches:
    - **Binary encoding**
    - **One-hot encoding**
    - **Dummy encoding**

### Binary encoding 

In [None]:
ce_be = ce.BinaryEncoder(cols=['Species']); # here we are just using species columns

# transform the data
data_binary = ce_be.fit_transform(penguin["Species"]);
data_binary

Unnamed: 0,Species_0,Species_1
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
...,...,...
339,1,1
340,1,1
341,1,1
342,1,1


- Here we had three species. However, encoding resulted in only 2 columns.
- Whenever we encode categorical variables, one column can be removed as other columns represt all categories

## Pandas get dummies


In [None]:
pd.get_dummies(penguin,columns=['Species', 'Clutch Completion', 'Sex'])

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Species_Adelie Penguin (Pygoscelis adeliae),Species_Chinstrap penguin (Pygoscelis antarctica),Species_Gentoo penguin (Pygoscelis papua),Clutch Completion_No,Clutch Completion_Yes,Sex_.,Sex_FEMALE,Sex_MALE
0,,,181.0,3750,1,0,0,0,1,0,0,1
1,39.5,17.4,186.0,3800,1,0,0,0,1,0,1,0
2,40.3,18.0,195.0,3250,1,0,0,0,1,0,1,0
3,,,,.,1,0,0,0,1,0,0,0
4,36.7,19.3,193.0,3450,1,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
339,,,,.,0,0,1,1,0,0,0,0
340,46.8,14.3,215.0,4850,0,0,1,0,1,0,1,0
341,50.4,15.7,222.0,5750,0,0,1,0,1,0,0,1
342,45.2,14.8,212.0,.,0,0,1,0,1,0,1,0


- Get dummies give all columns hence one column can be removed before ML model

## Scikit learn onehot encoding

In [None]:
s = (penguin.dtypes == 'object')
cols = list(s[s].index)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore',sparse=False)

In [None]:
penguin_Species = pd.DataFrame(ohe.fit_transform(penguin[["Species"]]))
penguin_Species

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
...,...,...,...
339,0.0,0.0,1.0
340,0.0,0.0,1.0
341,0.0,0.0,1.0
342,0.0,0.0,1.0


### Ordinal encoding
- When dealing with ordinal level data

In [None]:
speed = {'Car_speed' :['very high', 'high', 'medium', 'low', 'very low']}
df=pd.DataFrame(speed,columns=["Car_speed"])
temp_dict = {'very high': 1,'high': 2,'medium': 3,'low': 4,"very low":5}
df

Unnamed: 0,Car_speed
0,very high
1,high
2,medium
3,low
4,very low


In [None]:
df["speed_ordinal"] = df.Car_speed.map(temp_dict)
df

Unnamed: 0,Car_speed,speed_ordinal
0,very high,1
1,high,2
2,medium,3
3,low,4
4,very low,5
