# Encoding

- Converting Discrete Categorical Variables to Discrete Numerical Variables
- There are 2 types of categorical variables
- - 1. Nominal (Ex: Item Type)
    2. Ordinal (Ex: Outlet_Size(small,Medium,High))

In [92]:
import numpy as np
import pandas as pd

In [93]:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,town,area,price
0,Chennai,2600,5500000
1,Chennai,3000,5650000
2,Chennai,3200,6100000
3,Chennai,3600,6800000
4,Bangalore,2600,5850000
5,Bangalore,2800,6150000
6,Bangalore,3300,6500000
7,Bangalore,3600,7100000
8,Hyderabad,2600,5750000
9,Hyderabad,2900,6000000


In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   town    12 non-null     object
 1   area    12 non-null     int64 
 2   price   12 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 416.0+ bytes


In [95]:
df["town"].value_counts()

Chennai      4
Bangalore    4
Hyderabad    4
Name: town, dtype: int64

# To convert Nominal Categorical into Numeric

### Option 1: Using Pandas - get_dummies()

In [96]:
#Here I am doing one step at a time. 
# step 1:: get dummies for the give df.
dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,Bangalore,Chennai,Hyderabad
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,1,0,0
5,1,0,0
6,1,0,0
7,1,0,0
8,0,0,1
9,0,0,1


In [97]:
# step 2: add the above dummies data frame to original data fram and save in another new data frame. 

df_dummies = pd.concat([df,dummies],axis = "columns")
df_dummies

Unnamed: 0,town,area,price,Bangalore,Chennai,Hyderabad
0,Chennai,2600,5500000,0,1,0
1,Chennai,3000,5650000,0,1,0
2,Chennai,3200,6100000,0,1,0
3,Chennai,3600,6800000,0,1,0
4,Bangalore,2600,5850000,1,0,0
5,Bangalore,2800,6150000,1,0,0
6,Bangalore,3300,6500000,1,0,0
7,Bangalore,3600,7100000,1,0,0
8,Hyderabad,2600,5750000,0,0,1
9,Hyderabad,2900,6000000,0,0,1


In [98]:
# step 3: Sicne we created dummies for town column we dont need this column in data frame any more. 

df_dummies.drop("town",axis = "columns", inplace = True)

df_dummies

Unnamed: 0,area,price,Bangalore,Chennai,Hyderabad
0,2600,5500000,0,1,0
1,3000,5650000,0,1,0
2,3200,6100000,0,1,0
3,3600,6800000,0,1,0
4,2600,5850000,1,0,0
5,2800,6150000,1,0,0
6,3300,6500000,1,0,0
7,3600,7100000,1,0,0
8,2600,5750000,0,0,1
9,2900,6000000,0,0,1


In [99]:
# step 4: From above data fram any one city can be removed (ex:: lets say if you see 0 for hyd and 0 for chennai we by default know bangalore is 1). 

df_dummies.drop("Bangalore", axis = "columns",inplace = True)

df_dummies

Unnamed: 0,area,price,Chennai,Hyderabad
0,2600,5500000,1,0
1,3000,5650000,1,0
2,3200,6100000,1,0
3,3600,6800000,1,0
4,2600,5850000,0,0
5,2800,6150000,0,0
6,3300,6500000,0,0
7,3600,7100000,0,0
8,2600,5750000,0,1
9,2900,6000000,0,1


In [100]:
## All the above can be done in one step. 

df_dum = pd.get_dummies(df,drop_first=True)

df_dum

Unnamed: 0,area,price,town_Chennai,town_Hyderabad
0,2600,5500000,1,0
1,3000,5650000,1,0
2,3200,6100000,1,0
3,3600,6800000,1,0
4,2600,5850000,0,0
5,2800,6150000,0,0
6,3300,6500000,0,0
7,3600,7100000,0,0
8,2600,5750000,0,1
9,2900,6000000,0,1


## Dummy Variable Trap (Very important concept). 
When you can derive one variable from other variables, they are known to be multi-colinear. Here if you know values of california and georgia then you can easily infer
value of new jersey state, i.e california = 0 and georgia = 0. Therefore these state variables are called to be multi-colinear. In this situation linear regression won't work as expected. Hence you need to drop on column. 

### Note:: sklearn library takes ccare of dummy variable trap hence even if you dont drop one of the state columns it is going to work, however we should make a habit of taking care of dummy variable trap ourselves just in case library that you are using is not handling this for you. 

## Option 2:: OneHotEncoding Using sklearn

In [101]:
from sklearn.preprocessing import OneHotEncoder  #Personally I like doing it in pandas. Because here you need to specify the column names, but in pandas they are created by default. 

enc = OneHotEncoder(drop = "first")

enc_df = pd.DataFrame(enc.fit_transform(df[["town"]]).toarray(),columns=["chennai","Hyderabad"])

df_ohe = pd.concat([df,enc_df],axis = "columns")

df_ohe.drop("town",axis = "columns", inplace = True)

df_ohe

Unnamed: 0,area,price,chennai,Hyderabad
0,2600,5500000,1.0,0.0
1,3000,5650000,1.0,0.0
2,3200,6100000,1.0,0.0
3,3600,6800000,1.0,0.0
4,2600,5850000,0.0,0.0
5,2800,6150000,0.0,0.0
6,3300,6500000,0.0,0.0
7,3600,7100000,0.0,0.0
8,2600,5750000,0.0,1.0
9,2900,6000000,0.0,1.0


### To Convert Ordinal Categorical into Numerical

#### Option 1:: Label Encoding using sklearn. 
- Convert to numerica as per alphabetical order. 

In [102]:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,town,area,price
0,Chennai,2600,5500000
1,Chennai,3000,5650000
2,Chennai,3200,6100000
3,Chennai,3600,6800000
4,Bangalore,2600,5850000
5,Bangalore,2800,6150000
6,Bangalore,3300,6500000
7,Bangalore,3600,7100000
8,Hyderabad,2600,5750000
9,Hyderabad,2900,6000000


### Option 1:: Ordinal Encoding using sklearn.
- Convert to numeric as per alphabetical order. 

In [103]:
df = pd.read_csv("homeprices.csv")

df

Unnamed: 0,town,area,price
0,Chennai,2600,5500000
1,Chennai,3000,5650000
2,Chennai,3200,6100000
3,Chennai,3600,6800000
4,Bangalore,2600,5850000
5,Bangalore,2800,6150000
6,Bangalore,3300,6500000
7,Bangalore,3600,7100000
8,Hyderabad,2600,5750000
9,Hyderabad,2900,6000000


In [107]:
### inport from sklearn library

from sklearn.preprocessing import LabelEncoder

### Call the function

le = LabelEncoder()

### fit and transform. 

df["town"] = le.fit_transform(df["town"])

df

Unnamed: 0,town,area,price
0,1,2600,5500000
1,1,3000,5650000
2,1,3200,6100000
3,1,3600,6800000
4,0,2600,5850000
5,0,2800,6150000
6,0,3300,6500000
7,0,3600,7100000
8,2,2600,5750000
9,2,2900,6000000


### Option 2:: Ordinal Encoding using sklearn.
- Convert to numeric as per given order in the function (ascending order)

In [109]:
df = pd.read_csv("homeprices.csv")

df


### Import from sklearn library
from sklearn.preprocessing import OrdinalEncoder


## Call the function (We need to assign the order)
oe = OrdinalEncoder(categories=[["Bangalore","Chennai","Hyderabad"]])

## fit_transform

df["town_ore"] = oe.fit_transform(df[["town"]])

df

                                

Unnamed: 0,town,area,price,town_ore
0,Chennai,2600,5500000,1.0
1,Chennai,3000,5650000,1.0
2,Chennai,3200,6100000,1.0
3,Chennai,3600,6800000,1.0
4,Bangalore,2600,5850000,0.0
5,Bangalore,2800,6150000,0.0
6,Bangalore,3300,6500000,0.0
7,Bangalore,3600,7100000,0.0
8,Hyderabad,2600,5750000,2.0
9,Hyderabad,2900,6000000,2.0


### Option 3:: map() using pandas
- convert to numberic as per your choice

In [110]:
df = pd.read_csv("homeprices.csv")

df["town_enc"] = df["town"].map({"Bangalore": 0, "Chennai":1, "Hyderabad": 2})

df

Unnamed: 0,town,area,price,town_enc
0,Chennai,2600,5500000,1
1,Chennai,3000,5650000,1
2,Chennai,3200,6100000,1
3,Chennai,3600,6800000,1
4,Bangalore,2600,5850000,0
5,Bangalore,2800,6150000,0
6,Bangalore,3300,6500000,0
7,Bangalore,3600,7100000,0
8,Hyderabad,2600,5750000,2
9,Hyderabad,2900,6000000,2
