# Data Encoding

The process of converting the string data into numerical data.

1. Nominal / OHE (One Hot Encoding)
2. Label & Ordinal
3. Target guided ordinal encoding

## Nominal / OHE

- Changing categorical data into Numerical Data
- No roder in the data

- Disadvantage:
    - A column has many categories.
    - Curse of Dimensionality
    - n distinct value in column >> n no. of dummy variables will be created.

## Label and Ordinal Encoding

- **Label Encoding** : Assign numerical data to each category
- Advantages:
    - No curse of dimensionality
- Disadvantage : 
    - Good for ordinal data as for nominal data

- **Ordinal Encoding** : order data directly.

- **Target Guided Ordinal Encoding** : Based on their relationship with the target variable. 
    - Very useful when we have large number of unique categories in categorical data.
    - Categorical groups with mean or median of corressponding target variable.

# Practical Implementation

## Nominal / OHE
- Binary vector for each of the category
- example single, married, seperated.
- single : [1, 0, 0]
- married : [0, 1, 0]
- seperated : [0, 0, 1]

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [3]:
df = pd.DataFrame({"status":["Single", "Married", "Seperated", "Single", "Married"]})

In [4]:
df

Unnamed: 0,status
0,Single
1,Married
2,Seperated
3,Single
4,Married


In [6]:
encoder = OneHotEncoder()
encoder

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [9]:
encoded = encoder.fit_transform(df[["status"]]).toarray()
encoded

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [10]:
encoder.get_feature_names_out()

array(['status_Married', 'status_Seperated', 'status_Single'],
      dtype=object)

In [12]:
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
encoder_df

Unnamed: 0,status_Married,status_Seperated,status_Single
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0


In [None]:
a = encoder.transform([["Single"]]).toarray()


In [17]:
a

array([[0., 0., 1.]])

pd.get_dummies()

In [18]:
pd.get_dummies(encoder_df)

Unnamed: 0,status_Married,status_Seperated,status_Single
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0


In [19]:
pd.concat([df, encoder_df], axis= 1)

Unnamed: 0,status,status_Married,status_Seperated,status_Single
0,Single,0.0,0.0,1.0
1,Married,1.0,0.0,0.0
2,Seperated,0.0,1.0,0.0
3,Single,0.0,0.0,1.0
4,Married,1.0,0.0,0.0


In [20]:
import seaborn as sns
sns.load_dataset("tips")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


## Label encoding >> assigning unique label to categories

- using map or using library

In [21]:
df

Unnamed: 0,status
0,Single
1,Married
2,Seperated
3,Single
4,Married


In [22]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [None]:
a = label_encoder.fit_transform(df[['status']])

In [25]:
a

array([2, 0, 1, 2, 0])

In [None]:
b = label_encoder.transform([["Seperated"]])

In [28]:
b

array([1])

## Ordinal Encoding

- High School : 1
- Graduation : 2
- Post graduation : 3
- PhD : 4

In [29]:
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({"qualification": ["HS", "UG", "PG", "PHD", "HS", "PHD"]})

In [30]:
df

Unnamed: 0,qualification
0,HS
1,UG
2,PG
3,PHD
4,HS
5,PHD


In [31]:
encoder = OrdinalEncoder(categories=[["HS", "UG", "PG", "PHD"]])

In [32]:
a = encoder.fit_transform(df[["qualification"]])

In [33]:
a

array([[0.],
       [1.],
       [2.],
       [3.],
       [0.],
       [3.]])

In [None]:
# New Data
b= encoder.transform([["PG"]])

In [36]:
b

array([[2.]])

## Target Guided Ordinal Encoding
- relationship with target variable
- lot of unique category present
- replace the category with mean and median of respective group

In [37]:
df = pd.DataFrame({'time': ['lunch', 'breakfast', 'dinner', 'lunch', 'breakfast', 'dinner', 'lunch', 'breakfast', 'dinner'], 'total_bill' : [120, 130, 90, 125, 150, 190, 160, 180, 189]})

In [38]:
df

Unnamed: 0,time,total_bill
0,lunch,120
1,breakfast,130
2,dinner,90
3,lunch,125
4,breakfast,150
5,dinner,190
6,lunch,160
7,breakfast,180
8,dinner,189


In [40]:
mean_price = df.groupby('time')['total_bill'].mean()

In [43]:
df['time_encoded'] = df['time'].map(mean_price)

In [44]:
df

Unnamed: 0,time,total_bill,time_encoded
0,lunch,120,135.0
1,breakfast,130,153.333333
2,dinner,90,156.333333
3,lunch,125,135.0
4,breakfast,150,153.333333
5,dinner,190,156.333333
6,lunch,160,135.0
7,breakfast,180,153.333333
8,dinner,189,156.333333
