<a href="https://colab.research.google.com/github/namrataawagh/MachineLearning/blob/main/Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Label encoding when data is ordinal. One-hot encoding when data is nominal.

# **Encoding**
Encoding refers to the process of converting categorical data (non-numeric) into a numerical format so that machine learning algorithms can understand and work with it

## **Types of data:**

1.   Numerical data : Data that consists of numbers, either discrete or continuous. Example: age, income, temperature.
2.   Categorical data:Data that consists of categories or labels

    *  Nominal Data:Categories with no inherent order (e.g., colors, gender).
    *  Ordinal Data:Categories with a defined order (e.g : ratings;poor<average<good).

## **Types of encoding:**

1.   Ordinal encoding
2.   One-hot encoding
3.   Label encoding
4.   Target encoding
5.   Binary encoding

## **Why encoding is necessary?**
Encoding is essential in machine learning because most algorithms require numerical inputs to perform mathematical computations.It prevents misinterpretation, captures essential patterns, reduces dimensionality(removing irrelevant features , and enhances training efficiency, all of which contribute to improved model performance.)


In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('customer.csv')

In [None]:
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
4,16,Female,Average,UG,No
46,64,Female,Poor,PG,No
41,23,Male,Good,PG,Yes
25,57,Female,Good,School,No
43,27,Male,Poor,PG,No


## Analyzing type of categorical data
gender - nominal data

review - ordinal data

education - ordinal data

purchased - nominal data


In [None]:
# lets focus on ordinal data as ordinal encoding
df = df.iloc[:,2:]

In [None]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [None]:
from sklearn.model_selection import train_test_split
X_train ,X_test, y_train,y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,-1],test_size =0.3,random_state = 42)

In [None]:
X_train.shape , X_test.shape

((35, 2), (15, 2))

In [None]:
X_train.head()

Unnamed: 0,review,education
6,Good,School
41,Good,PG
46,Poor,PG
47,Good,PG
15,Poor,UG


In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

In [None]:
X_train.head()

Unnamed: 0,review,education
6,Good,School
41,Good,PG
46,Poor,PG
47,Good,PG
15,Poor,UG


In [None]:
oe.fit(X_train)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50 non-null     object
 1   education  50 non-null     object
 2   purchased  50 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB


In [None]:
X_train = oe.transform(X_train)

In [None]:
X_test = oe.transform(X_test)

In [None]:
X_train

array([[2., 0.],
       [2., 2.],
       [0., 2.],
       [2., 2.],
       [0., 1.],
       [2., 1.],
       [0., 1.],
       [1., 2.],
       [1., 0.],
       [0., 0.],
       [1., 0.],
       [1., 1.],
       [0., 2.],
       [2., 2.],
       [1., 0.],
       [1., 1.],
       [2., 1.],
       [2., 1.],
       [0., 1.],
       [1., 2.],
       [2., 2.],
       [0., 2.],
       [0., 0.],
       [2., 0.],
       [2., 0.],
       [2., 1.],
       [0., 2.],
       [2., 0.],
       [2., 1.],
       [1., 0.],
       [0., 0.],
       [2., 2.],
       [0., 2.],
       [0., 0.],
       [2., 0.]])

In [None]:
oe.categories

[['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']]

## LABEL ENCODING IS ONLY FOR OUTPUT COLUMNS

In [None]:
# to fit our output column
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()

In [None]:
le.fit(y_train)

In [None]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [None]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [None]:
y_train

array([0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0])

# One Hot Encoding

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('cars.csv')

In [None]:
df.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
4485,Maruti,7600,Petrol,First Owner,480000
3249,Hyundai,40000,Diesel,First Owner,910000
767,Hyundai,91500,Diesel,Second Owner,425000
3575,Toyota,121941,Diesel,First Owner,500000
1379,Maruti,35000,Petrol,Third Owner,75000


In [None]:
df['brand'].nunique()  # 32 different brands

32

## **1.OneHotEncoding using sklearn**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.iloc[:, :-1]  # Features (all columns except last)
y = df.iloc[:, -1]   # Target (last column)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
6783,Maruti,120000,Petrol,Third Owner
1073,Toyota,100000,Diesel,First Owner
7756,BMW,39000,Diesel,First Owner
144,Toyota,39000,Petrol,First Owner
6424,Maruti,70000,Diesel,Second Owner


In [None]:
from sklearn.preprocessing import OneHotEncoder


In [None]:
ohe = OneHotEncoder(drop='first', sparse_output= False, dtype = np.int32)  # drops the first column of both variables and gives a numpy array

In [None]:
X_train_new = ohe.fit_transform(X_train[['fuel','owner']])

In [None]:
X_test_new = ohe.fit_transform(X_test[['fuel','owner']])

In [None]:
X_train_new.shape

(5689, 7)

In [None]:
# To add this encoded 2 columns to the X_train :Stacking
np.hstack((X_train[['brand','km_driven']].values,X_train_new))

array([['Maruti', 120000, 0, ..., 0, 0, 1],
       ['Toyota', 100000, 1, ..., 0, 0, 0],
       ['BMW', 39000, 1, ..., 0, 0, 0],
       ...,
       ['Hyundai', 35000, 0, ..., 0, 0, 0],
       ['Maruti', 27000, 1, ..., 0, 0, 0],
       ['Maruti', 70000, 0, ..., 1, 0, 0]], dtype=object)

# OHE with top categories

In [None]:
# Encoding brand
counts = df['brand'].value_counts()

In [None]:
df['brand'].nunique()
threshold = 100

In [None]:
repl = counts[counts <= threshold].index

In [None]:
pd.get_dummies(df['brand'].replace(repl,'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
7455,False,False,True,False,False,False,False,False,False,False,False,False,False
5557,False,False,False,False,False,False,True,False,False,False,False,False,False
4876,False,False,False,False,False,False,False,False,True,False,False,False,False
3191,False,False,False,False,False,False,True,False,False,False,False,False,False
6951,False,False,False,False,False,False,False,False,True,False,False,False,False


More recent versions of Pandas, particularly Pandas 2.0 and later, have changed the default dtype of the output from pd.get_dummies() to boolean (bool). This means that the output now defaults to "True" and "False" instead of "0" and "1".
Hence , if you explicitly need it you can convert it using dtype = int.

In [None]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon').sample(5), dtype=int)

Unnamed: 0,BMW,Hyundai,Mahindra,Maruti
104,0,0,0,1
2725,0,1,0,0
1689,0,0,1,0
6568,1,0,0,0
5263,0,0,0,1


# Column Transformer

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder , OneHotEncoder
from sklearn.impute import SimpleImputer

In [None]:
df = pd.read_csv('covid_toy.csv')

In [None]:
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


Gender and city is nominal data: OneHotEncoding

Cough is ordinal data : Ordinal encoding

has_covid is also categorical : Label encoding

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:-1],df.iloc[:,-1],test_size = 0.2, random_state = 42)

In [None]:
X_train

Unnamed: 0,age,gender,fever,cough,city
55,81,Female,101.0,Mild,Mumbai
88,5,Female,100.0,Mild,Kolkata
26,19,Female,100.0,Mild,Kolkata
42,27,Male,100.0,Mild,Delhi
69,73,Female,103.0,Mild,Delhi
...,...,...,...,...,...
60,24,Female,102.0,Strong,Bangalore
71,75,Female,104.0,Strong,Delhi
14,51,Male,104.0,Mild,Bangalore
92,82,Female,102.0,Strong,Kolkata


In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
transformer = ColumnTransformer(transformers =[
                               ('tnf1', SimpleImputer(),['fever']),
                               ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
                               ('tnf3',OneHotEncoder(sparse_output = False, drop ='first'),['gender','city'])
                               ],remainder = 'passthrough')

In [None]:
transformer.fit_transform(X_train).shape

(80, 7)

In [None]:
transformer.transform(X_test).shape

(20, 7)