# Understanding Encoding

When working with **categorical data**, it is important to convert it into numerical format for machine learning models. Categorical data can be of two types:

## 1. Nominal Categorical Data
- Categories have **no inherent order**.  
- Example: `Color = {Red, Blue, Green}`  
- No numerical equivalence exists between categories.

## 2. Ordinal Categorical Data
- Categories have a **specific order**.  
- Example: `Size = {Small < Medium < Large}`  
- Can be mapped to numbers representing their order.

---

## Encoding Methods

To use categorical data in models, we convert it to numbers:

### 1. Ordinal Encoding
- Used for **ordinal input columns**.  
- Assigns numbers based on the order of categories.  
- Example: `Small → 1, Medium → 2, Large → 3`

### 2. Label Encoding
- Used for the **target/output column** in classification problems.  
- Converts labels into numerical values.  
- Example: `Yes → 1, No → 0`

> **Note:** For nominal input columns, we usually use **One-Hot Encoding** to avoid implying an order.


In [58]:
import pandas as pd

In [59]:
dataframe = pd.read_csv('../../EDA/data/customer.csv')

In [60]:
dataframe.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


In [61]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        50 non-null     int64 
 1   gender     50 non-null     object
 2   review     50 non-null     object
 3   education  50 non-null     object
 4   purchased  50 non-null     object
dtypes: int64(1), object(4)
memory usage: 2.1+ KB


In [62]:
df = dataframe.iloc[:, 2:]
df

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No
5,Average,School,Yes
6,Good,School,No
7,Poor,School,Yes
8,Average,UG,No
9,Good,UG,Yes


In [63]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test= train_test_split(df.iloc[:, :-1], df.iloc[:, -1], test_size=0.2 )

In [64]:
X_train.head(), y_train.head()

(     review education
 31     Poor    School
 23     Good    School
 33     Good        PG
 6      Good    School
 44  Average        UG,
 31    Yes
 23     No
 33    Yes
 6      No
 44     No
 Name: purchased, dtype: object)

In [65]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']])
oe.fit(X_train) 
X_train = oe.transform(X_train)
X_test = oe.transform(X_test)

In [None]:
X_train

(array([[0., 0.],
        [2., 0.],
        [2., 2.],
        [2., 0.],
        [1., 1.],
        [0., 0.],
        [0., 2.],
        [2., 1.],
        [1., 1.],
        [2., 2.],
        [1., 1.],
        [1., 0.],
        [2., 0.],
        [1., 0.],
        [2., 2.],
        [0., 1.],
        [0., 2.],
        [0., 2.],
        [1., 2.],
        [0., 2.],
        [1., 0.],
        [2., 1.],
        [1., 1.],
        [2., 2.],
        [0., 0.],
        [2., 2.],
        [2., 0.],
        [0., 2.],
        [2., 1.],
        [1., 0.],
        [0., 0.],
        [2., 0.],
        [1., 1.],
        [0., 1.],
        [0., 1.],
        [0., 2.],
        [1., 2.],
        [2., 1.],
        [0., 2.],
        [2., 2.]]),
 array([1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,
        0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0]))

In [67]:
oe.categories_


[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [68]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(y_train)


In [69]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [70]:
y_train =le.transform(y_train)
y_test = le.transform(y_test)

In [71]:
y_train

array([1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0])