# Day 24: Encoding Categorical Data

In machine learning, we often come across datasets with categorical data. Categorical data is non-numerical data that is represented by labels or names. Examples of categorical data include gender, color, and type of car.

There are two types of data: numerical data and categorical data. Numerical data can be measured and is represented by numbers, whereas categorical data cannot be measured and is represented by labels or names.


## Types of Categorical Data

Categorical data can be further classified into two types: 
- Nominal data 
- Ordinal data.

Nominal data is a type of categorical data that has no inherent order or ranking. This means that the values or categories of nominal data cannot be arranged in any particular order. For example, colors, gender, or countries are examples of nominal data.

On the other hand, ordinal data is also a type of categorical data, but it has an inherent order or ranking to it. This means that the values or categories of ordinal data can be arranged in a particular order. For example, educational level (high school, bachelor's degree, master's degree), or customer ratings (poor, fair, good, excellent) are examples of ordinal data.


## Types of Encoding

To use categorical data in machine learning algorithms, we need to convert them into numerical values. There are several ways to encode categorical data, including:

- Ordinal Encoding
- One Hot Encoding
- Label Encoding


## Ordinal Encoding

Ordinal encoding is a process of converting ordinal data into numerical data. Ordinal data has a specific order, and ordinal encoding assigns a unique numerical value to each category based on its rank or position in the order.

For example, if we have a dataset with a column for education level that includes categories like "High School", "College", and "Graduate School", we can use ordinal encoding to assign the values 1, 2, and 3 respectively to these categories based on their rank in the order.

## Label Encoding
Label encoding is a process of converting nominal data into numerical data. Nominal data has no order or ranking, and label encoding assigns a unique numerical value to each category. However, unlike ordinal encoding, there is no order or ranking assigned to these values and it should be noted that label encoding should only be applied to output data, or the target variable in a machine learning model.
For example, if we have a dataset with a column for color that includes categories like "Red", "Green", and "Blue", we can use label encoding to assign the values 1, 2, and 3 respectively to these categories.

## Example

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('customer.csv')

In [3]:
df.sample(10)

Unnamed: 0,age,gender,review,education,purchased
34,86,Male,Average,School,No
36,34,Female,Good,UG,Yes
19,97,Male,Poor,PG,Yes
13,57,Female,Average,School,No
16,59,Male,Poor,UG,Yes
22,18,Female,Poor,PG,Yes
46,64,Female,Poor,PG,No
11,74,Male,Good,UG,Yes
8,65,Female,Average,UG,No
47,38,Female,Good,PG,Yes


In [4]:
df = df.iloc[:,2:]

In [5]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,-1], test_size=0.2)

In [7]:
from sklearn.preprocessing import OrdinalEncoder

In [8]:
oe = OrdinalEncoder(categories=[['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']])

In [9]:
oe.fit(X_train)

In [10]:
X_train = oe.transform(X_train)
X_test = oe.transform(X_test)

In [11]:
X_train


array([[2., 2.],
       [2., 2.],
       [0., 2.],
       [1., 1.],
       [0., 2.],
       [1., 2.],
       [0., 1.],
       [1., 2.],
       [2., 2.],
       [2., 1.],
       [2., 0.],
       [1., 1.],
       [2., 1.],
       [1., 1.],
       [0., 1.],
       [1., 1.],
       [0., 1.],
       [1., 0.],
       [0., 2.],
       [1., 0.],
       [2., 2.],
       [0., 2.],
       [1., 0.],
       [0., 0.],
       [1., 1.],
       [0., 0.],
       [0., 0.],
       [2., 0.],
       [2., 1.],
       [0., 2.],
       [1., 0.],
       [0., 0.],
       [1., 1.],
       [2., 0.],
       [1., 2.],
       [0., 2.],
       [2., 2.],
       [2., 1.],
       [0., 2.],
       [2., 1.]])

In [12]:
X_test

array([[0., 1.],
       [2., 1.],
       [2., 2.],
       [1., 0.],
       [2., 0.],
       [2., 0.],
       [0., 2.],
       [2., 0.],
       [0., 2.],
       [0., 0.]])

In [13]:
from sklearn.preprocessing import LabelEncoder

In [14]:
le = LabelEncoder()

In [15]:
le.fit(y_train)

In [16]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [17]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [18]:
y_train

array([1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0])

## Conclusion

Categorical data is an important type of data in machine learning, and encoding it into numerical values is essential for using it in machine learning algorithms. Depending on the type of categorical data, we can use different encoding techniques such as ordinal encoding, one hot encoding, and label encoding to convert it into numerical values.