# **Encoding Categorical Data**
Encoding categorical data is an essential step in preparing data for machine learning models since most machine learning algorithms require numerical input data. Categorical data represents non-numeric data such as categories, labels, or classes.

In Python, you can use various techniques to encode categorical data, and the choice of encoding method depends on the nature of your data and the machine learning algorithm you plan to use.

## **Import Required Libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## **Read the Data**

In [2]:
df = pd.read_csv("D:\Coding\Datasets\customer.csv")
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


In [3]:
df.shape

(50, 5)

In [4]:
# Extrcting the 'review', 'education' and 'purchased' colums from the dataframe
df = df.iloc[:, 2:]

In [5]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


## **Train Test Split**

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
x_train, x_test, y_train, y_test = train_test_split(df.drop("purchased", axis=1),
                                                    df["purchased"],
                                                    test_size=0.3,
                                                    random_state=0)
x_train.shape, x_test.shape

((35, 2), (15, 2))

## **Ordinal Encoding**
Ordinal encoding is a technique for encoding categorical data where the categories have a meaningful order or ranking. This method assigns a unique integer value to each category based on its order or priority. Ordinal encoding is appropriate when the categorical data represents ordered or ranked values, such as "low," "medium," and "high" or "small," "medium," "large."

In [8]:
from sklearn.preprocessing import OrdinalEncoder

In [9]:
# Checking the unique values in each column
print("Unique values in each column:")
for i in range(len(df.columns)):
    print(f"{df.columns[i]}: {df.iloc[:, i].unique()}")

Unique values in each column:
review: ['Average' 'Poor' 'Good']
education: ['School' 'UG' 'PG']
purchased: ['No' 'Yes']


In [10]:
# Creating an object of ordinal encoder class
ordinal_encoder = OrdinalEncoder(categories=[["Poor", "Average", "Good"], ["School", "UG", "PG"]],
                                 dtype=np.int8)
# Fit the training data
ordinal_encoder.fit(x_train)

# Transform the training and testing data
x_train_encoded = ordinal_encoder.transform(x_train)
x_test_encoded = ordinal_encoder.transform(x_test)

In [11]:
ordinal_encoder.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [12]:
# Converting the encoded array into pandas dataframe
x_train_encoded = pd.DataFrame(x_train_encoded, columns=["review", "education"])
x_test_encoded = pd.DataFrame(x_test_encoded, columns=["review", "eucation"])

In [13]:
# Print the non-encoded training data
x_train.head()

Unnamed: 0,review,education
7,Poor,School
14,Poor,PG
45,Poor,PG
48,Good,UG
29,Average,UG


In [14]:
# Print the encoded training data
x_train_encoded.head()

Unnamed: 0,review,education
0,0,0
1,0,2
2,0,2
3,2,1
4,1,1


In [15]:
# Print the encoded testing data
x_test_encoded.head()

Unnamed: 0,review,eucation
0,0,0
1,2,1
2,2,1
3,2,2
4,2,2


## **Label Encoding**
Label encoding is a technique for encoding categorical data into numerical values, where each category is assigned a unique integer label. This encoding is suitable for categorical data where there is no inherent order or ranking among the categories.

You can use the **'LabelEncoder'** class from the sklearn.preprocessing module to perform label encoding. This encode target labels with value between 0 and n_classes-1. This transformer should be used to encode target values, i.e. y and not the input X.

In [16]:
from sklearn.preprocessing import LabelEncoder

In [17]:
# Creating an object of the label encoder class
label_encoder = LabelEncoder()

# Fit the training data
label_encoder.fit(y_train)

# Transform the training and testing data
y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

In [18]:
label_encoder.classes_

array(['No', 'Yes'], dtype=object)

In [19]:
# Print the y_train data
y_train.head(10)

7     Yes
14    Yes
45    Yes
48    Yes
29    Yes
15     No
30     No
32    Yes
16    Yes
42    Yes
Name: purchased, dtype: object

In [20]:
# Print the y_train_encoded data
y_train_encoded

array([1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0])

In [21]:
# Print the y_test_encoded data
y_test_encoded

array([0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0])