# Encoding Categorical Data
Encoding categorical data involves converting categorical variables into numerical values, making them suitable for machine learning algorithms that work with numeric inputs. Two common techniques for encoding categorical data are Ordinal Encoding and Label Encoding.

### Categorical Data Encoding:

#### 1. **Label Encoding:**
   - **Definition:** Assigns a unique numerical label to each category in a categorical variable.
   - **Implementation:** Transforms categorical values into integers from 0 to $N-1$ (where $N$ is the number of unique categories).
   - **Example:** Converting categorical labels like 'Red', 'Green', and 'Blue' to 0, 1, and 2, respectively.
   - **Library Usage:** Utilizes libraries like Scikit-learn (`LabelEncoder`) for implementation.

#### 2. **Ordinal Encoding:**
   - **Definition:** Specifically used for ordinal categorical variables, assigning numerical labels based on their order or rank.
   - **Implementation:** Assigns integers based on the predefined order/rank of categories.
   - **Example:** Transforming categorical labels like 'Low', 'Medium', and 'High' to 0, 1, and 2, respectively, according to their order.
   - **Library Usage:** Often implemented manually or with custom mapping in Python.

### When to Use Each Technique:

#### 1. **Label Encoding:**
   - Suitable for nominal categorical variables (without inherent order or hierarchy).
   - Not ideal for ordinal variables, as it may inadvertently introduce misleading relationships between categories.

#### 2. **Ordinal Encoding:**
   - Specifically designed for ordinal categorical variables that have a clear order or hierarchy.
   - Preserves the inherent order of categories, maintaining the ordinal relationship in the encoded data.

### Example (Label Encoding using Python's Scikit-learn):

```python
from sklearn.preprocessing import LabelEncoder

# Sample categorical data
categories = ['Low', 'Medium', 'High', 'Low', 'High']

# Instantiate LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform data
encoded_categories = label_encoder.fit_transform(categories)
print(encoded_categories)  # Output: [1, 2, 0, 1, 0]
```

Label Encoding and Ordinal Encoding are techniques to convert categorical data into numerical form. While Label Encoding applies numerical labels to categories without order, Ordinal Encoding specifically maintains the order or hierarchy of ordinal categorical variables. Choose the appropriate technique based on the nature of the categorical variable to preserve meaningful information for the machine learning models.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('customer.csv')

In [14]:
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
35,74,Male,Poor,School,Yes
21,32,Male,Average,PG,No
6,18,Male,Good,School,No
29,83,Female,Average,UG,Yes
40,39,Male,Good,School,No


In [15]:
df = df.iloc[:,2:]

In [16]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:2], df.iloc[:,-1], test_size=0.2)

In [18]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Poor', 'Average', 'Good'],['School', 'UG', 'PG']])

In [19]:
oe.fit(X_train)

In [20]:
X_train = oe.transform(X_train)
X_test = oe.transform(X_test)

In [21]:
X_train

array([[1., 0.],
       [2., 1.],
       [0., 1.],
       [2., 2.],
       [0., 2.],
       [0., 0.],
       [1., 1.],
       [2., 2.],
       [2., 2.],
       [1., 1.],
       [0., 2.],
       [2., 0.],
       [1., 2.],
       [2., 0.],
       [1., 2.],
       [1., 2.],
       [0., 2.],
       [2., 2.],
       [2., 1.],
       [1., 0.],
       [0., 1.],
       [2., 0.],
       [2., 0.],
       [0., 0.],
       [2., 1.],
       [1., 1.],
       [1., 0.],
       [2., 0.],
       [2., 0.],
       [0., 2.],
       [2., 1.],
       [0., 2.],
       [0., 2.],
       [1., 1.],
       [0., 2.],
       [2., 1.],
       [0., 0.],
       [2., 1.],
       [2., 2.],
       [1., 0.]])

In [22]:
X_test

array([[0., 0.],
       [1., 0.],
       [1., 1.],
       [0., 2.],
       [0., 0.],
       [0., 1.],
       [0., 2.],
       [0., 1.],
       [2., 2.],
       [1., 1.]])

In [23]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [24]:
le.fit(y_train)

In [25]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [26]:
y_train

array([1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0])

In [27]:
y_test

array([1, 0, 0, 0, 1, 1, 1, 1, 1, 0])