### Encoding Categorical Data

**Encoding categorical data** is the process of converting categorical variables into numerical values so that machine learning algorithms can process them. This step is essential because most machine learning algorithms require numerical input. 

#### Types of Encoding

1. **Ordinal Encoding**
2. **One-Hot Encoding**
3. **Label Encoding**

#### Ordinal Encoding

**Definition:**
Ordinal encoding assigns a unique integer to each category and is used when the categorical data has an inherent order.

**When to Use:**
- When the categories have a clear, ordered relationship.
- Examples include rating scales (e.g., low, medium, high), education levels (e.g., high school, bachelor's, master's).

**Why Use:**
- Maintains the ordinal relationship between categories.
- Simple and straightforward implementation.

**Example:**
For a feature "Size" with categories [Small, Medium, Large]:

| Size   | Ordinal Encoded |
|--------|-----------------|
| Small  | 0               |
| Medium | 1               |
| Large  | 2               |


#### Label Encoding

**Definition:**
Label encoding assigns a unique integer to each category but does not consider any order or relationship between the categories.

**When to Use:**
- When the categories do not have a natural order but are required to be converted to numerical form for algorithms that do not require an ordinal relationship.
- Often used for target variables in classification problems.

**Why Use:**
- Simple and efficient for converting categories to numerical values.
- Useful for target variables in classification.

**Example:**
For a feature "Animal" with categories [Cat, Dog, Fish]:

| Data   | Label Encoded |
|--------|---------------|
| No     | 0             |
| Yes    | 1             |





#### One-Hot Encoding

**Definition:**
One-hot encoding creates binary columns for each category, with a '1' indicating the presence of the category and '0' otherwise. 

**When to Use:**
- When the categories are nominal and do not have an inherent order.
- Examples include color (e.g., red, green, blue), geographical locations (e.g., USA, Canada, Mexico).

**Why Use:**
- Prevents the model from assuming a natural ordering between categories.
- Provides a clear representation of categorical data for algorithms that may be sensitive to integer encoding.

**Example:**
For a feature "Color" with categories [Red, Green, Blue]:

| Color | Red | Green | Blue |
|-------|-----|-------|------|
| Red   | 1   | 0     | 0    |
| Green | 0   | 1     | 0    |
| Blue  | 0   | 0     | 1    |


# Ordinal Encoding

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('customer.csv')

In [3]:
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
19,97,Male,Poor,PG,Yes
37,94,Male,Average,PG,Yes
11,74,Male,Good,UG,Yes
18,19,Male,Good,School,No
20,57,Female,Average,School,Yes


In [4]:
df = df.iloc[:, 2:]

In [5]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :2] , df.iloc[:,-1], test_size=0.1)

In [8]:
X_train.columns, X_test.columns

(Index(['review', 'education'], dtype='object'),
 Index(['review', 'education'], dtype='object'))

In [9]:
from sklearn.preprocessing import OrdinalEncoder

In [10]:
oe = OrdinalEncoder(categories=[['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']])

In [11]:
oe.fit(X_train)

print(y_train)
X_train_transformed = oe.transform(X_train)
X_test_transformed = oe.transform(X_test)

12     No
34     No
22    Yes
29    Yes
5     Yes
42    Yes
20    Yes
37    Yes
9     Yes
27     No
19    Yes
0      No
32    Yes
15     No
13     No
18     No
3      No
11    Yes
17    Yes
47    Yes
28     No
33    Yes
1      No
38     No
36    Yes
26     No
39     No
44     No
35    Yes
21     No
10    Yes
24    Yes
48    Yes
40     No
2      No
49     No
43     No
31    Yes
7     Yes
14    Yes
45    Yes
25     No
46     No
16    Yes
8      No
Name: purchased, dtype: object


In [12]:
X_train = pd.DataFrame(X_train_transformed, columns=X_train.columns)
X_test = pd.DataFrame(X_test_transformed, columns=X_test.columns)

In [13]:
X_train.head()

Unnamed: 0,review,education
0,0.0,0.0
1,1.0,0.0
2,0.0,2.0
3,1.0,1.0
4,1.0,0.0


In [14]:
X_test.head()

Unnamed: 0,review,education
0,2.0,0.0
1,1.0,1.0
2,1.0,1.0
3,2.0,0.0
4,2.0,2.0


In [15]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

<hr>

**here only independent variables are encoded, but to encode independent variable we have to use Labed Encoder**

# Label Encoder

In [16]:
from sklearn.preprocessing import LabelEncoder

In [17]:
le = LabelEncoder()

In [18]:
le.fit(y_train)

y_train_encoded = le.transform(y_train)
y_test_encoded = le.transform(y_test)

In [19]:
le.classes_

array(['No', 'Yes'], dtype=object)