# Data Encoding
1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

## Nominal/OHE(One Hot) Encoding

One-hot encoding is a method used to convert categorical data into a numerical format that can be used in various data analysis and machine learning tasks. It is particularly useful when dealing with categorical variables in datasets.

- How One-Hot Encoding Works

One-hot encoding works by creating binary (0 or 1) columns for each unique category within a categorical variable. Each binary column represents the presence (1) or absence (0) of a particular category for each data point.

For example, if you have a categorical variable "Color" with values ["Red", "Blue", "Green"], one-hot encoding will create three binary columns: "Red," "Blue," and "Green." Each data point will have a 1 in the column corresponding to its color and 0s in the other columns.

- When to Use One-Hot Encoding

One-hot encoding is useful when dealing with categorical data, especially in scenarios like:

- Machine learning models that require numerical input.
- Preventing ordinality assumptions between categories.
- Avoiding bias that might be introduced by assigning numerical values to categories.

In [3]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [24]:
# Sample dataset with a categorical column "Color"
data = {'color': ['Red', 'Blue', 'Green', 'Red']}
df = pd.DataFrame(data)
df

Unnamed: 0,color
0,Red
1,Blue
2,Green
3,Red


In [5]:
## Create an instance of OneHotencoder

In [9]:
encoder = OneHotEncoder()

In [15]:
## Perform fit and transformabs
encoded= encoder.fit_transform(df[['Color']]).toarray()

In [17]:
encoded_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [18]:
encoded_df

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0


In [20]:
encoder.transform([['Green']]).toarray()



array([[0., 1., 0.]])

## Label Encoding
Label encoding is a method used to convert categorical data into numerical values. It assigns a unique integer or label to each category within a categorical variable, effectively converting it into a numerical format.



-  How Label Encoding Works

Label encoding works by mapping each category to a unique integer. The integers are typically assigned in ascending order, starting from 0 or 1, depending on the implementation. The order of assignment can sometimes impact the performance of machine learning algorithms, so it's essential to be aware of this.

For example, if you have a categorical variable "Size" with values ["Small", "Medium", "Large"], label encoding might assign the following labels: {"Small": 0, "Medium": 1, "Large": 2}.

-  When to Use Label Encoding

Label encoding is suitable for scenarios where:

- The categorical variable has an ordinal relationship, meaning there's a meaningful order or ranking among categories (e.g., "Low," "Medium," "High").
- The machine learning algorithm used can work with ordinal data.
- You want to reduce the dimensionality of your dataset compared to one-hot encoding.


In [23]:
from  sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()

In [25]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 2])

In [27]:
lbl_encoder.transform([['Red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [28]:
lbl_encoder.transform([['Blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [29]:
lbl_encoder.transform([['Green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

### Ordinal Encoding

Ordinal encoding is a method used to convert categorical data with a meaningful order or rank into numerical values. It assigns a unique integer or label to each category while preserving the ordinal relationship between them.
**How Ordinal Encoding Works**

Ordinal encoding works by mapping each category to a unique integer based on the predefined order or ranking of the categories. This mapping can be done manually by specifying the order or using automated methods based on the observed frequency of each category.

For example, if you have a categorical variable "Education Level" with values ["High School", "Bachelor's Degree", "Master's Degree", "Ph.D."], ordinal encoding might assign the following labels: {"High School": 0, "Bachelor's Degree": 1, "Master's Degree": 2, "Ph.D.": 3}.

**When to Use Ordinal Encoding**

**Ordinal encoding is suitable for scenarios where:**

- The categorical variable has a clear and meaningful order or ranking among categories.
- The machine learning algorithm used can work with ordinal data.
- You want to represent the ordinal relationship efficiently without introducing high dimensionality as in one-hot encoding.


In [32]:
from sklearn.preprocessing import OrdinalEncoder

In [41]:
# Sample dataset with a categorical column "Education Level"
data = {'Education Level': ['Bachelor\'s Degree', 'Master\'s Degree', 'High School', 'Ph.D.']}
df = pd.DataFrame(data)

# Define the ordinal mapping
education_mapping = {
    "High School": 0,
    "Bachelor's Degree": 1,
    "Master's Degree": 2,
    "Ph.D.": 3
}

In [40]:
df

Unnamed: 0,Education Level
0,Bachelor's Degree
1,Master's Degree
2,High School
3,Ph.D.


In [34]:
## Create a Instance of ordinalencorder and then fit_transform

In [42]:
encorder = OrdinalEncoder(categories=[["High School", "Bachelor's Degree", "Master's Degree","Ph.D."]])

In [45]:
encoder.fit_transform(df[['Education Level']])

<4x4 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [47]:
encoder.transform([["Bachelor's Degree"]])



<1x4 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

## Target Guided Ordinal Encoding



Target Guided Ordinal Encoding is a data preprocessing technique that assigns ordinal labels to categories within a categorical variable based on their impact or predictive power concerning the target variable. It is commonly used in supervised machine learning tasks to encode categorical variables efficiently.

**How Target Guided Ordinal Encoding Works**

**Target Guided Ordinal Encoding involves the following steps:**

1. Calculate a statistical metric, such as the mean or median of the target variable, for each category within the categorical variable.

2. Rank the categories based on the calculated metric in ascending or descending order, depending on whether higher values are associated with higher or lower target variable values.

3. Assign ordinal labels to the categories based on their rankings. Categories with higher metric values receive lower ordinal labels, while categories with lower metric values receive higher ordinal labels.

For example, if you have a categorical variable "Education Level" and a binary target variable "Churn" (0 or 1), you can calculate the mean "Churn" rate for each education level category and then rank them accordingly.

**When to Use Target Guided Ordinal Encoding**

Target Guided Ordinal Encoding is suitable for scenarios where:

- You have categorical variables with a large number of categories.
- You want to encode categories based on their predictive power or impact on the target variable.
- You believe that the order of encoding should be determined by the relationship with the target variable.

In [49]:
# Sample dataset with a categorical column "Education Level" and a binary target column "Churn"
data = {'Education Level': ['Bachelor\'s Degree', 'Master\'s Degree', 'High School', 'Ph.D.'],
        'Churn': [1, 0, 1, 0]}
df = pd.DataFrame(data)

# Calculate the mean churn rate for each education level
education_mean_churn = df.groupby('Education Level')['Churn'].mean().reset_index()

# Rank the categories based on the mean churn rate
education_mean_churn['Education Level Rank'] = education_mean_churn['Churn'].rank().astype(int)

# Map the ordinal labels back to the original dataset
df = df.merge(education_mean_churn[['Education Level', 'Education Level Rank']], on='Education Level', how='left')

print(df)

     Education Level  Churn  Education Level Rank
0  Bachelor's Degree      1                     3
1    Master's Degree      0                     1
2        High School      1                     3
3              Ph.D.      0                     1


In [51]:
data = {'city': ['New York', 'London', 'Paris', 'Tokyo','London','Tokyo'],
        'price': [200,150,300,250,180,320]}
df = pd.DataFrame(data)

In [52]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,London,180
5,Tokyo,320


In [55]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [56]:
mean_price

{'London': 165.0, 'New York': 200.0, 'Paris': 300.0, 'Tokyo': 285.0}

In [58]:
df['city_encoded']=df['city'].map(mean_price)

In [59]:
df['city_encoded']

0    200.0
1    165.0
2    300.0
3    285.0
4    165.0
5    285.0
Name: city_encoded, dtype: float64

In [61]:
df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,200.0
1,150,165.0
2,300,300.0
3,250,285.0
4,180,165.0
5,320,285.0
