### Data Encoding
1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

In machine learning it refers to the process of converting data from one format or structure to another, primarily to make it compatible with machine learning. This is praticularly crucial for handling categorical data, which represents variables that fall into distinct categories, as most machine learning models require numerical input.

## Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red,green,blue), we  can represent it as one hot encoding as follows:
1. Red : [1, 0, 0]
2. Green : [0, 1, 0]
3. Blue : [0, 0, 1]

# Disadvantage
- When a categorical variable has many unique values (high cardinality), one-hot encoding creates many new features, significantly increasing the dimensionality of your dataset.
- Sparse data : The resulting matrices are often sparse, meaning they contain mostly zeros. This can be inefficient for some machine learning algorithms to process.
- The increased number of features and sparse data lead to higher computational costs and increased memory usage, which can slow down model training and inference.
- One-hot encoding treats each category as independent, failing to capture any inherent relationship or order between them.
- For small datasets, the increased number of features from one-hot encoding can make models more prone to overfitting, where the model learns the training data too well and performs poorly on new data.
- The technique doesn't scale well to very large vocabularies or features with a vast number of categories, making it impractical for certain large-scale applications.

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
## Create a simple dataframe
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,red
4,blue


In [4]:
## create an instance of Onehotencoder
encoder = OneHotEncoder()

In [5]:
## perform fit and transform
encoded = encoder.fit_transform(df[['color']]).toarray()

In [6]:
encoder_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [7]:
encoder_df.head()

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0


In [8]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [9]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,red,0.0,0.0,1.0
4,blue,1.0,0.0,0.0


In [10]:
import seaborn as sns
df = sns.load_dataset('tips')

In [11]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [12]:
enc = OneHotEncoder()

In [13]:
encode = enc.fit_transform(df[['sex']])

### Label Encoding
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.
Label encoding involves  assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. This transformer should be used to encode target values,i.e. y, and not the input x. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:
1. Red : 1
2. Green : 2
3. Blue : 3

In [14]:
from sklearn.preprocessing import LabelEncoder

In [15]:
le = LabelEncoder()

In [16]:
le.fit([1, 2, 2, 6])

In [17]:
le.classes_

array([1, 2, 6])

In [18]:
le.transform([1, 2, 2, 6])

array([0, 1, 1, 2])

In [19]:
le.inverse_transform([0, 1, 1, 2])

array([1, 2, 2, 6])

In [20]:
le.fit_transform([1, 1, 1, 2, 6])

array([0, 0, 0, 1, 2])

## Ordinal Encoding
It is a technique in machine learning used to convert categorical variables with an inherent order or ranking into numerical values. This method assigns a unique integer to each category based on its position in the defined order.
1. High school : 1
2. College : 2
3. Graduate : 3
4. Post-Graduate : 4

In [21]:
# Ordinal encoder
from sklearn.preprocessing import OrdinalEncoder

In [22]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size' : ['small','medium','large','medium','small','large']
})

In [23]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [24]:
## create an instance of ordinal encoder and then fit_transform
encoder = OrdinalEncoder(categories=[['small','medium','large']])

In [25]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [26]:
encoder.transform([['small']])



array([[0.]])

## Target Guided Ordinal Encoding
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.<br>
In Target Guided Ordinal Encoding. We replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [27]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['Ney York','London','Paris','Tokyo','New York','Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [28]:
df

Unnamed: 0,city,price
0,Ney York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [29]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [30]:
mean_price

{'London': 150.0,
 'New York': 180.0,
 'Ney York': 200.0,
 'Paris': 310.0,
 'Tokyo': 250.0}

In [31]:
df['city_encoded'] = df['city'].map(mean_price)

In [32]:
df[['city','city_encoded']]

Unnamed: 0,city,city_encoded
0,Ney York,200.0
1,London,150.0
2,Paris,310.0
3,Tokyo,250.0
4,New York,180.0
5,Paris,310.0


In [34]:
import seaborn as sns
df = sns.load_dataset('tips')

In [35]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [37]:
encoder = OneHotEncoder()

In [39]:
encoded_sex = encoder.fit_transform(df[['sex']]).toarray()

In [41]:
encoded_sex = pd.DataFrame(encoded_sex,columns=encoder.get_feature_names_out())

In [44]:
encoded_sex.head()

Unnamed: 0,sex_Female,sex_Male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0
