## Encoding Techniques.

In machine learning, encoding techniques are used to convert categorical data (data that represents categories or labels) into numerical format so that machine learning algorithms can process them. Python offers several libraries and techniques for encoding categorical data. Here are some common encoding techniques in ML using Python:

Types of Encoding 
1. Nominal Encoding: we dont need to rank these catagories of data if we convert it in numaric form.
    example: Gender: Male & Female.. we dont need to decide the rank.
            
2. Ordinal Encoding : in this we need to specify the orde of the data like Degree feature.
we have to specify High school , then 12 th then BE then ME etc...

Based on this we can specify encoding types :
1. Nominal Encoding:
    1. One Hot Encoding.
    2. One hot Encoding with many Catagories.
    3. Mean Encoding.

2. Ordinal Encoding:
    1. Label Encoding.
    2. Target guided ordinal Encoding.
    
    




### 1. Label Encoding:

It assigns a unique integer to each category in a categorical feature.
It is suitable for ordinal data where the order of categories matters.
You can use the LabelEncoder class from the sklearn.preprocessing module.
 


In [None]:
### Example of Label Encoding:

Suppose you have a dataset with a categorical feature "Education Level" that you want to label encode. Here's how you can do it:

 
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {'Education Level': ['High School', 'Bachelor\'s Degree', 'Associate\'s Degree', 'Master\'s Degree', 'High School']}

# Create a DataFrame
df = pd.DataFrame(data)

# Initialize the LabelEncoder
encoder = LabelEncoder()

# Fit and transform the data
df['Encoded Education Level'] = encoder.fit_transform(df['Education Level'])

# Display the encoded DataFrame
print(df)


Output:

 
    Education Level  Encoded Education Level
0       High School                        2
1  Bachelor's Degree                       0
2  Associate's Degree                      1
3   Master's Degree                        3
4       High School                        2


In this example, label encoding assigns a unique integer to each category in the "Education Level" column based on their order of appearance in the dataset. "Bachelor's Degree" is assigned 0, "Associate's Degree" is assigned 1, "High School" is assigned 2, and "Master's Degree" is assigned 3.

Remember that while label encoding is suitable for ordinal data, it may introduce unintended ordinal relationships if applied to nominal data, so it's crucial to understand the nature of your data before choosing this encoding technique. If there's no inherent order among categories, consider using one-hot encoding instead.



Rank with higher value in label encoding will have the first priority get selected by the ML Model.


### One-Hot Encoding:

For example i have a feature of Country names.
            Germany     France    spain
State          1          0         0
Geremany       0          1         0
France         0          1         0
Spain          0          0         1

Dummy Variable Trap concept is used in one hot encoding .
which means 

####  Dummy variable Trap in One Hot Encoding:

The dummy variable trap is a situation that can occur when using one-hot encoding to represent categorical data with binary columns. It happens when one binary column can be predicted from the values of the other binary columns in the dataset. In other words, it creates multicollinearity in your dataset, which can lead to issues in some statistical models.

Let's break down the dummy variable trap with an example:

Suppose you have a categorical feature "Season" with three categories: "Spring," "Summer," and "Fall." If you apply one-hot encoding without considering the trap, you might end up with the following columns:

Season_Spring
Season_Summer
Season_Fall

Now, consider this: if you know the values of Season_Summer and Season_Fall, you can easily infer the value of Season_Spring. In other words, one of the columns can be predicted from the values of the others. This redundancy can cause issues in some machine learning models, particularly in linear regression, as it violates the assumption of no perfect multicollinearity.

To avoid the dummy variable trap, you should use one less binary column than the number of categories. In the example above, you would only need two columns instead of three. For instance:

Season_Summer
Season_Fall
This way, if both Season_Summer and Season_Fall are 0, you can safely assume that the season is "Spring." By omitting one column, you eliminate the multicollinearity issue.

In Python, libraries like pandas automatically handle the dummy variable trap when you use the drop_first=True parameter in functions like pd.get_dummies(). This parameter removes one of the binary columns to avoid multicollinearity. Here's an example:

 
import pandas as pd

data = {'Season': ['Spring', 'Summer', 'Fall', 'Spring', 'Summer']}
df = pd.DataFrame(data)

 Apply one-hot encoding with drop_first=True
df_encoded = pd.get_dummies(df, columns=['Season'], drop_first=True)
The resulting DataFrame will have only two columns: Season_Summer and Season_Fall, avoiding the dummy variable trap.










It creates binary columns for each category in a categorical feature, with a 1 indicating the presence of the category and 0 otherwise.
It is suitable for nominal data where there is no inherent order among categories.
You can use the OneHotEncoder class from sklearn.preprocessing or the get_dummies function from pandas.
 


In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(categorical_data.reshape(-1, 1))

### Binary Encoding:

It combines label encoding and one-hot encoding by first converting categories to integers and then to binary code.
It reduces dimensionality compared to one-hot encoding.
You can use the BinaryEncoder class from the category_encoders library.

In [None]:
from category_encoders import BinaryEncoder

encoder = BinaryEncoder(cols=['categorical_feature'])
encoded_data = encoder.fit_transform(df)

### Count Encoding:

It replaces each category with the count of its occurrences in the dataset.
It can be useful for high cardinality categorical features.
You can use the CountEncoder class from the category_encoders library.

In [None]:
from category_encoders import CountEncoder

encoder = CountEncoder(cols=['categorical_feature'])
encoded_data = encoder.fit_transform(df)

### Target Encoding (Mean Encoding):

It replaces each category with the mean of the target variable for that category.
It can be useful for classification problems.
You can manually calculate and apply target encoding or use libraries like category_encoders.

In [None]:
from category_encoders import TargetEncoder

encoder = TargetEncoder(cols=['categorical_feature'])
encoded_data = encoder.fit_transform(df, target_column)