#  Name -> Deven Chhajed
# Roll No-> 32
# Batch -> B1 (CSE)
# Prn -> 1032210789
# Data Encoding

In [1]:
import pandas as pd
from sklearn import preprocessing

**import pandas as pd**: Pandas, a Python library for data manipulation and analysis, offers essential data structures such as DataFrames and Series to facilitate efficient data handling and analysis.

**from sklearn import preprocessing**: The preprocessing module in sklearn is used for transforming raw data into a format better suited for machine learning algorithms. It includes functions for scaling, normalization, and more. After importing it, you can use its functions to preprocess your data.



# Label Encoding

**Label Encoding** is a pivotal technique in machine learning that transforms categorical data into a numerical format, enabling their use in machine learning models which typically necessitate numerical input. Here’s a breakdown of the process:

**Unique Category Assignment:** Each unique category value is assigned a distinct integer, known as a label.

**Integer Assignment:** These labels are allocated arbitrarily. For instance, if you have a ‘Color’ feature with values ‘Red’, ‘Blue’, and ‘Green’, they could be encoded as 0, 1, and 2 respectively.

**Value Replacement:** The categorical values in the dataset are then substituted with their corresponding integers.

Consider a dataset with a ‘Height’ feature having values ‘Tall’, ‘Medium’, and ‘Short’. Post label encoding, these could be represented as 0, 1, and 2 respectively.

However, Label Encoding does have a **limitation.** It can sometimes imply an ordered relationship between the categories incorrectly. For example, in our ‘Height’ scenario, the model might mistakenly infer that ‘Short’ (2) is “greater than” ‘Tall’ (0), which is logically incorrect.

In [2]:
data = {
    "Departments": ['Production', 'Finance', 'IT', 'Marketing', 'Finance', 'IT', 'IT', 'Human Resource', 'Finance']
}

# Create a DataFrame from my_data
df = pd.DataFrame(data)

# Now, you can add an index to the DataFrame
df.index = df.index + 1

print(df)

le = preprocessing.LabelEncoder()

df['Departments'] = le.fit_transform(df['Departments'])

print("\nData Frame after Label Encoding:\n")
print(df)


      Departments
1      Production
2         Finance
3              IT
4       Marketing
5         Finance
6              IT
7              IT
8  Human Resource
9         Finance

Data Frame after Label Encoding:

   Departments
1            4
2            0
3            2
4            3
5            0
6            2
7            2
8            1
9            0


# "category_encoders" is used for encoding categorical data in Python, which is important for machine learning and data analysis.

In [3]:
pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.2-py2.py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.8/81.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.2


# **import category_encoders as ce** allows you to use the category_encoders package in your Python code with the alias ce. This package provides tools for encoding categorical data, making it suitable for use in machine learning and data analysis tasks.

In [4]:
import category_encoders as ce

# Target Encoding
Target encoding, also known as mean encoding, is a method for encoding categorical variables in machine learning. Here's a brief explanation:

1. **Transformation:** Target encoding replaces each category in a categorical variable with a numerical value. This value is computed based on the statistical properties of the target variable for each category.

2. **Supervised Learning:** It is primarily used in supervised learning tasks, where you have a target variable to predict, and you want to encode the categorical features to use them in machine learning models.

3. **Relationship Capture:** Target encoding captures the relationship between the categorical variable and the target variable by aggregating information from the target variable (e.g., mean, median, or other statistics) for each category.

4. **Common Usage:** It is often used with decision trees, random forests, and gradient boosting algorithms, as it can enhance the performance of these models when dealing with categorical features.

5. **Overfitting Concerns:** Care should be taken to prevent overfitting. Techniques like smoothing or regularization can be applied to mitigate the risk of overfitting when using target encoding.

In essence, target encoding helps convert categorical data into a numerical format, which is more suitable for machine learning algorithms, by summarizing the relationship between the categories and the target variable.

In [5]:
# Define the data in a dictionary
data1 = {
    "Subjects": ['Accounts', 'Business', 'Economics', 'Business', 'Economics', 'Accounts', 'Accounts', 'Accounts'],
    "Marks": [50, 30, 70, 80, 45, 97, 80, 68],
}

# Create a DataFrame from the data
df1 = pd.DataFrame(data1)

# Print the original DataFrame
print("Original DataFrame:\n")
print(df1)

# Create a TargetEncoder object for the "Subjects" column
encoder = ce.TargetEncoder(cols=["Subjects"])

# Apply target encoding to the DataFrame based on the "Marks" column
df1 = encoder.fit_transform(df1, df1["Marks"])

# Print the DataFrame after target encoding
print("\nData Frame after Target Encoding:\n")
print(df1)

Original DataFrame:

    Subjects  Marks
0   Accounts     50
1   Business     30
2  Economics     70
3   Business     80
4  Economics     45
5   Accounts     97
6   Accounts     80
7   Accounts     68

Data Frame after Target Encoding:

    Subjects  Marks
0  66.469839     50
1  63.581489     30
2  63.936117     70
3  63.581489     80
4  63.936117     45
5  66.469839     97
6  66.469839     80
7  66.469839     68


# One Hot Encoding
One-hot encoding is a technique for converting categorical variables into a binary matrix (0s and 1s). Here's a concise explanation:

- **Binary Transformation:** Each category in a categorical variable is transformed into a binary column where only one column is "1" (hot) and the rest are "0" (not hot).

- **Suitability:** One-hot encoding is typically used for categorical variables where there is no intrinsic order or ranking among the categories. It's ideal for scenarios where categories are exclusive and don't have a natural numeric relationship.

- **Avoiding Order Bias:** Unlike label encoding, which assigns arbitrary numbers to categories, one-hot encoding ensures that the machine learning model doesn't interpret the categories as having a meaningful order.

- **Dimensionality Increase:** One-hot encoding can significantly increase the dimensionality of your dataset, which may not be suitable for high-cardinality categorical variables.

- **Independence:** Each one-hot encoded column is independent of the others, which can be beneficial for some machine learning algorithms.

In summary, one-hot encoding is a method for converting categorical data into a format that is suitable for machine learning models, especially when categories don't have inherent order or ranking. It helps prevent the model from misinterpreting the data due to arbitrary numerical values.

In [6]:
# Define the data in a dictionary
data2 = {
    "Color": ['Red', 'Blue', 'Yellow', 'Green', 'Black'],
}

# Create a DataFrame from the data
df2 = pd.DataFrame(data2)

# Print the original DataFrame
print("Original DataFrame:\n")
print(df2)

# Perform one-hot encoding on the "Color" column
one_hot_encoded_data = pd.get_dummies(df2, columns=['Color'])

# Print the DataFrame after one-hot encoding
print("\nData Frame after One-Hot Encoding:\n")
print(one_hot_encoded_data)

Original DataFrame:

    Color
0     Red
1    Blue
2  Yellow
3   Green
4   Black

Data Frame after One-Hot Encoding:

   Color_Black  Color_Blue  Color_Green  Color_Red  Color_Yellow
0            0           0            0          1             0
1            0           1            0          0             0
2            0           0            0          0             1
3            0           0            1          0             0
4            1           0            0          0             0


# Dummy Encoding
Dummy encoding is a technique used to convert categorical variables into a numerical format for machine learning and data analysis. In this method, each category is represented by a set of binary columns. Each unique category corresponds to its own set of binary columns, and for each data point, only one of these columns has a "1" while the others are "0." This approach is akin to one-hot encoding but allows for multiple "hot" columns within a single observation, and it retains the independence of each encoded column. However, it can increase the dataset's dimensionality, which should be considered when applying it to your data.

- **Binary Conversion:** Dummy encoding transforms categorical variables into binary columns.
- **Number of Columns:** Each unique category corresponds to a set of binary columns, with only one "hot" (1) column per category per data point.
- **Difference from One-Hot Encoding:** While similar to one-hot encoding, dummy encoding allows multiple "hot" columns within a single observation.
- **Independence:** Like one-hot encoding, each dummy-encoded column is independent, which can be advantageous for certain machine learning algorithms.
- **Potential Dimensionality Increase:** Dummy encoding can increase the dimensionality of the dataset.

In [7]:
# Define the data in a dictionary
data3 = {
    "Temperature": ['Hot', 'Cold', 'Warm', 'Very Hot', 'Hot', 'Warm', 'Cold'],
}

# Create a DataFrame from the data
df3 = pd.DataFrame(data3)

# Print the original DataFrame
print("Original DataFrame:\n")
print(df3)

# Perform dummy encoding on the "Temperature" column
df_encoded = pd.get_dummies(df3, columns=["Temperature"])

# Print the DataFrame after dummy encoding
print("\nData Frame after Dummy Encoding:\n")
print(df_encoded)

Original DataFrame:

  Temperature
0         Hot
1        Cold
2        Warm
3    Very Hot
4         Hot
5        Warm
6        Cold

Data Frame after Dummy Encoding:

   Temperature_Cold  Temperature_Hot  Temperature_Very Hot  Temperature_Warm
0                 0                1                     0                 0
1                 1                0                     0                 0
2                 0                0                     0                 1
3                 0                0                     1                 0
4                 0                1                     0                 0
5                 0                0                     0                 1
6                 1                0                     0                 0


# Binary Encoding
Binary encoding is a method to represent categorical variables as binary code, using a fixed number of binary columns. It's efficient for high-cardinality variables and assumes a natural order among categories, making it suitable for ordinal variables.

- **Binary Representation:** Binary encoding transforms each category in a categorical variable into binary code.
- **Number of Columns:** It creates a fixed number of binary columns (log2(N), where N is the number of unique categories) to represent the categories.
- **Efficiency:** It is more efficient than one-hot encoding and dummy encoding when dealing with high-cardinality categorical variables, as it reduces dimensionality.
- **Relation to Order:** Binary encoding assumes a natural order among categories, so it's more suitable for ordinal categorical variables.
- **Combination of Binary Values:** Each category corresponds to a unique combination of 0s and 1s in the binary columns.

In summary, binary encoding is a technique for converting categorical data into a binary format based on a natural order among the categories, which can be more efficient and space-saving for certain applications.

In [8]:
# Define the data in a dictionary
data4 = {
    "Country": ['Dubai', 'Malaysia', 'Hungary', 'China', 'Bhutan', 'Dubai', 'Hungary', 'Malaysia', 'Australia'],
}

# Create a DataFrame from the data
df4 = pd.DataFrame(data4)

# Print the original DataFrame
print("Original DataFrame:\n")
print(df4)

# Create a BinaryEncoder object for the "Country" column
encoder = ce.BinaryEncoder(cols=['Country'])

# Perform binary encoding on the "Country" column
df_encoded = encoder.fit_transform(df4)

# Print the DataFrame after binary encoding
print("\nData Frame after Binary Encoding:\n")
print(df_encoded)

Original DataFrame:

     Country
0      Dubai
1   Malaysia
2    Hungary
3      China
4     Bhutan
5      Dubai
6    Hungary
7   Malaysia
8  Australia

Data Frame after Binary Encoding:

   Country_0  Country_1  Country_2
0          0          0          1
1          0          1          0
2          0          1          1
3          1          0          0
4          1          0          1
5          0          0          1
6          0          1          1
7          0          1          0
8          1          1          0


# Hash Encoding
Hash encoding is a technique used to convert categorical variables into numerical values by applying a hash function to the categories. Here's a concise explanation of hash encoding:

1. **Hashing Categories:** Each category is hashed into a numerical value using a hash function. This produces a unique numeric representation for each category.

2. **Fixed Dimensionality:** Hash encoding typically reduces the dimensionality compared to one-hot encoding or dummy encoding because the output is fixed in size.

3. **Randomness:** Hash encoding can introduce randomness, as different categories may be mapped to the same value if they collide in the hash function.

4. **Memory-Efficient:** It can be memory-efficient when dealing with high-cardinality categorical variables because it doesn't create many new columns.

5. **Non-Interpretable:** The numerical values produced by the hash function are typically not interpretable as they lack any ordinal or semantic meaning.

In summary, hash encoding is a method to convert categorical data into a numerical format, but it can introduce randomness and doesn't provide ordinal information about the categories. It's often used for dimensionality reduction in scenarios with high-cardinality categorical variables.

In [9]:
pip install category_encoders



# This package helps prepare categorical data for use in machine learning models by converting them into a suitable format for analysis and prediction.

In [10]:
import category_encoders as ce
import pandas as pd

#Create the dataframe
data4=pd.DataFrame({'Month':['January','April','March','April','Februay','July','April','July','September']})

#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month',n_components=6)

In [11]:
encoder.fit_transform(data4)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5
0,0,0,0,0,1,0
1,0,0,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0
5,1,0,0,0,0,0
6,0,0,0,1,0,0
7,1,0,0,0,0,0
8,0,0,0,0,1,0


# Difference between all the techniques:
| Encoding Technique   | Description                                                      | Use Cases                                      | Pros                                                                                                                                                       | Cons                                                                                                                             |
|----------------------|------------------------------------------------------------------|------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| **One-Hot Encoding** | Converts categorical variables into binary vectors with one element for each category. | Nominal categorical data without a natural order. | Preserves the independence of categories. Compatible with most machine learning algorithms.                             | Can lead to high-dimensional datasets with many columns, potentially increasing computational complexity. |
| **Label Encoding**   | Assigns a unique integer to each category or class.               | Ordinal categorical data with a natural order. | Simple and easy to implement. Suitable for algorithms that can interpret ordinal data.                                  | May introduce ordinality where it doesn't exist, potentially leading to misleading results.                                      |
| **Dummy Encoding**   | Creates binary columns for each category, indicating the presence or absence of the category. | Nominal categorical data without a natural order. | Provides a straightforward way to handle categorical data for regression and classification.                                | Like one-hot encoding, can result in high dimensionality in the dataset.                                                        |
| **Hash Encoding**    | Converts categories into fixed-length hash values.                | Nominal categorical data with many unique categories. | Reduces dimensionality compared to one-hot encoding, making it suitable for high-cardinality data.                      | May lead to collisions (different categories mapping to the same hash value), causing information loss.                   |
| **Target Encoding**  | Uses the target variable's mean for each category as a new feature. | Classification tasks.                            | Incorporates target-related information into the encoding, potentially improving predictive performance. | Prone to overfitting if not properly regularized.                                                                         |
| **Binary Encoding**  | Similar to one-hot encoding but uses binary values (0 and 1) instead of vectors. | Nominal categorical data with many unique categories. | Reduces dimensionality compared to one-hot encoding. Easier to handle for some models.                                  | May not capture the variability between categories as well as one-hot encoding.                        |

Each encoding technique serves a specific purpose and is selected based on the type of categorical data and the machine learning algorithm you are using. Make sure to choose the one that best suits your data and model requirements.