# Data Encoding (Encoding Categorical Data)

- Most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers such that the model is able to understand and extract valuable information.
- The performance of a machine learning model not only depends on the model and the hyperparameters but also on how we process and feed different types of variables to the model.
- converting categorical data not only elevates the model quality but also helps in better feature engineering.

# What is categorical data?

- Categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number. Examples: city a person lives in, department a person works in, highest degree a person has, grades of a student, ....etc

## Ordinal Data: The categories have an inherent order.

- while encoding, one should retain the information regarding the order in which the category is provided.
## Nominal Data: The categories do not have an inherent order.

- we have to consider the presence or absence of a feature. In such a case, no notion of order is present.

## Categorical Data Encoding Techniques:

- Ordinal Data: Label Encoding, Oridnal Encoding
- Nominal Data: One hot Encoding, Dummy Encoding, Frequency Encoding, Target Encoding

# Label Encoding or Ordinal Encoding

- We use this categorical data encoding technique when the categorical feature is ordinal. In this case, retaining the order is important. Hence encoding should reflect the sequence.
- In Label encoding, each label is converted into an integer value.
  - Label encoding is used on alphabetically ordered data already (grades without + or -)
  - Could also be used on nominal data in 2 cases:
    - X features are being used in tree-based models
    - For the target variable (Y) because Y is not used in numerical comparisons or distance calculations.

# One Hot Encoding

- We use this categorical data encoding technique when the features are nominal.
- For each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing either 0 or 1 (0 represents the absence, and 1 represents the presence of that category)

## Advantages

- Does not assume any order or distribution.(No fake ranking, The model doesn’t think one category is “bigger” than another)
- Keeps all the information (Every category is represented clearly, Nothing is lost or merged)

## Disadvantages

- Not suitable for tree-based models. (Prefer single columns they can split on, One-hot creates many sparse columns, which:Makes trees deeper and Makes splits less efficient), Trees often work better with label encoding instead.
- Problem with linear regression. (If you create N columns for N categories, they become: Perfectly correlated, causing Multicollinearity unless dummy encoding is used)
  - Multicollinearity is a condition where two or more independent variables are highly correlated, causing instability in statistical models like linear regression.
 
# Dummy Encoding

- Dummy coding scheme is similar to one-hot encoding.
- In the case of one-hot encoding, for N categories in a variable, it uses N binary variables. The dummy encoding is a small improvement over one-hot-encoding. Dummy encoding uses N-1 features to represent N labels/categories.
- Fixes Linear Regression Problem of One Hot Encoding

## Drawbacks of One-Hot and Dummy Encoding

- Expands the feature space. (More memory, Slower training, Curse of dimensionality, Worse for large datasets (high cardinality))
- Does not add extra information while encoding. (Encoding only reformats the data, It does NOT create new meaning or signal)

# Frequency Encoding

- It is a way to utilize the frequency of the categories as labels.
- Replace the categories by the count of the observations that show that category in the dataset.

## Advantages

- Straightforward to implement.
- Does not expand the feature space.
- Can work well with tree-based algorithms.

## Disadvantages

- We can lose valuable information if there are two different categories with the same amount of observations count—this is because we replace them with the same number.
- Does not handle new categories in the test set automatically. (cannot automatically handle unseen categories in the test set because their frequencies were not observed during training.)
- No direct relationship to the target (It only shows popularity, not usefulness)

## When to Use it

- Nominal data
- High cardinality
- Tree-based models
- Memory is a concern

## When Not to Use it

- You need semantic meaning (the value carries real-world meaning or information, not just a label or count.)
- Categories are equally frequent

# Target Encoding

- Replacing the category with the mean target value for that category. We start by grouping each category alone, and for each group, we calculate the mean of the target in the corresponding observations. Then we assign that mean to that category.

## Advantages

- Does not expand the feature space.
- Creates a monotonic relationship between categories and the target because categories with higher target means receive larger encoded values, preserving a consistent directional relationship with the target.

## Disadvantages

- May lead to overfitting. (Learn noise instead of real pattern especially when: Dataset is small and a category appears very few times)
- Loss of information if two categories have the same mean,Two different categories can end up with the same encoded value. (Different behaviors become indistinguishable)

# When to Use Each Type

- For most general machine learning tasks and low cardinality data (few unique values): Start with One-Hot Encoding.
- For classical statistics/econometrics linear models: Use Dummy Encoding to avoid multicollinearity issues.
- For high cardinality features (many unique values): Explore Frequency Encoding (simple and fast) or Target Encoding (powerful but requires careful implementation to prevent overfitting).

# Missing values Imputation

## What is Imputation?

- Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset.
- These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extend, which not only raises concerns for biasing the dataset but also leads to incorrect analysis.

## Why Imputation is Important?

We use imputation because Missing data can cause:

- Incompatible with most of the Python libraries used in Machine Learning: While using the libraries for ML(the most common is skLearn), they don’t have a provision to automatically handle these missing data and can lead to errors.
- Distortion in Dataset:A huge amount of missing data can cause distortions (the data no longer reflects the true original distribution or relationships after preprocessing) in the variable distribution i.e it can increase or decrease the value of a particular category in the dataset.
- Affects the Final Model: The missing data can cause a bias in the dataset and can lead to a faulty analysis by the model.

# Types of Missing Data

## Missing Completely At Random (MCAR)

- When the absence of data is completely unrelated to both the observed and unobserved data. Example: missing values could occur due to technical issues like a system crash.

## Missing At Random (MAR)

- When the probability of being missing is the same only within groups defined by the observed data. Example: being missing is lower for younger people than for older people. the probability of missing income is related to the observed variable "age".

## Missing Not At Random (MNAR)

- When the probability of missingness is related to the value of the missing data itself. Example: Patients with severe symptoms are less likely to report their health status, leading to missing data that depends on their condition.

# Imputation Techniques

## Mean/Median Imputation

- Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by
  - the mean (if the variable has a Gaussian distribution (Normal Distribution))
  - or median (if the variable has a skewed distribution)
- Mean/median imputation has the assumption that the data are missing completely at random (MCAR).

### Advantages

- Easy to implement
- Fast way of obtaining complete datasets

### Disadvantages

- Distortion of original variance
- Distortion of covariance with remaining variables within the dataset

## Arbitrary/Constant Value Imputation

- This is an important technique used in Imputation as it can handle both the Numerical and Categorical variables. This technique states that we group the missing values in a column and assign them to a new value that is far away from the range of that column. Mostly we use values like 99999999 or -9999999 or “Missing” or “Not defined” for numerical & categorical variables.
- Assumption is Data is not Missing At Random (MNAR).
- The missing data is imputed with an arbitrary value that is not part of the dataset or Mean/Median/Mode of data.

### Advantages

- Easy to implement.
- We can use it in production.
- It retains the importance of “missing values” if it exists.

### Disadvantages

- Can distort original variable distribution.
- Arbitrary values can create outliers.
- Extra caution required in selecting the Arbitrary value.

## Mode/Frequent Category Imputation

- This technique says to replace the missing value with the variable with the highest frequency or in simple words replacing the valueswith the Mode of that column. This technique is also referred to as Mode.
- Assumption is Data is missing at random (MAR).
- There is a high probability that the missing data looks like the majority of the data.

### Advantages

- Implementation is easy.
- We can obtain a complete dataset in very little time.
- We can use this technique in the production model.

### Disadvantages

- The higher the percentage of missing values, the higher will be the distortion.
- May lead to over-representation of a particular category.
- Can distort original variable distribution.

In [43]:
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [11]:
loanData = pd.read_csv("Loan_Default.csv")
loanData.drop(['ID', 'year'], axis = 1, inplace = True) # axis 1 to drop column

In [15]:
categoricalFeauters = loanData.select_dtypes(include = ['object']).columns.tolist()

In [17]:
OridnalFeauters = ['age'] 
NominalFeauters = categoricalFeauters.copy()
NominalFeauters.remove('age')

# Ordinal Encoding

In [None]:
encoder = OrdinalEncoder()
loanData[OrdinalFeauters] = enc.fit_transform(loanData[OrdinalFeauters]) 

# the fit_transform() function does two things at once: 
# fit : Learns the order of categories in each column
# transform : Converts categories into ordered numeric values

# Nominal Encoding

## Two Methods of One Hot Encoding

### First Method

In [19]:
# using category_encoders library

loanData1 = loanData.copy()
OHE = ce.OneHotEncoder(cols=NominalFeauters, handle_unknown='return_nan', return_df=True, use_cat_names=True)
# creates one hot encoder

# handle_unknown is used for handling unknown categories that appears during test that wasnt in training data
# return_df returns pandas dataframe instead of numpy array
# use_cat_names Encoded columns will include original category names

encoded_categories = OHE.fit_transform(loanData1[NominalFeauters])

loanData1.drop(NominalFeauters, axis=1, inplace=True)
loanData1 = pd.concat([loanData1, encoded_categories], axis=1)

### Second Method

In [25]:
# Using sklearn library

loanData2 = loanData.copy()
OHE = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# sparse_output=False By default, OneHotEncoder returns a sparse matrix (memory-efficient).
# Setting this to False means: Output is a dense NumPy array

encoded_categories = pd.DataFrame(OHE.fit_transform(loanData2[NominalFeauters]))

loanData2.drop(NominalFeauters, axis=1, inplace=True)
loanData2 = pd.concat([loanData2, encoded_categories], axis=1)

## Dummy Encoding

In [26]:
loanData3 = loanData.copy()

encoded_categories = pd.get_dummies(data=loanData3[NominalFeauters], drop_first=True)

loanData3.drop(NominalFeauters, axis=1, inplace=True)
loanData3 = pd.concat([loanData3, encoded_categories], axis=1)

## Frequency Encoding

In [28]:
loanData4 = loanData.copy()

for c in NominalFeauters:
    freq = loanData4[c].value_counts(normalize=True)
    loanData4[c+'_freq'] = loanData[c].map(freq)

loanData4.drop(NominalFeauters, axis=1, inplace=True)

## Target Encoding

In [32]:
loanData5 = loanData.copy()
TE = ce.TargetEncoder(cols=NominalFeauters)
loanData5 = TE.fit_transform(loanData5, loanData5['loan_amount'])

#target encoder only takes one target at a time

# Missing Values Imputation

In [33]:
loanData4.isna().sum()

loan_amount                           0
rate_of_interest                  36439
Interest_rate_spread              36639
Upfront_charges                   39642
term                                 41
property_value                    15098
income                             9150
Credit_Score                          0
age                                 200
LTV                               15098
Status                                0
dtir1                             24121
loan_limit_freq                    3344
Gender_freq                           0
approv_in_adv_freq                  908
loan_type_freq                        0
loan_purpose_freq                   134
Credit_Worthiness_freq                0
open_credit_freq                      0
business_or_commercial_freq           0
Neg_ammortization_freq              121
interest_only_freq                    0
lump_sum_payment_freq                 0
construction_type_freq                0
occupancy_type_freq                   0


In [34]:
high_missing_cols = ['rate_of_interest', 'Interest_rate_spread', 'Upfront_charges', 'property_value', 'LTV', 'dtir1']
cols_to_be_imputed = ['term', 'income', 'age', 'loan_limit_freq', 'approv_in_adv_freq', 'loan_purpose_freq', 'Neg_ammortization_freq', 'submission_of_application_freq']

In [35]:
#dropping columns with high missing values

loanData4.drop(columns=high_missing_cols, axis=1, inplace=True)

In [39]:
missingRows = loanData4[loanData4[cols_to_be_imputed].isnull().any(axis=1)][cols_to_be_imputed]

## Constant Imputer

In [44]:
loanData6 = loanData4.copy()
const = SimpleImputer(strategy='constant', fill_value = 99999)
loanData6[cols_to_be_imputed] = const.fit_transform(loanData6[cols_to_be_imputed])

## Mode Imputer

In [45]:
loanData7 = loanData4.copy()
const = SimpleImputer(strategy='most_frequent')
loanData7[cols_to_be_imputed] = const.fit_transform(loanData6[cols_to_be_imputed])

## Mean/Mediam Imputer

In [46]:
loanData8 = loanData4.copy()

for c in loanData8:
    loanData8[c].fillna(loanData8[c].mean, inplace=True)


1         360.0
2         360.0
3         360.0
4         360.0
          ...  
148665    180.0
148666    360.0
148667    180.0
148668    180.0
148669    240.0
Name: term, Length: 148670, dtype: float64>' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  loanData8[c].fillna(loanData8[c].mean, inplace=True)
1          4980.0
2          9480.0
3         11880.0
4         10440.0
           ...   
148665     7860.0
148666     7140.0
148667     6900.0
148668     7140.0
148669     7260.0
Name: income, Length: 148670, dtype: float64>' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  loanData8[c].fillna(loanData8[c].mean, inplace=True)
1         0.931341
2         0.931341
3         0.931341
4         0.931341
            ...   
148665    0.931341
148666    0.931341
148667    0.931341
148668    0.931341
148669    0.931341
Name: loan_limit_freq, Length: 148670, dtype: float64>' has dtype incompatible with float6

In [47]:
loanData8.isna().sum()

loan_amount                       0
term                              0
income                            0
Credit_Score                      0
age                               0
Status                            0
loan_limit_freq                   0
Gender_freq                       0
approv_in_adv_freq                0
loan_type_freq                    0
loan_purpose_freq                 0
Credit_Worthiness_freq            0
open_credit_freq                  0
business_or_commercial_freq       0
Neg_ammortization_freq            0
interest_only_freq                0
lump_sum_payment_freq             0
construction_type_freq            0
occupancy_type_freq               0
Secured_by_freq                   0
total_units_freq                  0
credit_type_freq                  0
co-applicant_credit_type_freq     0
submission_of_application_freq    0
Region_freq                       0
Security_Type_freq                0
dtype: int64