# Data Preprocessing

### Here are some important things to understand for data preprocessing in supervised machine learning. 

### A general guide to data preprocessing: https://www.kaggle.com/code/farzadnekouei/flight-data-eda-to-preprocessing#Step-3-|-Dataset-Overview 

## Categorical Variables Encoding

### Why:
- Machine Learning Algorithms Require Numerical Input: <span style="color:orange;">Most machine learning algorithms, particularly those based on mathematical computations such as linear regression, logistic regression, and support vector machines, require numerical input. These algorithms cannot directly process categorical data in their raw form</span>

- Improves Model Performance: <span style="color:orange;">Encoding categorical variables can help improve the performance of a model. Proper encoding allows the model to interpret categorical data correctly, leading to more accurate predictions. For instance, one-hot encoding can help the model treat each category as distinct, while ordinal encoding can preserve the order in ordinal data.</span>

- Handles High Cardinality: <span style="color:orange;">Encoding techniques like target encoding or frequency encoding can be beneficial when dealing with high cardinality categorical variables (i.e., variables with many unique categories). These techniques can help reduce dimensionality while still capturing important information from the categories.</span>

- Prevents Ordinal Assumptions: <span style="color:orange;">One-hot encoding prevents the model from assuming any ordinal relationship between categories that are nominal. For example, encoding colors ("red", "green", "blue") with one-hot encoding ensures that the model does not treat "green" as being between "red" and "blue".</span>

- Enhances Interpretability: <span style="color:orange;">Some encoding methods, such as target encoding, can enhance the interpretability of the model by linking categories directly to the target variable. This can be useful for understanding the impact of different categories on the prediction.</span>

- Reduces Overfitting: <span style="color:orange;">Encoding techniques can also help in reducing overfitting. For example, binary encoding reduces the number of features compared to one-hot encoding, which can help in preventing the model from memorizing the training data.</span>

- Compatibility with Algorithms: <span style="color:orange;">Certain algorithms, such as tree-based models (e.g., decision trees, random forests), can handle categorical data more flexibly. However, even for these algorithms, encoding can help in improving the efficiency and performance of the model.</span>

- Data Consistency: <span style="color:orange;">Encoding ensures that the data fed into the model is consistent and in a suitable format. This consistency is crucial for maintaining the integrity of the training process and ensuring reliable predictions.</span>

In [9]:
# Import libraries  
import numpy as np 
import pandas as pd 
from sklearn import preprocessing 
from tabulate import tabulate

### Examples of Encoding

**1. Label Encoding**
- Description: Assigns each unique category a different integer value
- Use Case: Suitable for ordinal categorical variables where the categories have an inherent order


In [14]:
cat_var = ["low", "medium", "high"]  
cat_var = pd.Categorical(cat_var, categories=["low", "medium", "high"], ordered=True) 
cat_var = cat_var.codes
cat_var

array([0, 1, 2], dtype=int8)

**2. One-Hot Encoding**
- Description: Creates a new binary column for each category
- Use Case: Suitable for nominal categorical variables with no intrinsic order

In [19]:
from sklearn.preprocessing import OneHotEncoder

cat_var = [["apple"], ["banana"], ["melon"], ["orange"]] #requires 2d array for one-hot encoding
print(tabulate(cat_var))
encoder = OneHotEncoder(sparse_output=False)
cat_var = encoder.fit_transform(cat_var)
print(tabulate(cat_var))

------
apple
banana
melon
orange
------
-  -  -  -
1  0  0  0
0  1  0  0
0  0  1  0
0  0  0  1
-  -  -  -


**3. Binary Encoding**
- Description: Converts categories to binary digits and then splits them into separate columns
- Use Case: Useful when there are a large number of categories and one-hot encoding would create too many columns

In [25]:
import category_encoders as ce

df = pd.DataFrame({'cat_var':['green','red','blue','pink','black','white',
           'brown','purple','yellow','grey','wheat']})
print(tabulate(df))
encoder = ce.BinaryEncoder(cols=['cat_var'])
df = encoder.fit_transform(df)
print(tabulate(df))


--  ------
 0  green
 1  red
 2  blue
 3  pink
 4  black
 5  white
 6  brown
 7  purple
 8  yellow
 9  grey
10  wheat
--  ------
--  -  -  -  -
 0  0  0  0  1
 1  0  0  1  0
 2  0  0  1  1
 3  0  1  0  0
 4  0  1  0  1
 5  0  1  1  0
 6  0  1  1  1
 7  1  0  0  0
 8  1  0  0  1
 9  1  0  1  0
10  1  0  1  1
--  -  -  -  -


**4. Frequency Encoding**
- Description: Replaces each category with its frequency in the dataset
- Use Case: When the frequency of categories can provide useful information to the model

In [31]:
cat_var = np.random.choice(['green', 'blue', 'red'], size=20, p=[0.3, 0.5, 0.2]) #note: generate each number with frequency given independently
df = pd.DataFrame({'cat_var': cat_var})
print(tabulate(df.head(10)))
df['cat_var'] = df['cat_var'].map(df['cat_var'].value_counts(normalize=True))
print(tabulate(df.head(10)))

-  -----
0  blue
1  red
2  blue
3  blue
4  blue
5  blue
6  green
7  red
8  red
9  blue
-  -----
-  ----
0  0.45
1  0.2
2  0.45
3  0.45
4  0.45
5  0.45
6  0.35
7  0.2
8  0.2
9  0.45
-  ----


**5. Target Encoding (Mean Encoding)**
- Description: Replaces each category with the mean of the target variable for that category
- Use Case: Useful for high cardinality categorical features, particularly in classification problems

In [33]:
cat_var = np.random.choice(['green', 'blue', 'red'], size=20, p=[0.3, 0.5, 0.2])
target_var = np.random.randint(0, 11, size=20)
df = pd.DataFrame({'cat_var': cat_var, 'target_var': target_var})
print(tabulate(df.head(10)))
target_mean = df.groupby('cat_var')['target_var'].mean()
df['cat_var'] = df['cat_var'].map(target_mean)
print(tabulate(df.head(10)))

-  -----  --
0  blue    8
1  blue    1
2  green   8
3  green   4
4  blue   10
5  blue    7
6  red     1
7  green   2
8  green   2
9  blue    9
-  -----  --
-  -----  --
0  6.3     8
1  6.3     1
2  3.875   8
3  3.875   4
4  6.3    10
5  6.3     7
6  5       1
7  3.875   2
8  3.875   2
9  6.3     9
-  -----  --


**6. Ordinal Encoding**
- Description: Assigns each unique category an integer value, respecting the order
- Use Case: Specifically for ordinal variables where categories have a defined order

In [42]:
from sklearn.preprocessing import OrdinalEncoder

cat_var = [['low'], ['medium'], ['high']] #2-d array
print(tabulate(cat_var))
encoder = OrdinalEncoder(categories=[['low','medium','high']]) #order the category
cat_var = encoder.fit_transform(cat_var)
print(tabulate(cat_var))

------
low
medium
high
------
-
0
1
2
-


**7. Hashing Encoding**
- Description: Applies a hash function to the category to convert it to a fixed-size vector
- Use Case: Useful when the number of unique categories is very large
- The number of components we use depends on many factors such as the size of the data and the number of unique categories

In [84]:
from sklearn.feature_extraction import FeatureHasher


cat_var = [['green'], ['red'], ['blue'], ['pink'], ['black'], ['white'], ['brown'], ['purple'], ['yellow'], ['grey']]
print(tabulate(cat_var))
encoder = FeatureHasher(n_features=7, input_type='string')
cat_var_encoded = encoder.transform(cat_var).toarray()
df = pd.DataFrame(cat_var_encoded)
df = pd.concat([pd.DataFrame(cat_var), df], axis=1)
print(tabulate(df))

------
green
red
blue
pink
black
white
brown
purple
yellow
grey
------
-  ------  -  --  --  --  --  --  -
0  green   0   0   0   0   0   1  0
1  red     0   0  -1   0   0   0  0
2  blue    0   0   0   0  -1   0  0
3  pink    0   0   0   1   0   0  0
4  black   0   0   0  -1   0   0  0
5  white   0   0   0   0   0  -1  0
6  brown   0   1   0   0   0   0  0
7  purple  0   0   0   0   0   0  1
8  yellow  0   0   0   0   1   0  0
9  grey    0  -1   0   0   0   0  0
-  ------  -  --  --  --  --  --  -
