# Feature Engineering Intro I
<hr style="border:2px solid black">

## Introduction

What is our __goal__ when we __train a machine learning model__ (ML)?

<img src = "./images/basic_model.png" width=400>

The primary goal when training a machine learning model is to develop a system that can accurately make predictions or decisions based on input data. 
But how do we improve the accuracy of our predictions?

**Selecting the right model** is essential, but it's only the beginning. Beyond model selection, we must explore other strategies to refine our model's learning, resulting in more **accurate predictions**.

Consider the learning process of the model:

When provided with `X_train` input, the algorithm determines the optimal parameters to align the model‚Äôs output with the known `y_train` values.

To elevate our model from just okay to truly effective, we must actively support its learning process.

So, what strategies can we employ to achieve this?


## What is Feature Engineering?
The diagram below is **over-simplified**. Simply feeding our algorithm with all the raw data results in a model that is equally raw and messy. Instead, we need to **clean up** the data and make **creative decisions** to select the **key features** that will enable the model to accurately predict outcomes.

<img src = "./images/feature_eng.png" width=500>

Some aspects of **feature engineering** are methodical and consistent. We'll explore these first.

The other aspects are more similar to an **art form**, requiring a deep understanding of the subject and a bit of **human intuition**.

The process of **feature engineering** can be one of the most **time-consuming** parts of modeling, but it's essential. Without it, your model will struggle to discern patterns amidst the noise.


**Estimated time spent with data organizing**

<img src = "./images/stacked-chart.jpeg" width=400>

### Feature engineering techniques

 |       technique      |                                        usefulness                                |
 |:--------------------:|:--------------------------------------------------------------------------------:|
 |     `Imputation`     |                    fills out missing values in data                    |
 |   `Discretization`   |                groups a feature in some logical fashion into bins                |
 |`Categorical Encoding`|encodes categorical features into numerical values|
 |  `Feature Splitting` |splits a feature into parts|
 |   `Feature Scaling`  |handles the sensitivity of ML algorithms to the scale of input values| 
 |`Feature Expansion`|derives new features from existing ones|
 | `Log Transformation` |deals with ill-behaved (skewed of heteroscedastic) data       |
 |   `Outlier Handling` |takes care of unusually high/low values in the dataset|
 | `RBF Transformation` |uses a continuous distribution to encode ordinal features|

### Feature engineering best practices

#### 1. **Split Dataset** into Train and Test sub-samples as early as possible

While this process is flexible‚Äîfor example, you can remove NaNs from the entire dataset before filling‚Äîit's generally a better practice, in the interest of good machine learning habits, to perform this step **after splitting**. If you remove or impute missing values before splitting, information from the test set could influence the training process, leading to overly optimistic performance estimates.

#### 2. **Feature Engineering** Includes any pre-processing techniques, such as:

- Dropping missing values
- Converting strings or non-numeric values into numeric values
- Combining features
- Creating new features

#### 3. **Feature Engineer Test Data** the same way as train data

Make sure to process the test data in the same way as training data.



<hr style="border:2px solid black">

## Example: Penguin Data

#### Load Packages

In [None]:
# data analysis stack
import numpy as np
import pandas as pd

# data visualization stack
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')

from sklearn.model_selection import train_test_split

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

#### Load Data

In [None]:
df = pd.read_csv('./data/penguins.csv')
df.head()

#### Quick Exploration

In [None]:
df.info()

In [None]:
df.describe()

#### Features and Target


In [None]:
numerical_features = [
    'bill_length_mm',
    'bill_depth_mm',
    'flipper_length_mm'
]

categorical_features = [
    'species',
    'island',
    'sex'
]

features = numerical_features + categorical_features

target_variable = 'body_mass_g'

#### Feature-Target separation

In [None]:
# Feature matrix 
X = df[features]

# Target column
y = df[target_variable]

#### Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42, shuffle=True)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

This code snippet **splits** the dataset into training and testing sets, where 25% of the data is reserved for testing. The `random_state` parameter ensures that the **split** is reproducible; the same random split will occur each time the code is run. The `shuffle` ensures that the data is shuffled before the split.


For teaching purposes, we'll demonstrate how to add a **validation set** in addition to the usual **train-test split**. The validation set is important to properly **evaluate** and **fine-tune** a model before final testing.

In [None]:
# Further split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=42)

print("Train shape:", X_train.shape)
print("Validation shape:", X_val.shape)

### Exploratory Data Analysis

For **Exploratory Data Analysis (EDA)**, you should concatenate `X_train` and `y_train`. This allows you to analyze also the relationship between features and the target variable.

In [None]:
# Assuming X_train is a DataFrame and y_train is a Series
df_train = pd.concat([X_train, y_train], axis=1)

print("Combined train data shape:", df_train.shape)

**Show Some Plots**

In [None]:
sns.pairplot(df_train,corner=True,hue='island');

The above **pairplot** shows both distribution of single variables and the relationships between two variables. 

1. **Scatter Plots**:
   - Each plot below the diagonal shows how two variables relate to each other. For example, in the plot where 'flipper_length_mm' meets 'bill_length_mm', each point represents those two measurements for one observation.

2. **Histograms/Density Plots**:
   - The plots along the diagonal show how frequently different values occur for a single variable. For instance, the plot for 'bill_depth_mm' displays the distribution of bill depths among all observations.

3. **Color**:
   - The colors represent different categories, in this case, categorized by 'island'. This helps to quickly see if measurements vary noticeably by island, with each color representing a different island.


One can create also other pairplots considering as the extra dimension `hue` the sex variable or the species variable. 

In [None]:
df_train

>**Note** that the **indices** in the DataFrame shown above appear in a **random order**, a result of **shuffling** prior to the data split.


#### Issue with the Data

**Many models cannot handle** missing values, **categorical features** with non-numeric values, or **metric features** with varying magnitudes. Proper data preprocessing is essential to prepare the data for these models.


**Check Missing Values**

First, let's take a look at what's missing:

In [None]:
df_train.isna().sum()

`isna()`and `isnull()` are identical methods that produce a boolean mask where True is a missing value. If we want to turn this into a useful view, we can filter using these masks.

In [None]:
# To find the rows with NaN
df_train.loc[df_train.isna().any(axis=1)]

In [None]:
# check missing values graphically
plt.figure(figsize=(7,5), dpi=100)
sns.heatmap(df_train.isna());

So, looking at either of these, we can see two missing values in both `bill_length_mm` and `flipper_length_mm` columns and six in the `sex` column. How do we deal with them?

<hr style="border:2px solid black">

## 1. Imputation - Filling in the Blanks

#### What can we do with missing information?
There are __few strategies__:

- __Drop__:
    + rows with missing values
    + columns with a lot of missing values
- __Fill with a value__:
    + __mean__/__median__/__mode__ of a column
    + __interpolate__ / __back fill__ / __forward fill__
    + __mean__/__median__/__mode__ of a group

- With `pandas`: 
    - `df.isna()`: checks for NaNs, then do a sum or a heatmap
    - `df.dropna()`: drop NaNs
    - `df.fillna()`: fill NaNs

One would to use `inplace=True` in these examples to modify the DataFrame directly.

---

#### 1.1 `SimpleImputer`

We can use the scikit-learn  <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">`SimpleImputer()`</a>  to quickly impute the missing data in a column. There are three strategies available:

 * `strategy = 'mean'` - **the default option**, numeric only
 
 * `strategy = 'median'` - numeric only
 
* `strategy = 'most_frequent'` - mode, numeric or categorical
 
 * `strategy = 'constant'` - needs additional arg `fill_value`, numeric or categorical

If your missing values aren't NaN (i.e. - None, 0, 999, "badvalue", etc), you may need to use the `missing_value` argument to let it know what to look for.

In [None]:
from sklearn.impute import SimpleImputer

**Sex Column Imputation with the most frequent value**


In [None]:
X_train['sex'].value_counts(dropna=False)

In [None]:
# Instantiating a SimpleImputer object
sex_imputer = SimpleImputer(strategy='most_frequent')#.set_output(transform='pandas')

In [None]:
# Fit the variable imputer on the 'sex' column training data
sex_imputer.fit(X_train[['sex']])

`.fit` teaches the imputer what to insert. In this example the imputer scans through the `sex` column in the training dataset (`X_train`) to determine the most frequent value in that column. After `fitting`, the imputer will inernally store the mode of that column in its `statistics_` attribute.

In [None]:
sex_imputer.statistics_

In [None]:
# Applying the transformation to the 'sex' column of the training data using the pre-fitted imputer.
sex_imputed_train = sex_imputer.transform(X=X_train[['sex']])
sex_imputed_train

The `transform` method uses a pre-trained imputer to fill in the `missing values` in the `sex` column. By default, this method returns a numpy array rather than a pandas DataFrame. If you need to convert this numpy array back into a DataFrame, it's important to ensure that the index of the new DataFrame aligns with the original one to maintain data consistency.

In [None]:
# The long way to the dataframe  
sex_imputed_df_train = pd.DataFrame(data=sex_imputed_train, columns=sex_imputer.get_feature_names_out(), index=X_train.index)
sex_imputed_df_train

In [None]:
sex_imputed_df_train.value_counts(dropna=False)

**How should we impute missing values in the test data?**

When imputing missing values - or applying other data transformations to the **test data**, it is important to **avoid using any information outside of the training data**. This helps us to avoid **data leakage** during the model building process. Therefore, you should apply only those transformations to the test data that are based on parameters established from the training data. Thus, in the context of scikit-learn, we use only the __transform method__ of the imputer or other transformation tools.

üö®üö®üö®**Very Important**üö®üö®üö®

As shown in the [machine learning workflow](../machine_learning_workflow.md#the-machine-learning-workflow) it is good practice to unlock and transform the test data only at the very end.

In [None]:
sex_imputed_test = sex_imputer.transform(X_test[['sex']])
sex_imputed_test

In [None]:
sex_imputed_df_test = pd.DataFrame(data=sex_imputed_test,columns=sex_imputer.get_feature_names_out(), index=X_test.index)
sex_imputed_df_test

---

**Imputation of 'Flipper Length' and 'Bill Depth' Columns Using Median Values**

We will follow the same steps we used for the 'sex' column to address missing values in the 'flipper length' and 'bill depth' columns. But in this case we will use the **median imputation** strategy.

In [None]:
# Instantiating a SimpleImputer object
flipper_bill_imputer = SimpleImputer(strategy='median').set_output(transform='pandas')
flipper_bill_imputer

The `set_output()` method sets the default otput of the `transform()` methods to a pandas DataFrame.

In [None]:
# Fit the variable imputer on the flipper_length_mm' and 'bill_depth_mm' columns of the training data
flipper_bill_imputer.fit(X_train[['flipper_length_mm','bill_depth_mm']])

In [None]:
# Check the median parameters stored in the imputer after the fit
flipper_bill_imputer.statistics_

In [None]:
X_train[['flipper_length_mm','bill_depth_mm']].median()

In [None]:
# Applying the transformation to the flipper_length_mm' and 'bill_depth_mm' columns of the training data using the pre-fitted imputer.
flipper_bill_imputed_df_train = flipper_bill_imputer.transform(X_train[['flipper_length_mm','bill_depth_mm']])
flipper_bill_imputed_df_train.isna()

<hr style="border:2px solid black">

## 2. Categorical encoding - Replacing categories with numbers


Most algorithms aren't capable of handling strings, but there are some helpful tools to convert them into integers.


### 2.1 `OneHotEncoder` for **nominal variables** (categories without inherent order)
We can use the scikit-learn <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html?highlight=onehotencoder#sklearn.preprocessing.OneHotEncoder">`OneHotEncoder()`</a> to get dummy values for our categorical data. It converts each category within a column into a separate binary column. Each category is represented by a **1** in its respective column for instances where it appears, and **0** where it does not. There are two options for chosing which columns to drop to avoid perfect collinearity - a common statistical issue for some models:
 
 * `drop = 'first'` - drops the first category in each feature. If there's only one category, it will drop the feature altogether.
 
 * `drop = 'if_binary'` - will only drop a category if the feature is binary (i.e. - yes/no, on/off, etc). Features with one or more than two categories will remain untouched.

After transforming our dataset, we can use `get_feature_names_out()` to get an array of feature names to label the columns.



In [None]:
from sklearn.preprocessing import OneHotEncoder

What are the **unique categories** in the `species` column/feautre?

In [None]:
X_train['species'].unique()

In [None]:
# Instantiating a OneHotEncoder object
species_ohencoder = OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=True)

+ The `handle_unknown` parameter tells the encoder how to deal with categories in new data that weren't present during its initial fitting.
+ When you set `sparse_output=True`, the encoder returns the transformed data as a **sparse matrix** instead of a dense numpy array.
  + **Dense Matrix**
    + Stores all values explicitly, including zeros
    + Very memory-intesive when the matrix has a lot of zeros
  + <a href="https://en.wikipedia.org/wiki/Sparse_matrix#:~:text=In%20numerical%20analysis%20and%20scientific,of%20the%20elements%20are%20zero.">**Sparse Matrix**</a>
    + Stores only the locations and values of non-zero elements
    + Reduces memory usage
    + Speeds up operations

In [None]:
# Fit the variable encoder on the 'species' column of the training data
species_ohencoder.fit(X=X_train[['species']])

In [None]:
# unique categories of the train species column learnt and stred in the encoder
species_ohencoder.categories_

In [None]:
# new columns/featureeaure of the categories
species_ohencoder.get_feature_names_out()


**Note that `species_Adelie` has been dropped**

In [None]:
species_encoded_train_sparse = species_ohencoder.transform(X=X_train[['species']])
species_encoded_train_sparse

We have a **Compressed Sparse Row (CSR)** sparse matrix with dimensions **(nrows, ncols) = (256, 2)**, where most of the entries are zero. This matrix efficiently stores only the 149 non-zero entries, as floating-point numbers.

In [None]:
# Convert a sparse matrix to a dense matrix
species_encoded_train_dense = species_encoded_train_sparse.todense()
species_encoded_train_dense

In [None]:
# Convert the 
species_encoded_train_df = pd.DataFrame(data=species_encoded_train_dense, columns=species_ohencoder.get_feature_names_out(), index=X_train.index)
species_encoded_train_df

In [None]:
# Let's see the encoded species columns for two penguins
species_encoded_train_df.loc[[24,323]]

**How to read the üëÜüèΩ DataFrame**?

+ Observation (the penguin) n 24 does not belong neither to the species `Chinstrap` not `Gentoo` but `Adelie`
+ Observation (the penguin) n 323 belong to the species `Gentoo` 

### 2.2 `OrdinalEncoder` for **Ordinal Variables** (Categories with Inherent Order)


We can use the scikit-learn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html">`OrdinalEncoder()`</a> to encode ordinal categories into meaningful numeric values. This encoder is particularly useful for categories that have a natural ordered relationship. Here are some key points:

* `categories` - specifies the order of categories explicitly if the default lexical order is not desired.

* `dtype` - the data type of the output (default is `float64`).

* `handle_unknown` - decides how to handle unknown categories that appear during transformation:
  
  * `handle_unknown='error'` - **the default option**, throws an error if an unknown category is encountered.
  
  * `handle_unknown='use_encoded_value'` - allows assigning a specific integer for unknown categories with additional arg `unknown_value`.

If your categories have a meaningful order, specify this in the `categories` argument to ensure the encoding respects the ordinal nature.


As in the penguins dataset we don't have a feature with ordinal categories, let's consider a simple case involving an ordinal feature: "Education Level". The categories, listed in order of their educational achievement, are:

+ High School
+ Bachelor's
+ Master's
+ Doctorate

In ordinal encoding, each category is assigned a unique integer based on its order.
Here is how you can map these categories:
+ High School --> 0
+ Bachelor's --> 1
+ Master's  --> 2
+ Doctorate --> 3
  


In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
train_data = [['Bachelor\'s'], ['High School'],  ['Doctorate'], ['Master\'s'], ['Doctorate']]
df_train_data_ = pd.DataFrame(data=train_data, columns=['education_level'])
df_train_data_

In [None]:
# Instantiate the encoder object
edu_encoder = OrdinalEncoder(categories=[['High School', 'Bachelor\'s', 'Master\'s', 'Doctorate']],dtype=int).set_output(transform='pandas')

In [None]:
# Fit the variable encoder on the 'education_level' column of the training data
edu_encoder.fit(df_train_data_[['education_level']])

In [None]:
# learned and stored categories by the encoder
edu_encoder.categories_

In [None]:
# Encode the 'education_level' column in the train data
edu_encoded_df_train = edu_encoder.transform(df_train_data_)
edu_encoded_df_train

#### Summary

`OrdinalEncoder()` and `OneHotEncoder()` turn categorical data into an integer

`OrdinalEncoder()` results in a single column

`OneHotEncoder()` results in multiple columns

`drop_first` removes a column from your dummy frame to avoid **perfect collinearity**, especially when using a regression model. There is some discussion of why multicollinearity is <a href = "https://towardsdatascience.com/multicollinearity-why-is-it-bad-5335030651bf">a problem</a> and <a href = "https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn">why it might not be that bad</a>. The arguments are interesting, but as a general rule, when using regression models it's best to avoid it.


- `LabelEncoder`  similar to [factorize()](https://pandas.pydata.org/docs/reference/api/pandas.factorize.html) in Pandas 
- `OneHotEncoder` similar to [get_dummies()](https://pandas.pydata.org//docs/reference/api/pandas.get_dummies.html) in Pandas

<hr style="border:2px solid black">

## 3. Discretization (Binning) - Splitting scalars into categories
Breaking a continuous variable into buckets can lead to some effects:
+ it can reduce the model's sensitivity to minor fluctuations and noise in the data, reducing the risk of overfitting
+ it can bring loss of important details in the data if the bins width are not properly choosen, leading to higher risk of underfitting
+ it makes linear models non-linear:
  + [regression example from scikit-learn](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization.html)
  + [classification example from scikit-learn](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html#sphx-glr-auto-examples-preprocessing-plot-discretization-classification-py)

### `KBinsDiscretizer()`

We can use <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html">`KBinsDiscretizer()`</a> to turn a set of scalars into bins, and then one-hot encode these bins in a single step. A few parameters to keep in mind:
 
* `n_bins` - choose how many bins to generate, default = 5.
 
* `strategy = 'quantile'` - **default option**, generate bins of equal population.
 
* `strategy = 'uniform'` - generate bins of equal width.
  
* `encode = 'onehot' ` - **default option**, encode the transformed result with one-hot encoding and return a sparse matrix.

In [None]:
sns.displot(data=X_train, x='bill_length_mm',bins=5);

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

In [None]:
# Instantiate  KBinsDiscretizer object
bill_length_binner = KBinsDiscretizer(n_bins = 5, encode='onehot-dense', strategy='quantile').set_output(transform='pandas')

In [None]:
# Fit the variable binner on the 'bill_length_mm' column of the training data
bill_length_binner.fit(X_train[['bill_length_mm']])

In [None]:
# Access and print the bin edges for the 'bill_length_mm' feature
bin_edges = bill_length_binner.bin_edges_
print("Bin edges for 'bill_length_mm':", bin_edges[0])

The bin edges represents the `quintiles`. They divides the the data into 5 more-or-less equal parts

In [None]:
# Get the quintiles
X_train[['bill_length_mm']].quantile(q=[0.,0.2,0.4,0.6,0.8,1.0]).rename({'bill_length_mm':'bill_quintiles'},axis=1)

In [None]:
# Plotting the bill length distribution using the learnt bin edges 
sns.displot(data=X_train, x='bill_length_mm',bins=bin_edges[0]);

As shown in the plot above the heights of the bins are more-or-less equal.

In [None]:
# Transform 'bill_length_mm' using the quintile bins defined in 'bill_length_binner'
bill_length_binned_df_train = bill_length_binner.transform(X_train[['bill_length_mm']])
bill_length_binned_df_train

The following table shows you the mapping between one-hot-encoded categories and binned values:
| Category | Bin Values     |
|----------|----------------|
| 0        | 32.1 - 38.6   |
| 1        | 38.6 - 42.3|
| 2        | 42.3 - 46.2|
| 3        | 46.2 - 49.5|
| 4        | 49.5 - 59.6|


**How to read the üëÜüèΩ transformed DataFrame**?
+ The transformed DataFrame categorizes each penguin's `bill_length_mm` into one of five quintile categories (0 through 4). Each category corresponds to a specific range of bill lengths
+ Observation (the penguin) n 24 is labeled as '0', this indicates that the the bill length falls in the range from the miniimum 32.1 to the first quintile 38.6


In [None]:
# Bill_length value for the observation 24
X_train[['bill_length_mm']].loc[24]

- `KBinsDiscritizer(strategy='quantile)`  similar to [qcut()](https://pandas.pydata.org/docs/reference/api/pandas.qcut.html) in Pandas 
- `KBinsDiscritizer(strategy='uniform)` similar to [cut()](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) in Pandas

<hr style="border:2px solid black">

## 4. Numerical Feature Scaling - Normalizing Data Ranges

The goal of feature scaling is to transform numerical features to be on a similar scale.

For example, consider the following two features:
+ `income` spans from 20_000 to 100_000 euro:
+ `age` spans from 20 to 100 years
  
Without scaling, the income feature would disproportionately influence [any distance-based algorithms](https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35/) because of its larger range of values compared to age.

By applying feature scaling, both income and age can be adjusted so that they **equally contribute to the model's learning**, thereby reflecting true feature importance and subsequently improving the model.


### 4.1 Min-Max Scaling 

It transforms all values of a numerical feature to a fixed range, typically between 0 and 1, by subtracting the minimum value and dividing by the range as shown in the formula below:

$$\large X_{scaled} = \large \dfrac{X - X_{min}}{X_{max} - X_{min}}$$

After transformation:
+ Features are constrained to a specific range (e.g., 0 to 1) 
  + Minimum value becomes 0, maximum value become 1, 
+ The original relationship between data points is preserved
  + Everything in between 0 and 1 is proportionally distributed
+ Outliers are not handled well.
  


In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Instantiate the scaler
min_max_scaler = MinMaxScaler(feature_range=(0,1)).set_output(transform='pandas')

In [None]:
# Fit the scaler to the columns 'bill_length_mm' and 'flipper_length_mm' of the training data
min_max_scaler.fit(X=X_train[['bill_length_mm','flipper_length_mm']])

In [None]:
# Learnt Maximum values for each feature on the train data
print("Max values:", min_max_scaler.data_max_)

# Learnt Minimum values for each feature on the train data
print("Min values:", min_max_scaler.data_min_)

# Learnt Range for each feature on the train data
print("Range:", min_max_scaler.data_range_)

In [None]:
X_train[['bill_length_mm','flipper_length_mm']].max()

In [None]:
# Scale the features
min_max_scaled_bill_fli_df_train = min_max_scaler.transform(X=X_train[['bill_length_mm','flipper_length_mm']])
min_max_scaled_bill_fli_df_train

In [None]:
min_max_scaled_bill_fli_df_train.describe()

**Note** that the **min** and **max** value for each features are respectively 0 and 1

In [None]:
fig, ax = plt.subplots(nrows=2,ncols=2)

# Plotting unscaled data
sns.histplot(data=X_train, x='bill_length_mm', ax=ax[0, 0], kde=True)
ax[0, 0].set_title('Unscaled Bill Length')

sns.histplot(data=X_train, x='flipper_length_mm', ax=ax[0, 1], kde=True)
ax[0, 1].set_title('Unscaled Flipper Length')

# Plotting scaled data
sns.histplot(data=min_max_scaled_bill_fli_df_train, x='bill_length_mm', ax=ax[1, 0], kde=True)
ax[1, 0].set_title('Scaled Bill Length')

sns.histplot(data=min_max_scaled_bill_fli_df_train, x='flipper_length_mm', ax=ax[1, 1], kde=True)
ax[1, 1].set_title('Scaled Flipper Length')

# Adjust layout
plt.tight_layout()


As shown in the visualization above. The underlying pattern of the data remains preserved after min-max scaling. 


### 4.2 Standard Scaling

It transforms the values of a numerical feature so that they have a mean of zero and a standard deviation of one. This is accomplished by subtracting the mean of the feature from each value and then dividing by the standard deviation, as shown in the formula below:

$$\large X_{\text{standardized}} = \frac{X - \mu}{\sigma}$$

Where:
- $\mu$ is the mean of the feature values.
- $\sigma$ is the standard deviation of the feature values.

**After transformation:**
+ **Centered Data**: The mean of the transformed data is 0
+ **Unit Variance**: The standard deviation becomes 1, which means that feature variance is normalized.
+ **Preserves Relationships**: The original relationships between variables are maintained, similar to Min-Max scaling.
+ **Handling Outliers**: Standard scaling is less sensitive to outliers than Min-Max scaling because it does not compress the data into a fixed range.

 **[`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)**

In [None]:
from sklearn.preprocessing import StandardScaler

### 4.3 Robust Scaling
It transforms the values of a numerical feature by subtracting the median and then dividing by the interquartile range (IQR), as shown in the formula below:

$$\large X_{\text{robust}} = \frac{X - \text{median}(X)}{\text{IQR}}$$

Where:
- **median(X)** is the median of the feature values.
- **IQR** is the interquartile range, which is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the feature values.

After transformation:
+ **Centered and Scaled Data**: The median of the transformed data becomes 0. The IQR used as the scaler normalizes the feature variance.
+ **Handling Outliers**: Since the median and IQR are less affected by outliers than the mean and standard deviation, this scaling method is much better suited for datasets with outliers.
+ **Preserving Relationships**: Like other scaling methods, robust scaling maintains the original relationships between variables that are not outliers.



**[`RobustScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)**

In [None]:
from sklearn.preprocessing import RobustScaler

### 4.4 Log Scaling
Log scaling is a non-linear transformation that uses the logarithmic function to reduce the range and variation of data values.

**Log transformation** applies a logarithmic scale to the values of a numerical feature, typically using the natural logarithm (base e), although any logarithm base can be used depending on the data and the specific needs. The transformation is given by the formula:

$$\large X_{\text{log}} = \log(X)$$

Where:
- **log()** denotes the logarithmic function, which could be to any base.

**After transformation:**
+ **Reduces Scale Differentials**: the scale of high magnitude values is more significantly compressed than those of low magnitude.
+ **Handling Skewness**: Converts a skewed distribution into one that is more uniform.
+ **Stabilizing Variance**: Variance near larger values is reduced more than variance near smaller values, which stabilizes variance across the dataset.
+ **Reducing Impact of Outliers**: Outliers that are far from the majority of data points become less dominating after log transformation due to the compression effect at higher value ranges.

**Considerations:**
- Log transformation can only be applied to positive values. For datasets containing zero or negative values, a constant may be added to each value before applying the log to shift all data into the positive domain.



While **scikit-learn** itself does not provide a direct transformer for log transformations, you can easily implement log transformation using a custom transformer with the help of [**FunctionTransformer**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html). This allows you to apply any function, including a logarithmic transformation, to your data.


In [None]:
from sklearn.preprocessing import FunctionTransformer

In [None]:
import numpy as np
# Define the log transformation function
# Adding a constant to avoid taking log of zero
def log_transformation(X, c):
    return np.log(X + c)

def inverse_log_transformation(Xlog, c):
    return np.exp(Xlog) - c

In [None]:
# Create the transformer using the log_transform function
log_transformer = FunctionTransformer(func=log_transformation, inverse_func=inverse_log_transformation, kw_args={'c':1}, inv_kw_args={'c':1}).set_output(transform='pandas')
                                                                                                                                                        

In [None]:
# Fit the transformer to the 'flipper_length_mm' column of X_train
log_transformer.fit(X=X_train[['flipper_length_mm']])

In [None]:
# Transform the data using the fitted transformer
log_transformed_flip_df_train = log_transformer.transform(X_train[['flipper_length_mm']])
log_transformed_flip_df_train

In [None]:
fig, ax = plt.subplots(nrows=2,ncols=1)

# Plotting unscaled data
sns.histplot(data=X_train, x='flipper_length_mm', ax=ax[0], kde=True)
ax[0].set_title('Unscaled Flipper Length')

sns.histplot(data=log_transformed_flip_df_train, x='flipper_length_mm', ax=ax[1], kde=True)
ax[1].set_title('Scaled Flipper Length')

# Adjust layout
plt.tight_layout()

#### Readings on Scaling
+ [Compare the effect of different scalers on data with otliers](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py)
+ [Importance of feature scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html)

<hr style="border:2px solid black">

## 5. Feature Expansion
### `PolynomialFeatures`
We can use the scikit-learn [`PolynomialFeatures()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) to add polynomial or interaction features to our dataset. This transformer is particularly useful for capturing interactions between features in a nonlinear model. Here are some key points:

* `degree` - specifies the degree of the polynomial features (default is 2). For example, for a single feature \(X\), if `degree=2`, it generates \(X, X^2\).

* `interaction_only` - if set to `True`, this will produce features that are the product of distinct input features. For example, if two features are \(X_1\) and \(X_2\), it will generate \(X_1 \times X_2\) but not \(X_1^2\) or \(X_2^2\).

* `include_bias` - decides whether to include a bias column (the feature column consisting of ones). This can be set to `False` if a bias is already handled or not required in the model.

üö®**Very important**üö®:
+ **you scale your features first and then apply polynomial feature expansion**

### 5.1 `Polynomial Terms`

- Additional features obtained by an existing feature to some power
- Non-linear relationships can be modelled
- For some feature x, consider the model: 

$$
y = a_0 + a_1x + a_2x^2 +\ldots+\epsilon
$$

- Potential improvement of model accuracy, but increased risk of overfitting

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
# Instantiate the polynomial features object
poly_expansion = PolynomialFeatures(degree = 2, interaction_only = False, include_bias = False).set_output(transform='pandas')

In [None]:
poly_expansion.fit(min_max_scaled_bill_fli_df_train[['bill_length_mm']])

In [None]:
expanded_scaled_bill_length_df_train = poly_expansion.transform(min_max_scaled_bill_fli_df_train[['bill_length_mm']])
expanded_scaled_bill_length_df_train

### 5.2 `Interaction Terms`

- For multiple initial features, there could be *interactions* (cross-polynomial terms)
- For 2 features, $x_0$ and $x_1$ for example, a 2nd-degree polynomial may contain:

$$
1,~x_0,~x_1,~x_0^2,~x_0x_1,~x_1^2
$$

- Each of the terms gets their own coefficient in a regression model
- Polynomial preprocessing function with `interaction_only = True`

In [None]:
# Instantiate the polynomial features object
poly_expansion = PolynomialFeatures(degree = 2, interaction_only = False, include_bias = False).set_output(transform='pandas')

Since PolynomialFeatures doesn‚Äôt support NaNs, we must preprocess the data before calling .fit()

Let's check for missing values first.

In [None]:
np.isnan(min_max_scaled_bill_fli_df_train).sum()

There are two missing values in the flipper_length variable. To handle this, we could apply various strategies you learned before, such as filling, imputing, or dropping. Since the data loss is minimal, let's simply drop them.

In [None]:
#drop missings in data frame
min_max_scaled_bill_fli_df_train.dropna(inplace = True)

In [None]:
#check for missings
np.isnan(min_max_scaled_bill_fli_df_train).sum()

In [None]:
poly_expansion.fit(X=min_max_scaled_bill_fli_df_train)

In [None]:
expanded_scaled_bill_fli_length_df_train = poly_expansion.transform(min_max_scaled_bill_fli_df_train)
expanded_scaled_bill_fli_length_df_train

**Note** that the column `bill_length_mm flipper_length_mm` represents the interaction term. If `interaction_only = True` the quadratic features `bill_length_mm^2` and `flipper_length_mm^2` will not be present.

<hr style="border:2px solid black">

## Bonus üå∂Ô∏èüå∂Ô∏èüå∂Ô∏è


### Spiced Imputation


Previously, we looked at basic imputation methods that use simple statistics like the **mean** or **median** from the same column with missing values.
As always, there is more to dicover. There are also more advanced imputation techniques that consider relationships between different features.

#### Group-wise Mean/Median/Mode Imputation:
Imputes missing values based on the mean, median, or mode calculated within subgroups of the data.

To build a group-wise imputer that follows the scikit-learn interface of `.fit()` and `.transform()`, we will create a custom class based  on scikit-learn's `BaseEstimator` and `TransformerMixin` classes

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
class GroupMeanImputer(BaseEstimator, TransformerMixin):
    def __init__(self, group_column, target_columns):
        # Initialize with the name of the group column and a list of target columns
        self.group_column = group_column
        self.target_columns = target_columns

    def fit(self, X, y=None):
        # Calculate the mean for each target column within each group
        self.means_ = {}
        # Calculate the unique categories in the group  column
        self.categories_ = X[self.group_column].unique()
        for column in self.target_columns:
            self.means_[column] = X.groupby(self.group_column)[column].mean().to_dict()
        return self

    def transform(self, X):
        # Apply the learned means_ to fill in missing values for each target column
        X = X.copy()
        for column in self.target_columns:
            # Get the group means_ for the current column
            for category in self.categories_:
                group_means = self.means_[column][category]
                # Fill missing values for the current column based on its group mean
                X.loc[:, column] = X.groupby(self.group_column)[column].fillna(group_means)
            
        return X[self.target_columns]

In [None]:
# Instantiate the group-wise imputer
group_mean_imputer = GroupMeanImputer(group_column='island', target_columns=['bill_depth_mm','flipper_length_mm'])

In [None]:
# Fit the imputer 
group_mean_imputer.fit(X=X_train)

In [None]:
group_mean_imputer.means_

In [None]:
group_mean_imputer.categories_

In [None]:
group_mean_imputer.transform(X=X_train)

### Model Imputation:
+ Use a model to predict the missing values based on other variables
+ For iterative imputation in scikit-learn there is  [IterativeImputer()](https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py)

### K-Nearest Neighbors (KNN) Imputation:
+ Imputes missing entries based on the k-nearest neighbors found by measuring distance from other points.
+ In scikit-learn there is [KNNImputer()](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html)