## Feature Engineering

## Table of Contents

Feature Engineering is crucial in Machine Learning. It involves selecting, transforming, and creating features from raw data to improve model performance and interpretability. Effective feature engineering can significantly enhance the predictive power and generalization ability of machine learning models.

Yesterday, in the KNN Regression aproach, we saw a pretty poor model. Let's apply some feature engineering techniques to see if it improves our model.

#### Loading and preparing the data

In [56]:
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [None]:
california = fetch_california_housing()
print(california["DESCR"])

In [None]:
df_cali = pd.DataFrame(california["data"], columns = california["feature_names"])
df_cali["median_house_value"] = california["target"]

df_cali.head()

#### Checking for anomalies

In [None]:
df_cali.info()

#### Train Test Split

In [60]:
features = df_cali.drop(columns = ["median_house_value"])
target = df_cali["median_house_value"]

In [61]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

#### Normalization

During normalization or standardization, it's essential to fit the model to the training data exclusively, preventing any exposure to the test data to avoid potential data leakage issues.

Create an instance of the normalizer

In [62]:
normalizer = MinMaxScaler()

Fit it to our training data

In [None]:
normalizer.fit(X_train)

Transforming our training and testing data

In [64]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

When applying transformations of our dataframe, normalizer will return an array instead of a dataframe object

In [None]:
X_train_norm

In [None]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

In [None]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

##### KNN Regressor - modeling

Let's create an instance of KNN with the same hyperparameter as before, n_neighbors = 10.

In [68]:
knn = KNeighborsRegressor(n_neighbors=10)

Training KNN to our normalized data

In [None]:
knn.fit(X_train_norm, y_train)

Evaluate model's performance

In [None]:
knn.score(X_test_norm, y_test)

With raw data we obtain a R2 of 0.16, just by normalizing our data, model's perfomance increase a lot to a R2 of 0.70.

This happens because KNN is a distance based algorithm, so its suffers a lot with data in completely different scales.

## Feature Selection

Even though normalizing our data had a huge impact on KNN performance, we are currently using every single feature of the dataset.

Now let's do a selection of features based on correlactions between themselves but also with the target.

We want low correlaction between features, but high correlaction between features and our target.

In [None]:
corr=np.abs(df_cali.corr())

#Set up mask for triangle representation
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 10))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask,  vmax=1,square=True, linewidths=.5, cbar_kws={"shrink": .5},annot = corr)

plt.show()

By the correlation matrix we can see that:
- "AveRooms" is highly correlated with "AveBedrms", so we drop the one less correlated with our target
- "AveOccup" and "Population" also have pretty low correlation with our target variable, so lets remove them from our selected features

In [72]:
X_train_reduced = X_train_norm.drop(columns = ["AveOccup", "Population", "AveBedrms"])
X_test_reduced = X_test_norm.drop(columns = ["AveOccup", "Population", "AveBedrms"])

In [None]:
knn = KNeighborsRegressor(n_neighbors=10)
knn.fit(X_train_reduced, y_train)

knn.score(X_test_reduced, y_test)

By normalizing our data and selecting a subset of available features, we were able to massively improve our model, increasing the R2 score from 0.16 to 0.70

Notice that we still haven't fine-tuned our hyperparameter, so we will be able to improve even more our model.