<center>
  <a href="MLSD-03-DetectingAnomalies-A.ipynb" target="_self">Detecting Anomalies A</a> | <a href="./">Content Page</a> | <a href="MLSD-04-FeatureEngineering-B.ipynb">Feature Engineering B | <a href="MLSD-04-FeatureEngineering-Ex-1.ipynb">Feature Engineering Exercise 1</a>
</center>

# <center>FEATURE ENGINEERING A</center>

<center><b>Copyright &copy 2023 by DR DANNY POO</b><br> e:dannypoo@nus.edu.sg<br> w:drdannypoo.com</center><br>

# Definition
Feature Engineering is the process of transforming data to increase the predictive performance of machine learning models.

# Importance
Feature Engineering is both useful and necessary for the following reasons:
- Often better predictive accuracy. Feature engineering techniques such as standardization and normalization often lead to better weighting of variables which improves accuracy and sometimes leads to faster convergence.
- Better interpretability of relationships in the data. When we engineer new features and understand how they relate with our outcome of interest, that opens up our understanding of the data. 

Feature engineering is necessary because most models cannot accept certain data representations. <br>
Models like linear regression, for example, cannot handle missing values on their own - they need to be imputed (filled in). 

# Feature Engineering Methods


![image.png](attachment:image.png)

**Imputation**<br>
Imputation is the process of managing missing values where information is missing in some cells of a respective row.

**Outlier Handling**<br>
Outliers are data points that are significantly different from other observations.<br>
It can be done by removing or replacing outliers. 

**One-hot Encoding**<br>
Categorical values (often referred to as nominal) such as gender, seasons, pets, brand names, or age groups often require transformation, depending on the ML algorithm used. <br>
Decision trees can work with categorical data but many others need the introduction of additional artificial categories with a binary representation.<br>
One-hot encoding is a technique of preprocessing categorical features for machine learning models. For each category, it designs a new binary feature, often called a “<b>dummy variable</b>.”

**Log Transformation**<br>
This method can approximate a skewed distribution to a normal one. <br>
Logarithm transformation (or log transformation) replaces each variable x with a log(x).<br>
Data magnitude within a range often varies. For example, magnitude between ages 10 and 20 is not the same as that between ages 60 and 70. <br>
Differences in this type of data are normalized by log transformation.

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

**Scaling**<br>
Scaling is a data calibration technique that facilitates the comparison of different types of data. <br>
It is useful for measurements to correct the way the model handles small and large numbers.<br>
For example, despite its small value, the floor number in a building is as important as the square footage.

![image-4.png](attachment:image-4.png)

Techniques include:<br>
- <b>Standardization</b> is done by calculating the difference between the individual numbers and their mean, divided by the range of variation, called the standard deviation (sigma).<br>
- <b>Normalization</b> is quite similar, except that we work with the difference of each value from the mean, divided by the difference between maximum and minimum values in the dataset.





# Feature Engineering
<b>Dataset</b>: Supermarket Sales data set.<br>
<b>Description</b>:
- Invoice ID - Computer generated sales slip invoice identification number
- Branch - Branch of supercenter (3 branches are available identified by A, B, and C).
- City - Location of supercenters
- Customer type - Type of customers, recorded by Members for customers using member card and Normal for without member card
- Gender - Gender type of customer
- Product line - General item categorization groups
- Unit price - Price of each product in dollars
- Quantity - Number of products purchased by the customer
- Tax 5% - 5% tax fee for customer buying
- Total - Total price including tax
- Date - Date of purchase
- Time - Purchase time
- Payment - Payment used by the customer for their purchase
- cogs - Cost of goods sold
- gross margin percentage
- gross income
- Rating - Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

<b>Tasks</b>: 
- To read in and explore data set.
- To carry out <b>Numeric Aggregations</b>.
- To produce <b>indicator variables and interaction terms</b>.
- To carry out <b>Numeric Transformations</b>.
- To carry out <b>Numeric Scaling</b>.
- To <b>encode categorical values</b>.
- To <b>handle missing values</b>.
- To carry out <b>date-time decomposition</b>.

## Read in and Explore Data Set

In [None]:
# Import libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
sns.set_palette(sns.color_palette(['#851836', '#edbd17']))
sns.set_style("darkgrid")

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [None]:
# Read in data
df = pd.read_csv('./data/supermarketSales/supermarketSales.csv')
df.head()

In [None]:
# Examine data format
df.info()

**Observations**:
- There are 1000 total rows 
- No missing rows.

## Numeric Aggregations

Numeric aggregation is a common feature engineering approach for longitudinal or panel data - data where subjects are repeated.<br>There are categorical variables with repeated observations (for example, there are multiple entries for each supermarket branch).

Numeric aggregation involves three parameters:
- Categorical column
- Numeric column(s) to be aggregated
- Aggregation type: Mean, median, mode, standard deviation, variance, count etc.

**Example**
- Branch – categorical column to be grouped.
- Tax 5%, Unit Price, Product line, and Gender – numeric columns to be aggregated.
- Mean, standard deviation, and count – aggregations to be used on the numeric columns.

In [None]:
# Numeric aggregations
grouped_df = df.groupby('Branch')

df[['tax_branch_mean','unit_price_mean']] = grouped_df[['Tax 5%', 'Unit price']].transform('mean')
df[['tax_branch_std','unit_price_std']]   = grouped_df[['Tax 5%', 'Unit price']].transform('std')
df[['product_count','gender_count']]      = grouped_df[['Product line', 'Gender']].transform('count')

In [None]:
# Display the columns
df[['Branch', 'Tax 5%', 'tax_branch_mean', 'Unit price', 'unit_price_mean', 'tax_branch_std',
    'unit_price_std', 'product_count', 'gender_count']].head(10)

## Indicator Variables and Interaction Terms
- Indicator variables only take on the value 0 or 1 to indicate the absence or presence of some information.
- Interaction terms are created based on the presence of interaction effects between two or more variables.

**Indicator Variables**:
- Define an indicator variable unit_price_50 to indicate if the product has a unit price greater than 50.

In [None]:
# Is unit price greater than 50?
df['unit_price_50'] = np.where(df['Unit price'] > 50, 1, 0)

**Interaction Terms**:
- For example, while free shipping may affect customer rating, free shipping combined with quantity may have a different effect on customer rating, which would be useful to encode. 

In [None]:
# Free shipping combined with quantity
df['unit_price_50 * qty'] = df['unit_price_50'] * df['Quantity']

In [None]:
# Result
df[['unit_price_50', 'unit_price_50 * qty']].head()

## Numeric Transformations

- Tree-based models (decision trees, random forests, etc.) are not impacted by numeric transformations. Therefore, performing these transformations do nothing to improve predictive performance. 
- For linear regression, these transformations can make a big difference as they are sensitive to the scale of their variables.

### Applying Logarithmic Transformation

In [None]:
# Plot variable cogs (Cost of Goods Sold)
fig, (ax1) = plt.subplots(1, figsize=(15,6))
sns.histplot(df['cogs'], ax=ax1, kde=True)

**Observations**:
- cogs is right skewed.

In [None]:
# To correct this skew, apply numeric transformation on cogs to create a new variable log_cogs
df['log_cogs'] = np.log(df['cogs'] + 1)
df[['cogs', 'log_cogs']]

In [None]:
# Function to plot two variables side by side
def plot_hist(data1, data2):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,6))
    sns.histplot(data1, ax=ax1, kde=True)
    sns.histplot(data2, ax=ax2, kde=True);

In [None]:
# Plot them side by side
plot_hist(df['cogs'], df['log_cogs'])

**Observations**:
- The log transformation makes the distribution of Cost of Goods Sold (cogs) more normally distributed (or less right-skewed).
- This will benefit models like linear regression as their weights/coefficients will not be strongly influenced by outliers that caused the initial skewness.

### Applying Square Transformation
- Square a variable if the relationship between a predictor and target variable is not linear, but quadratic in nature (i.e. as a predictor variable changes, target variable changes by an order of 2).

In [None]:
# Square variable gross income
df['gross income squared'] = np.square(df['gross income'])

In [None]:
# Display variables
df[['gross income', 'gross income squared']].head()

In [None]:
# Plot them side by side
plot_hist(df['gross income'], df['gross income squared'])

## Numeric Scaling

- 'gross income' and 'Rating' are on very different scales.
- To correct for this, perform 'normalization' to put both columns on a 0-1 scale.

In [None]:
# 'gross income' and 'Rating' are on very different scales.
gincome = df["gross income"]
rating = df["Rating"]

print(f'Gross income range: {gincome.min()} to {gincome.max()}')
print(f'Rating range: {rating.min()} to {rating.max()}')

plot_hist(gincome, rating)

In [None]:
# Applying normalization for both variables to be on a 0-1 scale
df[["gross income", "Rating"]] = MinMaxScaler().fit_transform(df[["gross income", "Rating"]])

plot_hist(df['gross income'], df['Rating'])

## Encode Categorical Values

Machine learning models can only handle numeric variables. <br>
Therefore, encode categorical variables as numeric ones. <br>
Use 'one-hot-encode' them to create indicator variables for a categorical column with categories. <br>
Two categorical columns - Gender and Payment.

**To encode**:
- Gender and Payment columns

### One-Hot Encoding

In [None]:
# Get dummies to one-hot-encode variables
pd.get_dummies(df[['Gender','Payment']]).head()

**Potential problem with one-hot-encoding**<br>
What if the column has 1000 categories?<br>
one-hot-encoding that one column will create 1000 new columns! <br>
That's a lot! <br>
Feeding too much information into model will make it harder to find patterns. <br>
When we have too much dimensionality, our model will take much longer to train and find the optimal predictor weights.

To resolve this, try using Target Encoding.

### Target Encoding

For each unique category, the average value of the target variable (assuming it is either continuous or binary) is calculated and that becomes the value for the respective category in the categorical column.

Objective is to encode the predictor variable (a categorical column) into a numeric variable that can be used by the model. <br>
To do this we simply group by the predictor variable to get the mean target value for each predictor category. 

In [None]:
# Assuming target variable has values 1, 4, 5, 6
# Categorical variable 'predictor' has values a, b, a, b
target = [1, 4, 5, 6]
predictor = ['a', 'b', 'a', 'b']

# Create dataframe of these two columns
target_enc_df = pd.DataFrame(data={'target':target, 'predictor':predictor})
target_enc_df

In [None]:
# Determine the mean values of 'target' column based on the category in 'predictor' variable
means = target_enc_df.groupby('predictor')['target'].mean()

# Display the encoded predictor
target_enc_df['predictor_encoded'] = target_enc_df['predictor'].map(means)
target_enc_df

**Applying Target Encoding on Supermarket Sales dataset**<br>
Use Product line as the categorical column that is target encoded, and Rating is the target variable, which is a continuous variable.

In [None]:
# Categorical column: 'Product line'
# Continuous variable: 'Rating'
means = df.groupby('Product line')['Rating'].mean()

df['Product line target encoded'] = df['Product line'].map(means)
df[['Product line','Product line target encoded','Rating']]

## Missing Value Handling

When data is missing at random, we have a loss of information.<br>
Need to remove rows with missing data, as most models do not handle missing data.<br>
Since columns with too many missing values do not usually provide a helpful signal, they can be removed based on a threshold condition for missingness.

### Remove columns with missing values rate higher than a threshold

In [None]:
# Remove missing values for a certain threshold
threshold = 0.7

# Dropping columns with missing value rate higher than threshold
df = df[df.columns[df.isnull().mean() < threshold]]

# Dropping rows with missing value rate higher than threshold
df = df.loc[df.isnull().mean(axis=1) < threshold]

### Impute missing values with a single value (e.g. mean or median of column)

For continuous variable, impute missing values with a single value such as the mean or median of the column. <br>
For categorical columns, impute missing values with the mode, or most frequent category in the column.

In [None]:
# Filling missing values with medians of the columns
df = df.fillna(df.mean())

# Filling remaining columns - categorical columns - with mode
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

## Date-Time Decomposition
Break down a date variable into its constituents since a model needs to works with numeric variables.

In [None]:
# The original 'Date' variable
df[['Date']].head()

In [None]:
# Convert to datetime object
df['Date'] = pd.to_datetime(df['Date'])
df[['Date']].head()

In [None]:
# Date Decomposition
df['Year']  = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day']   = df['Date'].dt.day
df[['Year','Month','Day']].head()

**Observations**:
- The date column which was in the format "year-month-day" are now in individual columns, namely year, month, and day.
- Such information can now be used by the model to make predictions, as the new columns are numeric.

<center>
  <a href="MLSD-03-DetectingAnomalies-A.ipynb" target="_self">Detecting Anomalies A</a> | <a href="./">Content Page</a> | <a href="MLSD-04-FeatureEngineering-B.ipynb">Feature Engineering B | <a href="MLSD-04-FeatureEngineering-Ex-1.ipynb">Feature Engineering Exercise 1</a>
</center>