# **4 Feature Engineering: Air Quality**

In this section, we'll conduct essential data transformations guided by the insights gleaned from our previous step, Exploratory Data Analysis (EDA). Additionally, we'll derive new features from existing ones to enhance our understanding of the dataset. The outcome of this phase will be a refined dataset tailored for our machine learning classification modeling.

## **Methodology**

* [1. Loading Data from Staged](#1_lds)
* [2. Encoding Target Variable](#2_etv)
* [3. Log Transformation](#3_logt)
* [4. Standarize Features](#4_std)
* [5. Variance Inflation Factor (VIF)](#5_vif)
* [6. Feature Selection and Target](#6_feat)
* [7. Resampling Target](#7_res)
* [8. Feature Selection with SelectKBest](#8_sel)
* [9. Save the Processed Dataset](#9_save)

In [1]:
# Imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from imblearn.over_sampling import SMOTE
from statsmodels.stats.outliers_influence import variance_inflation_factor

---

### **1. Loading data from staged**<a id='1_lds'></a>

In [2]:
# Load the staged dataset
file_path = '../data/staged/air_dataset_staged.csv'
df = pd.read_csv(file_path)

# Convert 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'])

---

### **2. Encoding target variable**<a id='2_etv'></a>

This step is necessary to have a numerical (non operable) representation of our target variable.

In [3]:
label_enc = LabelEncoder()
df['Air_Quality_Encoded'] = label_enc.fit_transform(df['Air_Quality'])

---

### **3. Log transformation**<a id='3_logt'></a>

This step will be made to reduce the skewness observed in the numerical features.

In [4]:
# Define the numerical columns
numerical_cols = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'AQI']

# Log Transformation to reduce skewness
df[numerical_cols] = df[numerical_cols].apply(lambda x: np.log1p(x))

---

### **4. Standarize features**<a id='4_std'></a>

Having the numerical features to have a mean of 0 and a standard deviation of 1 will help us ensuring all features contribute equally during the classification task.

In [5]:
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[numerical_cols])

---

### **5. Variance Inflation Factor (VIF)**<a id='5_vif'></a>

This step of calculating the VIF is mainly to identify multicollinearity among features so later we can remove features with VIF > 10. The final goal is to reduce multicollinearity and recalculates VIF.

In [6]:
# Calculate VIF to handle multicollinearity
vif_data = pd.DataFrame()
vif_data['Feature'] = numerical_cols
vif_data['VIF'] = [variance_inflation_factor(X_scaled, i) for i in range(len(numerical_cols))]
print("VIF before reduction:")
print(vif_data)

VIF before reduction:
    Feature       VIF
0     PM2.5  3.948656
1      PM10  2.242072
2        NO  2.375015
3       NO2  1.987779
4       NOx  2.184968
5       NH3  1.261571
6        CO  2.167666
7       SO2  1.399587
8        O3  1.350048
9   Benzene  1.896216
10  Toluene  2.236061
11      AQI  5.939448


In [7]:
# Handle high VIF values (if any)
high_vif_features = vif_data[vif_data['VIF'] > 10]['Feature'].tolist()
X_reduced = df[numerical_cols].drop(columns=high_vif_features)

In [8]:
# Recalculate VIF for the reduced set of features
X_reduced_scaled = scaler.fit_transform(X_reduced)
vif_data_reduced = pd.DataFrame()
vif_data_reduced['Feature'] = X_reduced.columns
vif_data_reduced['VIF'] = [variance_inflation_factor(X_reduced_scaled, i) for i in range(X_reduced.shape[1])]
print("VIF after reduction:")
print(vif_data_reduced)

VIF after reduction:
    Feature       VIF
0     PM2.5  3.948656
1      PM10  2.242072
2        NO  2.375015
3       NO2  1.987779
4       NOx  2.184968
5       NH3  1.261571
6        CO  2.167666
7       SO2  1.399587
8        O3  1.350048
9   Benzene  1.896216
10  Toluene  2.236061
11      AQI  5.939448


---

### **6. Feature Selection and Target**<a id='6_feat'></a>

In this step we will select our features and target for the modeling

In [9]:
# Define features and target
X = df.drop(columns=['City', 'Air_Quality', 'Date', 'Air_Quality_Encoded'] + high_vif_features)
y = df['Air_Quality_Encoded']

---

### **7. Resample target**<a id='7_res'></a>

In this step, we will use SMOTE to generate synthetic samples for minority classes to balance the dataset

In [10]:
# Resampling to handle class imbalance
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

---

### **8. Feature selection**<a id='8_sel'></a>

In this step, we will use SelectKBest to select the most relevant features based on ANOVA F-test.

In [11]:
# Feature selection using SelectKBest
selector = SelectKBest(score_func=f_classif, k='all')
X_selected = selector.fit_transform(X_resampled, y_resampled)

---

### **9. Saving the processed dataset**<a id='9_save'></a>

In [12]:
# Rebuilt the output dataset
selected_feature_names = [numerical_cols[i] for i in selector.get_support(indices=True)] # Get the selected feature names
processed_df = pd.DataFrame(X_selected, columns=selected_feature_names)
processed_df['Air_Quality'] = y_resampled

In [13]:
# Save the dataset for modeling
output_file_path = '../data/processed/processed_air_quality_data.csv'
processed_df.to_csv(output_file_path, index=False)
print(f"Processed dataframe saved to {output_file_path}")

Processed dataframe saved to ../data/processed/processed_air_quality_data.csv


In [14]:
processed_df.head()

Unnamed: 0,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,AQI,Air_Quality
0,4.432363,4.571407,2.070653,3.391484,3.547316,2.824351,2.070653,3.922369,4.106932,0.019803,0.0,5.347108,2
1,4.392472,4.571407,2.698,3.390473,3.739573,2.824351,2.698,3.901771,4.585682,0.039221,0.0,5.796058,5
2,4.559336,4.571407,3.234355,3.51631,3.981736,2.824351,3.234355,4.225227,4.721441,0.215111,0.00995,6.244167,4
3,4.919908,4.571407,3.79504,3.763059,4.449335,2.824351,3.79504,4.333755,4.641502,0.336472,0.039221,6.663133,4
4,5.189228,4.571407,4.017464,3.592093,4.301359,2.824351,4.017464,4.026066,4.685644,0.378436,0.058269,6.818924,4
