<h1><span style="color: #6495ED;">Delhi Air Quality Forecasting</span></h1>
<h2><span style="color: #6495ED;">Deep Learning for Time Series: LSTM Networks</span></h2>

Prepared by Lipsita Tripathy

January 2024

## Introduction
Welcome to the series of Jupyter Notebooks dedicated to our project on "Delhi Air Quality Prediction." This comprehensive project aims to develop a robust predictive model for forecasting the air quality index (AQI) in Delhi, a city known for its challenging air pollution levels. Through these notebooks, we will journey through the various stages of the project, encompassing data preparation, exploratory data analysis (EDA), baseline and advanced modeling.

## Data Dictionary

For this project, we're using air quality data gathered from 40 stations across Delhi, covering the period from 2013 to 2023. The dataset includes 12 distinct features, each representing different air quality and environmental parameters. These data points are collected from each station and then aggregated to form a comprehensive dataset with unique datetime records for each entry.

| Features                  | Description                                                | Type       |
|---------------------------|------------------------------------------------------------|------------|
| Datetime                  | Timestamp indicating the date and time of the recorded data | datetime64 |
| StationId                 | Unique identifier for each monitoring station              | Numeric    |
| PM2.5 (ug/m3)             | Particulate Matter with a diameter of 2.5 microns or less   | Numeric    |
| PM10 (ug/m3)              | Particulate Matter with a diameter of 10 microns or less   | Numeric    |
| NO (ug/m3)                | Nitric Oxide concentration                                 | Numeric    |
| NO2 (ug/m3)               | Nitrogen Dioxide concentration                              | Numeric    |
| NOx (ug/m3)               | Sum of Nitric Oxide and Nitrogen Dioxide concentrations    | Numeric    |
| NH3 (ug/m3)               | Ammonia concentration                                      | Numeric    |
| SO2 (ug/m3)               | Sulfur Dioxide concentration                               | Numeric    |
| CO (ug/m3)                | Carbon Monoxide concentration                              | Numeric    |
| Ozone (ug/m3)             | Ozone concentration                                        | Numeric    |
| Benzene (ug/m3)           | Concentration of Benzene in the air                         | Numeric    |
| Toluene (ug/m3)           | Concentration of Toluene in the air                         | Numeric    |
| Xylene (ug/m3)            | Concentration of Xylene in the air                          | Numeric    |
| RH (%)                    | Relative Humidity in percentage                            | Numeric    |
| WS (m/s)                  | Wind Speed in meters per second                             | Numeric    |
| WD (degree)               | Wind Direction in degrees                                  | Numeric    |
| BP (mmHg)                 | Barometric Pressure in millimeters of mercury              | Numeric    |
| AT (degree C)             | Ambient Temperature in degrees Celsius                     | Numeric    |
| RF (mm)                   | Rainfall in millimeters                                    | Numeric    |
| SR (W/mt2)                | Solar Radiation in Watts per square meter                   | Numeric    |


| Target                    | Description                                                | Type       |
|---------------------------|------------------------------------------------------------|------------|
| <span style="color: #FF0000;">y_AQI</span> | Target variable representing the predicted Air Quality Index for the next 24 hours | Numeric    |


In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import LSTM, Dense
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_csv('data/generated/all_in_one/Delhi_AQI_final_df_before_modeling.csv')
df.head()

Unnamed: 0,Datetime,AQI,PM2.5 (ug/m3),PM10 (ug/m3),NO (ug/m3),NO2 (ug/m3),NOx (ug/m3),NH3 (ug/m3),SO2 (ug/m3),CO (ug/m3),...,t_CO (ug/m3),t_Ozone (ug/m3),t_Benzene (ug/m3),t_Toluene (ug/m3),t_Xylene (ug/m3),t_WS (m/s),t_SR (W/mt2),t_Volatility_Last_24hr,t_Volatility_Last_7d,t_Volatility_Last_30d
0,2013-01-01 00:00:00,354.0,290.774583,292.631667,52.055615,66.014148,117.224563,75.685556,9.99213,9.138167,...,2.316307,2.694264,1.844071,3.437065,0.0,0.314162,3.714195,2.23082,4.097861,7.233227
1,2013-01-01 01:00:00,358.0,275.749821,296.15,37.73625,59.41563,88.122976,66.740556,9.477546,7.66531,...,2.159328,2.39589,1.851948,3.244656,0.0,0.330103,3.673794,2.353812,4.086369,7.233005
2,2013-01-01 02:00:00,362.0,271.463472,309.03,26.387774,57.951291,61.46469,57.030556,9.207963,10.777421,...,2.466184,2.11825,1.835493,3.169191,0.0,0.329304,3.455326,2.475083,4.074597,7.232991
3,2013-01-01 03:00:00,367.0,279.071667,317.826667,23.310857,59.172513,47.583524,43.298333,10.871667,11.79381,...,2.548961,2.187191,1.83567,3.1627,0.0,0.326422,3.079282,2.592957,4.063127,7.23317
4,2013-01-01 04:00:00,370.0,269.118333,308.521667,24.574667,62.451032,43.535333,32.023333,11.020833,10.027778,...,2.400417,2.336875,1.747604,3.113589,0.0,0.379197,2.525195,2.692013,4.051925,7.233493


In [3]:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df.set_index('Datetime', inplace=True)

In [4]:
# Removing the columns that we don't need
X_columns = df.drop(['Year', 'y_AQI', 'AQI_Category', 'PM2.5 (ug/m3)', 'PM10 (ug/m3)', 'NO (ug/m3)', 'NO2 (ug/m3)', 'NOx (ug/m3)', 'NH3 (ug/m3)', 'SO2 (ug/m3)', 
                      'CO (ug/m3)', 'Ozone (ug/m3)', 'Benzene (ug/m3)', 'Toluene (ug/m3)', 'Xylene (ug/m3)', 'WS (m/s)', 'SR (W/mt2)',
                      'Volatility_Last_24hr', 'Volatility_Last_7d', 'Volatility_Last_30d'], axis=1).columns.tolist()

# Selecting the features (X) and the target variable (y)
X = df[X_columns]
y = df['y_AQI']  

In [5]:
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (89784, 31)
Shape of y: (89784,)


In [6]:
# Assuming df is your DataFrame with a datetime index and the necessary columns

# Splitting the data into train and test sets
split_date = '2022-03-01'

# Training set
X_train = X[X.index < split_date]
y_train = y[y.index < split_date]

# Testing set
X_test = X[X.index >= split_date]
y_test = y[y.index >= split_date]

### Normalize the Data

### Reasons for Using `reshape(-1, 1)` when Scaling Target Variable

- **Scikit-Learn Expectation:** Scikit-learn's scaling functions require a two-dimensional input. `reshape(-1, 1)` ensures the target variable (`y`) is in the expected format.

- **Compatibility:** Ensures compatibility with various machine learning libraries, maintaining a consistent data format for both features (`X`) and target variable (`y`).

- **Fit-Transform Requirement:** Scikit-learn's `fit_transform` method expects a two-dimensional array. `reshape(-1, 1)` is commonly used to reshape the one-dimensional target variable.

- **Consistency in Data Format:** Facilitates consistent data format, promoting clarity and ease of use in machine learning workflows.

- **Input Dimension Uniformity:** Guarantees uniform input dimensions for both features and target variable, enhancing the robustness of machine learning models.

In [7]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler for features and target variable
scaler_X = MinMaxScaler()
scaler_y = MinMaxScaler()

# Fit and transform the scaler on the training features
X_train_scaled = scaler_X.fit_transform(X_train)
# Fit and transform the scaler on the training target variable
y_train_scaled = scaler_y.fit_transform(y_train.values.reshape(-1, 1))

# Transform the test features using the fitted scaler for features
X_test_scaled = scaler_X.transform(X_test)
# Transform the test target variable using the fitted scaler for the target variable
y_test_scaled = scaler_y.transform(y_test.values.reshape(-1, 1))

We need to make sure that our X_train_scaled and y_train_scaled are NumPy arrays with appropriate data types.
We can check the shapes of X_train_scaled and y_train_scaled to make sure they are compatible with the LSTM input.

In [8]:
# Check data types and shapes
print(type(X_train_scaled), X_train_scaled.shape)
print(type(y_train_scaled), y_train_scaled.shape)

<class 'numpy.ndarray'> (80304, 31)
<class 'numpy.ndarray'> (80304, 1)


### Create Sequences for LSTM
Define a function to create sequences of input features and target variable for the LSTM model.

In [9]:
# def create_sequences(data, seq_length):
#     X, y = [], []
#     for i in range(len(data) - seq_length):
#         seq = data[i:i+seq_length]
#         label = data[i+seq_length]
#         X.append(seq)
#         y.append(label)
#     return np.array(X), np.array(y)

### Define Hyperparameters and Create Sequences using the function we created

In [10]:
# seq_length = 24  # Number of hours to consider for each input sequence

# # Create sequences for training data
# X_train_seq, y_train_seq = create_sequences(X_train_scaled, seq_length)

# # Create sequences for test data
# X_test_seq, y_test_seq = create_sequences(X_test_scaled, seq_length)

### Define and Build the LSTM Model:

In [11]:
# model = Sequential()
# model.add(LSTM(50, activation='relu', input_shape=(X_train_seq.shape[1], X_train_seq.shape[2])))
# model.add(Dense(1))
# model.compile(optimizer='adam', loss='mse')

### Train the Model:

Let's train our LSTM model using the training sequences (X_train_seq, y_train_seq).

In [12]:
# model.fit(X_train_seq, y_train_seq, epochs=50, batch_size=32, verbose=2)

### Predict on Test Data:

Use the trained model to make predictions on the test sequences (X_test_seq). Inverse transform the scaled predictions to get predictions in the original scale.

In [13]:
# y_pred_scaled = model.predict(X_test_seq)
# y_pred = scaler_y.inverse_transform(y_pred_scaled)

### Evaluate Model Performance:

In [14]:
# print(y_test_actual.shape)
# print(y_pred.shape)


In [15]:
# y_test_actual = y_test_actual.reshape(-1)
# y_pred = y_pred.reshape(-1)


In [16]:
# print(len(X_test_seq))
# print(len(y_test_seq))


In [17]:
# print(len(y_test_actual))
# print(len(y_pred))


In [18]:
# from sklearn.metrics import mean_squared_error

# # Inverse transform the scaled actual values for evaluation
# y_test_actual = scaler_y.inverse_transform(y_test_seq).reshape(-1)

# # Calculate Mean Squared Error (MSE)
# mse = mean_squared_error(y_test_actual, y_pred.reshape(-1))
# print(f'Mean Squared Error on Test Set: {mse}')

### Visualize Predictions:

Plot the predicted values against the actual values to visually assess the model's performance.

In [19]:
# plt.plot(y_test_actual, label='Actual')
# plt.plot(y_pred, label='Predicted')
# plt.legend()
# plt.show()

Create Sequences

In [20]:
# Function to create sequences
def create_sequences(data, seq_length):
    X, y = [], []
    for i in range(len(data) - seq_length):
        seq = data[i:i+seq_length]
        label = data[i+seq_length]
        X.append(seq)
        y.append(label)
    return np.array(X), np.array(y)


In [21]:
# Choose sequence length for the weekly pattern
seq_length_weekly = 24 * 7  # weekly pattern

# Create sequences for training data
X_train_seq, y_train_seq = create_sequences(X_train_scaled, seq_length_weekly)

# Create sequences for testing data
X_test_seq, y_test_seq = create_sequences(X_test_scaled, seq_length_weekly)

Build and Compile LSTM Model

In [23]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, LeakyReLU
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

In [24]:
# Function to build the LSTM model with Leaky ReLU activation
def build_lstm_model(seq_length):
    model = Sequential()
    model.add(LSTM(50, return_sequences=True, input_shape=(seq_length, X_train_seq.shape[2])))
    model.add(LSTM(50))
    model.add(Dropout(0.2))
    model.add(Dense(1))
    model.add(LeakyReLU(alpha=0.1))  # Use Leaky ReLU activation

    optimizer = Adam(learning_rate=0.001)  # Custom learning rate
    model.compile(loss='mse', optimizer=optimizer)
    
    return model

# Build and compile the model for the weekly pattern
lstm_model_weekly = build_lstm_model(seq_length_weekly)



Train the LSTM Model

In [None]:
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
lstm_model_weekly.fit(X_train_seq, y_train_seq, epochs=20, batch_size=16, validation_split=0.1, callbacks=[early_stopping])


In [25]:
# Train the model with early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
lstm_model_weekly.fit(X_train_seq, y_train_seq, epochs=20, batch_size=16, validation_split=0.1, callbacks=[early_stopping])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20


<keras.src.callbacks.History at 0x2a008a910>

Evaluate and Visualize Results

In [31]:
# Evaluate the model on the test set
y_pred_weekly = lstm_model_weekly.predict(X_test_seq)
y_test_actual_weekly = scaler_y.inverse_transform(y_test_seq.reshape(-1, 1))
y_pred_actual_weekly = scaler_y.inverse_transform(y_pred_weekly)




In [35]:
# Ensure shapes before calculating MSE
print("Shape of y_test_actual_weekly:", y_test_actual_weekly.shape)
print("Shape of y_pred_actual_weekly:", y_pred_actual_weekly.shape)


Shape of y_test_actual_weekly: (288672, 1)
Shape of y_pred_actual_weekly: (9312, 1)


In [33]:
# Calculate Mean Squared Error (MSE)
mse_weekly = mean_squared_error(y_test_actual_weekly, y_pred_actual_weekly)
print(f'Mean Squared Error on Test Set (Weekly Pattern): {mse_weekly}')

ValueError: Found input variables with inconsistent numbers of samples: [288672, 9312]

In [None]:
# Visualize actual vs. predicted values
plt.plot(y_test_actual_weekly, label='Actual AQI (Weekly Pattern)')
plt.plot(y_pred_actual_weekly, label='Predicted AQI (Weekly Pattern)')
plt.legend()
plt.xlabel('Time Steps')
plt.ylabel('AQI')
plt.show()

In [29]:
# Evaluate the model on the test set
y_pred = lstm_model_weekly.predict(X_test_seq)
y_test_actual = scaler_y.inverse_transform(y_test_seq.reshape(-1, 1))
y_pred_actual = scaler_y.inverse_transform(y_pred)

mse = mean_squared_error(y_test_actual, y_pred_actual)
print(f'Mean Squared Error on Test Set (seq_length={seq_length}): {mse}')

# Visualize actual vs. predicted values
plt.plot(y_test_actual, label=f'Actual AQI (seq_length={seq_length})')
plt.plot(y_pred_actual, label=f'Predicted AQI (seq_length={seq_length})')
plt.legend()
plt.xlabel('Time Steps')
plt.ylabel('AQI')
plt.show()



ValueError: Found input variables with inconsistent numbers of samples: [288672, 9312]

In [30]:
print(X_train_seq.shape, y_train_seq.shape)
print(X_test_seq.shape, y_test_seq.shape)


(80136, 168, 31) (80136, 31)
(9312, 168, 31) (9312, 31)
