## Notebook 4 - All locations
### **Predicting Dengue Fever Incidence and Disease Dynamics under Climate Change in Southeast Asia**
### Master's thesis by Josephine Lutter, supervised by Professor Roberto Henriques

## Table of Contents
<ul>
  <li><a href="#1.-Import">1. Import</a></li>
  <li><a href="#2.-Data-Preparation">2. Data Preparation</a>
    <ul>
      <li><a href="#a.-Feature-Removal">a. Feature Removal</a></li>
      <li><a href="#b.-Feature-Creation">b. Feature Creation</a></li>
      <li><a href="#c.-Data-Encoding">c. Data Encoding</a></li>
      <li><a href="#d.-Data-Partition">d. Data Partition</a></li>
      <li><a href="#e.-Data-Normalization">e. Data Normalization</a></li>
    </ul>
  </li>
  <li><a href="#3.-Predictive-Modeling">3. Predictive Modeling</a>
    <ul>
      <li><a href="#a.-Recurrent-Neural-Network-(RNN)">a. Recurrent Neural Network (RNN)</a></li>
      <li><a href="#b.-Feedforward-Neural-Network-(FNN)">b. Feedforward Neural Network (FNN)</a></li>
      <li><a href="#c.-Convolutional-Neural-Network-(CNN)">c. Convolutional Neural Network (CNN)</a></li>
    </ul>
  </li>
  <li><a href="#3.1-Final-Training-and-Prediction-on-Test-Set">3.1 Final Training and Prediction on Test Set</a></li>
  <li><a href="#3.2-Climate-Change-Assessment">3.2 Climate Change Assessment</a></li>
</ul>


## 1. Import

In [None]:
#!pip install scipy
#!pip install tensorflow
#!pip install keras-self-attention
#!pip install git+https://www.github.com/keras-team/keras-contrib.git
#!pip install bayesian-optimization
#!pip install keras-tuner

In [None]:
# General imports
import os
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from numpy import concatenate
from pandas import read_csv, DataFrame, concat

# Imports for Data Exploration
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import STL

# Imports for Data Engineering / Preprocessing
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.feature_selection import RFE, SequentialFeatureSelector, SelectFromModel
from sklearn.ensemble import RandomForestRegressor

# Imports for Traditional Machine Learning Modeling
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import randint, uniform
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, r2_score

# Imports for Deep Learning
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Input, Conv1D, MaxPooling1D, Flatten, GRU, SimpleRNN
from tensorflow.keras.optimizers import Adam, RMSprop, Nadam, legacy
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
import keras_tuner
from keras_tuner import HyperModel
from keras_tuner.tuners import RandomSearch
from keras.utils import plot_model

# Set seed value to ensure constant, deterministic prediction
random_seed = 1234
os.environ['PYTHONHASHSEED'] = str(random_seed)
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'

# Lastly, setting random seed for modeling
import random
random.seed(random_seed)

import numpy as np
np.random.seed(random_seed)

# Check tensorflow version
#print(tf.__version__)
tf.random.set_seed(random_seed)

session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)

# Ensure reproducibility for sklearn as well
from sklearn.utils import check_random_state
check_random_state(random_seed)

In [None]:
# Integration of Excel files that have been pre-structured using Mircosoft Excel to guarantee consistent structure

file = "/Users/Fine/Documents/Master Business Analytics/Thesis/Research Data/Final Data/Combined//All locations combined.xlsx"
#file = "/Users/Fine/Documents/Master Business Analytics/Thesis/Research Data/Final Data/Combined//Simulations RCP8.5_mod.xlsx"
df_3 = pd.read_excel(file, sheet_name='With outlier capping')
#df_3 = pd.read_excel(file, sheet_name='RCP8.5, Jan, 2030')
#df_3 = pd.read_excel(file, sheet_name='RCP8.5, Jul, 2030')
#df_3 = pd.read_excel(file, sheet_name='RCP8.5, Jan, 2050')
#df_3 = pd.read_excel(file, sheet_name='RCP8.5, Jul, 2050')
#df_3 = pd.read_excel(file, sheet_name='RCP8.5, Jan, 2070')
#df_3 = pd.read_excel(file, sheet_name='RCP8.5, Jul, 2070')
#df_3 = pd.read_excel(file, sheet_name='RCP8.5, Jan, 2100')
#df_3 = pd.read_excel(file, sheet_name='RCP8.5, Jul, 2100')

# Define the location for plots
location = 'Southeast Asia'

# Set Date variable as index and to remain cohesive time series, round, and sort index 'Date' of df_3
df_3.set_index("Date", inplace=True)
df_3 = df_3.round(4).sort_index()

## 2. Data Preparation

The EDA of the monthly combined and aggregated environmental and incidence data was performed in notebook 3. This notebook covers the implementation of the deep learning models. Here, the data preparation differs from the traditional machine learning models, with most steps being performed within the functions.

### a. Feature Removal

Removing features that are identified as irrelevant.

In [None]:
# Drop the highlighted variables from notebook 3

df_3 = df_3.drop(columns=['Min_Daily_Prcp', 'Monthly_Avg_Prcp', 'Min_Average_Temp', 'Max_Average_Temp'])

# Define list
environmental_variables = [
    "Max_Daily_Prcp", "Monthly_Total_Prcp","Monthly_Avg_Temp", 
    "Min_Daily_Temp", "Max_Daily_Temp", "N_Raining_Days",
]

### b. Feature Creation

Having defined relevant environmental variables in the previous sections, the objective is to create lagged variables for optimal time-series forecasting. At first, the lagged intervals will be determined. As discussed in the literature review, the total dengue infection cycle lasts 4-7 weeks. Moreover, as elaborated, environmental factors immensely influence the mosquito habitat and lifecycle. Consequently, lagged variables will be established for each environmental factor at multiple intervals before the current observation, considering that the data is aggregated monthly.

In [None]:
# Creation of lagged features
# Lag intervals from 1 to 6 months
# Meaning, missing values will be created in the first five instances as the data is not giving
# Rows with missing values will be excluded to avoid subsequent issues

# Define interval
lagged_intervals = range(1, 7)
new_features = []

# Create lags
for variable in environmental_variables:
    for lag in lagged_intervals:
        new_feature_name = f"{variable}_lag_{lag}"
        df_3[new_feature_name] = df_3[variable].shift(lag)
        new_features.append(new_feature_name)

# Update environmental_variables
environmental_variables.extend(new_features)

# Drop rows with missing values as they are represented in the lags
df_3.dropna(inplace=True)

# Visualize the engineered data 
df_3.head()

In [None]:
# Check how many features are established after feature engineering
len(environmental_variables)

### c. Data Encoding
Encoding of categorical variable "Name". Source of inspiration: https://medium.com/aiskunks/categorical-data-encoding-techniques-d6296697a40f

In [None]:
# Fit and transform the variable Name with LabelEncoder
from sklearn.preprocessing import LabelEncoder
df_3['Location Code'] = LabelEncoder().fit_transform(df_3['Name'])

In [None]:
# Print location names and their respective location codes
df_3[['Name', 'Location Code']].drop_duplicates().sort_values('Location Code')

### d. Data Partition

At this stage, a split is performed to separate a test set from the training phase for the later assessment of the model's success on unseen data.

Here, only the location-specific individual approach is pursued, not the ineffective evaluated holistic model from Notebook 3. The later applied time-series cross-validation aligns with an expanding window training approach. In addition, a time-based split is performed at 80%.


#### Individual approach

In [None]:
# Perform time-based cross-validation

# Time-series split to handle multiple independent time series of training and test set
# Test set represents the last 12 monthly instances representing a year
# This one is for the use of later cross-validation

# Specify features and target
features = ['Location Code'] + environmental_variables
target = 'Incidence Rate'

# Divide the DataFrame into features (X) and target (y)
X = df_3[features]#.astype('float32')
y = df_3[target]#.astype('float32')

# Prepare lists to hold split data
train_dfs, test_dfs, y_train_dfs, y_test_dfs = [], [], [], []

# Perform the split for each unique location
for location in X['Location Code'].unique():
    # Filter rows for the current location
    location_mask = X['Location Code'] == location
    X_location = X[location_mask]
    y_location = y[location_mask]

    # Calculate the train size by excluding the last 12 instances, which represents a year
    train_size = len(X_location) - 12

    # Split the data into train and test sets for the current location
    X_train_val, X_test = X_location.iloc[:train_size], X_location.iloc[train_size:]
    y_train_val, y_test = y_location.iloc[:train_size], y_location.iloc[train_size:]

    # Append the splits to their respective lists
    train_dfs.append(X_train_val)
    test_dfs.append(X_test)
    y_train_dfs.append(y_train_val)
    y_test_dfs.append(y_test)
    
# Concatenate the splits into training and testing sets
X_train_val, X_test = pd.concat(train_dfs).sort_index(), pd.concat(test_dfs).sort_index()
y_train_val, y_test = pd.concat(y_train_dfs).sort_index(), pd.concat(y_test_dfs).sort_index()

In [None]:
# Visualization: Individual, independent approach

# Visualization of the train-validation-test split by location
plt.figure(figsize=(12, 8))

# Plot data for each unique location
for location in df_3['Location Code'].unique(): 
    # Plot training data
    train_subset = X_train_val[X_train_val['Location Code'] == location]
    plt.scatter(train_subset.index, [location] * len(train_subset), color='blue', alpha=0.5)

    # Plot testing data
    test_subset = X_test[X_test['Location Code'] == location]
    plt.scatter(test_subset.index, [location] * len(test_subset), color='red', alpha=0.5)

plt.xlabel('Date')
plt.ylabel('Location Code')
plt.title('Train-Test Split by Location Code', fontsize=18)
plt.legend(['Train', 'Test'])
plt.grid(True)
plt.xticks()
plt.tight_layout()
plt.savefig(os.path.join(f"{location}:Train-Test-Split-for-each-location.png"), dpi=200)
plt.show()

### e. Data Normalization

In [None]:
# Function for feature scaling

def feature_scaling(X_train, X_test, numerical_features, scaler):

    # Apply scaling
    X_train_scaled = scaler.fit_transform(X_train[numerical_features])
    X_test_scaled = scaler.transform(X_test[numerical_features])
    
    # Convert scaled arrays back to DataFrame, ensuring correct column names and index
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=numerical_features, index=X_train.index)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=numerical_features, index=X_test.index)
    
    # Combine scaled and non-scaled features
    X_train_final = pd.concat([X_train.drop(columns=numerical_features), X_train_scaled], axis=1)
    X_test_final = pd.concat([X_test.drop(columns=numerical_features), X_test_scaled], axis=1)
    
    return X_train_final, X_test_final

In [None]:
# Function for inverse scaling

def inverse_feature_scaling(X_train_scaled, X_test_scaled, numerical_features, scaler):

    # Inverse scaling
    X_train_inverse_scaled = scaler.inverse_transform(X_train_scaled[numerical_features])
    X_test_inverse_scaled = scaler.inverse_transform(X_test_scaled[numerical_features])
    
    # Convert inverse scaled arrays back to DataFrame, ensuring correct column names and index
    X_train_inverse_scaled = pd.DataFrame(X_train_inverse_scaled, columns=numerical_features, index=X_train_scaled.index)
    X_test_inverse_scaled = pd.DataFrame(X_test_inverse_scaled, columns=numerical_features, index=X_test_scaled.index)
    
    # Combine inverse scaled and non-scaled features
    X_train_final = pd.concat([X_train_scaled.drop(columns=numerical_features), X_train_inverse_scaled], axis=1)
    X_test_final = pd.concat([X_test_scaled.drop(columns=numerical_features), X_test_inverse_scaled], axis=1)
    
    return X_train_final, X_test_final

In [None]:
X_train_val_scaled_minmax, X_test_scaled_minmax = feature_scaling(X_train_val, X_test, environmental_variables, MinMaxScaler())

In [None]:
X_train_scaled_robust, X_val_scaled_robust = feature_scaling(X_train_val, X_test, environmental_variables, RobustScaler())

## 3. Predictive Modeling
In this notebook, predictive modeling is performed using deep learning models. Several requirements were addressed: Scaling and encoding data in float format, creating lagged features and removing non-informative ones, setting a random seed for reproducibility, and appropriately splitting the data to address the time-series problem. Data is sorted by index Date, which is a time-based feature. 

Various deep learning models with multiple layers suitable for sequential data analysis, such as time series, were applied. Different model configurations were tested. Relatively low layers and early stopping criteria were implemented to avoid overfitting, which would stop the training process if no improvement was observed after 5 consecutive epochs.

Research used: 

- https://towardsdatascience.com/time-series-forecasting-with-deep-learning-and-attention-mechanism-2d001fc871fc
- https://towardsdatascience.com/how-to-reshape-data-and-do-regression-for-time-series-using-lstm-133dad96cd00
- https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-attention-mechanism-9e844763d07b
- https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
- https://medium.com/intel-tech/how-to-apply-transformers-to-time-series-models-spacetimeformer-e452f2825d2e
- https://www.sciencedirect.com/topics/computer-science/deep-learning-model
- https://www.ibm.com/topics/deep-learning
- https://doi.org/10.1016/j.undsp.2023.05.006
- https://doi.org/10.1016/j.chaos.2020.110121
- https://medium.com/aimonks/multivariate-timeseries-analysis-using-tensorflow-9554e607077a
- https://towardsdatascience.com/multivariate-time-series-forecasting-with-deep-learning-3e7b3e2d2bcf
- https://github.com/Alro10/deep-learning-time-series?tab=readme-ov-file

### a. RNN (Recurrent Neural Network)

Appllied are RNN and its evolutions developed to overcome the vanishing gradient problem. A helpful source of inspiration for multivariate time series forecasting with LSTM:
- https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

#### SimpleRNN

In [None]:
# Simple RNN (Recurrent Neural Network)
# Location-specific approach, without k-fold cross-validation

# Initialize lists to store MAE and RMSE scores for each location
mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    # Regular train-validation split with respect to time-series
    train_size = int(len(X_location) * 0.8)
    X_train, X_val = X_location[0:train_size], X_location[train_size:]
    y_train, y_val = y_location[0:train_size], y_location[train_size:]
    
    # Reshape input to be 3D [samples, timesteps, features] for RNN
    X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
    X_val = X_val.reshape((X_val.shape[0], 1, X_val.shape[1]))
        
    # Define and compile the model
    model = Sequential()
    model.add(SimpleRNN(50, input_shape=(X_train.shape[1], X_train.shape[2])))
    model.add(Dense(1)) # Output layer for regression
    model.compile(loss='mae', optimizer='adam')

    # Define early stopping criteria
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True) 
    # Based on validation loss, the model will stop training if no improvement after five 5 cohesive epochs
    
    # Fit neural network
    history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                        validation_data=(X_val, y_val), 
                        callbacks=[early_stopping], 
                        verbose=2, 
                        shuffle=False)    
    
    # Prediction and error calculation
    y_pred = model.predict(X_val)
    X_train = X_train.reshape((X_train.shape[0], X_train.shape[2]))
    X_val = X_val.reshape((X_val.shape[0], X_val.shape[2]))
        
    # Perform and save metrics
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    mae = mean_absolute_error(y_val, y_pred)
    mae_scores.append(mae)
    rmse_scores.append(rmse)

    # Plotting results and prediction comparisons
    plt.figure(figsize=(20, 10))
    
    # Plot for actual vs predicted
    plt.subplot(1, 3, 1)  # 3 plots, this is the first
    plt.plot(y_val, label='Actual', color='blue')
    plt.plot(y_pred, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location}')
    plt.xlabel('Date')
    plt.ylabel('Incidence')
    plt.legend()
    
    # Plot for training and validation loss
    plt.subplot(1, 3, 2)  # This is the second plot
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.title('Training and Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    
    plt.show()

    # Print error metrics for this location
    print(f'MAE: {mae}, RMSE: {rmse}')

# Calculate average MAE and RMSE scores
avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)

# Print average scores
print(f'Average MAE: {avg_mae}, Average RMSE: {avg_rmse}')

In [None]:
# Cross-validation for SimpleRNN
# Source: https://machinelearningmastery.com/understanding-simple-recurrent-neural-networks-in-keras/

# Set random seeds for constant result
def set_random_seeds(seed_value=42):
    np.random.seed(seed_value)
    tf.random.set_seed(seed_value)
    random.seed(seed_value)
    
set_random_seeds(42)
# Initialize lists to store MAE and RMSE scores for each location
mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    all_actuals = []
    all_predictions = []
    
    # Time-series cross-validation with k-fold=5
    tscv = TimeSeriesSplit(n_splits=5)
    
    location_mae_scores = []
    location_rmse_scores = []
    
    for train_index, test_index in tscv.split(X_location):
        X_train, X_val = X_location[train_index], X_location[test_index]
        y_train, y_val = y_location[train_index], y_location[test_index]
        
        # Reshape data for RNN (samples, timesteps, features)
        X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
        X_val = X_val.reshape((X_val.shape[0], 1, X_val.shape[1]))
        
        # Define and compile the SimpleRNN model
        model = Sequential()
        model.add(SimpleRNN(50, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2]))) #SimpleRNN layer 
        model.add(Dense(1)) #Output layer
        model.compile(loss='mae', optimizer='adam')

        # Define early stopping criteria
        early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

        # Fit neural network
        history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                            validation_data=(X_val, y_val), 
                            callbacks=[early_stopping], 
                            verbose=2, 
                            shuffle=False)
        
        # Prediction and error calculation
        y_pred = model.predict(X_val).flatten()
        
        # Collect predictions and actual values
        all_actuals.extend(y_val)
        all_predictions.extend(y_pred)
    
        # Calculate metrics and save them
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        mae = mean_absolute_error(y_val, y_pred)
        location_mae_scores.append(mae)
        location_rmse_scores.append(rmse)
    
    # Append location-specific scores to the overall scores
    mae_scores.extend(location_mae_scores)
    rmse_scores.extend(location_rmse_scores)
    
    # Plotting predictions vs actual values for the entire location
    plt.figure(figsize=(10, 6))
    plt.plot(all_actuals, label='Actual', color='blue')
    plt.plot(all_predictions, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location}')
    plt.xlabel('Time Step')
    plt.ylabel('Incidence Rate')
    plt.legend()
    plt.show()

    print(f'Average MAE for location {location}: {np.mean(location_mae_scores)}, Average RMSE for location {location}: {np.mean(location_rmse_scores)}')

avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)
print(f'Average MAE of SimpleRNN: {avg_mae}, Average RMSE of SimpleRNN: {avg_rmse}')

#### GRU

In [None]:
# GRU (Gated Recurrent Unit)
# Location-specific approach, without k-fold cross-validation
# Simpler architecture than LSTM with a single update gate
# Deal with the vanishing gradient problem

# Initialize lists to store MAE and RMSE scores for each location
mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    # Regular train-validation split with respect to time-series
    train_size = int(len(X_location) * 0.8)
    
    X_train, X_val = X_location[0:train_size], X_location[train_size:]
    y_train, y_val = y_location[0:train_size], y_location[train_size:]
    
    # Reshape input to be 3D [samples, timesteps, features] for GRU
    X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
    X_val = X_val.reshape((X_val.shape[0], 1, X_val.shape[1]))
        
    # Define and compile the model with GRU
    model = Sequential()
    model.add(GRU(50, input_shape=(X_train.shape[1], X_train.shape[2]))) # 1 GRU layer with 50 units 
    model.add(Dense(1)) # Dense layer
    model.compile(loss='mae', optimizer='adam')

    # Define early stopping criteria
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    
    # Fit neural network
    history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                        validation_data=(X_val, y_val), 
                        callbacks=[early_stopping], 
                        verbose=2,
                        shuffle=False)    
    
    # Prediction and error calculation
    y_pred = model.predict(X_val)
    
    # Perform and save metrics
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    mae = mean_absolute_error(y_val, y_pred)
    mae_scores.append(mae)
    rmse_scores.append(rmse)

    # Creating the plot
    plt.figure(figsize=(10, 6))
    plt.plot(y_val, label='Actual', color='blue')
    plt.plot(y_pred, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location}')
    plt.xlabel('Date')
    plt.ylabel('Incidence')
    plt.legend()
    plt.show()
    
    print(f'MAE: {mae}, RMSE: {rmse}')

# Calculate average MAE and RMSE scores
avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)

# Print average scores
print(f'Average MAE: {avg_mae}, Average RMSE: {avg_rmse}')

In [None]:
# GRU with cross-validation

# Set random seeds for reproducibility
def set_random_seeds(seed_value=42):
    np.random.seed(seed_value)
    tf.random.set_seed(seed_value)
    random.seed(seed_value)
    
set_random_seeds(42)

# Initialize lists to store MAE and RMSE scores for each location
mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    all_actuals = []
    all_predictions = []
    
    # Time-series cross-validation with k-fold=5
    tscv = TimeSeriesSplit(n_splits=5)
    
    location_mae_scores = []
    location_rmse_scores = []
    
    for train_index, test_index in tscv.split(X_location):
        X_train, X_val = X_location[train_index], X_location[test_index]
        y_train, y_val = y_location[train_index], y_location[test_index]
        
        # Reshape data for GRU (samples, timesteps, features)
        X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
        X_val = X_val.reshape((X_val.shape[0], 1, X_val.shape[1]))
        
        # Define and compile the GRU model
        model = Sequential()
        model.add(GRU(50, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2]))) # 1 GRU layer
        model.add(Dense(1)) # Output layer
        model.compile(loss='mae', optimizer='adam')

        # Define early stopping criteria
        early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

        # Fit neural network
        history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                            validation_data=(X_val, y_val), 
                            callbacks=[early_stopping], 
                            verbose=2, 
                            shuffle=False)
        
        # Prediction and error calculation
        y_pred = model.predict(X_val).flatten()
        
        # Collect predictions and actual values
        all_actuals.extend(y_val)
        all_predictions.extend(y_pred)
    
        # Calculate metrics and save them
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        mae = mean_absolute_error(y_val, y_pred)
        location_mae_scores.append(mae)
        location_rmse_scores.append(rmse)
    
    # Append location-specific scores to the overall scores
    mae_scores.extend(location_mae_scores)
    rmse_scores.extend(location_rmse_scores)
    
    # Plotting predictions vs actual values for the entire location
    plt.figure(figsize=(10, 6))
    plt.plot(all_actuals, label='Actual', color='blue')
    plt.plot(all_predictions, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location}')
    plt.xlabel('Time Step')
    plt.ylabel('Incidence Rate')
    plt.legend()
    plt.show()

    print(f'Average MAE for location {location}: {np.mean(location_mae_scores)}, Average RMSE for location {location}: {np.mean(location_rmse_scores)}')

avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)
print(f'Average MAE of GRU: {avg_mae}, Average RMSE of GRU: {avg_rmse}')

#### LSTM

In [None]:
# LSTM (Long short-term memory)
# Location-specific approach, without k-fold cross-validation
# Iterative for each location
# Without time-series split cross-validation

# Initialize lists to store MAE and RMSE scores for each location
mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    # Regular train-validation split with respect to time-series
    train_size = int(len(X_location) * 0.8)
    
    X_train, X_val = X_location[0:train_size], X_location[train_size:]
    y_train, y_val = y_location[0:train_size], y_location[train_size:]
    
    # Reshape input to be 3D [samples, timesteps, features] for LSTM
    X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
    X_val = X_val.reshape((X_val.shape[0], 1, X_val.shape[1]))
    #print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)
    
    # Define and compile the model
    model = Sequential()
    model.add(LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2]))) # Number of LSTM units (neurons) in the layer
    model.add(Dense(1)) # Dense layer, that is fully connected layer that outputs one unit
    model.compile(loss='mae', optimizer='adam') #MAE as loss and adam optimizer
    #Model summary
    #model.summary()

    # Define early stopping criteria
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    # Fit neural network
    history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                        validation_data=(X_val, y_val), 
                        callbacks=[early_stopping], 
                        verbose=2,
                        shuffle=False)    
    
    # Plot history
    #pyplot.plot(history.history['loss'], label='Training Loss')
    #pyplot.plot(history.history['val_loss'], label='Validation Loss')
    #pyplot.legend()
    #pyplot.show()

    # Prediction and error calculation
    y_pred = model.predict(X_val)
    X_train = X_train.reshape((X_train.shape[0], X_train.shape[2]))
    X_val = X_val.reshape((X_val.shape[0], X_val.shape[2]))
        
    # Perform and save metrics
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    mae = mean_absolute_error(y_val, y_pred)
    mae_scores.append(mae)
    rmse_scores.append(rmse)

    # Creating the plot
    plt.figure(figsize=(10, 6))
    plt.plot(y_val, label='Actual', color='blue')
    plt.plot(y_pred, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location}')
    plt.xlabel('Date')
    plt.ylabel('Incidence')
    plt.legend()
    plt.show()
    
    print(f'MAE: {mae}, RMSE: {rmse}')

# Calculate average MAE and RMSE scores
avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)

# Print average scores
print(f'Average MAE: {avg_mae}, Average RMSE: {avg_rmse}')

In [None]:
# LSTM (Long short-term memory)
# Location-specific approach, with k-fold cross-validation
# Iterative for each location with cross-validation
# Scaled dataset

# Set random seeds for reproducibility
def set_random_seeds(seed_value=42):
    np.random.seed(seed_value)
    tf.random.set_seed(seed_value)
    random.seed(seed_value)
    
set_random_seeds(42)

# Initialize lists to store MAE and RMSE scores for each location
mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    # Time-series cross-validation with k-fold=5
    tscv = TimeSeriesSplit(n_splits=5)
    
    for train_index, test_index in tscv.split(X_location):
        X_train, X_val = X_location[train_index], X_location[test_index]
        y_train, y_val = y_location[train_index], y_location[test_index]

        X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
        X_val = X_val.reshape((X_val.shape[0], 1, X_val.shape[1]))
        #print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)
        
        # Define and compile the model
        model = Sequential()
        model.add(LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2]))) #LSTM layer
        model.add(Dense(1)) #LSTM layer
        model.compile(loss='mae', optimizer='adam')
        #Model summary
        #model.summary()

        # Define early stopping criteria
        early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
        # Fit neural network
        history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                            validation_data=(X_val, y_val), 
                            callbacks=[early_stopping], 
                            shuffle=False)    

        # Plot history
        #pyplot.plot(history.history['loss'], label='Training Loss')
        #pyplot.plot(history.history['val_loss'], label='Validation Loss')
        #pyplot.legend()
        #pyplot.show()
        
        # Prediction and error calculation
        y_pred = model.predict(X_val)
        X_train = X_train.reshape((X_train.shape[0], X_train.shape[2]))
        X_val = X_val.reshape((X_val.shape[0], X_val.shape[2])) #check

        # Perform and save metrics
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        mae = mean_absolute_error(y_val, y_pred)
        mae_scores.append(mae)
        rmse_scores.append(rmse)

        # Creating the plot
        plt.figure(figsize=(10, 6))
        plt.plot(y_val, label='Actual', color='blue')
        plt.plot(y_pred, label='Predicted', color='red')
        plt.title(f'Actual vs Predicted for Location: {location}')
        plt.xlabel('Date')
        plt.ylabel('Incidence')
        plt.legend()
        plt.show()
    
        print(f'MAE: {mae}, RMSE: {rmse}')

# Calculate average MAE and RMSE scores
avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)

# Print average scores
print(f'Average MAE: {avg_mae}, Average RMSE: {avg_rmse}')

### b. Feedforward Neural Network (FNN)

#### MLP

In [None]:
# Multilayer Perceptron (MLP)
# Location-specific approach, without k-fold cross-validation
# FNN with two layers, information in one direction

# Initialize lists to store MAE and RMSE scores for each location
mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    # Regular train-validation split with respect to time-series
    train_size = int(len(X_location) * 0.8)
    X_train, X_val = X_location[0:train_size], X_location[train_size:]
    y_train, y_val = y_location[0:train_size], y_location[train_size:]
    
    # Define and compile the MLP model
    model = Sequential()
    model.add(Dense(64, activation='relu', input_dim=X_train.shape[1])) # Input layer explicitly defined here, first hidden layer
    model.add(Dense(32, activation='relu'))  # Second hidden layer
    model.add(Dense(1))  # Output layer
    model.compile(loss='mae', optimizer='adam')

    # Define early stopping criteria
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

    # Fit neural network
    history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                        validation_data=(X_val, y_val), 
                        callbacks=[early_stopping], 
                        verbose=2)

    # Prediction and error calculation
    y_pred = model.predict(X_val)
        
    # Perform and save metrics
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    mae = mean_absolute_error(y_val, y_pred)
    mae_scores.append(mae)
    rmse_scores.append(rmse)

    # Creating the plot
    plt.figure(figsize=(10, 6))
    plt.plot(y_val, label='Actual', color='blue')
    plt.plot(y_pred, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location}')
    plt.xlabel('Date')
    plt.ylabel('Incidence')
    plt.legend()
    plt.show()
    
    print(f'MAE: {mae}, RMSE: {rmse}')

# Calculate average MAE and RMSE scores
avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)

# Print average scores
print(f'Average MAE: {avg_mae}, Average RMSE: {avg_rmse}')

In [None]:
# Multilayer Perceptron (MLP)
# Location-specific approach, with k-fold cross-validation
# Plotting for all splits on each location
# FNN with two layers, information in one direction

# Set random seeds for reproducibility
def set_random_seeds(seed_value=42):
    np.random.seed(seed_value)
    tf.random.set_seed(seed_value)
    random.seed(seed_value)
    
# Call the function to set seeds
set_random_seeds()

# Initialize lists to store MAE and RMSE scores for each location
mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    # Initialize lists to store predictions and actual values for plotting
    all_actuals = []
    all_predictions = []

    # Time-series cross-validation with k-fold=5
    tscv = TimeSeriesSplit(n_splits=5)
    
    for train_index, test_index in tscv.split(X_location):
        X_train, X_val = X_location[train_index], X_location[test_index]
        y_train, y_val = y_location[train_index], y_location[test_index]
        
        # Define and compile the MLP model
        model = Sequential()
        model.add(Dense(64, activation='relu', input_dim=X_train.shape[1])) #First hidden layers
        model.add(Dense(32, activation='relu'))  # Additional hidden layer
        model.add(Dense(1))  # Output layer
        model.compile(loss='mae', optimizer='adam')

        # Define early stopping criteria
        early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

        # Fit neural network
        history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                            validation_data=(X_val, y_val), 
                            callbacks=[early_stopping], 
                            verbose=2)
        
        # Prediction and error calculation
        y_pred = model.predict(X_val).flatten()

        # Accumulate predictions and actuals
        all_predictions.extend(y_pred)
        all_actuals.extend(y_val)

        # Perform and save metrics
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        mae = mean_absolute_error(y_val, y_pred)
        mae_scores.append(mae)
        rmse_scores.append(rmse)
    
    # Plot all accumulated predictions and actual values after cross-validation
    plt.figure(figsize=(10, 6))
    plt.plot(all_actuals, label='Actual', color='blue')
    plt.plot(all_predictions, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location}')
    plt.xlabel('Date')
    plt.ylabel('Incidence Rate')
    plt.legend()
    plt.show()
    
    print(f'Average MAE for location {location}: {np.mean(mae_scores)}, Average RMSE for location {location}: {np.mean(rmse_scores)}')

# Calculate average MAE and RMSE scores
avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)

# Print average scores
print(f'Average MAE of MLP: {avg_mae}, Average RMSE of MLP: {avg_rmse}')

### c. Convolutional Neural Network (CNN)

Sources: 
- https://machinelearningmastery.com/how-to-develop-convolutional-neural-networks-for-multi-step-time-series-forecasting/
- https://thejaskiran99.medium.com/unlocking-the-potential-of-convolutional-neural-networks-cnns-in-time-series-forecasting-b2fac329e184

#### CNN

In [None]:
# Convolutional Neural Networks (CNNs)
# Location-specific approach, without k-fold cross-validation
# 1 D CNN, suitable for time-series
# Kernel goes in one direction only
# Source: https://www.tensorflow.org/tutorials/images/cnn

mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    # Regular train-validation split with respect to time-series
    train_size = int(len(X_location) * 0.8)
    X_train, X_val = X_location[0:train_size], X_location[train_size:]
    y_train, y_val = y_location[0:train_size], y_location[train_size:]
    
    # Reshape input for CNN
    X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
    X_val = X_val.reshape((X_val.shape[0], X_val.shape[1], 1))
    #print(X_train.shape, X_val.shape)
    
    # Define the CNN model
    model = Sequential([
        Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(X_train.shape[1], 1)), 
        # Convolutional 1D layer, larger filter means bigger learning capacity, 3 time steps at a time
        MaxPooling1D(pool_size=2), #Pooling layer
        Flatten(),
        Dense(50, activation='relu'),
        Dense(1)
    ])
    model.compile(loss='mae', optimizer='adam')
    #Model summary
    #model.summary()

    # Early stopping criteria
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    
    # Fit CNN
    history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                        validation_data=(X_val, y_val), 
                        callbacks=[early_stopping], 
                        verbose=2,
                        shuffle=False)
    
    # Plot history
    #pyplot.plot(history.history['loss'], label='Training Loss')
    #pyplot.plot(history.history['val_loss'], label='Validation Loss')
    #pyplot.legend()
    #pyplot.show()

    # Prediction and error calculation
    y_pred = model.predict(X_val)
    #X_train = X_train.reshape((X_train.shape[0], X_train.shape[2]))
    #X_val = X_val.reshape((X_val.shape[0], X_val.shape[2]))
    
    # Calculate metrics and save them
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    mae = mean_absolute_error(y_val, y_pred)
    mae_scores.append(mae)
    rmse_scores.append(rmse)
    
    # Plotting predictions vs actual values
    plt.figure(figsize=(10, 6))
    plt.plot(y_val, label='Actual', color='blue')
    plt.plot(y_pred, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location}')
    plt.xlabel('Date')
    plt.ylabel('Incidence')
    plt.legend()
    plt.show()

    print(f'MAE: {mae}, RMSE: {rmse}')

avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)
print(f'Average MAE: {avg_mae}, Average RMSE: {avg_rmse}')

In [None]:
# CNN with cross-validation

# Set random seeds for reproducibility
def set_random_seeds(seed_value=42):
    np.random.seed(seed_value)
    tf.random.set_seed(seed_value)
    random.seed(seed_value)

# Call the function to set seeds
set_random_seeds()

# Initialize lists to store MAE and RMSE scores for each location
mae_scores = []
rmse_scores = []

# Loop through each location
for location in X_train_val_scaled_minmax['Location Code'].unique():
    location_mask = X_train_val_scaled_minmax['Location Code'] == location
    X_location = X_train_val_scaled_minmax[location_mask].drop(columns=['Location Code']).values.astype('float32')
    y_location = y_train_val[location_mask].values.astype('float32')

    # Time-series cross-validation with k-fold=5
    tscv = TimeSeriesSplit(n_splits=5)

    all_actuals = []
    all_predictions = []
    
    location_mae_scores = []
    location_rmse_scores = []
    
    for train_index, test_index in tscv.split(X_location):
        X_train, X_val = X_location[train_index], X_location[test_index]
        y_train, y_val = y_location[train_index], y_location[test_index]
    
        # Reshape input for CNN
        X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
        X_val = X_val.reshape((X_val.shape[0], X_val.shape[1], 1))
    
        # Define the CNN model
        model = Sequential([ 
            Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(X_train.shape[1], 1)),
            MaxPooling1D(pool_size=2),
            Flatten(),
            Dense(50, activation='relu'),
            Dense(1)
        ])
        model.compile(loss='mae', optimizer='adam')

        # Early stopping criteria
        early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
        
        # Fit CNN
        history = model.fit(X_train, y_train, epochs=50, batch_size=72, 
                            validation_data=(X_val, y_val), 
                            callbacks=[early_stopping], 
                            verbose=2,
                            shuffle=False)

        # Prediction and error calculation
        y_pred = model.predict(X_val).flatten()
        
        # Collect predictions and actual values
        all_actuals.extend(y_val)
        all_predictions.extend(y_pred)
    
        # Calculate metrics and save them
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        mae = mean_absolute_error(y_val, y_pred)
        location_mae_scores.append(mae)
        location_rmse_scores.append(rmse)
    
    # Append location-specific scores to the overall scores
    mae_scores.extend(location_mae_scores)
    rmse_scores.extend(location_rmse_scores)
    
    # Plotting predictions vs actual values for the entire location
    plt.figure(figsize=(10, 6))
    plt.plot(all_actuals, label='Actual', color='blue')
    plt.plot(all_predictions, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location}')
    plt.xlabel('Time Step')
    plt.ylabel('Incidence Rate')
    plt.legend()
    plt.show()

    print(f'Average MAE for location {location}: {np.mean(location_mae_scores)}, Average RMSE for location {location}: {np.mean(location_rmse_scores)}')

avg_mae = np.mean(mae_scores)
avg_rmse = np.mean(rmse_scores)
print(f'Average MAE of CNN: {avg_mae}, Average RMSE of CNN: {avg_rmse}')

## 3.1 Final Training and Prediction on Test Set
Prediction on test set to evaluate how model generalized on unseen data and forecast annual dengue incidence rate under climate change.

In [None]:
# MLP on the test set

# Set random seeds for reproducibility
def set_random_seeds(seed_value=42):
    np.random.seed(seed_value)
    tf.random.set_seed(seed_value)
    random.seed(seed_value)

# Call the function to set seeds
set_random_seeds()

# Function to train and evaluate the MLP model on the test set
def train_and_evaluate_mlp_on_test_set(X_train_val, y_train_val, X_test, y_test):
    # Initialize lists to store MAE and RMSE scores for each location
    mae_scores = []
    rmse_scores = []

    # Loop through each location
    for location in X_train_val['Location Code'].unique():
        
        location_mask_train = X_train_val['Location Code'] == location
        X_location_train = X_train_val[location_mask_train].drop(columns=['Location Code']).values.astype('float32')
        y_location_train = y_train_val[location_mask_train].values.astype('float32')

        location_mask_test = X_test['Location Code'] == location
        X_location_test = X_test[location_mask_test].drop(columns=['Location Code']).values.astype('float32')
        y_location_test = y_test[location_mask_test].values.astype('float32')

        # Define and compile the MLP model
        model = Sequential()
        model.add(Dense(64, activation='relu', input_shape=(X_location_train.shape[1],)))
        model.add(Dense(32, activation='relu'))
        model.add(Dense(1))
        model.compile(loss='mae', optimizer='adam')

        # Define early stopping criteria
        early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

        # Fit neural network
        history = model.fit(X_location_train, y_location_train, epochs=50, batch_size=72, 
                            validation_data=(X_location_test, y_location_test), 
                            callbacks=[early_stopping], 
                            verbose=2, 
                            shuffle=False)

        # Prediction and error calculation
        y_pred = model.predict(X_location_test).flatten()
    
        # Calculate metrics and save them
        rmse = np.sqrt(mean_squared_error(y_location_test, y_pred))
        mae = mean_absolute_error(y_location_test, y_pred)
        mae_scores.append(mae)
        rmse_scores.append(rmse)
    
        # Plotting predictions vs actual values for the entire location
        plt.figure(figsize=(10, 6))
        plt.plot(y_location_test, label='Actual', color='blue')
        plt.plot(y_pred, label='Predicted', color='red')
        plt.title(f'Actual vs Predicted for Location: {location}')
        plt.xlabel('Time Step')
        plt.ylabel('Incidence Rate')
        plt.legend()
        plt.show()

        print(f'MAE for location {location}: {mae}, RMSE for location {location}: {rmse}')

    avg_mae = np.mean(mae_scores)
    avg_rmse = np.mean(rmse_scores)
    print(f'MAE on Test Set: {avg_mae}, RMSE on Test Set: {avg_rmse}')

# Call the function
train_and_evaluate_mlp_on_test_set(X_train_val_scaled_minmax, y_train_val, X_test_scaled_minmax, y_test)

In [None]:
# Final assessment on the test set using CNN

# Set random seeds for reproducibility
def set_random_seeds(seed_value=42):
    np.random.seed(seed_value)
    tf.random.set_seed(seed_value)
    random.seed(seed_value)

# Call the function to set seeds
set_random_seeds()

def train_and_evaluate_on_test_set(X_train_val, y_train_val, X_test, y_test):
    # Initialize lists to store MAE and RMSE scores for each location
    mae_scores = []
    rmse_scores = []

    # Loop through each location
    for location in X_train_val['Location Code'].unique():
        
        location_mask_train = X_train_val['Location Code'] == location
        X_location_train = X_train_val[location_mask_train].drop(columns=['Location Code']).values.astype('float32')
        y_location_train = y_train_val[location_mask_train].values.astype('float32')

        location_mask_test = X_test['Location Code'] == location
        X_location_test = X_test[location_mask_test].drop(columns=['Location Code']).values.astype('float32')
        y_location_test = y_test[location_mask_test].values.astype('float32')

        # Reshape input for CNN
        X_location_train = X_location_train.reshape((X_location_train.shape[0], X_location_train.shape[1], 1))
        X_location_test = X_location_test.reshape((X_location_test.shape[0], X_location_test.shape[1], 1))

        # Define the CNN model
        model = Sequential([ 
            Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(X_location_train.shape[1], 1)),
            MaxPooling1D(pool_size=2),
            Flatten(),
            Dense(50, activation='relu'),
            Dense(1)
        ])
        model.compile(loss='mae', optimizer='adam')

        # Early stopping criteria
        early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
        
        # Fit CNN
        history = model.fit(X_location_train, y_location_train, epochs=50, batch_size=72, 
                            validation_data=(X_location_test, y_location_test), 
                            callbacks=[early_stopping], 
                            verbose=2,
                            shuffle=False)

        # Prediction and error calculation
        y_pred = model.predict(X_location_test).flatten()
    
        # Calculate metrics and save them
        rmse = np.sqrt(mean_squared_error(y_location_test, y_pred))
        mae = mean_absolute_error(y_location_test, y_pred)
        mae_scores.append(mae)
        rmse_scores.append(rmse)
    
        # Plotting predictions vs actual values for the entire location
        plt.figure(figsize=(10, 6))
        plt.plot(y_location_test, label='Actual', color='blue')
        plt.plot(y_pred, label='Predicted', color='red')
        plt.title(f'Actual vs Predicted for Location: {location}')
        plt.xlabel('Time Step')
        plt.ylabel('Incidence Rate')
        plt.legend()
        plt.show()

        print(f'MAE for location {location}: {mae}, RMSE for location {location}: {rmse}')

    avg_mae = np.mean(mae_scores)
    avg_rmse = np.mean(rmse_scores)
    print(f'MAE on Test Set: {avg_mae}, RMSE on Test Set: {avg_rmse}')

# Call the function
train_and_evaluate_on_test_set(X_train_val_scaled_minmax, y_train_val, X_test_scaled_minmax, y_test)

## 3.2 Climate Change Assessment
Location Code 5 (Kuching, Malaysia) was chosen for the deployment with the lowest predictive error across all locations.

In [None]:
# Final assessment on the test set for location 5
# This function needs to be run with the different datasets to assess the change in annual incidence rate

# Set random seeds for reproducibility
def set_random_seeds(seed_value=42):
    np.random.seed(seed_value)
    tf.random.set_seed(seed_value)
    random.seed(seed_value)
    
# Call the function to set seeds
set_random_seeds()

def train_and_evaluate_on_test_set(X_train_val, y_train_val, X_test, y_test, location_code):
    # Filter data for the specified location
    location_mask_train = X_train_val['Location Code'] == location_code
    X_location_train = X_train_val[location_mask_train].drop(columns=['Location Code']).values.astype('float32')
    y_location_train = y_train_val[location_mask_train].values.astype('float32')

    location_mask_test = X_test['Location Code'] == location_code
    X_location_test = X_test[location_mask_test].drop(columns=['Location Code']).values.astype('float32')
    y_location_test = y_test[location_mask_test].values.astype('float32')

    # Reshape input for CNN
    X_location_train = X_location_train.reshape((X_location_train.shape[0], X_location_train.shape[1], 1))
    X_location_test = X_location_test.reshape((X_location_test.shape[0], X_location_test.shape[1], 1))

    # Define the CNN model
    model = Sequential([
        Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(X_location_train.shape[1], 1)),
        MaxPooling1D(pool_size=2),
        Flatten(),
        Dense(50, activation='relu'),
        Dense(1)
    ])
    model.compile(loss='mae', optimizer='adam')

    # Early stopping criteria
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

    # Fit CNN
    history = model.fit(X_location_train, y_location_train, epochs=50, batch_size=72, 
                        validation_data=(X_location_test, y_location_test), 
                        callbacks=[early_stopping], 
                        verbose=2,
                        shuffle=False)

    # Prediction and error calculation
    y_pred = model.predict(X_location_test).flatten()
    
    # Calculate metrics and print
    rmse = np.sqrt(mean_squared_error(y_location_test, y_pred))
    mae = mean_absolute_error(y_location_test, y_pred)
    
    # Plotting predictions vs actual values
    plt.figure(figsize=(10, 6))
    plt.plot(y_location_test, label='Actual', color='blue')
    plt.plot(y_pred, label='Predicted', color='red')
    plt.title(f'Actual vs Predicted for Location: {location_code}')
    plt.xlabel('Time Step')
    plt.ylabel('Incidence Rate')
    plt.legend()
    plt.show()

    print(f'MAE for location {location_code}: {mae}, RMSE for location {location_code}: {rmse}')
    print(f"Predicted outcome for Location {location_code}: {y_pred[0]:.2f}")

# Call the function
train_and_evaluate_on_test_set(X_train_val_scaled_minmax, y_train_val, X_test_scaled_minmax, y_test, location_code=5)

In [None]:
#Scores of the simulated datasets have been saved to visualize the predicted change in the annual incidence rate for both scenarios

# Updated data including both scenarios
data_rcp4_5 = {
    "Change of mean surface temperature in (°C)": [-0.37, 0.22, 0.60, 0.77, 0.37, 0.52, 0.82, 0.93],
    "Change in monthly mean total precipitation": [67.72, -10.12, 1.29, 69.40, 143.05, 19.69, 123.08, -9.76],
    "Change in yearly incidence rate for Kuching, MY (%)": [-2.35, -1.88, -0.94, -0.47, -0.47, -0.94, 0.47, -0.47]
}

data_rcp8_5 = {
    "Change of mean surface temperature in (°C)": [-0.34, 0.76, -0.20, 0.99, 1.36, 1.87, 2.11, 2.50],
    "Change of monthly mean total precipitation": [76.69, 10.09, 84.82, -3.09, -76.87, -30.61, 25.80, -156.44],
    "Change in yearly incidence rate for Kuching, MY (%)": [-1.88, -0.94, -1.88, 1.88, 1.41, 2.35, 2.82, 2.35]
}

df_rcp4_5 = pd.DataFrame(data_rcp4_5)
df_rcp8_5 = pd.DataFrame(data_rcp8_5)

# Create the first scatter plot for Temperature change vs Incidence rate change
plt.figure(figsize=(10, 6))
plt.scatter(df_rcp4_5["Change of mean surface temperature in (°C)"], df_rcp4_5["Change in yearly incidence rate for Kuching, MY (%)"], color='b', label='RCP4.5')
plt.scatter(df_rcp8_5["Change of mean surface temperature in (°C)"], df_rcp8_5["Change in yearly incidence rate for Kuching, MY (%)"], color='r', label='RCP8.5')
plt.title('Changes in Mean Temperature and Annual Incidence Rate', fontsize=18)
plt.xlabel('Change of mean temperature in (°C)')
plt.ylabel('Change in annual incidence rate for Kuching, MY (%)')
plt.grid(True)
plt.legend()
plt.tight_layout()

# Save the first figure
plt.savefig('temperature_change_incidence_rate_scatter.png')

# Show the first plot
plt.show()

# Create the second scatter plot for Precipitation change vs Incidence rate change
plt.figure(figsize=(10, 6))
plt.scatter(df_rcp4_5["Change in monthly mean total precipitation"], df_rcp4_5["Change in yearly incidence rate for Kuching, MY (%)"], color='b', label='RCP4.5')
plt.scatter(df_rcp8_5["Change of monthly mean total precipitation"], df_rcp8_5["Change in yearly incidence rate for Kuching, MY (%)"], color='r', label='RCP8.5')
plt.title('Changes in Total Precipitation and Annual Incidence Rate', fontsize=18)
plt.xlabel('Change in total precipitation (mm)')
plt.ylabel('Change in annual incidence rate for Kuching, MY (%)')
plt.grid(True)
plt.legend()
plt.tight_layout()

# Save the second figure
plt.savefig('precipitation_change_incidence_rate_scatter.png')

# Show the second plot
plt.show()

In [None]:
## While changes in both total precipitation and mean temperature impact the disease’s response, their effects might be interrelated, 
## with mean temperature appearing to dominate. 
## Therefore, only simulated changes in total precipitation were applied to assess its influence in annual incidence rate. 
## At constant mean temperature, total precipitation shows a positive linear trend with changes in yearly incidence rate,
## supporting the assumption of an interfering effect between these independent variables. However, given the variability of predicted values,
## the informational significance of the simulated changes in total precipitation alone is dubious. 

## Data for RCP4.5 and RCP8.5
#data = {
#    "Scenario": ["4.5"] * 8 + ["8.5"] * 8,
#    "Year": [2030, 2030, 2050, 2050, 2070, 2070, 2100, 2100, 2030, 2030, 2050, 2050, 2070, 2070, 2100, 2100],
#    "Month": ["January", "July", "January", "July", "January", "July", "January", "July", "January", "July", "January", "July", "January", "July", "January", "July"],
#    "Change in total precipitation": [67.72, -10.12, 1.29, 69.40, 143.05, 19.69, 123.08, -9.76, 76.69, 10.09, 84.82, -3.09, -76.87, -30.61, 25.80, -156.44],
#    "Percentage Change": [5.94, 5.48, 5.48, 5.94, 6.85, 5.48, 6.39, 5.48, 5.94, 5.48, 5.94, 5.48, 4.57, 5.02, 5.48, 4.11]
#}
#
#df = pd.DataFrame(data)
#
## Create the scatter plot
#plt.figure(figsize=(10, 6))
#colors = {'4.5': 'blue', '8.5': 'red'}
#for scenario in ['4.5', '8.5']:
#    subset = df[df["Scenario"] == scenario]
#    plt.scatter(subset["Change in total precipitation"], subset["Percentage Change"], c=colors[scenario], label=f'RCP{scenario}')
#
#plt.title('Only Precipitation: Changes in Total Precipitation and Annual Incidence Rate', fontsize=18)
#plt.xlabel('Change in total precipitation (mm)')
#plt.ylabel('Change in annual incidence rate for Kuching, MY (%)')
#plt.grid(True)
#plt.legend()
#plt.tight_layout()
#plt.savefig('change_in_precipitation_vs_percentage_change.png')
#plt.show()