# Dengue Time-Series Forecasting  
## 02 — Baseline Models

This notebook establishes baseline performance before advanced feature engineering.

Objectives:
- Apply proper time-based splitting
- Implement naive forecasting
- Train simple regression models - LR and RF
- Compare performance using MAE
- Understand why basic models struggle

In [59]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/datasets/hardikthapar/deng123/DengAI_Training_Data_Features.csv
/kaggle/input/datasets/hardikthapar/deng123/DengAI_Training_Data_Labels.csv
/kaggle/input/datasets/hardikthapar/deng123/DengAI_Test_Data_Features.csv


In [60]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

## 1. Load and Prepare Data

We separate data by city and maintain chronological ordering.

In [61]:
train_features = pd.read_csv(
    "/kaggle/input/datasets/hardikthapar/deng123/DengAI_Training_Data_Features.csv",
    index_col=[0, 1, 2]
)

train_labels = pd.read_csv(
    "/kaggle/input/datasets/hardikthapar/deng123/DengAI_Training_Data_Labels.csv",
    index_col=[0, 1, 2]
)

# Separate cities
sj_X_raw = train_features.loc["sj"].copy()
iq_X_raw = train_features.loc["iq"].copy()

sj_y = train_labels.loc["sj"]["total_cases"].copy()
iq_y = train_labels.loc["iq"]["total_cases"].copy()

# Drop date column
sj_X_raw.drop("week_start_date", axis=1, inplace=True)
iq_X_raw.drop("week_start_date", axis=1, inplace=True)

In [62]:
base_features = [
    "reanalysis_specific_humidity_g_per_kg",
    "reanalysis_dew_point_temp_k",
    "station_avg_temp_c",
    "station_min_temp_c",
    "precipitation_amt_mm"
]

sj_X = sj_X_raw[base_features].copy()
iq_X = iq_X_raw[base_features].copy()

# Forward-fill missing values
sj_X.fillna(method="ffill", inplace=True)
iq_X.fillna(method="ffill", inplace=True)

  sj_X.fillna(method="ffill", inplace=True)
  iq_X.fillna(method="ffill", inplace=True)


In [63]:
sj_X = sj_X.sort_index()
sj_y = sj_y.sort_index()

iq_X = iq_X.sort_index()
iq_y = iq_y.sort_index()

## 2. Time-Based Train/Test Split

We use an 80/20 chronological split.

Random splitting is avoided to prevent data leakage.

In [64]:
def time_split(X, y, split_ratio=0.8):
    split_idx = int(len(X) * split_ratio)

    X_train = X.iloc[:split_idx]
    X_test  = X.iloc[split_idx:]

    y_train = y.iloc[:split_idx]
    y_test  = y.iloc[split_idx:]

    return X_train, X_test, y_train, y_test

In [65]:
sj_X_train, sj_X_test, sj_y_train, sj_y_test = time_split(sj_X, sj_y)
iq_X_train, iq_X_test, iq_y_train, iq_y_test = time_split(iq_X, iq_y)

## 4. Linear Regression Baseline

In [66]:
from sklearn.linear_model import LinearRegression

sj_lr = LinearRegression()
iq_lr = LinearRegression()

sj_lr.fit(sj_X_train, sj_y_train)
iq_lr.fit(iq_X_train, iq_y_train)

In [68]:
sj_pred_lr = sj_lr.predict(sj_X_test)
iq_pred_lr = iq_lr.predict(iq_X_test)

In [70]:
from sklearn.metrics import mean_absolute_error

sj_mae_lr = mean_absolute_error(sj_y_test, sj_pred_lr)
iq_mae_lr = mean_absolute_error(iq_y_test, iq_pred_lr)

print('sj mae LR: ', sj_mae_lr)
print('iq mae LR: ', iq_mae_lr)

sj mae LR:  24.729977945031013
iq mae LR:  7.009574422244498


In [71]:
from sklearn.ensemble import RandomForestRegressor

sj_rf = RandomForestRegressor(n_estimators=300, random_state=2, max_depth=7)
iq_rf = RandomForestRegressor(n_estimators=100, random_state=2, max_depth=4)

sj_rf.fit(sj_X_train, sj_y_train)
iq_rf.fit(iq_X_train, iq_y_train)

In [72]:
sj_pred_rf = sj_rf.predict(sj_X_test)
iq_pred_rf = iq_rf.predict(iq_X_test)

In [73]:
from sklearn.metrics import mean_absolute_error

sj_mae_rf = mean_absolute_error(sj_y_test, sj_pred_rf)
iq_mae_rf = mean_absolute_error(iq_y_test, iq_pred_rf)

print('sj mae rf: ', sj_mae_rf)
print('iq mae rf: ', iq_mae_rf)

sj mae rf:  29.41952844327523
iq mae rf:  7.5911851405619695


## 6. Model Comparison

In [74]:
results = pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest"],
    "San Juan MAE": [sj_mae_lr,sj_mae_rf],
    "Iquitos MAE": [iq_mae_lr,iq_mae_rf]
})

results

Unnamed: 0,Model,San Juan MAE,Iquitos MAE
0,Linear Regression,24.729978,7.009574
1,Random Forest,29.419528,7.591185


## Observations

- Linear Regression struggles to model outbreak spikes.
- Random Forest captures nonlinear effects but still smooths extreme outbreaks.
- MAE remains high due to strong temporal dependency.

### Key Insight

Simple climate-only models fail to capture outbreak dynamics.

This motivates time-aware feature engineering and recursive modeling.