</br>
<h1 align="center" style="color:green">Feature Extraction on Meteorological & Dengue Dataset in the City of Manila</br> DOST-Pagasa Port Area Dataset & DOH City of Manila Dengue Dataset</h1>
<div style="text-align:center">Prepared by <b>Jose Rafael C Crisostomo, Jan Vincent G. Elleazar, Dodge Deiniol D. Lapis, and Carl Jacob F. Mateo</b><br>
FOMaC-Autoformer: A Hybrid First Order Markov Chain-Autoformer Model for Dengue Incidence Forecasting in the City of Manila<br>
<b>University of Santo Tomas - College of Information and Computing Sciences</b>
</div>

We will extract all features from both datasets into the following: Temporal Features, Statistical Features, and Composite Features. These will then converted into feature vectors in which allows the Autoformer model to be fed by the right input data.
***

# Load the dataset

In [1]:
import pandas as pd
import numpy as np

# Load the dataset you created in the state discretization step
try:
    data = pd.read_csv('final_data_with_states.csv')
except FileNotFoundError:
    print("Error: 'final_data_with_states.csv' not found.")
    print("Please run your state discretization script first.")
    # In a real script, you'd exit here

# Set 'date' as the index. This is VITAL for time-series features.
data['date'] = pd.to_datetime(data['date'])
data.set_index('date', inplace=True)
data.sort_index(inplace=True)

print("--- Loaded Data ---")
print(data.head())

--- Loaded Data ---
               location  cases  cases_minmax  RAINFALL_minmax  TMAX_minmax  \
date                                                                         
2016-01-10  MANILA CITY     49      0.204167         0.005472     0.380835   
2016-01-17  MANILA CITY     47      0.195833         0.008208     0.471744   
2016-01-24  MANILA CITY     37      0.154167         0.008208     0.455774   
2016-01-31  MANILA CITY     31      0.129167         0.008208     0.380835   
2016-02-07  MANILA CITY     33      0.137500         0.167715     0.253071   

            TMIN_minmax  RH_minmax  WIND_SPEED_minmax  WIND_DIR_X  WIND_DIR_Y  \
date                                                                            
2016-01-10     0.298276   0.500000           0.225734   -0.481295   -0.876559   
2016-01-17     0.443103   0.511666           0.322799   -0.229200   -0.973379   
2016-01-24     0.460345   0.403147           0.354402   -0.844328   -0.535827   
2016-01-31     0.362069   0.

# Temporal Features

In [2]:
print("\n--- 1. Creating Temporal Features ---")

# Extract basic time features
data['month'] = data.index.month
data['week_of_year'] = data.index.isocalendar().week
data['day_of_year'] = data.index.dayofyear
data['year'] = data.index.year

# --- Cyclical Feature Encoding ---
# This is a best practice. It helps the model understand that
# December (12) is "close" to January (1).

def encode_cyclical(df, col, max_val):
    df[col + '_sin'] = np.sin(2 * np.pi * df[col] / max_val)
    df[col + '_cos'] = np.cos(2 * np.pi * df[col] / max_val)
    return df

# Encode month and week_of_year
data = encode_cyclical(data, 'month', 12)
data = encode_cyclical(data, 'week_of_year', 52)

# Now we can drop the original month/week columns if we want
# data.drop(['month', 'week_of_year'], axis=1, inplace=True)

print("Temporal features created and encoded.")
print(data[['month_sin', 'month_cos', 'week_of_year_sin', 'week_of_year_cos']].head())


--- 1. Creating Temporal Features ---
Temporal features created and encoded.
            month_sin  month_cos  week_of_year_sin  week_of_year_cos
date                                                                
2016-01-10   0.500000   0.866025          0.120537          0.992709
2016-01-17   0.500000   0.866025          0.239316          0.970942
2016-01-24   0.500000   0.866025          0.354605          0.935016
2016-01-31   0.500000   0.866025          0.464723          0.885456
2016-02-07   0.866025   0.500000          0.568065          0.822984


***
# Statistical Features

In [3]:
print("\n--- 2. Creating Statistical Features (Lags & Rolling) ---")

# List of columns we want to create features for
# We use the minmax-scaled features for this
features_to_lag = [
    'RAINFALL_minmax', 
    'TMAX_minmax', 
    'TMIN_minmax', 
    'RH_minmax',
    'cases_minmax'  # Lagging the target itself is a key predictive feature
]

# Define our window sizes
# We'll use a 4-week window for rolling averages
window_size = 4
# We'll create lags for 1, 2, 3, and 4 weeks
lag_periods = [1, 2, 3, 4]

for col in features_to_lag:
    # --- Rolling Averages ---
    # Calculates the mean of the last 'window_size' weeks
    col_roll_mean = f'{col}_roll_mean_{window_size}w'
    data[col_roll_mean] = data[col].rolling(window=window_size).mean()
    
    # --- Lagged Features ---
    # Shows the value from 'n' weeks ago
    for lag in lag_periods:
        col_lag = f'{col}_lag_{lag}w'
        data[col_lag] = data[col].shift(lag)

print("Statistical features created.")
print(data.filter(like='RAINFALL_minmax').head(10)) # Show first 10 to see NaNs


--- 2. Creating Statistical Features (Lags & Rolling) ---
Statistical features created.
            RAINFALL_minmax  RAINFALL_minmax_roll_mean_4w  \
date                                                        
2016-01-10         0.005472                           NaN   
2016-01-17         0.008208                           NaN   
2016-01-24         0.008208                           NaN   
2016-01-31         0.008208                      0.007524   
2016-02-07         0.167715                      0.048085   
2016-02-14         0.008208                      0.048085   
2016-02-21         0.008208                      0.048085   
2016-02-28         0.018057                      0.050547   
2016-03-06         0.005472                      0.009986   
2016-03-13         0.008208                      0.009986   

            RAINFALL_minmax_lag_1w  RAINFALL_minmax_lag_2w  \
date                                                         
2016-01-10                     NaN                    

***
# Composite Features

In [4]:
print("\n--- 3. Creating Composite Features ---")

# Interaction between Temperature and Humidity
# This "heat-humidity" feature could be more predictive than either alone.
data['TMAX_x_RH'] = data['TMAX_minmax'] * data['RH_minmax']

# Temperature Range
data['T_range'] = data['TMAX_minmax'] - data['TMIN_minmax']

print("Composite features created.")
print(data[['TMAX_minmax', 'RH_minmax', 'TMAX_x_RH', 'T_range']].head())


--- 3. Creating Composite Features ---
Composite features created.
            TMAX_minmax  RH_minmax  TMAX_x_RH   T_range
date                                                   
2016-01-10     0.380835   0.500000   0.190418  0.082560
2016-01-17     0.471744   0.511666   0.241375  0.028641
2016-01-24     0.455774   0.403147   0.183744 -0.004571
2016-01-31     0.380835   0.387412   0.147540  0.018766
2016-02-07     0.253071   0.620184   0.156951 -0.071067


# Clean and save the dataset

In [5]:
print("\n--- 4. Final Cleanup and Save ---")

# Check how many rows have NaN values
nan_rows_before = data.isna().any(axis=1).sum()
print(f"Rows with NaN values before dropping: {nan_rows_before}")

# Drop all rows that have ANY NaN values
# This ensures our model only trains on complete, valid data
final_feature_data = data.dropna()

nan_rows_after = final_feature_data.isna().any(axis=1).sum()
print(f"Rows with NaN values after dropping: {nan_rows_after}")
print(f"Original data shape: {data.shape}")
print(f"Final data shape: {final_feature_data.shape}")

# Save the final dataset
final_feature_data.to_csv('feature_engineered_data.csv')

print("\n--- Feature Engineering Complete! ---")
print("Your final dataset 'feature_engineered_data.csv' is ready.")
print("This file contains the 'Feature Vectors' for your Autoformer.")


--- 4. Final Cleanup and Save ---
Rows with NaN values before dropping: 11
Rows with NaN values after dropping: 0
Original data shape: (259, 46)
Final data shape: (248, 46)

--- Feature Engineering Complete! ---
Your final dataset 'feature_engineered_data.csv' is ready.
This file contains the 'Feature Vectors' for your Autoformer.
