</br>
<h1 align="center" style="color:green">State Discretization on Dengue Dataset in the City of Manila</br>DOH City of Manila Dengue Dataset</h1>
<div style="text-align:center">Prepared by <b>Jose Rafael C Crisostomo, Jan Vincent G. Elleazar, Dodge Deiniol D. Lapis, and Carl Jacob F. Mateo</b><br>
FOMaC-Autoformer: A Hybrid First Order Markov Chain-Autoformer Model for Dengue Incidence Forecasting in the City of Manila<br>
<b>University of Santo Tomas - College of Information and Computing Sciences</b>
</div>

We will convert dengue cases into bins/states using thresholds. This will be converted into categorical states which allows the First-Order Markov Chain model to be capable of having probabilistic forecast of dengue states.
***

# State Discretization

In [1]:
# Import the libraries you already have in your preprocessing file
import pandas as pd
import numpy as np

# 1. Load your final, preprocessed dataset
try:
    final_data = pd.read_csv('final_dengue_meteorological_data.csv')
except FileNotFoundError:
    print("Error: 'final_dengue_meteorological_data.csv' not found.")
    print("Please run your preprocessing script first.")
    # In a real script, you'd exit here
    
# Make sure 'date' column is a datetime object for proper sorting
if 'date' in final_data.columns:
    final_data['date'] = pd.to_datetime(final_data['date'])
    final_data = final_data.sort_values(by='date').reset_index(drop=True)

    # 2. Define the bins and labels based on EDA 
    # Bins: (-inf to 34], (34 to 75], (75 to +inf]
    # We use -np.inf and np.inf to cover all values
    bins = [-np.inf, 34, 75, np.inf]
    
    # Labels for our 3 states
    labels = ['Low', 'Medium', 'High']

    # 3. Create the new 'state' column
    # We use the raw 'cases' column to determine the state
    final_data['state'] = pd.cut(final_data['cases'], 
                                 bins=bins, 
                                 labels=labels, 
                                 right=True) # right=True means the bin includes the right edge

    # 4. Check the work
    print("--- State Discretization Complete ---")
    print("\nDistribution of Dengue States:")
    
    # Calculate and print the percentage for each state
    state_distribution = final_data['state'].value_counts(normalize=True).sort_index()
    print(state_distribution * 100)

    print("\nDataFrame with new 'state' column:")
    print(final_data[['date', 'cases', 'state', 'cases_minmax']].head())

    # 5. Save this new dataframe for the next phase
    final_data.to_csv('final_data_with_states.csv', index=False)
    
else:
    print("Error: 'date' column not found in the dataset.")

--- State Discretization Complete ---

Distribution of Dengue States:
state
Low       50.579151
Medium    25.096525
High      24.324324
Name: proportion, dtype: float64

DataFrame with new 'state' column:
        date  cases   state  cases_minmax
0 2016-01-10     49  Medium      0.204167
1 2016-01-17     47  Medium      0.195833
2 2016-01-24     37  Medium      0.154167
3 2016-01-31     31     Low      0.129167
4 2016-02-07     33     Low      0.137500
