In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

In [2]:
import mercury as mr

# configure app properties
app = mr.App(title="Static notebook", description="Display static notebook", static_notebook=True)

## Introduction and Motivation 

Dengue fever, a mosquito-borne viral disease, poses a significant threat to public health, particularly in tropical and subtropical regions. As global temperatures rise due to climate change, the geographic distribution of the Aedes aegypti mosquito—the primary vector for dengue—and other mosquitos like Aedes albopictus extend into new areas, increasing the risk of outbreaks in regions previously considered safe. This report examines the critical intersection of climate change and dengue fever, focusing on the countries most affected and the potential future landscape for dengue transmission.

Utilizing historical data, I have constructed multiple supervised machine learning models to predict the number of new dengue cases by country, allowing us to gain insights into the patterns and trends of this increasingly prevalent disease. Also, I've used two unsupervised classification methods to identify the future susceptibility of various countries to dengue outbreaks, based on factors like [],[] and [].

The motivation behind this study lies in the urgent need to raise awareness about the expanding risk of dengue fever as climate change intensifies. In many jurisdictions, monitoring and reporting systems are inadequate, leaving medical professionals and concerned citizens without the tools necessary to prepare for and respond to outbreaks effectively. By highlighting vulnerable regions and predicting dengue transmission trends, this report aims to provide valuable information that can support healthcare initiatives and guide policy decisions in a rapidly warming world.


## Dengue in 5 Questions 

Which countries of the world are experiencing the most Dengue fever?

Which countries are now up-and-coming Dengue hotspots? 

What about countries which have no Dengue transmission data available, what can we do for them? 

Given sufficient historical data, can we predict the number of Dengue cases by country?

Given sufficient historical data, can we predict future Dengue hotspots?

## Literature Review 

Outside of academic sources, this project was largely inspired by this Guardian article documenting the impact of Dengue in Burkina Faso. While I had heard of it before, I had not realized the sheer geographical range of Dengue fever, sometimes also called breakbone fever because of how severe the joint pain is for many patients. While I read many sources in the process of making this, here is a condensed summary of the best academic resources I found. 

The impact of Dengue in the tropics goes back centuries and is well documented. The term breakbone fever was coined in the 1800s by Dr. Benjamin Rush. The earliest case in history was recorded on [] and the earliest comprehensive study I could find was [] 

More recently, global warming has expanded the range of multiple mosquitos that are known carriers for this disease, making more and more non-tropical countries susceptible to local epidemics of Dengue. Here is an excellent yet succinct example documenting local dengue spread in Iran. Among other things, it correctly points out that even regions that are neither southern nor tropical are at risk, because many woodlands are also prone to humidity if they are coastal enough. 

There have been attempts to forecast dengue, but likely due to financial and politicial factors, they have been limited to specific countries and not to global settings. The best one I could find was this paper out of Brazil, which accomplished the task using Ensemble Methods, particularly SVMs. More research is being done in the pharmacuetical world, where ML is being used to ameliorate slow and faulty testing procedures for Dengue, like in this paper, but that is less relevant given this is a forecasting attempt. 

In addition, unsupervised methods to predict regional risk for disease are not new, as shown here and here. I could not readily find one for Dengue, so I will do my best to be among the first to go down that route. 

Citations:







## Data Collection

For calculating the number of present and future dengue,  I'll be using OpenDengue National Dengue Reports data, a relatively clean dataset that includes many countries and is relatively clean to begin with. But it relies on countries self-reporting Dengue cases, and many of them choose not to do so because of reasons ranging from political instability/lack of funding and internal collection, to wanting to avoid stigma. As a result there is a lot of opportunity to impute data. 

In [3]:
cases = pd.read_csv('National_extract_V1_3.csv')

In [4]:
cases

Unnamed: 0,adm_0_name,adm_1_name,adm_2_name,full_name,ISO_A0,FAO_GAUL_code,RNE_iso_code,IBGE_code,calendar_start_date,calendar_end_date,Year,dengue_total,case_definition_standardised,S_res,T_res,UUID
0,AFGHANISTAN,,,AFGHANISTAN,AFG,1011446,AFG,,2021-09-05,2021-09-11,2021,18.0,Suspected,Admin0,Week,WHOEMRO-ALL-2021-Y01-05
1,AFGHANISTAN,,,AFGHANISTAN,AFG,1011446,AFG,,2021-09-12,2021-09-18,2021,24.0,Suspected,Admin0,Week,WHOEMRO-ALL-2021-Y01-05
2,AFGHANISTAN,,,AFGHANISTAN,AFG,1011446,AFG,,2021-09-19,2021-09-25,2021,9.0,Suspected,Admin0,Week,WHOEMRO-ALL-2021-Y01-05
3,AFGHANISTAN,,,AFGHANISTAN,AFG,1011446,AFG,,2021-09-26,2021-10-02,2021,104.0,Suspected,Admin0,Week,WHOEMRO-ALL-2021-Y01-05
4,AFGHANISTAN,,,AFGHANISTAN,AFG,1011446,AFG,,2021-10-03,2021-10-09,2021,98.0,Suspected,Admin0,Week,WHOEMRO-ALL-2021-Y01-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29868,YEMEN,,,YEMEN,YEM,269,YEM,,2024-06-23,2024-06-29,2024,139.0,Suspected and confirmed,Admin0,Week,LITERATURE-YEM-20202024-Y01-00
29869,YEMEN,,,YEMEN,YEM,269,YEM,,2024-06-30,2024-07-06,2024,124.0,Suspected and confirmed,Admin0,Week,LITERATURE-YEM-20202024-Y01-00
29870,YEMEN,,,YEMEN,YEM,269,YEM,,2024-07-07,2024-07-13,2024,121.0,Suspected and confirmed,Admin0,Week,LITERATURE-YEM-20202024-Y01-00
29871,YEMEN,,,YEMEN,YEM,269,YEM,,2024-07-14,2024-07-20,2024,121.0,Suspected and confirmed,Admin0,Week,LITERATURE-YEM-20202024-Y01-00


In [5]:
# space for other dataset


## Data Cleaning 

Logging issues:
- some data not encoded correctly
- inconsistent column names
- a lot of useless columns

In [6]:
cases['T_res'].value_counts()

T_res
Week     23248
Year      3495
Month     3130
Name: count, dtype: int64

In [7]:
# num countries?
cases['full_name'].value_counts()

full_name
NICARAGUA                   897
PERU                        813
SINGAPORE                   793
MEXICO                      711
COLOMBIA                    671
                           ... 
MAYOTTE                      10
CHAD                          9
CENTRAL AFRICAN REPUBLIC      7
REUNION                       7
GUINEA                        7
Name: count, Length: 129, dtype: int64

In [8]:
# looks like we have 129 countries 
cases['full_name'].nunique()

129

In [9]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29873 entries, 0 to 29872
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   adm_0_name                    29873 non-null  object 
 1   adm_1_name                    0 non-null      float64
 2   adm_2_name                    0 non-null      float64
 3   full_name                     29873 non-null  object 
 4   ISO_A0                        29873 non-null  object 
 5   FAO_GAUL_code                 29873 non-null  int64  
 6   RNE_iso_code                  29873 non-null  object 
 7   IBGE_code                     0 non-null      float64
 8   calendar_start_date           29873 non-null  object 
 9   calendar_end_date             29873 non-null  object 
 10  Year                          29873 non-null  int64  
 11  dengue_total                  29873 non-null  float64
 12  case_definition_standardised  29873 non-null  object 
 13  S

In [10]:
annual_cases = cases.query('T_res == "Year" & case_definition_standardised == "Total"')

In [11]:
annual_cases

Unnamed: 0,adm_0_name,adm_1_name,adm_2_name,full_name,ISO_A0,FAO_GAUL_code,RNE_iso_code,IBGE_code,calendar_start_date,calendar_end_date,Year,dengue_total,case_definition_standardised,S_res,T_res,UUID
94,AMERICAN SAMOA,,,AMERICAN SAMOA,ASM,5,ASM,,1955-01-01,1955-12-31,1955,0.0,Total,Admin0,Year,TYCHO-ALL-19242017-SV_DF01-00
95,AMERICAN SAMOA,,,AMERICAN SAMOA,ASM,5,ASM,,1979-01-01,1979-12-31,1979,0.0,Total,Admin0,Year,TYCHO-ALL-19242017-SV_DF01-00
96,AMERICAN SAMOA,,,AMERICAN SAMOA,ASM,5,ASM,,1980-01-01,1980-12-31,1980,1.0,Total,Admin0,Year,TYCHO-ALL-19242017-SV_DF01-00
97,AMERICAN SAMOA,,,AMERICAN SAMOA,ASM,5,ASM,,1981-01-01,1981-12-31,1981,1.0,Total,Admin0,Year,TYCHO-ALL-19242017-SV_DF01-00
98,AMERICAN SAMOA,,,AMERICAN SAMOA,ASM,5,ASM,,1982-01-01,1982-12-31,1982,0.0,Total,Admin0,Year,TYCHO-ALL-19242017-SV_DF01-00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29291,WALLIS AND FUTUNA,,,WALLIS AND FUTUNA,WLF,266,WLF,,2008-01-01,2008-12-31,2008,0.0,Total,Admin0,Year,TYCHO-ALL-19242017-SV_DF01-00
29292,WALLIS AND FUTUNA,,,WALLIS AND FUTUNA,WLF,266,WLF,,2009-01-01,2009-12-31,2009,13.0,Total,Admin0,Year,TYCHO-ALL-19242017-SV_DF01-00
29296,WALLIS AND FUTUNA,,,WALLIS AND FUTUNA,WLF,266,WLF,,2013-01-01,2013-12-31,2013,94.0,Total,Admin0,Year,LITERATURE-ALL-20112019-Y01-02
29297,WALLIS AND FUTUNA,,,WALLIS AND FUTUNA,WLF,266,WLF,,2017-01-01,2017-12-31,2017,222.0,Total,Admin0,Year,LITERATURE-ALL-20112019-Y01-02


### PCA 

In [12]:
annual_cases.columns

Index(['adm_0_name', 'adm_1_name', 'adm_2_name', 'full_name', 'ISO_A0',
       'FAO_GAUL_code', 'RNE_iso_code', 'IBGE_code', 'calendar_start_date',
       'calendar_end_date', 'Year', 'dengue_total',
       'case_definition_standardised', 'S_res', 'T_res', 'UUID'],
      dtype='object')

In [13]:
drops = ['adm_0_name', 'adm_1_name', 'adm_2_name', 'ISO_A0',
       'FAO_GAUL_code', 'RNE_iso_code', 'IBGE_code', 'calendar_start_date',
       'calendar_end_date', 'case_definition_standardised', 'S_res', 'T_res', 'UUID']

annual_cases = annual_cases.drop(drops, axis=1).reset_index(drop=True)
annual_cases.columns = ['Country', 'Year', 'Dengue_Total']

In [14]:
annual_cases

Unnamed: 0,Country,Year,Dengue_Total
0,AMERICAN SAMOA,1955,0.0
1,AMERICAN SAMOA,1979,0.0
2,AMERICAN SAMOA,1980,1.0
3,AMERICAN SAMOA,1981,1.0
4,AMERICAN SAMOA,1982,0.0
...,...,...,...
3187,WALLIS AND FUTUNA,2008,0.0
3188,WALLIS AND FUTUNA,2009,13.0
3189,WALLIS AND FUTUNA,2013,94.0
3190,WALLIS AND FUTUNA,2017,222.0


In [15]:
annual_cases['Country'].nunique()

97

In [16]:
# hey wait some countries went missing, had 129 
# they were not encoded with the appropriate total - year marker by the administrator 

In [17]:
# add up weeks manually, these don't have annual markers like the are supposed to 
grp_cases = cases.query('T_res == "Week"').groupby(['full_name', 'Year', ]).agg({
    'dengue_total': 'sum'
}).reset_index()
grp_cases

Unnamed: 0,full_name,Year,dengue_total
0,AFGHANISTAN,2021,734.0
1,AFGHANISTAN,2023,1481.0
2,AMERICAN SAMOA,2017,111.0
3,AMERICAN SAMOA,2018,0.0
4,AMERICAN SAMOA,2019,4.0
...,...,...,...
532,YEMEN,2020,62028.0
533,YEMEN,2021,7324.0
534,YEMEN,2022,24545.0
535,YEMEN,2023,21302.0


In [18]:
missing_set = set(grp_cases['full_name']) - set(annual_cases['Country'])
missing_set

{'AFGHANISTAN', 'SUDAN', 'YEMEN'}

In [19]:
# oh wow these are important

In [20]:
# merge back with the original to keep everything
merged_df = pd.merge(cases, grp_cases, on=['full_name', 'Year'])
# extract the missing ones 
missing = merged_df.query('full_name in @missing_set')

missing = missing.groupby(['full_name', 'Year']).agg({
    'dengue_total_y': 'sum'
}).reset_index()
missing.columns = ['Country', 'Year', 'Dengue_Total']



In [21]:
all_countries = pd.concat([annual_cases, missing]).reset_index(drop=True).sort_values(['Country'])

In [22]:
# ah ah ah remove the year 2025 because it is incomplete and will bias our data 
# cdf is cleaned df
cdf = all_countries.query('Year != 2025').reset_index(drop=True)
cdf

Unnamed: 0,Country,Year,Dengue_Total
0,AFGHANISTAN,2023,75531.0
1,AFGHANISTAN,2021,11744.0
2,AMERICAN SAMOA,1955,0.0
3,AMERICAN SAMOA,2016,0.0
4,AMERICAN SAMOA,2015,479.0
...,...,...,...
3199,YEMEN,2018,1469260.0
3200,YEMEN,2017,212.0
3201,YEMEN,2023,1129006.0
3202,YEMEN,2019,4243200.0


## A Note About This Data 
Notice how, because the numbers are self-reported, there can be inconsistencies. For example, American Samoa went from 479 cases in 2015, but 0 in 2016. This could be true, or it could be that there wasn't as accurate reporting in 2016.

And because anyone can contribute to OpenDengue, sometimes statisticians can submit estimated guesses. Yemen skyrockets to around 4 million cases because they were having an epidemic, but also because doctors treating people made estimates. That is why it goes down so dramatically a few years later. People were no longer submitting estimates.

## EDA

## Supervised Learning Approaches: Predicting Total Future Dengue Cases 

In [24]:
# making country numerical 
le = LabelEncoder()
cdf['Country_Encoded'] = le.fit_transform(cdf['Country'])

X = cdf[['Year', 'Country_Encoded']]
y = cdf['Dengue_Total']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=67) # train test split

model = LinearRegression()
model.fit(X_train, y_train)


# y hat for future, 2025-2026 etc.
future_years = pd.DataFrame({
    'Year': [2026, 2026],
    'Country_Encoded': le.transform(['AFGHANISTAN', 'YEMEN'])
})

predictions = model.predict(future_years)
predicted_df = pd.DataFrame({'Country': ['AFGHANISTAN', 'YEMEN'], 'Predicted_Cases': predictions})

predicted_df

Unnamed: 0,Country,Predicted_Cases
0,AFGHANISTAN,57518.109906
1,YEMEN,60602.015255


In [None]:
## Unsupervised Learning Approaches: 