## Data Cleaning: Dengue Data

This notebook will work through the steps to clean the `Weekly Infectious Disease Bulletin Data` from `data.gov.sg`^ and obtain only the relevant dengue case data needed for this project.

^Link to dataset: https://data.gov.sg/dataset/weekly-infectious-disease-bulletin-cases  
Data of weekly infectious disease cases from 2012 to 2022 are provided in this dataset.

In [1]:
import pandas as pd
import numpy as np

### Load the data

In [2]:
# import csv data
diseases_df = pd.read_csv('../../data/weekly_disease_bulletin/weekly-infectious-disease-bulletin-cases.csv')
diseases_df

Unnamed: 0,epi_week,disease,no._of_cases
0,2012-W01,Acute Viral hepatitis B,0
1,2012-W01,Acute Viral hepatitis C,0
2,2012-W01,Avian Influenza,0
3,2012-W01,Campylobacterenterosis,6
4,2012-W01,Chikungunya Fever,0
...,...,...,...
20065,2022-W52,Japanese Encephalitis,0
20066,2022-W52,Tetanus,0
20067,2022-W52,Botulism,0
20068,2022-W52,Murine Typhus,0


### Keep only dengue data
Many infectious diseases are listed in this dataset. We want only relevant data for dengue. 

In [3]:
# check disease names
diseases_df['disease'].value_counts().sort_index()

Acute Viral Hepatitis A              313
Acute Viral Hepatitis E              313
Acute Viral hepatitis B              574
Acute Viral hepatitis C              574
Avian Influenza                      574
Botulism                             313
Campylobacter enteritis              313
Campylobacterenterosis               261
Chikungunya                          313
Chikungunya Fever                    261
Cholera                              574
Dengue Fever                         574
Dengue Haemorrhagic Fever            574
Diphtheria                           574
Ebola Virus Disease                  313
Encephalitis                         574
HFMD                                 313
Haemophilus influenzae type b        574
Hand, Foot Mouth Disease             261
Japanese Encephalitis                313
Legionellosis                        574
Leptospirosis                        313
Malaria                              574
Measles                              573
Melioidosis     

We find that there are 2 diseases related to dengue: `Dengue Fever` and `Dengue Haemorrhagic Fever`.


Based on online research, both are commonly referred to as Dengue together, although the latter is the more serious and deadly form of the former, and both of which are caused by the Aedes aegypti mosquito. 

However, we should confirm that both listings in the dataset are indeed distinct from one another.

In [4]:
# check if `Dengue Fever` and `Dengue Haemorrhagic Fever` are treated the same or differently
dengue_df = diseases_df[(diseases_df['disease'] == 'Dengue Fever') | (diseases_df['disease'] == 'Dengue Haemorrhagic Fever')]
dengue_df

Unnamed: 0,epi_week,disease,no._of_cases
6,2012-W01,Dengue Fever,74
7,2012-W01,Dengue Haemorrhagic Fever,0
37,2012-W02,Dengue Fever,64
38,2012-W02,Dengue Haemorrhagic Fever,2
68,2012-W03,Dengue Fever,60
...,...,...,...
19962,2022-W50,Dengue Haemorrhagic Fever,1
20000,2022-W51,Dengue Fever,270
20001,2022-W51,Dengue Haemorrhagic Fever,0
20039,2022-W52,Dengue Fever,285


The data appears to reflect the rarer and more serious nature of `Dengue Haemorrhagic Fever`, since the number of cases for this is much fewer than `Dengue Fever`. We can also be more certain that they are not duplicate listings in the dataset.

In [5]:
# sum both `Dengue Fever` and `Dengue Haemorrhagic Fever` for each week as a total count
dengue_df = dengue_df.groupby('epi_week').sum(numeric_only=True).reset_index()
dengue_df

Unnamed: 0,epi_week,no._of_cases
0,2012-W01,74
1,2012-W02,66
2,2012-W03,61
3,2012-W04,52
4,2012-W05,85
...,...,...
569,2022-W48,242
570,2022-W49,327
571,2022-W50,290
572,2022-W51,270


### Epidemiological Weeks

We also notice that the infectious disease/dengue data are based on "Epidemiological Weeks", or in short, epi weeks.

For Singapore (and many countries around the world), epi weeks are 7-day weeks that begin on Sunday and end on Saturday. As such, the first day of epi week 1 of each year may not begin on Jan 1 of the calendar year; it depends on which day the first Sunday on the month falls.

Similarly, epi years are not the same as calendar years, since epi week 1 may start a few calendar days before or after Jan 1.

For standardizing our data across all datasets, we shall follow the use of epi weeks.

In [6]:
# split epi_week into year and week respectively
dengue_df['year'] = dengue_df['epi_week'].str[:4].astype('int')
dengue_df['week'] = dengue_df['epi_week'].str[6:].astype('int')

In [7]:
# keep only necessary columns and reorder them for easy reading
dengue_df = dengue_df[['year', 'week', 'no._of_cases']]

# rename columns
dengue_df = dengue_df.rename(columns={'no._of_cases': 'cases'})
dengue_df

Unnamed: 0,year,week,cases
0,2012,1,74
1,2012,2,66
2,2012,3,61
3,2012,4,52
4,2012,5,85
...,...,...,...
569,2022,48,242
570,2022,49,327
571,2022,50,290
572,2022,51,270


### Save the cleaned dataframe as a csv

So that we can easily read it in during analysis later on!

In [8]:
dengue_df.to_csv('../../data/cleaned/dengue_clean.csv', index = False)