<a href="https://colab.research.google.com/github/moira-du-monde/space_time_dengue_fever/blob/data/cleaning_dengue_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Cleaning: Dengue Fever Cases in Kaohsiung City, 1998-2019**

Install and/or import necessary libraries.

In [39]:
%%capture
#pip install geopandas

In [40]:
from datetime import date
import pandas as pd
import json
import geopandas as gpd

Access Taiwan's historical dengue fever data from Github user jerrytohvan's "dangy" repository.  Then rename, consolidate and remove unnecessary columns.  Filter location to isolate case study area.

In [41]:
url = "https://raw.githubusercontent.com/jerrytohvan/dangy/master/data/dengue_case.json"
df = pd.read_json(url, orient='records', dtype='dict')

In [42]:
df.columns

Index(['Onset_day', 'Case_study_date', 'Notification_day', 'gender',
       'age_group', 'Living_county', 'Living_township', 'Residential_village',
       'Minimum_statistical_area', 'Minimum_statistical_area_center_point_X',
       'Minimum_statistical_area_center_point_Y', 'Primary_statistical_area',
       'Secondary_statistical_area', 'Infected_counties_and_cities',
       'Infect_township', 'Infected_village', 'Whether_to_move_abroad',
       'Infected_country', 'Determine_the_number_of_cases',
       'Residential_village_code', 'Infected_village_code', 'Serotype',
       'Ministry_of_the_Interior_resident_county_code', 'Hometown_code',
       'The_Ministry_of_the_Interior_is_infected_with_the_county_code'],
      dtype='object')

In [43]:
df = df.drop(columns = ['Case_study_date', 'Notification_day', 'gender',
       'age_group','Living_township', 'Residential_village',
       'Minimum_statistical_area','Primary_statistical_area',
       'Secondary_statistical_area', 'Infected_counties_and_cities',
       'Infect_township', 'Infected_village', 'Whether_to_move_abroad',
       'Infected_country', 'Determine_the_number_of_cases',
       'Residential_village_code', 'Infected_village_code', 'Serotype',
       'Ministry_of_the_Interior_resident_county_code', 'Hometown_code',
       'The_Ministry_of_the_Interior_is_infected_with_the_county_code'])
df.head()

Unnamed: 0,Onset_day,Living_county,Minimum_statistical_area_center_point_X,Minimum_statistical_area_center_point_Y
0,1998/01/02,Pingtung County,120.505898941,22.46420665
1,1998/01/03,Pingtung County,120.45365746,22.466338948
2,1998/01/13,Yilan County,121.751433765,24.749214667
3,1998/01/15,Kaohsiung City,120.338158907,22.6303167
4,1998/01/20,Yilan County,121.798235373,24.684507639


In [44]:
df['x'] = df['Minimum_statistical_area_center_point_X']
df['y'] = df['Minimum_statistical_area_center_point_Y']
df['date'] = df['Onset_day']

df = df.drop(columns = ['Minimum_statistical_area_center_point_X', 'Minimum_statistical_area_center_point_Y', 'Onset_day'])
df.head()

Unnamed: 0,Living_county,x,y,date
0,Pingtung County,120.505898941,22.46420665,1998/01/02
1,Pingtung County,120.45365746,22.466338948,1998/01/03
2,Yilan County,121.751433765,24.749214667,1998/01/13
3,Kaohsiung City,120.338158907,22.6303167,1998/01/15
4,Yilan County,121.798235373,24.684507639,1998/01/20


In [45]:
data = df.loc[df['Living_county'] == 'Kaohsiung City']

In [46]:
data['date'] = df['date'].str.replace('/','-')
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Living_county,x,y,date
3,Kaohsiung City,120.338158907,22.6303167,1998-01-15
9,Kaohsiung City,120.313207729,22.724216594,1998-02-16
10,Kaohsiung City,120.34001159,22.607026822,1998-02-17
12,Kaohsiung City,120.326696134,22.588748367,1998-03-05
16,Kaohsiung City,120.318715566,22.586563562,1998-03-22


**Add full date-time stamps (including days that had 0 new cases) to reflect the data's real temporal nature.**

In [47]:
date_range = pd.DataFrame({'date': pd.date_range(date(1998, 1, 15), date(2019,2,1), freq='D')})

date_range = date_range['date'].dt.strftime('%Y-%m-%d')

df = pd.merge(date_range, data, how='left', on='date')
df.head()

Unnamed: 0,date,Living_county,x,y
0,1998-01-15,Kaohsiung City,120.338158907,22.6303167
1,1998-01-16,,,
2,1998-01-17,,,
3,1998-01-18,,,
4,1998-01-19,,,


In [48]:
df = df.fillna(0)

df.to_csv('case_data.csv')

In [49]:
df.head()

Unnamed: 0,date,Living_county,x,y
0,1998-01-15,Kaohsiung City,120.338158907,22.6303167
1,1998-01-16,0,0.0,0.0
2,1998-01-17,0,0.0,0.0
3,1998-01-18,0,0.0,0.0
4,1998-01-19,0,0.0,0.0
