# Stochastic Prediction Model, <br><small>Case of Dengue Outbreak in Tainan, 2015</small>
<div align="right">cch/03/11/2019</div>

## Part 1

### Data Cleaning and Visualization

Data 
---
- [Tainan Goverment](http://data.tainan.gov.tw/dataset/denguefevercases): cases data

- [Tainan Goverment](http://data.tainan.gov.tw/dataset/2015-df-mosquito-density): density of vector masquitos




Note
---
- vector mosquito: Aedes aegypt (埃及斑蚊), Aedes albopictus（及白線斑蚊）
- Breteau Index (布氏指數): KaoKsuing<br>
    ```布氏指數=8.349×誘蚊產卵器陽性率+4.972```

- This data is recorded in *big5* encoding and have to converted into **utf-8** in advanced. The worse, data are partly in **big5** encoding and partly in **utf-8** encoding.
  - Libreoffice:
    
    ```
    [Save As:] ...
    
    File type: Text csv(.csv)
      √ automatic file extensions
      √ Edit file setting
      
    --------------------
            Use TeXt CSV Format                    
    --------------------
    
    Character set: Unicode (UTF 8)
    ```
  - iconv   
    ```shell
    > iconv -c -t utf8 data.csv > data_utf8.csv
    ```
- Also change the names of column in English    

In [None]:
from IPython.core.display import HTML
import qgrid
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gmplot,folium
import seaborn as sns
from ipywidgets import interact,widgets,interactive
from IPython.display import display,clear_output
import warnings
warnings.filterwarnings('ignore')

sns.set()

%matplotlib inline

Eco Systems Info
---
Requirements, by `pip`:
- **gmap**, package which helps to render data on the Google map; the bad news of using Google map, it not fully free. Using Google map, you need to get a API key and link with your billing card info. Or try the next solution:
- **folium**, folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js library. Manipulate your data in Python, then visualize it in a Leaflet map via folium.
- **statmodels**, a Python package provides statistical utilities;
- **arch**, Autoregressive Conditional Heteroskedasticity (ARCH) avails tools for time-series,

What packages we used,

In [None]:
%load_ext watermark

In [None]:
%watermark -a "" -v -p numpy,scipy,seaborn,pandas,plotly,qgrid,ipywidgets,folium,matplotlib,notebook

In [None]:
qgrid.set_defaults( precision=4)

Data Sources
---
<a href="http://data.tainan.gov.tw/dataset/dengue-dist">Tainan City Government</a>

Subjects
---

Load native data,

In [None]:
df = pd.DataFrame.from_csv('data/test.csv',index_col=0,parse_dates=[0],encoding="utf-8")

In [None]:
qgrid.show_grid(df)

In [None]:
# convert Date format from YY/MM/DD to YY-MM-DD
df['date']=pd.to_datetime(df['date'])

# before we accumulate the number of suspectives at each day, we add a new colume, num=1
df['num']=1

In [None]:
df.tail(3)

In [None]:
import folium
#from folium.element import IFrame
from folium.plugins import MarkerCluster,CirclePattern

GeoData
===
Folium package is used in place of gmplot, for more free. 

Data Usages
---
- `df['feature']`: data listed as table;
     index   'feature'
       1       A
       2       B
       :      ...
- `df['feature'].values`: as array-like [A,B, ...].   
-  `list(zip(lons1, lats1))`
    lats1=[A1,A2,A3,...]
    lons1=[B1,B2,B3,...]
         ⬇︎ zip
      [A1,B1],[A2,B2],...
         ⬇︎ list
      [[A1,B1],[A2,B2],...]  
-  `df['date'].dt.strftime('%Y-%m-%d')`: date-consersion to `YYYY-MM-DD`, for instance, 2019-03-23.    

In [None]:
lats1=df['latitude'].values
lons1=df['longitude'].values
locations1 = list(zip(lons1, lats1))

dates1=df['date'].dt.strftime('%Y-%m-%d')
dates2=dates1.values

In [None]:
# wait patiently
Tainan_COORDINATES = ( df['longitude'].mean(), df['latitude'].mean()) 

map = folium.Map(location=Tainan_COORDINATES, 
                 zoom_start=10)
map.add_children(MarkerCluster(locations=locations1, popups=dates2))

In [None]:
# display(map)
map.save('Taiwanfolium.html')

In [None]:
from scipy import stats
import numpy as np


In [None]:
df.columns

In [None]:
def df_zscore(df):
    z = np.abs(stats.zscore(df))
    #df['Z']=z
    #df['Z'].plot(kind='box')
    #print(z) 
    threshold = 10
    print(np.where(z > threshold))
    print(df.shape)

In [None]:
df_zscore(df[['latitude','longitude']])

In [None]:
def df_zscore_iqr(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    print(Q1)
    df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
    #print(df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))
    print(df.shape)

In [None]:
df_zscore_iqr(df[['latitude','longitude']])

In [None]:
# let us focus on the cases in each district
cases=df[['date','num','district']]
cases.tail(3)

In [None]:
df.to_csv("data/tainan-2016-native.csv")

To let qgrid work again (with ipywidget-6.0.0+), change
```
qgrid.show_grid(df) 
  ⬇︎
qgrid.show_grid(df,show_toolbar=True, grid_options={'forceFitColumns': False, 'defaultColumnWidth': 200})

```
Also, new Jupyter notebook limits the iopub data rate; enlarge the rate in configuration, 
**jupyter_notebook_config.py*** as follows:
```
NotebookApp.iopub_data_rate_limit=10000000000
```

Note
---
Fast development of ipywidgets provides much advantages of data display, flexibility, interactivity on browser; disavantge due to its update rate also affect the the third-parckages: **qgrid-0.3.2** can't work since IPython > 6!

<del>qqrid 0-3.3</del> 
---
Solved,
1. modify $anaconda/etc/jupyter/nbconfig/notebook.json as follows:
```
{
  "load_extensions": {
     ...
     "qgrid/qgridjs/qgrid.widget": true
   }
}
```

- make share qgrid had been copy to $Anaconda/share/jupyter/nbextensions as named qgrid (not qgridjs) after executing
```
shell> nbextension install qgrid --sys-prefix
shell> nbextension enable qgrid --sys-prefix
```
   

In [None]:
qgrid.show_grid(df,show_toolbar=True, grid_options={'forceFitColumns': False, 'defaultColumnWidth': 100})


Accumulation numbers of cases:
--

Make the accumulated sum based on date:

In [None]:
#  grouped data with respect to the date
cases_group = cases.groupby('date');
#cases_group.size()

In [None]:
cases_group

In [None]:
# calculate the accumulated number of cases

cases_totals = cases_group.sum()
#cases_totals.sort(columns='num').head()
cases_totals.tail()

In [None]:
cases_totals.to_csv("data/taiwan-2016-by-date.csv")

First eys on Data Visualization
---

In [None]:
my_plot = cases_totals.plot(kind='area',figsize=[12,4])

In [None]:
# or this one is better
my_plot = cases_totals.plot(drawstyle='steps',figsize=[12,4])

Concerned with the public Health policy, exact number of infected in indivividual district area is more important than total sum in whole area.

In [None]:
#  grouped data with respect to the date and district
cases_group1 = cases.groupby(['date','district']);
cases_1 = cases.groupby(['date','district'])[['num']].sum();

In [None]:
# retrieve sequences of district names of data data
district=df.district.tolist()
ndistrict=set(district)

print(ndistrict,len(ndistrict))

In [None]:
# to list
ndistrict=list(ndistrict)

In [None]:
from matplotlib.font_manager import FontProperties
#myFont = FontProperties('/Users/cch/Downloads/kaiu.ttf')
# change font for Traditional Chinese Language
plt.rcParams['font.sans-serif'] = ['LiHei Pro']
plt.rcParams['font.size']=18

myFont = FontProperties('AR PL New Kai')
font_chinese = FontProperties(fname="/Users/cch/Documents/2017/jeibaChinese/fireflysung.ttf")

In [None]:
#plt.rcParams['axes.unicode_minus'] = False

kind=0
plt.figure(figsize=(12,5))
for k in ndistrict:
    plt.plot(cases_1.xs(k,level='district'),label=k)
    #plt.legend(k)
    plt.title(u"各區病例個數",fontsize=30,fontproperties = font_chinese)
    kind=kind+1 

Flexible Theme 
---
Using plotly, incorporated with ipywidgets, designs a quick app for displaying as above:

1. district selection menu:
```
    widgets.Dropdown(
            options=ndistrict,       # all districts in Tainnan
            value=ndistrict[9],      # initional option is 10-th district area
            description='District:', 
    )
```

In [None]:
# define dropdown options
w = widgets.Dropdown(
    options=ndistrict,
    value=ndistrict[9],
    description='District:',
)

# Matplotlib version, make plot for selected district

def plot_district_case(district_area):   
    plt.rcParams['font.family'] = 'AppleGothic' 
    plt.figure(figsize=(12,6))
    plt.plot(cases_1.xs(district_area,level='district'),drawstyle='steps')
    #plt.legend(k)
    plt.title(district_area+" 病例",fontproperties = font_chinese) ;
    #plt.ylim([1,160])
    plt.xticks(rotation=90)
    clear_output(True)    

In [None]:
import plotly.plotly as py
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import cufflinks
# offline mode
cufflinks.go_offline()
# Set the global theme for cufflinks
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)

In [None]:
def plotly_district_case(district_area):   
    case_size=len(cases_1.xs(district_area,level='district'))

    title_= district_area+' 病例個數'
    cases_1.xs(district_area,level='district').iplot(title=title_, 
                                                     xTitle='Date, 日期',
                                                     yTitle='Numbers of cases, 病例數目')
    clear_output(True)

In [None]:
# try yourself
tool=interact(plot_district_case, district_area=w);
display(tool)

In [None]:
tool=interact(plotly_district_case, district_area=w);
display(tool)

Mosquitos Data
---
Without doublt, vectors are the main infectious agent!

In [None]:
df_mosquito2 = pd.DataFrame.from_csv('data/mosquito-re1.csv',index_col=['date'],encoding="utf-8")

In [None]:
qgrid.show_grid(df_mosquito2)

Note
---
`df_mosquito2['BIndex']` is not aranged in order by date; to get it work correct, we have to rearrange the `date`-index, which is done as follows:
```
df_mosquito2['BIndex'].sort_index()
```

In [None]:
# Sored by date-index but comes a little fuzzy
for DistName in ndistrict:    
    df_mosquito2[df_mosquito2['district']==DistName]['BIndex'].sort_index().iplot(title=DistName)
    

Ipywidget was introduced again:


In [None]:
layout1 = cufflinks.Layout(
    title='BIndex',
    height=250,
    width=800
)
layout2 = cufflinks.Layout(
    title='病例個數',
    height=300,
    width=800
)
# Let's classify out by help of ipywidgets again
def plotly_BIndex(district_area):
    title_= district_area+' BIndex'
    p1=df_mosquito2[df_mosquito2['district']==district_area]['BIndex'].sort_index().iplot(title=title_,layout=layout1.to_plotly_json())
    
    #case_size=len(cases_1.xs(district_area,level='district'))

    title2_= district_area+' 病例個數'
    p2=cases_1.xs(district_area,level='district').iplot(title=title2_, xTitle='Date, 日期',yTitle='Numbers of cases, 病例數目',layout=layout2.to_plotly_json())

    clear_output(True)

In [None]:
tool=interact(plotly_BIndex, district_area=w);
display(tool)

Binding togeter
---
Place these two subplots together to observe the relation between Bindex and epidemic,

In [None]:
import plotly.plotly as py
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go

In [None]:
# look at the infected case dataframe
district_area='永康區'
df_mos=df_mosquito2[df_mosquito2['district']==district_area].sort_index()

In [None]:
df_mos[:5]

In [None]:
# and the Bindex dataframe, only one feature
p2=cases_1.xs(district_area,level='district')
p2.head()

Now Let's to implement the figure which puts *infected* dataframe and *BIndex* dataframe together:

In [None]:
def plotly_BIndex_num(district_area):
    title_= district_area+' 病例個數, BIndex'
    df_mos=df_mosquito2[df_mosquito2['district']==district_area].sort_index()
    p2=cases_1.xs(district_area,level='district')
    
    trace_mosquitos = go.Scatter(
                x=df_mos.index,
                y=df_mos.BIndex,
                name = "BIndex",
                line = dict(color = '#17BECF'),
                opacity = 0.8)
    trace_cases=go.Scatter(
                x=p2.index,
                y=p2.num,
                name =' 病例個數',
                line = dict(color = '#7F7F7F'),
                opacity = 0.8)
    data = [trace_mosquitos,trace_cases]
    
    layout = dict(
                  title = title_,
                  xaxis=dict(
                             rangeselector=dict(
                                  buttons=list([
                                          dict(count=1,label='1m',step='month',stepmode='backward'),
                                          dict(count=3,label='3m',step='month',stepmode='backward'),
                                          dict(step='all')
                                  ])
                              ),
                              rangeslider=dict(visible = True),
                              type='date'
                  )
             )
    fig = go.Figure(data=data, layout=layout)
    iplot(fig, filename = "Infected_Cases_and_BIndex")
    clear_output(True)

In [None]:
tool2=interact(plotly_BIndex_num, district_area=w);
display(tool2)

In [None]:
district_area='永康區'
df_mos=df_mosquito2[df_mosquito2['district']==district_area].sort_index()
p2=cases_1.xs(district_area,level='district')

trace_mosquitos = go.Scatter(
                x=df_mos.index,
                y=df_mos.BIndex,
                name = "BIndex",
                line = dict(color = '#17BECF'),
                opacity = 0.8)
trace_cases=go.Scatter(
                x=p2.index,
                y=p2.num,
                name = "infected number",
                line = dict(color = '#7F7F7F'),
                opacity = 0.8)
data = [trace_mosquitos,trace_cases]

layout = dict(
    title = "Infected Cases and Bindex",
    #xaxis = dict(
    #range = ['2016-07-01','2016-12-31'])
)

fig = go.Figure(data=data, layout=layout)
plot(fig, filename = "Dengue_district_Bindex")