# Capacity Building
## Prerequisites
Some basic understanding of Python variables, data types, looping, conditionals and functions will be of benefit.
## Data inputs
### Imports

Let's import some modules. A module is a library of Python code that we can leverage to provide useful functionality.<br> These may be part of the standard Python library, or be external packages

In [None]:
# Install the summer package
# Pip is Python's standard package manager

%pip install summerepi

In [None]:
# Python standard library imports come first
from datetime import datetime, timedelta  # We use datetime to manipulate date-time indexes

# Then external package imports
import pandas as pd  # pd is an alias for pandas. This is similar to dataframes in R
import numpy as np
import requests # To download google drive files
import io # To download google drive zip file
from matplotlib import pyplot as plt  # matplotlib is a common visualisation package for Python
from zipfile import ZipFile # To manage zip files



In [None]:
# We'll do a bit of global setup here too - let's set a plotting style we like (this can easily be omitted)
plt.style.use("ggplot")
# Try just typing plt.s (or similar) and pressing tab (or shift-tab on Colab) to see what's available within plt

Try: There's a variable inside plt.style that contains the list of available styles. Change the plotting style to something you like.

### Define constants and useful variables
Defining and capitalising constants at the start of a Python script or module is a common convention.<br>
Only do this for variables that will never change during runtime.

*Note the two links will change over time and should be verified before running the notebook.*
The daily data link is available from this [link](https://drive.google.com/drive/folders/1ZPPcVU4M7T-dtRyUceb0pMAd8ickYf8o)

In [None]:
# URL to the department of health Google drive repository.
# What is the data type here, a tuple or string? Do you know how to check for the type?

# Shareable google drive links
PHL_DOH_LINK = "1hlHG7gIOz_n8kRVBnMDx3ZxdwsEAamLe"  # sheet 05 daily report.
PHL_FASSSTER_LINK = "15eDyTjXng2Zh38DVhmeNy0nQSqOMlGj3" # Fassster google drive zip file.


# We define a day zero for the analysis.
COVID_BASE_DATE = datetime(2019, 12, 31)

# By defining a region variable, we can easily change the analysis later.
region = "NATIONAL CAPITAL REGION (NCR)"



### Utility functions

In [None]:
def fetch_doh_data(link:str)->pd.DataFrame:
    """Requests for the DoH 05 cases from the google drive data repository

    Args:
        link (str): The shareable link code.

    Returns:
        pd.DataFrame: A data frame containing the hospital data.
    """

    doh = f"https://drive.google.com/uc?id={link}&export=download&confirm=t"
    df = pd.read_csv(doh)

    return df

Now call the function and pass it the DoH url and region.

In [None]:
doh_df = fetch_doh_data(PHL_DOH_LINK)

In [None]:
def get_fassster_data(link:str)->pd.DataFrame:
    """Reads a google drive zip file and extracts the data from it in memory

    Args:
        link (str): The shareable link code.

    Returns:
        pd.DataFrame: A data frame containing NCR cases.
    """

    faster = f"https://drive.google.com/uc?id={link}&export=download&confirm=t"
    req = requests.get(faster)
    file_like_object = io.BytesIO(req.content)
    zipfile_ob = ZipFile(file_like_object)
    filename = [
        each for each in zipfile_ob.namelist() if each.startswith("2022")
    ]
    df = pd.read_csv(zipfile_ob.open(filename[0]))
    return df


In [None]:
fas_df = get_fassster_data(PHL_FASSSTER_LINK)

Well done! We have scraped Philippines's regional Covid-19 dataset into two dataframes.

In [None]:
df_cases = fas_df

Let's tidy up the Fassster dataset a bit

In [None]:
df_cases = df_cases.groupby(["Report_Date","Region"],as_index=False).size() # Because each row is a case we can use the group size.
df_cases = df_cases[df_cases['Region']=="NCR"] # Filter for NCR cases
df_cases = df_cases.rename(columns={"Report_Date":"reportdate", "size":"cases"}) # Rename columns to match DoH names.

In [None]:
# Because this particular dataframe is too big to easily inspect,
# we might want to look at parts of it (e.g. the column names)
doh_df.columns

Each column is explained in this metadata [file](https://drive.google.com/file/d/1NdFiDTR6Q_CSvy45uh7lUiMC7XRFsrpV/view?usp=sharing) <br>(*This link will also change over time. Please verify with the daily data [link](https://drive.google.com/drive/folders/1ZPPcVU4M7T-dtRyUceb0pMAd8ickYf8o) before running*).

We need to do some housekeeping.
- Select a subset of the columns.
- Ensure the date type is correct and not a string '10-06-2022'
- Aggregate the hospital and ICU occupancy to daily counts.

In [None]:
doh_df = doh_df[['reportdate','region','cfname','nonicu_o','icu_o']]
doh_df["reportdate"]  = pd.to_datetime(doh_df["reportdate"]).dt.tz_localize(None)
doh_df


In [None]:
doh_df = doh_df.groupby(['reportdate','region'], as_index=False).sum()
doh_df

In [None]:
#doh_df['date_index'] = (doh_df.reportdate - COVID_BASE_DATE).dt.days
#doh_df

Let's create a boolean mask to aid with our analysis. Recall the 'region' variable we set at the beginning.<br>

In this example, the mask is for Philippines's 'National Capital Region'. By changing the 'region' variable, we can change the focus of the analysis.

In [None]:
mask = (doh_df['region'] == region)

In [None]:
doh_df[mask] # Notice how the region is NCR due to the filtering

In [None]:
doh_df = doh_df[mask]

In [None]:
df_cases['reportdate'] = pd.to_datetime(df_cases['reportdate'])

Two become one. More details about this can be found [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html#join-tables-using-a-common-identifier)

In [None]:
df = pd.merge(doh_df,df_cases, how='left', on='reportdate')

In [None]:
# Set the index of this DataFrame to use calendar dates
df = df.set_index('reportdate')

After all that work, let's look at the results.<br />
Pandas has a .plot() function. Here is a [quick](https://pandas.pydata.org/docs/getting_started/intro_tutorials/04_plotting.html) or [detailed](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html?highlight=plot) tutorial.<br />
We can also use `x='date_index` and change the `y` to any `case_` column.

In [None]:
df[['nonicu_o', 'icu_o','cases']].plot(figsize=(20, 10));  # The semicolon suppresses the printing of the name of the object that was created in this line

In [None]:
# We might prefer to plot this as points rather than a line, given each entry is an observation
df[['nonicu_o', 'icu_o','cases']].plot(figsize=(20, 10), marker='o', linewidth=0);

Let's also download the latest population distributions from the our GitHub repository.

In [None]:
population_url = 'https://github.com/monash-emu/AuTuMN/raw/master/data/inputs/world-population/subregions.csv'
df_pop = pd.read_csv(population_url)
df_pop = df_pop[df_pop['region']=='Metro Manila']

In [None]:
df_pop = df_pop.melt(id_vars=['country','iso3','region','year'], var_name='age_group', value_name='pop')
df_pop = df_pop[['region','pop']].groupby('region',as_index=False).sum()
initial_population = df_pop['pop'][0] * 1000 # We need to multiply by 1000 to covert it back to counts.

In [None]:
initial_population

## Basic model introduction

This page introduces the processes for building and running a simple compartmental disease model with `summer`.
In the following example, we will create an SEIR compartmental model for a general, unspecified emerging infectious disease spreading through a fully susceptible population. In this model there will be:

- four compartments: susceptible (S), exposed(E), infected (I) and recovered (R)
- a starting population of the REGION, with 100 of them infected (and infectious)
- an evaluation timespan from day zero to END_DATE in 0.1 day steps
- inter-compartmental flows for infection, deaths and recovery

You may wish to give the compartments more descriptive names, which is actually what we usually do when building these models.
First, let's look at a complete example of this model in action, and then examine the details of each step. This is the complete example model that we will be working with:

In [None]:
import numpy as np
from summer import CompartmentalModel

start_date = datetime(2021,1,1)  # Define the start date
end_date = start_date + timedelta(days=300)  # Define the duration

# Integer representation of the start and end dates.
start_date_int = (start_date - COVID_BASE_DATE).days
end_date_int = (end_date- COVID_BASE_DATE).days

In [None]:
# Define the model compartments and time step.
basic_model = CompartmentalModel(
    times=(start_date_int, end_date_int),
    compartments=["S", "E", "I", "R"],
    infectious_compartments=["I"],
    ref_date=COVID_BASE_DATE
)

In [None]:
# Define the initial population and compartmental flows.
basic_model.set_initial_population(distribution={"S": 100000, "E": 0, "I": 100})
basic_model.add_infection_frequency_flow(name="exposure", contact_rate=0.12, source="S", dest="E")
basic_model.add_transition_flow(name="infection", fractional_rate=1/15., source="E", dest="I")
basic_model.add_transition_flow(name="recovery", fractional_rate=0.04, source="I", dest="R")
#base_model.add_death_flow(name="infection_death", death_rate=0.05, source="I")

# Run the model
basic_model.run()

Our CompartmentalModel object has many methods defined on it. You are encouraged to explore these methods as this object is integral to the platform.

In [None]:
output_df = basic_model.get_outputs_df()

We now have a Pandas dataframe of compartments sizes at each time step.

In [None]:
output_df.head(20)

Extract the target data from the DoH dataframe.

In [None]:
df

In [None]:
df[start_date:end_date]

In [None]:
target = df[start_date:end_date]['cases']

Useful Matplotlib [guide](https://matplotlib.org/stable/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py)

In [None]:
# Visualize the results
subplot = {"title": "SEIR Model Outputs", "xlabel": "Days", "ylabel": "Compartment size"} # A dictionary of key:values pairs that matplotlib will use to label items.
fig, ax = plt.subplots(1, 1, figsize=(12, 6), dpi=120, subplot_kw=subplot) # Create a subplot object.

for compartment in output_df:  # Loop over each compartment
    ax.plot(basic_model.times, output_df[compartment])  # Plot the times and compartment values

ax.legend(["S", "E", "I", "R"]);


Now let's inspect each step of the example in more detail. To start, here's how to create a new model: let's import the summer library and create a new [CompartmentalModel](/api/model.html) object. You can see that our model has an attribute called `compartments`, which contains a description of each modelled compartment.

In [None]:
# Define the model
philippines_model = CompartmentalModel(
    times=(start_date_int, end_date_int),
    compartments=["S", "E", "I", "R"],
    infectious_compartments=["I"],
    ref_date=COVID_BASE_DATE
)

### Adding a population 

Initially the model compartments are all empty. Let's add:

- 32 million people to the susceptible (S) compartment, plus
- 100 in the infectious (I) compartment.

In [None]:
# Add people to the model
# We'll use the initial_population variable we obtained from the MOH data earlier
philippines_model.set_initial_population(distribution={"S": initial_population - 1000, "E": 0, "I": 1000})

# View the initial population
philippines_model.initial_population

### Adding inter-compartmental flows 

Now, let's add some flows for people to transition between the compartments. These flows will define the dynamics of our infection. We will add:

- an infection flow from S to E (using frequency-dependent transmission)
- an exposed individual becomes infected E to I.
- a recovery flow from I to R

In [None]:
# Susceptible people can get infected.
philippines_model.add_infection_frequency_flow(name="infection", contact_rate=0.18, source="S", dest="E")

# Expose people transition to infected.
philippines_model.add_transition_flow(name="progression", fractional_rate=1/15, source="E", dest="I")

# Infectious people recover.
philippines_model.add_transition_flow(name="recovery", fractional_rate=0.04, source="I", dest="R")

# Importantly, we will also request an output for the 'progression' flow, and name this 'notifications'
# This will be available after a model run using the get_derived_outputs_df() method

philippines_model.request_output_for_flow("notifications", "progression")

# Inspect the new flows, which we just added to the model.
philippines_model._flows



### Running the model

Now we can calculate the outputs for the model over the requested time period. 
The model calculates the compartment sizes by solving a system of differential equations (defined by the flows we just added) over the requested time period.

In [None]:
philippines_model.run()

### View the model outputs

The recommended way to view the model's results is via the get_outputs_df() method

In [None]:
ph_outputs_df = philippines_model.get_outputs_df()
ph_outputs_df

In [None]:
ph_outputs_df.plot()

You can also access the raw numpy array of outputs, which can be useful in performance sensitive contexts

In [None]:
# Force NumPy to format the output array nicely. 
import numpy as np
np.set_printoptions(formatter={'all': lambda f: f"{f:0.2f}"})

# View the first 10 timesteps of the output array.
philippines_model.outputs[:10]

### Accessing derived outputs

Derived outputs are accessed in much the same way as the raw compartment outputs, via the get_derived_outputs_df() method

In [None]:
ph_derived_df = philippines_model.get_derived_outputs_df()
ph_derived_df

### Plot the outputs

You can get a better idea of what is going on inside the model by visualising how the compartment sizes change over time.

In [None]:
# Visualize the results.
subplot = {"title": "SEIR Model Outputs", "xlabel": "Days", "ylabel": "Compartment size"}
fig, ax = plt.subplots(1, 1, figsize=(12, 6), dpi=120, subplot_kw=subplot)

for compartment in ph_outputs_df: # Loop over each compartment. 
    ax.plot(philippines_model.times, ph_outputs_df[compartment]) # Plot the times and compartment values

ax.legend(["S", "E", "I", "R"]);

In [None]:
# Let's allow for the fact that case detection is never complete,
# by multiplying the model outputs through by a constant value
proportion_of_cases_detected = 0.05

fig, ax = plt.subplots(1, 1, figsize=(12, 6), dpi=120)
ax.plot(target)
ax.plot(ph_derived_df["notifications"] * proportion_of_cases_detected)
ax.legend(["Observed","Modelled"])

## Summary

That's it for now, now you know how to:

- Create a model
- Add a population
- Add flows
- Run the model
- Access and visualise the outputs

A detailed API reference for the CompartmentalModel class can be found [here](http://summerepi.com/api/model.html)

The point we reached here is that we have a model that runs and gives some reasonably sensible-looking outputs,
but doesn't match the data we are trying to fit to perfectly.
However, even though this is a mechanistic model of COVID-19 dynamics, 
that is clearly not the only aspect of this model that is unrealistic.
Please reflect on the most important ways in which this very simple model is unrealistic.
There are at least a dozen features of the Philippines COVID-19 epidemic that aren't captured in this model.
Try listing them out and ordering them according to importance.
How many of these features would you need to include before you were satisfied that this model was something
that could guide policy or be used for prediction?