# Capacity Building
## Prerequisites
Some basic understanding of Python variables, data types, looping, conditionals and functions will be of benefit.
## Data inputs
### Imports

Let's import some modules. A module is a library of Python code that we can leverage to provide useful functionality.<br> These may be part of the standard Python library, or be external packages

In [None]:
# Install the summer package
# Pip is Python's standard package manager

%pip install summerepi

In [None]:
# Python standard library imports come first
from datetime import datetime, timedelta  # We use datetime to manipulate date-time indexes

# Then external package imports
import pandas as pd  # pd is an alias for pandas. This is similar to dataframes in R
from matplotlib import pyplot as plt  # matplotlib is a common visualisation package for Python


In [None]:
# We'll do a bit of global setup here too - let's set a plotting style we like (this can easily be omitted)
plt.style.use("ggplot")
# Try the following to get help on an example command
plt.style.use?
# Try just typing plt.s (or similar) and pressing tab (or shift-tab on Colab) to see what's available within plt

Try: There's a function inside plt.style that will show the styles. Change the plotting style to something you like.

### Define constants and useful variables
Defining and capitalising constants at the start of a Python script or module is a common convention.<br>
Only do this for variables that will never change during runtime

In [None]:
# URL to the Ministry of health's GitHub repository.
# What is the data type here, a tuple or string? Do you know how to check for the type?
GITHUB_MOH = (
    "https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/"
)

# A list containing the files to download.
MOH_FILES = [
    "cases_malaysia",
    "deaths_malaysia",
    "hospital",
    "icu",
    "cases_state",
    "deaths_state",
]

# We define a day zero for the analysis.
COVID_BASE_DATE = datetime(2019, 12, 31)

# By defining a region variable, we can easily change the analysis later.
region = "Malaysia"



### Utility functions

In [None]:
# We'll import List from the typing module here, so we can add clear type annotations to our code
from typing import List

In [None]:
def fetch_mys_data(base_url: str, file_list: List[str]) -> pd.DataFrame:
    """
    Request files from MoH and combine them into one data frame.
    
    Args:
        base_url: A the base url to fetch data from.
        file_list: A list of specific files to fetch

    Returns:
        pd.DataFrame: A data frame containing all the files.
    """
    a_list = []  # An empty list to hold each dataframe (a list can hold any python object)
    for file in file_list:  # Loop over each file name
        data_type = file.split('_')[0]  # Split the file name on '_' and take the first part
        df = pd.read_csv(base_url + file + ".csv")  # Build the full url path to the file and ask pandas to download it
        df['type']  = data_type  # Create a new column 'type' and enter the data_type

        a_list.append(df)  # Place this dataframe into the list. 

    # We have looped over all the files, downloaded and entered it into a list of shape [df1,df2,df3,...]
    
    # Pandas will automatically combine this list into a single dataframe. It will expand the rows and columns as necessary
    df = pd.concat(a_list) 
    
    return df # The function returns the dataframe

Now call the function and pass it the MoH url.<br> Well done! We have scraped Malaysia's entire national and regional Covid-19 dataset into one dataframe

In [None]:
df = fetch_mys_data(GITHUB_MOH, MOH_FILES)
df

In [None]:
# Because this particular dataframe is too big to easily inspect,
# we might want to look at parts of it (e.g. the column names)
df.columns

In [None]:
df['state']

We need to do some housekeeping.
- Fill the missing state values with 'Malaysia'
- Ensure the date type is correct and not a string '10-06-2022'
- Create an integer offset from COVID_BASE_DATE. 

In [None]:
df.loc[df['state'].isna(), 'state'] = 'Malaysia' 
df['date'] = pd.to_datetime(df['date'])
df['date_index'] = (df['date'] - COVID_BASE_DATE).dt.days

Let's create a boolean mask to aid with our analysis. Recall the 'region' variable we set at the beginning and the type column we created while downloading the data.<br>

In this example, the mask is for Malaysia's cases. By changing the 'region' variable and or type column, we can change the focus of the analysis.

In [None]:
mask = (df['state'] == region) & (df['type'] == 'cases')

In [None]:
df[mask][['date', 'cases_new', 'deaths_new']]  # Notice how the death data is all NaN due to the filtering

After all that work, let's look at the results.<br />
Pandas has a .plot() function. Here is a [quick](https://pandas.pydata.org/docs/getting_started/intro_tutorials/04_plotting.html) or [detailed](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html?highlight=plot) tutorial.<br />
We can also use `x='date_index` and change the `y` to any `case_` column.

In [None]:
df[mask].plot(x='date', y='cases_new', figsize=(20, 10));  # The semicolon suppresses the printing of the name of the object that was created in this line

In [None]:
# We might prefer to plot this as points rather than a line, given each entry is an observation
df[mask].plot(x='date', y='cases_new', figsize=(20, 10), marker='o', linewidth=0);

Let's also download the latest population distributions from the MoH GitHub repository.

In [None]:
population_url = 'https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/static/population.csv'
df_pop = pd.read_csv(population_url)

In [None]:
df_pop

In [None]:
initial_population = df_pop[df_pop['state'] == region]['pop'][0]

## Basic model introduction

This page introduces the processes for building and running a simple compartmental disease model with `summer`.
In the following example, we will create an SEIR compartmental model for a general, unspecified emerging infectious disease spreading through a fully susceptible population. In this model there will be:

- four compartments: susceptible (S), exposed(E), infected (I) and recovered (R)
- a starting population of the REGION, with 100 of them infected (and infectious)
- an evaluation timespan from day zero to END_DATE in 0.1 day steps
- inter-compartmental flows for infection, deaths and recovery

You may wish to give the compartments more descriptive names, which is actually what we usually do when building these models.
First, let's look at a complete example of this model in action, and then examine the details of each step. This is the complete example model that we will be working with:

In [None]:
import numpy as np
from summer import CompartmentalModel

start_date = datetime(2021,1,1)  # Define the start date
end_date = start_date + timedelta(days=300)  # Define the duration

# Integer representation of the start and end dates.
start_date_int = (start_date - COVID_BASE_DATE).days
end_date_int = (end_date- COVID_BASE_DATE).days

In [None]:
# Define the model compartments and time step.
basic_model = CompartmentalModel(
    times=(start_date_int, end_date_int),
    compartments=["S", "E", "I", "R"],
    infectious_compartments=["I"],
    timestep=1.0,
)

In [None]:
# Define the initial population and compartmental flows.
basic_model.set_initial_population(distribution={"S": 100000, "E": 0, "I": 100})
basic_model.add_infection_frequency_flow(name="infection", contact_rate=0.12, source="S", dest="E")
basic_model.add_transition_flow(name="progression", fractional_rate=1/15., source="E", dest="I")
basic_model.add_transition_flow(name="recovery", fractional_rate=0.04, source="I", dest="R")
#base_model.add_death_flow(name="infection_death", death_rate=0.05, source="I")

# Run the model
basic_model.run()

Our CompartmentalModel object has many methods defined on it. You are encouraged to explore these methods as this object is integral to the platform.

In [None]:
output_df = basic_model.get_outputs_df()

We now have a Pandas dataframe of compartments sizes at each time step.

In [None]:
output_df.head(20)

Extract the target data from the MoH dataframe.

In [None]:
target = df[mask][start_date_int: end_date_int]['cases_new']
x_range = range(start_date_int, end_date_int)  # Create a integer range from the start date to the end date

Useful Matplotlib [guide](https://matplotlib.org/stable/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py)

In [None]:
# Visualize the results
subplot = {"title": "SEIR Model Outputs", "xlabel": "Days", "ylabel": "Compartment size"} # A dictionary of key:values pairs that matplotlib will use to label items.
fig, ax = plt.subplots(1, 1, figsize=(12, 6), dpi=120, subplot_kw=subplot) # Create a subplot object.

for compartment in output_df:  # Loop over each compartment
    ax.plot(basic_model.times, output_df[compartment])  # Plot the times and compartment values

ax.legend(["S", "E", "I", "R"]);


Now let's inspect each step of the example in more detail. To start, here's how to create a new model: let's import the summer library and create a new [CompartmentalModel](/api/model.html) object. You can see that our model has an attribute called `compartments`, which contains a description of each modelled compartment.

In [None]:
# Define the model
malaysia_model = CompartmentalModel(
    times=(start_date_int, end_date_int),
    compartments=["S", "E", "I", "R"],
    infectious_compartments=["I"],
    timestep=1.0,
)

### Adding a population 

Initially the model compartments are all empty. Let's add:

- 32 million people to the susceptible (S) compartment, plus
- 100 in the infectious (I) compartment.

In [None]:
# Add people to the model
# We'll use the initial_population variable we obtained from the MOH data earlier
malaysia_model.set_initial_population(distribution={"S": initial_population - 100, "E": 0, "I": 100})

# View the initial population
malaysia_model.initial_population

### Adding inter-compartmental flows 

Now, let's add some flows for people to transition between the compartments. These flows will define the dynamics of our infection. We will add:

- an infection flow from S to E (using frequency-dependent transmission)
- an exposed individual becomes infected E to I.
- a recovery flow from I to R

In [None]:
# Susceptible people can get infected.
malaysia_model.add_infection_frequency_flow(name="infection", contact_rate=0.18, source="S", dest="E")

# Expose people transition to infected.
malaysia_model.add_transition_flow(name="progression", fractional_rate=1/15, source="E", dest="I")

# Infectious people recover.
malaysia_model.add_transition_flow(name="recovery", fractional_rate=0.04, source="I", dest="R")

# Importantly, we will also request an output for the 'progression' flow, and name this 'notifications'
# This will be available after a model run using the get_derived_outputs_df() method

malaysia_model.request_output_for_flow("notifications", "progression")

# Inspect the new flows, which we just added to the model.
malaysia_model._flows



### Running the model

Now we can calculate the outputs for the model over the requested time period. 
The model calculates the compartment sizes by solving a system of differential equations (defined by the flows we just added) over the requested time period.

In [None]:
malaysia_model.run()

### Print the model outputs

The recommended way to view the model's results is via the get_outputs_df() method

In [None]:
mm_outputs_df = malaysia_model.get_outputs_df()
mm_outputs_df

In [None]:
mm_outputs_df.plot()

You can also access the raw numpy array of outputs, which can be useful in performance sensitive contexts

In [None]:
# Force NumPy to format the output array nicely. 
import numpy as np
np.set_printoptions(formatter={'all': lambda f: f"{f:0.2f}"})

# View the first 10 timesteps of the output array.
malaysia_model.outputs[:10]

### Accessing derived outputs

Derived outputs are accessed in much the same way as the raw compartment outputs, via the get_derived_outputs_df() method

In [None]:
mm_derived_df = malaysia_model.get_derived_outputs_df()
mm_derived_df

### Plot the outputs

You can get a better idea of what is going on inside the model by visualising how the compartment sizes change over time.

In [None]:
# Visualize the results.
subplot = {"title": "SEIR Model Outputs", "xlabel": "Days", "ylabel": "Compartment size"}
fig, ax = plt.subplots(1, 1, figsize=(12, 6), dpi=120, subplot_kw=subplot)

for compartment in mm_outputs_df: # Loop over each compartment. 
    ax.plot(malaysia_model.times, mm_outputs_df[compartment]) # Plot the times and compartment values

ax.legend(["S", "E", "I", "R"]);

In [None]:
# Let's allow for the fact that case detection is never complete,
# by multiplying the model outputs through by a constant value
proportion_of_cases_detected = 0.05

fig, ax = plt.subplots(1, 1, figsize=(12, 6), dpi=120)
ax.plot(x_range, target)  # Plot the MoH target values
ax.plot(malaysia_model.times, mm_derived_df["notifications"] * proportion_of_cases_detected);

## Summary

That's it for now, now you know how to:

- Create a model
- Add a population
- Add flows
- Run the model
- Access and visualise the outputs

A detailed API reference for the CompartmentalModel class can be found [here](http://summerepi.com/api/model.html)

The point we reached here is that we have a model that runs and gives some reasonably sensible-looking outputs,
but doesn't match the data we are trying to fit to perfectly.
However, even though this is a mechanistic model of COVID-19 dynamics, 
that is clearly not the only aspect of this model that is unrealistic.
Please reflect on the most important ways in which this very simple model is unrealistic.
There are at least a dozen features of the Malaysian COVID-19 epidemic that aren't captured in this model.
Try listing them out and ordering them according to importance.
How many of these features would you need to include before you were satisfied that this model was something
that could guide policy or be used for prediction?