[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/lisphilar/covid19-sir/HEAD?labpath=example%2F01_data_preparation.ipynb)

# Data preparation
The first step of data science is data preparation. `covsirphy` has the following three functionality for that.

1. Downloading datasets from recommended data servers
2. Reading `pandas.DataFrame`
3. Generator of sample data with SIR-derived ODE model

In [None]:
from datetime import date
from pprint import pprint
import numpy as np
import pandas as pd
import covsirphy as cs
cs.__version__

## 1. Downloading datasets from recommended data 
We will download datasets from the following recommended data servers.

* **[COVID-19 Data Hub](https://covid19datahub.io/)**
    * Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
    * The number of cases (JHU style)
    * Population values in each country/province
    * [Government Response Tracker (OxCGRT)](https://github.com/OxCGRT/covid-policy-tracker)
    * The number of tests
* **[Our World In Data](https://github.com/owid/covid-19-data/tree/master/public/data)**
    * Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
    * The number of tests
    * The number of vaccinations
    * The number of people who received vaccinations
* **[COVID-19 Open Data by Google Cloud Platform](https://github.com/GoogleCloudPlatform/covid-19-open-data)**
    * O. Wahltinez and others (2020), COVID-19 Open-Data: curating a fine-grained, global-scale data repository for SARS-CoV-2, Work in progress, https://goo.gle/covid-19-open-data
    * percentage to baseline in visits
    * Note: Please refer to [Google Terms of Service](https://policies.google.com/terms) in advance.
* **[World Population Prospects 2022](https://population.un.org/wpp/)**
    * United Nations, Department of Economic and Social Affairs, Population Division (2022). World Population Prospects 2022, Online Edition.
    * Total population in each country
* **[Datasets for CovsirPhy](https://github.com/lisphilar/covid19-sir/tree/master/data)**
    * Hirokazu Takaya (2020-2022), GitHub repository, COVID-19 dataset in Japan, https://github.com/lisphilar/covid19-sir/tree/master/data
    * The number of cases in Japan (total/prefectures)
    * Metadata regarding Japan prefectures

***

How to request new data loader:  
If you want to use a new dataset for your analysis, please kindly inform us using [GitHub Issues: Request new method of DataLoader class](https://github.com/lisphilar/covid19-sir/issues/new/?template=request-new-method-of-dataloader-class.md). Please read [Guideline of contribution](https://lisphilar.github.io/covid19-sir/CONTRIBUTING.html) in advance.

### 1-1. With `DataEngineer` class
We can use `DataEngineer().download()` for data downloading from recommended data servers as the quickest way.

In [None]:
eng = cs.DataEngineer()
eng.download();

We can get the all downloaded records as a `pandas.DataFrame` with `DataEngineer().all()` method.

In [None]:
all_df = eng.all()
# Overview of the records
all_df.info()

`DataEngineer.citations()` shows citations of the datasets.

In [None]:
print("\n".join(eng.citations()))

Note that, as default, `DataEngineer().download()` collects country-level data and save the datasets as CSV files in "input" (=`directory` argument of `DataEngineer()`) folder of the current directory. If the last modification time of the saved CSV files is within the last 12 (=`update_interval` argument of `DataEngineer()`) hours, the saved CSV files will be used as the database.

For some countries (eg. Japan), province/state/prefecture level data is available and we can download it as follows.

In [None]:
eng_jpn = cs.DataEngineer()
eng_jpn.download(country="Japan")
eng_jpn.all().head()

For some countries (eg. USA), city-level data is available and we can download it as follows.

In [None]:
eng_alabama = cs.DataEngineer()
eng_alabama.download(country="USA", province="Alabama")
eng_alabama.all().head()

Move forward to [Tutorial: Data engineering](https://lisphilar.github.io/covid19-sir/02_data_engineering.html).

### 1-2. With `DataDownloader` class
`DataEngineer` class is useful because it has data cleaning methods and so on (explained with [Tutorial: Data engineering](https://lisphilar.github.io/covid19-sir/02_data_engineering.html)), but we can use `DataDownloader` class for data downloading.

In [None]:
dl = cs.DataDownloader()
dl_df = dl.layer(country=None, province=None)

In [None]:
# Overview of the records
dl_df.info()

Note that ISO3/Province/City columns have string data instead of categorical data.

In [None]:
# Citations
print("\n".join(dl.citations()))

### Acknowledgement

In Feb2020, CovsirPhy project started in [Kaggle platform](https://www.kaggle.com/) with [COVID-19 data with SIR model](https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model) notebook by Hirokazu Takaya helped by Kagglers using the following datasets.

- The number of cases (JHU) and linelist: [Novel Corona Virus 2019 Dataset by SRK](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset)
- Population in each country:  [covid19 global forecasting: locations population by Dmitry A. Grechka](https://www.kaggle.com/dgrechka/covid19-global-forecasting-locations-population)
- The number of cases in Japan: [COVID-19 dataset in Japan by Lisphilar](https://www.kaggle.com/lisphilar/covid19-dataset-in-japan)

The current version of `covsirphy` does not have interfaces to use the datasets in Kaggle because they are not updated at this time. However, we could not have done CovsirPhy project without their supports. Thank you!!

## 2. Reading `pandas.DataFrame`
We may need to use our own datasets for analysis because the dataset is not included in the recommended data servers. `DataEngineer().register()` registers new datasets of `pandas.DataFrame` format.

### 2-1. Retrieve Monkeypox line list

At first, we will prepare the new dataset as `pandas.DataFrame`. We will use line list data regarding Monkeypox 2022.

[Global.health Monkeypox under CC BY 4.0 license](https://github.com/globaldothealth/monkeypox)

In [None]:
today = date.today()
mp_cite = f"Global.health Monkeypox (accessed on {today.strftime('%Y-%M-%d')}):\n" \
    "Kraemer, Tegally, Pigott, Dasgupta, Sheldon, Wilkinson, Schultheiss, et al. " \
    "Tracking the 2022 Monkeypox Outbreak with Epidemiological Data in Real-Time. " \
    "The Lancet Infectious Diseases. https://doi.org/10.1016/S1473-3099(22)00359-0.\n" \
    "European Centre for Disease Prevention and Control/WHO Regional Office for Europe." \
    f" Monkeypox, Joint Epidemiological overview, {today.day} {today.month}, 2022"
print(mp_cite)

Retrieve CSV file with `pandas.read_csv()', specifying engine.

In [None]:
raw_url = "https://raw.githubusercontent.com/globaldothealth/monkeypox/main/latest.csv"
raw = pd.read_csv(raw_url, engine="pyarrow")
raw.info()

Review the data.

In [None]:
raw.head()

In [None]:
pprint(raw.Status.unique())
pprint(raw.Outcome.unique())

### 2-2. Convert line list to the number of cases data
Prepare analyzable data, converting the line list to the number of case.

Prepare PPT (per protocol set) data.

In [None]:
date_cols = [
    "Date_onset", "Date_confirmation", "Date_hospitalisation",
    "Date_isolation", "Date_death", "Date_last_modified"
]
cols = ["ID", "Status", "City", "Country_ISO3", "Outcome", *date_cols]
df = raw.loc[:, cols].rename(columns={"Country_ISO3": "ISO3"})
df = df.loc[df["Status"].isin(["confirmed", "suspected"])]

for col in date_cols:
    df[col] = pd.to_datetime(df[col])

df["Date_min"] = df[date_cols].min(axis=1)
df["Date_recovered"] = df[["Outcome", "Date_last_modified"]].apply(
    lambda x: x[1] if x[0] == "Recovered" else pd.NaT, axis=1)
df["City"] = df["City"].fillna("Unknown")

ppt_df = df.copy()
ppt_df.head()

Calculate daily new confirmed cases.

In [None]:
df = ppt_df.rename(columns={"Date_min": "Date"})
series = df.groupby(["ISO3", "City", "Date"])["ID"].count()
series.name = "Confirmed"
c_df = pd.DataFrame(series)
c_df.head()

Calculate daily new recovered cases.

In [None]:
df = ppt_df.rename(columns={"Date_recovered": "Date"})
series = df.groupby(["ISO3", "City", "Date"])["ID"].count()
series.name = "Recovered"
r_df = pd.DataFrame(series)
r_df.head()

Calculate daily new fatal cases.

In [None]:
df = ppt_df.rename(columns={"Date_death": "Date"})
series = df.groupby(["ISO3", "City", "Date"])["ID"].count()
series.name = "Fatal"
f_df = pd.DataFrame(series)
f_df.head()

Combine data (cumulative number).

In [None]:
df = c_df.combine_first(f_df).combine_first(r_df)
df = df.unstack(level=["ISO3", "City"])
df = df.asfreq("D").fillna(0).cumsum()
df = df.stack(level=["ISO3", "City"]).reorder_levels(["ISO3", "City", "Date"])
df = df.sort_index().reset_index()
all_df_city = df.copy()
all_df_city.head()

At country level (City = "-") and city level (City != "="):

In [None]:
df2 = all_df_city.groupby(["ISO3", "Date"], as_index=False).sum()
df2.insert(1, "City", "-")
df = pd.concat([df2, all_df_city], axis=0)
df = df.loc[df["City"] != "Unknown"]
all_df = df.convert_dtypes()
all_df

Check data.

In [None]:
gis = cs.GIS(layers=["ISO3", "City"], country="ISO3", date="Date")
gis.register(data=all_df, convert_iso3=False);

In [None]:
variable = "Confirmed"
gis.choropleth(
    variable=variable, filename=None,
    title=f"Choropleth map (the number of {variable} cases)"
)

In [None]:
global_df = gis.subset(geo=None).set_index("Date").astype(np.int64)
global_df.tail()
cs.line_plot(global_df, title="The number of cases (Global)")

### 2-3. Retrieve total population data
So that we can analyze the data, total population values are necessary (we will confirm this with [Tutorial: SIR-derived ODE models](https://lisphilar.github.io/covid19-sir/03_ode.html) later).

Population data at **country-level** can be retrieved with `DataDownloader().layer(databases=["wpp"])` via `DataEngineer().register(databases=["wpp"])`.

In [None]:
# Set layers and specify layer name of country
# (which will be converted to ISO3 code for standardization)
eng = cs.DataEngineer(layers=["ISO3", "City"], country=["ISO3"], verbose=1)
# Download and automated registration of population data
eng.download(databases=["wpp"])
# Specify date range to reduce the memory
date_range = (all_df["Date"].min(), all_df["Date"].max())
eng.clean(kinds=["resample"], date_range=date_range)
# Show all data
display(eng.all())
# Show citations
pprint(eng.citations())

### 2-4. Register Monkeypox data
Register the Monkeypox data to `DataEngineer()` instance.

In [None]:
eng.register(data=all_df, citations=[mp_cite])
# Show all data
display(eng.all())
# Show citations
pprint(eng.citations())

Move forward to [Tutorial: Data engineering](https://lisphilar.github.io/covid19-sir/02_data_engineering.html).

## 3. Generator of sample data with SIR-derived ODE model
CovsirPhy can generate sample data with subclasses of `ODEModel` and `Dynamics` class. Refer to the followings.

### 3.1 Sample data of one-phase ODE model
Regarding ODE models, please refer to **[TBC]**. Here, we will create a sample data with one-phase SIR model and tau value 1440 min, the first date 01Jan2022, the last date 30Jun2022. ODE parameter values are preset.

In [None]:
# Create solver with preset
model = cs.SIRModel.from_sample(date_range=("01Jan2022", "30Jun2022"), tau=1440)
# Show settings
pprint(model.settings())

Solve the ODE model with `ODEModel().solve()` method.

In [None]:
one_df = model.solve()
display(one_df.head())
display(one_df.tail())

Plot the time-series data.

In [None]:
cs.line_plot(one_df, title=f"Sample data of SIR model {model.settings()['param_dict']}")

### 3.2 Sample data of multi-phase ODE model
Regarding multi-phase ODE models, please refer to [Phase-dependent SIR models](https://lisphilar.github.io/covid19-sir/04_phase_dependent.html). Here, we will create a sample data with two-phase SIR model and tau value 1440 min, the first date 01Jan2022, the last date 30Jun2022.

The 0th phase: 01Jan2022 - 28Feb2022, rho=0.2, sigma=0.075 (preset)  
The 1st phase: 01Mar2022 - 30Jun2022, **rho=0.4**, sigma=0.075

We will use `Dynamics` class. At first, set the first/date of dynamics and set th 0th phase ODE parameters.

In [None]:
dyn = cs.Dynamics.from_sample(model=cs.SIRModel, date_range=("01Jan2022", "30Jun2022"))
# Show summary
dyn.summary()

Add the 1st phase with `Dynamics.register()` method.

In [None]:
setting_df = dyn.register()
setting_df.loc["01Mar2022": "30Jun2022", ["rho", "sigma"]] = [0.4, 0.075]
setting_df

In [None]:
dyn.register(data=setting_df)
# Show summary
dyn.summary()

Solve the ODE model with `Dynamics().simulate()` method and plot the time-series data.

In [None]:
two_df = dyn.simulate(model_specific=True)
cs.line_plot(two_df, title="Sample data of two-phase SIR model", v=["01Mar2022"])

When we need convert model-specific variables to model-free variables (Susceptible/Infected/Fatal/Recovered), we will set `model_specific=False` (default).
Because R="Fatal or Recovered" in SIR model, we assume that R="Recovered" and F = 0.

In [None]:
two_df = dyn.simulate(model_specific=False)
cs.line_plot(two_df, title="Sample data of two-phase SIR model with SIRF variables", v=["01Mar2022"])

Actually, observable variables are Population/Confirmed/Infected/Recovered. We can calculate Population and Confirmed as follows.

- Confirmed = Infected + Fatal + Recovered
- Population = Susceptible + Confirmed

In [None]:
real_df = two_df.copy()
real_df["Confirmed"] = real_df[["Infected", "Fatal", "Recovered"]].sum(axis=1)
real_df["Population"] = real_df[["Susceptible", "Confirmed"]].sum(axis=1)
real_df = real_df.loc[:, ["Population", "Confirmed", "Recovered", "Fatal"]]
cs.line_plot(real_df, title="Sample data of two-phase SIR model with observable variables", v=["01Mar2022"])

Thank you!