<a href="https://colab.research.google.com/github/lisphilar/covid19-sir/blob/master/example/01_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data preparation
The first step of data science is data preparation. `covsirphy` has the following three functionality for that.

1. Downloading datasets from recommended data servers
2. Reading `pandas.DataFrame`
3. Generator of sample data with SIR-derived ODE model

In [None]:
from pprint import pprint
import pandas as pd
try:
    import covsirphy as cs
except ImportError:
    !pip install --upgrade "git+https://github.com/lisphilar/covid19-sir.git#egg=covsirphy" -qq
    import covsirphy as cs
cs.__version__

## 1. Downloading datasets from recommended data 
We will download datasets from the following recommended data servers.

* **COVID-19 Data Hub, https://covid19datahub.io/**
    * Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
    * The number of cases (JHU style)
    * Population values in each country/province
    * [Government Response Tracker (OxCGRT)](https://github.com/OxCGRT/covid-policy-tracker)
    * The number of tests
* **Our World In Data, https://github.com/owid/covid-19-data/tree/master/public/data**
    * Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
    * The number of tests
    * The number of vaccinations
    * The number of people who received vaccinations
* **COVID-19 Open Data by Google Cloud Platform, https://github.com/GoogleCloudPlatform/covid-19-open-data**
    * O. Wahltinez and others (2020), COVID-19 Open-Data: curating a fine-grained, global-scale data repository for SARS-CoV-2, Work in progress, https://goo.gle/covid-19-open-data
    * percentage to baseline in visits
    * Note: Please refer to [Google Terms of Service](https://policies.google.com/terms) in advance.
* **World Bank Open Data, https://data.worldbank.org/**
    * World Bank Group (2020), World Bank Open Data, https://data.worldbank.org/
    * Population pyramid
* **Datasets for CovsirPhy, https://github.com/lisphilar/covid19-sir/tree/master/data**
    * Hirokazu Takaya (2020-2022), GitHub repository, COVID-19 dataset in Japan, https://github.com/lisphilar/covid19-sir/tree/master/data.
    * The number of cases in Japan (total/prefectures)
    * Metadata regarding Japan prefectures

***

How to request new data loader:  
If you want to use a new dataset for your analysis, please kindly inform us using [GitHub Issues: Request new method of DataLoader class](https://github.com/lisphilar/covid19-sir/issues/new/?template=request-new-method-of-dataloader-class.md). Please read [Guideline of contribution](https://lisphilar.github.io/covid19-sir/CONTRIBUTING.html) in advance.

### 1-1. With `DataEngineer` class
We can use `DataEngineer().download()` for data downloading from recommended data servers as the quickest way.

In [None]:
eng = cs.DataEngineer()
eng.download();

We can get the all downloaded records as a `pandas.DataFrame` with `DataEngineer().all()` method.

In [None]:
all_df = eng.all()
# Overview of the records
all_df.info()

`DataEngineer.citations()` shows citations of the datasets.

In [None]:
print("\n".join(eng.citations()))

Note that, as default, `DataEngineer().download()` collects country-level data and save the datasets as CSV files in "input" (=`directory` argument of `DataEngineer()`) folder of the current directory. If the last modification time of the saved CSV files is within the last 12 (=`update_interval` argument of `DataEngineer()`) hours, the saved CSV files will be used as the database.

For some countries (eg. Japan), province/state/prefecture level data is available and we can download it as follows.

In [None]:
eng_jpn = cs.DataEngineer()
eng_jpn.download(country="Japan")
eng_jpn.all().head()

For some countries (eg. USA), city-level data is available and we can download it as follows.

In [None]:
eng_alabama = cs.DataEngineer()
eng_alabama.download(country="USA", province="Alabama")
eng_alabama.all().head()

### 1-2. With `DataDownloader` class
`DataEngineer` class is suggested because it has data cleaning methods and so on, but we can use `DataDownloader` class for data downloading.

In [None]:
dl = cs.DataDownloader()
dl_df = dl.layer(country=None, province=None)

In [None]:
# Overview of the records
dl_df.info()

Note that ISO3/Province/City columns have string data instead of categorical data.

In [None]:
# Citations
print("\n".join(dl.citations()))

## 2. Reading `pandas.DataFrame`
We may need to use our own datasets for analysis because the dataset is not included in the recommended data servers. `DataEngineer().register()` registers new datasets of `pandas.DataFrame` format.

At first, we will prepare the new dataset as `pandas.DataFrame`. Just as a demonstration, we use [COVID-19 dataset in Japan](https://github.com/lisphilar/covid19-sir/tree/master/data). (Note that this is included in the recommended servers and the following is usually un-necessary.)

Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan

Country-level data:

In [None]:
c_url = "https://raw.githubusercontent.com/lisphilar/covid19-sir/master/data/japan/covid_jpn_total.csv"
c_df = pd.read_csv(c_url, dayfirst=False)
# Check columns of the pandas.DataFrame
c_df.tail()

Prefecture-level data:

In [None]:
p_url = "https://raw.githubusercontent.com/lisphilar/covid19-sir/master/data/japan/covid_jpn_prefecture.csv"
p_df = pd.read_csv(p_url, dayfirst=False)
# Check columns of the pandas.DataFrame
p_df.tail()

We may create a `DataEngineer` instance, specifying the layer of location names. However, there is a discrepancy of layer names. `c_df` had "Location" layer and `p_df` had "Prefecture" layer. 

In [None]:
print(c_df.columns)
print(p_df.columns)

In [None]:
print(c_df.Location.unique())

To make country-level dataset, calculate total values of Domestic/Returnee/Airport.

In [None]:
country_df = c_df.groupby("Date").sum().reset_index()
country_df.insert(1, "Country", "Japan")
country_df.insert(2, "Prefecture", pd.NA)
country_df.tail()

To make prefecture-level data, add "Country" column.

In [None]:
prefecture_df = p_df.copy()
prefecture_df.insert(1, "Country", "Japan")
prefecture_df.tail()

Create `DataEngineer` instance and register datasets.

In [None]:
# Set layers and specify layer name of country (which will be converted to ISO3 code for standardization)
eng_own = cs.DataEngineer(layers=["Country", "Prefecture"], country="Country")
# Country-level data
eng_own.register(data=country_df, citations="New country-level data", dayfirst=False)
# Prefecture-level data
eng_own.register(data=prefecture_df, citations="New prefecture-level data", dayfirst=False)
# Show data
display(eng_own.all().tail())
# Show citations
print("\n".join(eng_own.citations()))

### Data loading in Kaggle Notebook

We can use the recommended datasets in [Kaggle](https://www.kaggle.com/) Notebook. The datasets are saved in "/kaggle/input/" directory. Additionally, we can use Kaggle Datasets (CSV files) with `covsirphy` in Kaggle Notebook.

Note:  
If you have Kaggle API, you can download Kaggle datasets to your local environment by updating and executing [input.py](https://github.com/lisphilar/covid19-sir/blob/master/input.py) script. CSV files will be saved in "/kaggle/input/" directory.

Kaggle API:  
Move to account page of Kaggle and download "kaggle.json" by selecting "API > Create New API Token" button. Copy the json file to the top directory of the local repository or "~/.kaggle". Please refer to [How to Use Kaggle: Public API](https://www.kaggle.com/docs/api) and [stackoverflow: documentation for Kaggle API *within* python?](https://stackoverflow.com/questions/55934733/documentation-for-kaggle-api-within-python#:~:text=Here%20are%20the%20steps%20involved%20in%20using%20the%20Kaggle%20API%20from%20Python.&text=Go%20to%20your%20Kaggle%20account,json%20will%20be%20downloaded)

### Acknowledgement

In Feb2020, CovsirPhy project started in Kaggle platform with [COVID-19 data with SIR model](https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model) notebook using the following datasets.

- The number of cases (JHU) and linelist: [Novel Corona Virus 2019 Dataset by SRK](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset)
- Population in each country:  [covid19 global forecasting: locations population by Dmitry A. Grechka](https://www.kaggle.com/dgrechka/covid19-global-forecasting-locations-population)
- The number of cases in Japan: [COVID-19 dataset in Japan by Lisphilar](https://www.kaggle.com/lisphilar/covid19-dataset-in-japan)

Best Regards.


## 3. Generator of sample data with SIR-derived ODE model
CovsirPhy can generate sample data with subclasses of `ODEModel` and `Dynamics` class. Refer to the followings.

### 3.1 Sample data of one-phase ODE model
Regarding ODE models, please refer to **[TBC]**. Here, we will create a sample data with one-phase SIR model and tau value 1440 min, the first date 01Jan2022, the last date 30Jun2022. ODE parameter values are preset.

In [None]:
# Create solver with preset
model = cs.SIRModel.from_sample(date_range=("01Jan2022", "30Jun2022"), tau=1440)
# Show settings
pprint(model.settings())

Solve the ODE model with `ODEModel().solve()` method.

In [None]:
one_df = model.solve()
display(one_df.head())
display(one_df.tail())

Plot the time-series data.

In [None]:
cs.line_plot(one_df, title=f"Sample data of SIR model {model.settings()['param_dict']}")

### 3.2 Sample data of multi-phase ODE model
Regarding multi-phase ODE models, please refer to [Phase-dependent SIR models](https://lisphilar.github.io/covid19-sir/04_phase_dependent.html). Here, we will create a sample data with two-phase SIR model and tau value 1440 min, the first date 01Jan2022, the last date 30Jun2022.

The 0th phase: 01Jan2022 - 28Feb2022, rho=0.2, sigma=0.075 (preset)  
The 1st phase: 01Mar2022 - 30Jun2022, **rho=0.4**, sigma=0.075

We will use `Dynamics` class. At first, set the first/date of dynamics and set th 0th phase ODE parameters.

In [None]:
dyn = cs.Dynamics.from_sample(model=cs.SIRModel, date_range=("01Jan2022", "30Jun2022"))
# Show summary
dyn.summary()

Add the 1st phase with `Dynamics.register()` method.

In [None]:
setting_df = dyn.register()
setting_df.loc["01Mar2022": "30Jun2022", ["rho", "sigma"]] = [0.4, 0.075]
setting_df

In [None]:
dyn.register(data=setting_df)
# Show summary
dyn.summary()

Solve the ODE model with `Dynamics().simulate()` method and plot the time-series data.

In [None]:
two_df = dyn.simulate(model_specific=True)
cs.line_plot(two_df, title="Sample data of two-phase SIR model", v=["01Mar2022"])

When we need convert model-specific variables to model-free variables (Susceptible/Infected/Fatal/Recovered), we will set `model_specific=False` (default).
Because R="Fatal or Recovered" in SIR model, we assume that R="Recovered" and F = 0.

In [None]:
two_df = dyn.simulate(model_specific=False)
cs.line_plot(two_df, title="Sample data of two-phase SIR model with SIRF variables", v=["01Mar2022"])

Actually, observable variables are Population/Confirmed/Infected/Recovered. We can calculate Population and Confirmed as follows.

- Confirmed = Infected + Fatal + Recovered
- Population = Susceptible + Confirmed

In [None]:
real_df = two_df.copy()
real_df["Confirmed"] = real_df[["Infected", "Fatal", "Recovered"]].sum(axis=1)
real_df["Population"] = real_df[["Susceptible", "Confirmed"]].sum(axis=1)
real_df = real_df.loc[:, ["Population", "Confirmed", "Recovered", "Fatal"]]
cs.line_plot(real_df, title="Sample data of two-phase SIR model with observable variables", v=["01Mar2022"])

Thank you!