# Data preparation
The first step of data science is data preparation. `covsirphy` has the following three functionality for that.

1. Downloading datasets from recommended data servers
2. Reading `pandas.DataFrame`
3. Generator of sample data with SIR-derived ODE model

In [None]:
# For 2.
import pandas as pd
# !pip install covsirphy --update
import covsirphy as cs
cs.__version__

## 1. Downloading datasets from recommended data 
We will download datasets from the following recommended data servers.

- **COVID-19 Data Hub, https://covid19datahub.io/**
    - Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
    - The number of cases (JHU style)
    - Population values in each country/province
    - [Government Response Tracker (OxCGRT)](https://github.com/OxCGRT/covid-policy-tracker)
    - The number of tests
- **Our World In Data, https://github.com/owid/covid-19-data/tree/master/public/data**
    - Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
    - The number of tests
    - The number of vaccinations
    - The number of people who received vaccinations
- **COVID-19 Open Data by Google Cloud Platform, https://github.com/GoogleCloudPlatform/covid-19-open-data**
    - O. Wahltinez and others (2020), COVID-19 Open-Data: curating a fine-grained, global-scale data repository for SARS-CoV-2, Work in progress, https://goo.gle/covid-19-open-data
    - percentage to baseline in visits
    - Note: Please refer to [Google Terms of Service](https://policies.google.com/terms) in advance.
- **World Bank Open Data, https://data.worldbank.org/**
    - World Bank Group (2020), World Bank Open Data, https://data.worldbank.org/
    - Population pyramid
- **Datasets for CovsirPhy, https://github.com/lisphilar/covid19-sir/tree/master/data**
    - Hirokazu Takaya (2020-2022), GitHub repository, COVID-19 dataset in Japan, https://github.com/lisphilar/covid19-sir/tree/master/data.
    - The number of cases in Japan (total/prefectures)
    - Metadata regarding Japan prefectures

***

How to request new data loader:  
If you want to use a new dataset for your analysis, please kindly inform us using [GitHub Issues: Request new method of DataLoader class](https://github.com/lisphilar/covid19-sir/issues/new/?template=request-new-method-of-dataloader-class.md). Please read [Guideline of contribution](https://lisphilar.github.io/covid19-sir/CONTRIBUTING.html) in advance.

### 1-1. With `DataEngineer` class
We can use `DataEngineer().download()` for data downloading from recommended data servers as the quickest way.

In [None]:
eng = cs.DownEngineer()
eng.download()

We can get the all downloaded records as a `pandas.DataFrame` with `DataEngineer().all()` method.

In [None]:
all_df = eng.all()
# Overview of the records
all_df.info()

`DataEngineer.citations()` shows citations of the datasets.

In [None]:
eng.citations()

Note that, as default, `DataEngineer().download()` collects country-level data and save the datasets as CSV files in "input" (=`directory` argument of `DataEngineer()`) folder of the current directory. If the last modification time of the saved CSV files is within the last 12 (=`update_interval` argument of `DataEngineer()`) hours, the saved CSV files will be used as the database.

For some countries (eg. Japan), province/state/prefecture level data is available and we can download it as follows.

In [None]:
eng_jpn = cs.DataEngineer()
eng_jpn.download(country="Japan")

For some countries (eg. USA), city-level data is available and we can download it as follows.

In [None]:
eng_jpn = cs.DataEngineer()
eng_jpn.download(country="USA", province="Alabama")

### 1-2. With `DataDownloader` class
`DataEngineer` class is suggested because it has data cleaning methods and so on, but we can use `DataDownloader` class for data downloading.

In [None]:
dl = cs.DataDownloader()
dl_df = dl.layer(country=None, province=None)

In [None]:
# Overview of the records
dl_df.info()

In [None]:
# Citations
dl.citations()

## 2. Reading `pandas.DataFrame`
We may need to use our own datasets for analysis because the dataset is not included in the recommended data servers. `DataEngineer().register()` registers new datasets of `pandas.DataFrame` format.

At first, will prepare the new dataset as `pandas.DataFrame`. Just as a demonstration, we use [COVID-19 dataset in Japan](https://github.com/lisphilar/covid19-sir/tree/master/data). (Note that this is included in the recommended servers and the following is usually un-necessary.)

Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan

Country-level data:

In [None]:
c_url = "https://raw.githubusercontent.com/lisphilar/covid19-sir/master/data/japan/covid_jpn_total.csv"
c_df = pd.read_csv(c_url, dayfirst=False)
# Check columns of the pandas.DataFrame
c_df.head()

Prefecture-level data:

In [None]:
p_url = "https://raw.githubusercontent.com/lisphilar/covid19-sir/master/data/japan/covid_jpn_prefecture.csv"
p_df = pd.read_csv(p_url, dayfirst=False)
# Check columns of the pandas.DataFrame
p_df.head()

Create a `DataEngineer` instance, specifying the layer of location names. `c_df` had "Location" layer and `p_df` had "Prefecture" layer.

In [None]:
print(c_df.columns)
print(p_df.columns)

In [None]:
# 

## 3. Generator of sample data with SIR-derived ODE model