# Usage: exploratory data analysis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lisphilar/covid19-sir/blob/master/example/usage_dataset.ipynb)

Here, we will review the datasets downladed and cleaned with `DataLoader` class. Methods of this class produces the following class instances.

1. `JHUData`: the number of confirmed/infected/fatal/recovored cases
1. `OxCGRTData`: indicators of government responses (OxCGRT)
1. `PCRData`: the number of tests
1. `VaccineData`: the number of vaccinations, people vaccinated
1. `MobilityData`: percentage to baseline in visits
1. `PyramidData`: population pyramid
1. `JapanData`: Japan-specific dataset

If you want to use a new dataset for your analysis, please kindly inform us with [GitHub Issues: Request new method of DataLoader class](https://github.com/lisphilar/covid19-sir/issues/new/?template=request-new-method-of-dataloader-class.md).

Note:  
`LinelistData` (linelist of case reports) was deprecated with [issue #866](https://github.com/lisphilar/covid19-sir/issues/866) at development version 2.22.0.

Note:  
`PopulationData` (population values) was deprecated with [issue #904](https://github.com/lisphilar/covid19-sir/issues/904) at development version 2.22.0.

In this notebook, review the cleaned datasets one by one and visualize them.

## Preparation

Import the packages.

In [None]:
# !pip install covsirphy --upgrade
from pprint import pprint
import covsirphy as cs
cs.__version__

Data cleaning classes will be produced with methods of `DataLoader` class. Please specify the directory to save CSV files when creating `DataLoader` instance. The default value of `directory` is "input" and we will set "../input" here.

Note:  
Please find the details of `DataLoader` at [Usage: data loading](https://lisphilar.github.io/covid19-sir/markdown/LOADING.html).

In [None]:
# Create DataLoader instance
loader = cs.DataLoader("../input")

Usage of methods will be explained in the following sections. If you want to download all datasets with copy & paste, please refer to [Dataset preparation](https://lisphilar.github.io/covid19-sir/markdown/INSTALLATION.html#dataset-preparation).

## The number of cases (JHU style)

The main data for analysis is that of the number of cases. `JHUData` class created with `DataLoader.jhu()` method is for the number of confirmed/fatal/recovered cases. The number of infected cases will be calculated as "Confirmed - Recovered - Fatal" when data cleaning.

In [None]:
# Create instance
jhu_data = loader.jhu()

In [None]:
# Check type
type(jhu_data)

`JHUData.citation` property shows the description of this dataset.

In [None]:
print(jhu_data.citation)

Detailed citation list is saved in `DataLoader.covid19dh_citation` property. This is not a property of `JHUData`. Because many links are included, the will not be shown in this tutorial.

In [None]:
# Detailed citations (string)
# data_loader.covid19dh_citation

We can check the raw data with `JHUData.raw` property.

In [None]:
jhu_data.raw.tail()

The cleaned dataset is here.

In [None]:
jhu_data.cleaned().tail()

As you noticed, they are returned as a Pandas dataframe. Because tails are the latest values, `pandas.DataFrame.tail()` was used for reviewing it.

Check the data types and memory usage as follows.

In [None]:
jhu_data.cleaned().info()

Note that date is `pandas.datetime64`, area names are `pandas.Category` and the number of cases is `numpy.int64`.

### Total number of cases in all countries

`JHUData.total()` returns total number of cases in all countries. Fatality and recovery rate are added.

In [None]:
total_df = jhu_data.total()
# Show the oldest data
display(total_df.loc[total_df["Confirmed"] > 0].head())
# Show the latest data
display(total_df.tail())

The first case (registered in the dataset) was 07Jan2020. COVID-19 outbreak is still ongoing.

We can create line plots with `covsirphy.line_plot()` function.

In [None]:
cs.line_plot(total_df[["Infected", "Fatal", "Recovered"]], "Total number of cases over time")

Statistics of fatality and recovery rate are here.

In [None]:
total_df.loc[:, total_df.columns.str.contains("per")].describe().T

### Subset for area

`JHUData.subset()` creates a subset for a specific area. We can select country name and province name. In this tutorial, "Japan" and "Tokyo in Japan" will be used. Please replace it with your country/province name.

Subset for a country:   
We can use both of country names and ISO3 codes.

In [None]:
# Specify contry name
df, complement = jhu_data.records("Japan")
# Or, specify ISO3 code
# df, complement = jhu_data.records("JPN")
# Show records
display(df.tail())
# Show details of complement
print(complement)

Complement of records was performed. The second returned value is the description of complement. Details will be explained later and we can skip complement with `auto_complement=False` argument. Or, use just use `JHUData.subset()` method when the second returned value (`False` because no complement) is un-necessary.

In [None]:
# Skip complement
df, complement = jhu_data.records("Japan", auto_complement=False)
# Or,
# df = jhu_data.subset("Japan")
display(df.tail())
# Show complement (False because not complemented)
print(complement)

Subset for a province (called "prefecture" in Japan):

In [None]:
df, _ = jhu_data.records("Japan", province="Tokyo")
df.tail()

The list of countries can be checked with `JHUdata.countries()` as folows.

In [None]:
pprint(jhu_data.countries(), compact=True)

### Complement

`JHUData.records()` automatically complement the records, if necessary and `auto_complement=True` (default). Each area can have either none or one or multiple complements, depending on the records and their preprocessing analysis.

We can show the specific kind of complements that were applied to the records of each country with `JHUData.show_complement()` method. The possible kinds of complement for each country are the following:

1. “Monotonic_confirmed/fatal/recovered” (monotonic increasing complement)
    Force the variable show monotonic increasing.
2. “Full_recovered” (full complement of recovered data)
    Estimate the number of recovered cases using the value of estimated average recovery period.
3. “Partial_recovered” (partial complement of recovered data)
    When recovered values are not updated for some days, extrapolate the values.

Note:  
"Recovery period" will be discussed in the next subsection.

For `JHUData.show_complement()`, we can specify country names and province names.

In [None]:
# Specify country name
jhu_data.show_complement(country="Japan")
# Or, specify country and province name
# jhu_data.show_complement(country="Japan", province="Tokyo")

When list was apllied was `country` argument, the all spefied countries will be shown. If `None`, all registered countries will be used.

In [None]:
# Specify country names
jhu_data.show_complement(country=["Greece", "Japan"])
# Or, apply None
# jhu_data.show_complement(country=None)

If complement was performed incorrectly or you need new algorithms, kindly let us know via [issue page](https://github.com/lisphilar/covid19-sir/issues).

### Recovery period

We defined "recovery period" as yhe time period between case confirmation and recovery (as it is subjectively defined per country). With the global cases records, we estimate the average recovery period using `JHUData.calculate_recovery_period()`.

In [None]:
recovery_period = jhu_data.calculate_recovery_period()
print(f"Average recovery period: {recovery_period} [days]")

What we currently do is to calculate the difference between confirmed cases and fatal cases and try to match it to some recovered cases value in the future. We apply this method for every country that has valid recovery data and average the partial recovery periods in order to obtain a single (average) recovery period. During the calculations, we ignore time intervals that lead to very short (<7 days) or very long (>90 days) partial recovery periods, if these exist with high frequency (>50%) in the records. We have to assume temporarily invariable compartments for this analysis to extract an approximation of the average recovery period.

Alternatively, we had tried to use linelist of case reports to get precise value of recovery period (average of recovery date minus confirmation date for cases), but the number of records was too small.

### Visualize the number of cases at a timepoint

We can visualize the number of cases with `JHUData.map()` method. When country is None, global map will be shown.

Global map with country level data:

In [None]:
# Global map with country level data
jhu_data.map(country=None, variable="Infected")
# To set included/exclude some countries
# jhu_data.map(country=None, variable="Infected", included=["Japan"])
# jhu_data.map(country=None, variable="Infected", excluded=["Japan"])
# To change the date
# jhu_data.map(country=None, variable="Infected", date="01Oct2021")

Values can be retrieved with `.layer()` method.

In [None]:
jhu_data.layer(country=None).tail()

Country map with province level data:

In [None]:
# Country map with province level data
jhu_data.map(country="Japan", variable="Infected")
# To set included/exclude some countries
# jhu_data.map(country="Japan", variable="Infected", included=["Tokyo"])
# jhu_data.map(country="Japan", variable="Infected", excluded=["Tokyo"])
# To change the date
# jhu_data.map(country="Japan", variable="Infected", date="01Oct2021")

Values are here.

In [None]:
jhu_data.layer(country="Japan").tail()

Note for Japan:  
Province "Entering" means the number of cases who were confirmed when entering Japan.

## OxCGRT indicators

Government responses are tracked with [Oxford Covid-19 Government Response Tracker (OxCGRT)](https://github.com/OxCGRT/covid-policy-tracker). Because government responses and activities of persons change the parameter values of SIR-derived models, this dataset is significant when we try to forcast the number of cases. `OxCGRTData` class will be created with `DataLoader.oxcgrt()` method.

In [None]:
oxcgrt_data = loader.oxcgrt()

In [None]:
type(oxcgrt_data)

Because records will be retrieved via "COVID-19 Data Hub" as well as `JHUData`, data description and raw data is the same.

In [None]:
# Description
print(oxcgrt_data.citation)
# Raw
# oxcgrt_data.raw.tail()

The cleaned dataset is here.

In [None]:
oxcgrt_data.cleaned().tail()

### Subset for area

`PopulationData.subset()` creates a subset for a specific area. We can select only country name. Note that province level data is not registered in `OxCGRTData`.

Subset for a country:   
We can use both of country names and ISO3 codes.

In [None]:
oxcgrt_data.subset("Japan").tail()
# Or, with ISO3 code
# oxcgrt_data.subset("JPN").tail()

### Visualize indicator values

We can visualize indicator values with `.map()` method. Arguments are the same as `JHUData.map()`, but country name cannot be specified.

In [None]:
oxcgrt_data.map(variable="Stringency_index")

Values are here.

In [None]:
oxcgrt_data.layer().tail()

## The number of tests

The number of tests is also key information to understand the situation. `PCRData` class will be created with `DataLoader.pcr()` method.

In [None]:
pcr_data = loader.pcr()

In [None]:
type(pcr_data)

Because records will be retrieved via "COVID-19 Data Hub" as well as `JHUData`, data description and raw data is the same.

In [None]:
# Description
print(pcr_data.citation)
# Raw
# pcr_data.raw.tail()

The cleaned dataset is here.

In [None]:
pcr_data.cleaned().tail()

### Subset for area

`PCRData.subset()` creates a subset for a specific area. We can select country name and province name. 

Subset for a country:   
We can use both of country names and ISO3 codes.

In [None]:
pcr_data.subset("Japan").tail()
# Or, with ISO3 code
# pcr_data.subset("JPN").tail()

### Positive rate

Under the assumption that all tests were PCR test, we can calculate the positive rate of PCR tests as "the number of confirmed cases per the number of tests" with `PCRData.positive_rate()` method.

In [None]:
pcr_data.positive_rate("Japan").tail()

### Visualize the number of tests

We can visualize the number of tests with `.map()` method. When country is None, global map will be shown. Arguments are the same as `JHUData`, but variable name cannot be specified.

Country level data:

In [None]:
pcr_data.map(country=None)

Values are here.

In [None]:
pcr_data.layer(country=None).tail()

Province level data:

In [None]:
pcr_data.map(country="Japan")

Values are here.

In [None]:
pcr_data.layer(country="Japan").tail()

## Vaccinations

Vaccinations is a key factor to end the outbreak as soon as possible. `VaccineData` class will be created with `DataLoader.vaccine()` method.

In [None]:
vaccine_data = loader.vaccine()

In [None]:
type(vaccine_data)

Description is here.

In [None]:
print(vaccine_data.citation)

Raw data is here.

In [None]:
vaccine_data.raw.tail()

The next is the cleaned dataset.

In [None]:
vaccine_data.cleaned().tail()

### Note for variables

Definition of variables are as follows.

- Vaccinations: cumulative number of vaccinations
- Vaccinations_boosters: cumulative number of booster vaccinations
- Vaccinated_once: cumulative number of people who received at least one vaccine dose
- Vaccinated_full: cumulative number of people who received all doses prescrived by the protocol

Registered countries can be checked with `VaccineData.countries()` method.

In [None]:
pprint(vaccine_data.countries(), compact=True)

### Subset for area

`VaccineData.subset()` creates a subset for a specific area. We can select only country name. Note that province level data is not registered.

Subset for a country:   
We can use both of country names and ISO3 codes.

In [None]:
vaccine_data.subset("Japan").tail()
# Or, with ISO3 code
# vaccine_data.subset("JPN").tail()

### Visualize the number of vaccinations

We can visualize the number of vaccinations and the other variables with `.map()` method. Arguments are the same as `JHUData`, but country name cannot be specified.

In [None]:
vaccine_data.map()

Values are here.

In [None]:
vaccine_data.layer().tail()

## Mobility

Levels of mobility is a key factor of $\rho$ (effective contact rate) of SIR-derived ODE models. `MobilityData` class will be created with `DataLoader.mobility()` method.

In [None]:
mobility_data = loader.mobility()

In [None]:
type(mobility_data)

Description is here.

In [None]:
print(mobility_data.citation)

Raw data is here.

In [None]:
mobility_data.raw.tail()

The next is the cleaned dataset.

In [None]:
mobility_data.cleaned().tail()

### Note for variables

Definition of variables are as follows.

- Mobility_grocery_and_pharmacy (int): % to baseline in visits (grocery markets, pharmacies etc.)
- Mobility_parks (int): % to baseline in visits (parks etc.)
- Mobility_transit_stations (int): % to baseline in visits (public transport hubs etc.)
- Mobility_retail_and_recreation (int): % to baseline in visits (restaurant, museums etc.)
- Mobility_residential (int): % to baseline in visits (places of residence)
- Mobility_workplaces (int): % to baseline in visits (places of work)

Registered countries can be checked with `MobilityData.countries()` method.

In [None]:
pprint(mobility_data.countries(), compact=True)

### Subset for area

`MobilityData.subset()` creates a subset for a specific area (country/province).

Subset for a country:
We can use both of country names and ISO3 codes.

In [None]:
mobility_data.subset("Japan").tail()
# Or, with ISO3 code
# mobility_data.subset("JPN").tail()

### Visualize mobility data

We can visualize the levels of mobility with `MobilityData.map()` method. Arguments are the same as `JHUData`.

In [None]:
mobility_data.map(country=None)

Values are here.

In [None]:
mobility_data.layer().tail()

## Population pyramid

With population pyramid, we can divide the population to sub-groups. This will be useful when we analyse the meaning of parameters. For example, how many days go out is different between the sub-groups. `PyramidData` class will be created with `DataLoader.pyramid()` method.

In [None]:
pyramid_data = loader.pyramid()

In [None]:
type(pyramid_data)

Description is here.

In [None]:
print(pyramid_data.citation)

Raw dataset is not registered. Subset will be retrieved when `PyramidData.subset()` was called.

In [None]:
pyramid_data.subset("Japan").tail()

"Per_total" is the proportion of the age group in the total population.

## Japan-specific dataset

This includes the number of confirmed/infected/fatal/recovered/tests/moderate/severe cases at country/prefecture level and metadata of each prefecture (province). `JapanData` class will be created with `DataLoader.japan()` method.

In [None]:
japan_data = loader.japan()

In [None]:
type(japan_data)

Description is here.

In [None]:
print(japan_data.citation)

The next is the cleaned dataset.

In [None]:
japan_data.cleaned().tail()

### Visualize values

We can visualize the values with `.map()` method. Arguments are the same as `JHUData`.

In [None]:
japan_data.map(variable="Severe")

Values are here.

In [None]:
japan_data.layer(country="Japan").tail()

Map with country level data is not prepared, but country level data can be retrieved.

In [None]:
japan_data.layer(country=None).tail()

### Metadata

Additionally, `JapanData.meta()` retrieves meta data for Japan prefectures.

In [None]:
japan_data.meta().tail()