# Loading the Open COVID-19 Dataset
This very short notebook showcases how to load the [Open COVID-19 datset](https://github.com/open-covid-19/data), including some examples for commonly performed operations.

First, loading the data is very simple with `pandas`. We can use the CSV or the JSON file to download the entire Open COVID-19 dataset in a single step:

In [1]:
import pandas as pd

# Load CSV data directly from the URL with pandas
data = pd.read_csv('https://open-covid-19.github.io/data/data.csv')

# Alternatively load the JSON data, which should be identical
data_json = pd.read_json('https://open-covid-19.github.io/data/data.json')
assert len(data) == len(data_json)

# Print a small snippet of the dataset
print('The dataset currently contains %d records, here are the last few:' % len(data))
data.tail()

The dataset currently contains 8798 records, here are the last few:


Unnamed: 0,Date,CountryCode,CountryName,RegionCode,RegionName,Confirmed,Deaths,Latitude,Longitude,Population
8793,2020-03-23,CN,China,SD,Shandong,767,7.0,36.3427,118.1498,
8794,2020-03-23,CN,China,SH,Shanghai,404,4.0,31.202,121.4491,
8795,2020-03-23,CN,China,ZJ,Zhejiang,1238,1.0,29.1832,120.0934,
8796,2020-03-23,ES,Spain,,,28572,1720.0,40.463667,-3.74922,46736776.0
8797,2020-03-23,IT,Italy,,,59138,5476.0,41.87194,12.56738,60550075.0


### Looking at country-level data
Some records contain country-level data, in other words, data that is aggregated at the country level. Other records contain region-level data, which are subdivisions of a country; for example, Chinese provinces or USA states.

To filter only country-level data from the dataset, look for records that have a null value for the region:

In [2]:
# Look for rows with null RegionCode
countries = data[data['RegionCode'].isna()]

# We no longer need the region-level columns
countries = countries.drop(columns=['RegionCode', 'RegionName'])

countries.tail()

Unnamed: 0,Date,CountryCode,CountryName,Confirmed,Deaths,Latitude,Longitude,Population
8777,2020-03-22,ZA,South Africa,240,0.0,-30.559482,22.937506,58558270.0
8778,2020-03-22,ZM,Zambia,2,0.0,-13.133897,27.849332,17861030.0
8779,2020-03-22,ZW,Zimbabwe,2,0.0,-19.015438,29.154857,14645468.0
8796,2020-03-23,ES,Spain,28572,1720.0,40.463667,-3.74922,46736776.0
8797,2020-03-23,IT,Italy,59138,5476.0,41.87194,12.56738,60550075.0


### Looking at region-level data
Conversely, to filter region-level data for a specific country, we need to look for records where the region columns have non-null values. The following snippet extracts data related to Spain's subregions from the dataset:

In [3]:
# Filter records that have the right country code AND a non-null region code
spain = data[(data['CountryCode'] == 'ES') & ~(data['RegionCode'].isna())]

spain.tail()

Unnamed: 0,Date,CountryCode,CountryName,RegionCode,RegionName,Confirmed,Deaths,Latitude,Longitude,Population
8602,2020-03-22,ES,Spain,ML,Melilla,25,0.0,35.2937,-2.9383,
8603,2020-03-22,ES,Spain,NC,Navarra,794,14.0,42.8169,-1.6432,
8604,2020-03-22,ES,Spain,PV,País Vasco,2097,97.0,43.2627,-2.9253,
8605,2020-03-22,ES,Spain,RI,La Rioja,654,18.0,42.4667,-2.45,
8606,2020-03-22,ES,Spain,VC,Comunidad Valenciana,1604,69.0,39.4697,-0.3774,


### Data consistency
Often, region-level data and country-level data will come from different sources. This will lead to numbers not adding up exactly, or even date misalignment (the data for the region may be reported sooner or later than the whole country). However, country- and region- level data will *always* be self-consistent