# Assess Covid-19 data
***

In [1]:
from datetime import date

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

# allow web-acces for downloading: https://stackoverflow.com/a/60671292
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

%load_ext autoreload
%autoreload 2

## GLOBAL DATA
***
#### Load Data

In [54]:
#df = pd.read_csv('../data/raw/global_confirmed.csv')
#df = pd.read_csv('../data/raw/global_deaths.csv')
df = pd.read_csv('../data/raw/global_recovered.csv')

In [55]:
df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,4/28/20,4/29/20,4/30/20,5/1/20,5/2/20,5/3/20,5/4/20,5/5/20,5/6/20,5/7/20
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,228,252,260,310,331,345,397,421,458,468
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,431,455,470,488,519,531,543,570,595,605
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,1651,1702,1779,1821,1872,1936,1998,2067,2197,2323
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,398,423,468,468,472,493,499,514,521,526
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,6,7,7,11,11,11,11,11,11,11


> format used for `confirmed`, `deaths` and `recovered` is similar, so we can use the same code to assess as below.

#### Initial Check
***
**Brief summary - data quality:**
* No missing values outside of province/state (where it's known and irrelevant to us)
* No duplicates
* Proper data-format used everywhere
* Values within reasonable bound (i.e. confirmed cases between 0 and 1.25M are in lign with other reports on 2020-05-07
* Country value for United States is 'US', this should be updated to 'United States' aligned with i.e. 'United Kingdom'

**We do want to update the data-structure**
* Get all dates in a row (Country/Date/Cases)
* Ignore Province/States (Get a country-total)
* Move Lat & Long to a separate table (Country as key)
* Change date-format (YYYY-MM-DD)

We do not need the data from the separate United States files, they are also included in the global files and they suffice given that we are looking for country level data.

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Columns: 111 entries, Province/State to 5/7/20
dtypes: float64(2), int64(107), object(2)
memory usage: 218.7+ KB


In [57]:
df.duplicated().sum()

0

In [58]:
df.isnull().sum()

Province/State    185
Country/Region      0
Lat                 0
Long                0
1/22/20             0
                 ... 
5/3/20              0
5/4/20              0
5/5/20              0
5/6/20              0
5/7/20              0
Length: 111, dtype: int64

In [59]:
(df.describe().loc['min']<0).sum()

2

> twice a column minimum below 0: for Lat & Long (as expected)

In [60]:
df['Country/Region'].value_counts()

China             33
France            11
United Kingdom    11
Australia          8
Netherlands        5
                  ..
Costa Rica         1
Paraguay           1
Egypt              1
Haiti              1
Albania            1
Name: Country/Region, Length: 187, dtype: int64

In [61]:
df[['Country/Region','5/7/20']].sort_values('5/7/20', ascending=False)

Unnamed: 0,Country/Region,5/7/20
225,US,195036
112,Germany,141700
199,Spain,128511
131,Italy,96276
213,Turkey,82984
...,...,...
245,France,0
242,Netherlands,0
223,United Kingdom,0
166,Netherlands,0
