# COVID-19 Dataset (updated roughly twice, daily)
by *Jeffrey Blessing, Ph.D.* (blessing@msoe.edu or Dr.Jeffrey.Blessing@gmail.com)
## Open access epidemiological data from the COVID-19 outbreak
Coronavirus disease 2019 (COVID-19) is spreading rapidly around the world.  The availability of accurate and robust epidemiological, clinical, and laboratory data early in an epidemic is important to guide public health decision-making.  Consistent recording of epidemiological information is important to understand transmissibility, risk of geographic spread, routes of transmission, and risk factors for infection, and to provide the baseline for epidemiological modeling that can inform planning of response and containment efforts to reduce the burden of disease.  This Jupyter notebook was created to encourage data scientists, familiar with Python tools for data analysis, to collaborate on discovering insights from COVID-19 data.

My thanks to the Open COVID-19 Data Curation Group (below) for posting the data on GitHub.com

*Attribution to*: Bo Xu, Moritz U G Kraemer, on behalf of the Open COVID-19 Data Curation Group 
moritz.kraemer@zoo.ox.ac.uk
Department of Zoology,
University of Oxford, 
Oxford OX1 3SZ, UK

**Note**: The uniform resource identifier (URI) to the __*metadata*__ is, oddly, stored on GitHub as a comma-separated values (CSV) file.  Its URI is: https://github.com/beoutbreakprepared/nCoV2019/blob/master/latest_data/latestdata.csv.  The URI to the __*raw, CSV data*__ is in the first code cell below.  (You may need to install `wget` on your local machine to fetch the data.  Wget for the Windows operating system can be downloaded and installed from: https://eternallybored.org/misc/wget/.  However, be sure to get the 64-bit version if you're running Windows Vista, 8 or 10; otherwise install the 32-bit version.)  Users on Unix/Linux systems should be good to go!

Let's see what we can do with this data in real time!

First, I'll use a `wget` command below to fetch the latest *.csv* data from GitHub.com.  But wait!  The authors have changed the way they upload their data.  Now I must browser the following URI and manually download the tarball (*.tar.gz*) and extract the latest data with 7-zip:  https://github.com/beoutbreakprepared/nCoV2019/tree/master/latest_data

In [5]:
import numpy as np
import pandas as pd

# Downloading the dataset to a local file no longer works!  Use Lancet article URI to find tarball on GitHub.
#!wget -nv -O covid-19.csv https://raw.githubusercontent.com/beoutbreakprepared/nCoV2019/master/latest_data/latestdata.csv

https://raw.githubusercontent.com/beoutbreakprepared/nCoV2019/master/latest_data/latestdata.csv:
2020-05-08 18:00:09 ERROR 404: Not Found.


**Note**:  Only run the above command once/day to update your local covid-19.csv data file.

Next, let's import the spreadsheet data into a pandas data frame.

In [14]:
# Parse the CSV data into a pandas data frame
df = pd.read_csv("latestdata.csv", low_memory=False)
print('The dimensions (#rows, #cols) of the data frame are:', df.shape)

The dimensions (#rows, #cols) of the data frame are: (549325, 33)


Let's filter out blank rows, columns and remove duplicate rows from the data set

In [15]:
df.dropna(how='all')   # drops only blank rows
df.dropna(axis=1, how='all')   # drops only blank cols
df.drop_duplicates()   # drops identical rows
print('After dropping empty rows, cols and duplicates, data frame dimensions are:', df.shape)

After dropping empty rows, cols and duplicates, data frame dimensions are: (549325, 33)


Hmmm, no data was eliminated by the above filtering.  Okay, let's group data by country and city to see how many locations of outbreak we're dealing with:

In [16]:
df.groupby(['country', 'city']).size()

country   city                                 
Algeria   Adrar                                      3
          Ain Defla                                 31
          Ain Temouchent                             7
          Alger                                    181
          Algiers                                   21
                                                  ... 
Vietnam   Thien Ke commune, Binh Xuyen District      2
Zimbabwe  Bulawayo                                  10
          Harare                                    16
          Mhondoro                                   3
          Victoria Falls                             1
Length: 3785, dtype: int64

Let's run a quick describe() on our data frame:

In [17]:
df.describe()

Unnamed: 0,latitude,longitude,admin_id
count,549275.0,549275.0,549275.0
mean,40.984042,2.849269,4992.796843
std,15.782018,60.705855,3596.692949
min,-54.0,-159.727596,1.0
25%,39.415022,-64.183333,887.0
50%,44.036955,9.994373,6195.0
75%,48.04818,37.6173,7726.0
max,70.0718,174.74,10988.0


Only 3 numeric columns!  Looks like the rest are string data fields.  Let's describe the string fields with a new query:

In [18]:
df.describe(include = ['O'])   # only describes columns that hold string objects

Unnamed: 0,ID,age,sex,city,province,country,geo_resolution,date_onset_symptoms,date_admission_hospital,date_confirmation,...,outcome,date_death_or_discharge,notes_for_discussion,location,admin3,admin2,admin1,country_new,data_moderator_initials,travel_history_binary
count,549325,18102,16777,403801,528582,549223,549275,4267,1862,546470,...,2745,507,1852,7542,9188,318425,454917,525661,296658,483850
unique,549325,1482,2,3764,842,154,7,105,82,137,...,98,77,190,344,407,1907,467,150,11,2
top,007-210260,50-59,male,Moscow,Central,Italy,admin2,23.03.2020,30.01.2020,23.03.2020,...,Under treatment,18.02.2020,Hospitalized,Chicago,Birmingham,Moscow,Central,Italy,TR,False
freq,1,534,9446,50553,70610,157733,326879,140,90,33354,...,359,22,1247,985,309,50551,68540,157733,293266,460229


Let's see what data types we have for each column in our data frame:

In [19]:
df.dtypes

ID                           object
age                          object
sex                          object
city                         object
province                     object
country                      object
latitude                    float64
longitude                   float64
geo_resolution               object
date_onset_symptoms          object
date_admission_hospital      object
date_confirmation            object
symptoms                     object
lives_in_Wuhan               object
travel_history_dates         object
travel_history_location      object
reported_market_exposure     object
additional_information       object
chronic_disease_binary         bool
chronic_disease              object
source                       object
sequence_available           object
outcome                      object
date_death_or_discharge      object
notes_for_discussion         object
location                     object
admin3                       object
admin2                      