### COVID-19 Statistics Notebook

Some simple examples of pandas and jupyter notebook applied to the JHU COVID-19 online reports. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dtim

In [None]:
database_url = "https://raw.github.com/CSSEGISandData/COVID-19/master/csse_covid_19_data"

def fetchDatabase(report_name, remove_cols = {"FIPS", "Lat", "Long_", "UID", "ISO3"}):
    # Find yesterday's date.  The CSSE database is updated daily
    yesterday = dtim.datetime.now() - dtim.timedelta(days=1)
    mmddyyyy = "%02d-%02d-%4d" % (yesterday.month, yesterday.day, yesterday.year)
    
    # build the report filename
    report_url = "%s/csse_covid_19_%s/%s.csv" % (database_url, report_name, mmddyyyy)
    rep = pd.read_csv(report_url)
    # drop a few of the less useful columns
    orig_cols = set(list(rep))
    # make sure we don't try to remove a column that doesn't exist
    actual_remove_cols = list(orig_cols.intersection(remove_cols))
    return rep.drop(columns = actual_remove_cols)

From the README.md file at [JHU COVID-19 Dataset](https://raw.github.com/CSSEGISandData/COVID-19)

# JHU CSSE COVID-19 Dataset

## Table of contents

 * [Daily reports (csse_covid_19_daily_reports)](#daily-reports-csse_covid_19_daily_reports)
 * [USA daily state reports (csse_covid_19_daily_reports_us)](#usa-daily-state-reports-csse_covid_19_daily_reports_us)
 * [Time series summary (csse_covid_19_time_series)](#time-series-summary-csse_covid_19_time_series)
 * [Data modification records](#data-modification-records)
 * [UID Lookup Table Logic](#uid-lookup-table-logic)
---

## [Daily reports (csse_covid_19_daily_reports)](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports)

This folder contains daily case reports. All timestamps are in UTC (GMT+0).

### File naming convention
MM-DD-YYYY.csv in UTC.

### Field description
* <b>FIPS</b>: US only. Federal Information Processing Standards code that uniquely identifies counties within the USA.
* <b>Admin2</b>: County name. US only.
* <b>Province_State</b>: Province, state or dependency name.
* <b>Country_Region</b>: Country, region or sovereignty name. The names of locations included on the Website correspond with the official designations used by the U.S. Department of State.
* <b>Last Update</b>: MM/DD/YYYY HH:mm:ss  (24 hour format, in UTC).
* <b>Lat</b> and <b>Long_</b>: Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian dots are located at the centroid of the largest city in each state.
* <b>Confirmed</b>: Confirmed cases include presumptive positive cases  and probable cases, in accordance with CDC guidelines as of April 14.
* <b>Deaths</b>: Death totals in the US include confirmed and probable, in accordance with [CDC](https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html) guidelines as of April 14.
* <b>Recovered</b>: Recovered cases outside China are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number. US state-level recovered cases are from [COVID Tracking Project](https://covidtracking.com/).
* <b>Active:</b> Active cases = total confirmed - total recovered - total deaths.
* <b>Incidence_Rate</b>: Admin2 + Province_State + Country_Region.
* <b>Case-Fatality Ratio (%)</b>: = confirmed cases per 100,000 persons.
* <b>US Testing Rate</b>: = total test results per 100,000 persons. The "total test results" is equal to "Total test results (Positive + Negative)" from [COVID Tracking Project](https://covidtracking.com/).
* <b>US Hospitalization Rate (%)</b>: = Total number hospitalized / Number confirmed cases. The "Total number hospitalized" is the "Hospitalized – Cumulative" count from [COVID Tracking Project](https://covidtracking.com/). The "hospitalization rate" and "hospitalized - Cumulative" data is only presented for those states which provide cumulative hospital data.

### Update frequency
* Files on and after April 23, once per day between 03:30 and 04:00 UTC.
* Files from February 2 to April 22: once per day around 23:59 UTC.
* Files on and before February 1: the last updated files before 23:59 UTC. Sources: [archived_data](https://github.com/CSSEGISandData/COVID-19/tree/master/archived_data) and dashboard.

### Data sources
Refer to the [mainpage](https://github.com/CSSEGISandData/COVID-19).




In [None]:
# Read the US daily report.  
# Choices are daily_reports (worldwide), daily_reports_us (US only)
daily_rep_us = fetchDatabase("daily_reports_us")
daily_rep_us.head(3)

In [None]:
daily_rep_us.loc[daily_rep_us["Province_State"] == "Maryland"]

In [None]:
# An outcome here is defined as either a reported death or reported recovery
daily_rep_us["Deaths_vs_Outcomes"] = np.divide(daily_rep_us["Deaths"], np.add(daily_rep_us["Deaths"], daily_rep_us["Recovered"]))

In [None]:
# now sort by the probability of death vs. outcome
daily_rep_us.sort_values(by="Deaths_vs_Outcomes", ascending=False, inplace=True)

In [None]:
# Note that in Nevada, there are more death outcomes than recovered outcomes.
daily_rep_us.head(5)

# Geographic Presentation

Beyond this point you'll need to have installed "GeoPlot" and "GeoPandas."

I did it this way on my Fedora linux box at home: 

```
sudo dnf install python3-geoplot python3-geopandas 
```

You might also use pip3. 

In any case, if you don't have these packages installed, the remaining code will say ugly things and refuse to run.  
