Coronavirus Data Science
This repository contains Jupyter notebooks and python scripts for investigating the 2019 coronavirus outbreak. The goal is to serve as a starting point to track and analyze this outbreak. Getting an environment set up to read, analyze, and plot the outbreak data is not trivial. I am hoping this helps more people get started.
If you are a researcher, journalist, or other interested member of the public, please use this freely. If you are a data scientist, please fork and contribute back to build a better foundation for future research.
- Provide a framework and tools for loading outbreak data into Python
- Easily visualize outbreak geodata
- Facilitate collaboration among researchers
From the CDC:
2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people. The latest situation summary updates are available on CDC’s web page 2019 Novel Coronavirus, Wuhan, China.
This is an emerging, rapidly evolving situation and CDC will provide updated information as it becomes available.
The data for tracking the 2019-nCoV outbreak is provided by the Johns Hopkins Center for Systems Science and Engineering. They have created an interactive GIS Dashboard.
In response to this ongoing public health emergency, we developed an online dashboard (static snapshot shown below) to visualize and track the reported cases on a daily timescale; the complete set of data is downloadable as a google sheet. The case data visualized is collected from various sources, including WHO, U.S. CDC, ECDC China CDC (CCDC), NHC and DXY. DXY is a Chinese website that aggregates NHC and local CCDC situation reports in near real-time, providing more current regional case estimates than the national level reporting organizations are capable of, and is thus used for all the mainland China cases reported in our dashboard (confirmed, suspected, recovered, deaths). U.S. cases (confirmed, suspected, recovered, deaths) are taken from the U.S. CDC, and all other country (suspected and confirmed) case data is taken from the corresponding regional health departments. The dashboard is intended to provide the public with an understanding of the outbreak situation as it unfolds, with transparent data sources.
Pulling Updates from Google Sheets
The data is updated in a read-only Google Sheet.
Download credentials and install dependencies as described in the Google documentation..
The Jan 25 Jupyter notebook works on a snapshot of data from Jan 25.
- Load the coronavirus data into a Pandas DataFrame and plot
- Load world, China, and US shapefiles into GeoDataFrames
- Merge the coronavirus DataFrame with the GeoDataFrames
- Display on a map
The nCoV Spread Jupyter notebook loads all data files into one time-indexed DataFram.
pip install pandas pip install requests pip install geopandas pip install descartes
- Load and visualize a data snapshot
- Create a script to download new data from Google Sheets
- Visualize time-series data
Coronavirus Data Science © 2019+, P. Daniel Tyreus, PhD Released under the MIT License.