# Final Project: Phase 2

### Purpose: Initial Data Collection & Processing

### Name & NetID: Jason Zheng, jz449

### Data Description:
What are the observations (rows) and the attributes (columns)?
    - For the states data, the observations represent daily counts for a given state and date. The attributes are State, Confirmed Cases, Deaths, and Date.
    - For the U.S totals data, the rows represent daily total counts across the United States. The attributes are Confirmed Cases, Deaths, Recoveries, and Date.
Why was this dataset created?
    - The API which enabled this dataset to be formed was created to help build dashboards and mobile apps to track COVID-19 with the latest information. 
Who funded the creation of the dataset?
    - The original data is sourced from the Johns Hopkins Center for Systems Science and Engineering, while the API was created by Kyle Redelinghuys.
What processes might have influenced what data was observed and recorded and what was not?
    - The data was collected from U.S. county and state health departments, multiple national government health departments, as well as data aggregating websites which rely on a combination of reporting from local health departments and local media reports. Thus, the accuracy and reliability of the data depends on whether or not these sources reported their results rigorously.
What preprocessing was done, and how did the data come to be in the form that you are using?
    - From the API (which sourced data from the JHU CSSE public Github repository), I converted the JSON files' data to Pandas dataframes, which I am currently using to perform data analysis.
If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
    - The original data collection was developed to provide researchers, public health authorities and the general public with a dashboard to track the outbreak.

Where can your raw source data be found, if applicable?
    - Raw data from API: https://api.covid19api.com/all
    - us_states.json: https://cornell.box.com/s/373058ydm13hd044ukkbn7v2tdp5ln4c
    - us_totals.json: https://cornell.box.com/s/77y7sy2364p1hefrp0zgz2okbzerq2it

In [15]:
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import io
import sys
import datetime

## Data Gathering from API

In [2]:
response1 = requests.get("https://api.covid19api.com/all")

In [3]:
data = response1.json()

In [4]:
df = pd.DataFrame.from_records(data)

In [150]:
df.head()

Unnamed: 0,Country,CountryCode,Province,City,CityCode,Lat,Lon,Confirmed,Deaths,Recovered,Active,Date
0,Afghanistan,AF,,,,33.94,67.71,0,0,0,0,2020-01-22T00:00:00Z
1,Afghanistan,AF,,,,33.94,67.71,0,0,0,0,2020-01-23T00:00:00Z
2,Afghanistan,AF,,,,33.94,67.71,0,0,0,0,2020-01-24T00:00:00Z
3,Afghanistan,AF,,,,33.94,67.71,0,0,0,0,2020-01-25T00:00:00Z
4,Afghanistan,AF,,,,33.94,67.71,0,0,0,0,2020-01-26T00:00:00Z


## Filtering / Processing API Data

As seen in the above dataframe of the original data, we do not need all the columns. We will filter the data for U.S entries only, and drop the City, CityCode, Lat, Lon and Active columns. In addition, we will change the data formatting to a more readable format.

In [171]:
df_filtered = df.drop(columns=['CountryCode', 'City', 'CityCode', 'Lat', 'Lon', 'Active'])
df_filtered = df_filtered[df_filtered['Country']=='United States of America']
df_filtered.head()

Unnamed: 0,Country,Province,Confirmed,Deaths,Recovered,Date
26670,United States of America,New York,0,0,0,2020-01-22T00:00:00Z
26671,United States of America,New Hampshire,0,0,0,2020-01-22T00:00:00Z
26672,United States of America,Pennsylvania,0,0,0,2020-01-22T00:00:00Z
26673,United States of America,Indiana,0,0,0,2020-01-22T00:00:00Z
26674,United States of America,Alabama,0,0,0,2020-01-22T00:00:00Z


In [172]:
# Converting date format: https://stackoverflow.com/questions/26763344/convert-pandas-column-to-datetime
df_filtered['Date'] = pd.to_datetime(df_filtered['Date'], format="%Y-%m-%dT%H:%M:%SZ")

In [173]:
df_filtered = df_filtered.drop(columns=['Country'])
df_filtered.tail()

Unnamed: 0,Province,Confirmed,Deaths,Recovered,Date
368230,Ohio,21,1,0,2020-05-05
368231,Kansas,0,0,0,2020-05-05
368232,Louisiana,657,40,0,2020-05-05
368233,Wyoming,1,0,0,2020-05-05
368234,Georgia,107,5,0,2020-05-05


## Extracting U.S Totals

Looking at the filtered dataframe more closely, we see that there are entries with missing Province values. These entries correspond to U.S totals, which are inherently different than the day-by-day state data. Therefore, we will extract these entries into a new dataframe called 'df_ustotal'.

In [174]:
df_ustotal = df_filtered[df_filtered['Province']=='']
df_ustotal = df_ustotal.drop(columns=['Province'])
df_ustotal.tail()

Unnamed: 0,Confirmed,Deaths,Recovered,Date
352862,1103461,64943,164015,2020-05-01
355990,1132539,66369,175382,2020-05-02
358999,1158040,67682,180152,2020-05-03
361957,1180375,68922,187180,2020-05-04
365832,1204351,71064,189791,2020-05-05


After extracting the U.S totals data, we will make a new dataframe without the totals and call it 'df_states'; this dataframe will only include the day-by-day data for states. In addition, since the values for 'Recovered' are all zero for state data, we can get rid of the column altogether. 

In [179]:
df_states = df_filtered[df_filtered['Province']!='']
df_states.columns = ['State', 'Confirmed', 'Deaths', 'Recovered', 'Date']

In [176]:
# since all recovered values are 0 for state data, we remove the column
df_states = df_states.drop(columns=['Recovered'])
df_states.tail()

Unnamed: 0,State,Confirmed,Deaths,Date
368230,Ohio,21,1,2020-05-05
368231,Kansas,0,0,2020-05-05
368232,Louisiana,657,40,2020-05-05
368233,Wyoming,1,0,2020-05-05
368234,Georgia,107,5,2020-05-05


## Export Data to JSON Files

Now, with our filtered dataframes 'df_states' (22mb) and 'df_ustotal' (6.5kb), we can export them to JSON files.

In [177]:
df_states.to_json(r'us_states.json')
df_ustotal.to_json(r'us_totals.json')