# COVID-19 in California: The Big Picture

## 1. Introduction
California has emerged as a hotspot for COVID-19 cases, leading to extensive, high-profile efforts to stem the virus' spread. However, policymakers continue to grapple with the Golden State's substantial size and diversity; sprawling metropolitan areas may see their hospitals straining for resources as deaths rise, while less populated regions could experience unnecessary economic devastation, given their relatively low mortality and infection rates. Therefore, the goal of this study is to provide a county-by-county overview of California's COVID-19 figures, paying particular attention to the following questions:
* What is the relationship between the overall number of cases and the number of serious/fatal cases? 
* What is the relationship between the number of suspected cases (including symptomatic patients awaiting test results) and the number of actual cases?
* Can we accurately predict total deaths for a particular day, given the number of cases?

## 2. Data
The data came in a CSV file and was acquired on May 16, 2020, from the [California Human Health and Services Agency data repository](https://healthdata.gov/dataset/california-covid-19-hospital-data-and-case-statistics). It contains the day-by-day count of COVID-19 cases, arranged by county and several other dimensions: whether the case required ICU treatment, whether the case was suspected or not, and the number of total deaths. This data was reported from April 1, 2020, to May 15, 2020 (totals before April 1 were unavailable). The following columns are available from this dataset:
* County
* Date
* CONTINUE

## 3. Setup
We'll begin by loading the required packages and data, making sure that everything shows up correctly.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [64]:
covid = pd.read_csv('covid19data.csv')
covid.head()

Unnamed: 0,County Name,Most Recent Date,Total Count Confirmed,Total Count Deaths,COVID-19 Positive Patients,Suspected COVID-19 Positive Patients,ICU COVID-19 Positive Patients,ICU COVID-19 Suspected Patients
0,Los Angeles,4/1/2020,3502.0,66.0,739.0,1332.0,335.0,220.0
1,San Bernardino,4/1/2020,245.0,5.0,95.0,196.0,39.0,52.0
2,Orange,4/1/2020,579.0,11.0,117.0,221.0,50.0,48.0
3,Riverside,4/1/2020,306.0,11.0,85.0,182.0,29.0,47.0
4,Sacramento,4/1/2020,299.0,8.0,53.0,138.0,20.0,33.0


In [11]:
covid.shape

(2654, 8)

In [15]:
covid.describe()

Unnamed: 0,Total_Cases,Total_Deaths,Positive,Suspected,ICU_Positive,ICU_Suspected
count,2653.0,2653.0,2629.0,2629.0,2623.0,2623.0
mean,683.464757,26.524689,53.206923,31.321795,19.46321,5.884102
std,2742.827685,128.325401,216.573905,101.664574,73.118813,17.958283
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.0,0.0,0.0,0.0,0.0,0.0
50%,53.0,2.0,2.0,3.0,1.0,1.0
75%,393.0,11.0,27.0,22.0,11.0,4.0
max,36317.0,1755.0,1962.0,1350.0,625.0,220.0


## 4. Cleaning

### A. Renaming

For convenience, since we know that our data deals with COVID-19 cases, we can simplify our column names.

In [65]:
covid.columns = ['County', 'Date', 'Total_Cases', 'Total_Deaths', 'Positive', 'Suspected', 'ICU_Positive', 'ICU_Suspected']

### B. Missing Values

As we saw earlier, we seem to have some missing data. Let's take a closer look at the suspect columns:

In [21]:
covid.isnull().sum(axis = 0)

County            0
Date              0
Total_Cases       1
Total_Deaths      1
Positive         25
Suspected        25
ICU_Positive     31
ICU_Suspected    31
dtype: int64

We find out that `Total_Cases` and `Total_Deaths` are missing for only one row: Modoc County on April 1, 2020. We should replace our `NaN` values with 0 here, since there are 0 reported positive cases for that day, and the following days list the total number of cases and deaths as 0, as shown in the following code cell.

In [40]:
covid[covid.Total_Cases.isnull()]

Unnamed: 0,County,Date,Total_Cases,Total_Deaths,Positive,Suspected,ICU_Positive,ICU_Suspected
58,Modoc,4/1/2020,,,0.0,1.0,0.0,0.0


In [41]:
covid[covid.County == 'Modoc'].head()

Unnamed: 0,County,Date,Total_Cases,Total_Deaths,Positive,Suspected,ICU_Positive,ICU_Suspected
58,Modoc,4/1/2020,,,0.0,1.0,0.0,0.0
141,Modoc,4/3/2020,0.0,0.0,0.0,0.0,0.0,0.0
200,Modoc,4/4/2020,0.0,0.0,0.0,0.0,0.0,0.0
259,Modoc,4/5/2020,0.0,0.0,0.0,0.0,0.0,0.0
318,Modoc,4/6/2020,0.0,0.0,0.0,0.0,0.0,0.0


Next, let's look at the rows that have missing values for `ICU_Positive`. We discover that the resulting table accounts for all other missing values, and they seem to pop up for the same counties: Glenn, Alpine, Sierra, Sutter, and an unassigned region (cases that don't fall in any particular county).

In [54]:
covid[covid.ICU_Positive.isnull()].head(8) # remove .head(8) to see the entire table

Unnamed: 0,County,Date,Total_Cases,Total_Deaths,Positive,Suspected,ICU_Positive,ICU_Suspected
54,Unassigned,4/1/2020,50.0,1.0,,,,
55,Glenn,4/1/2020,2.0,0.0,,,,
56,Alpine,4/1/2020,1.0,0.0,,,,
57,Sierra,4/1/2020,0.0,0.0,,,,
60,Alpine,4/2/2020,1.0,0.0,,,,
103,Sierra,4/2/2020,0.0,0.0,,,,
108,Sutter,4/2/2020,10.0,1.0,0.0,0.0,,
113,Unassigned,4/2/2020,49.0,1.0,,,,


We can safely assume that for the named counties, all `NaN` values can be replaced by 0. For instance, in the time between 4/1/2020 and 4/8/2020, Alpine and Sierra Counties experienced no change in total cases/deaths, so the number of in-hospital patients would probably stay at 0. Sutter County saw its cases rise steadily, but there were no lab-confirmed hospital cases, implying that the new cases were non-hospital patients. Glenn County only has missing entries for 4/1/2020, the first day of reporting, and with only 2 confirmed cases, it's likely that those patients contracted COVID-19 before that day and thus were not part of the hospitals' daily count. Finally, since we are focusing on the county-by-county perspective, we will ignore those cases that went unassigned. We then clean our data as follows, making sure no more missing data remains:

In [73]:
covid = covid.fillna(0).drop(covid[covid.County == 'Unassigned'].index)
covid.isnull().sum()

County           0
Date             0
Total_Cases      0
Total_Deaths     0
Positive         0
Suspected        0
ICU_Positive     0
ICU_Suspected    0
dtype: int64

### C. Creating New Columns

Based on the data codebook, we know that our columns are described in disparate ways. For example, `Total_Cases` and `Total_Deaths` are cumulative, but the other numeric variables are day-to-day counts. Moreover, some values are double-counted within several columns, such as how `ICU_Positive` is a subset of `Positive`. This can negatively impact model performance later on, so we want to create new columns that don't have information overlap.

In [76]:
covid_full.Total_Cases.shift(1)

0          NaN
1       3502.0
2        245.0
3        579.0
4        306.0
         ...  
2648       1.0
2649    1356.0
2651       2.0
2652     789.0
2653     182.0
Name: Total_Cases, Length: 2609, dtype: float64

In [81]:
covid_full.groupby('County').Total_Cases.shift(1)

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
         ...  
2648    1317.0
2649       2.0
2651     760.0
2652     180.0
2653      21.0
Name: Total_Cases, Length: 2609, dtype: float64

In [82]:
covid_full = covid.copy()
covid_full['Total_Nonfatal_Cases'] = covid_full.Total_Cases - covid_full.Total_Deaths
covid_full['Today_Cases'] = covid_full.Total_Cases - covid_full.groupby('County').Total_Cases.shift(1)