## COVID-19 Post Vaccination Infection Data and Analysis in California

"The California Department of Public Health (CDPH) is identifying vaccination status of COVID-19 cases, hospitalizations, and deaths by analyzing the state immunization registry and registry of confirmed COVID-19 cases. Post-vaccination cases are individuals who have a positive SARS-Cov-2 molecular test (e.g. PCR) at least 14 days after they have completed their primary vaccination series or 14 days after they have completed their booster or additional dose."

All data and data definition in this notebook are downloaded and referred from [California Open Data Portal](https://data.ca.gov/dataset/covid-19-post-vaccination-infection-data)

In this notebook, we want to answer some questions:
- What are the trends of vaccination status in California?
- Among those who are not vaccinated, how many % were infected and how many % were hospitalized?
- What are the trends of COVID cases?
- What are the trends of COVID hospitalization?

In [1]:
# install libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# load dataframe
url = "https://data.chhs.ca.gov/dataset/e39edc8e-9db1-40a7-9e87-89169401c3f5/resource/c5978614-6a23-450b-b637-171252052214/download/covid19postvaxstatewidestats.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,date,area,area_type,unvaccinated_cases,vaccinated_cases,boosted_cases,unvaccinated_hosp,vaccinated_hosp,boosted_hosp,unvaccinated_deaths,...,population_boosted,unvaccinated_cases_per_100k,vaccinated_cases_per_100k,boosted_cases_per_100k,unvaccinated_hosp_per_100k,vaccinated_hosp_per_100k,boosted_hosp_per_100k,unvaccinated_deaths_per_100k,vaccinated_deaths_per_100k,boosted_deaths_per_100k
0,2021-02-01,California,State,13804,22,0,792,0,0,12,...,0,,,,,,,,,
1,2021-02-02,California,State,11352,17,0,633,0,0,15,...,0,,,,,,,,,
2,2021-02-03,California,State,10328,26,0,567,0,0,17,...,0,,,,,,,,,
3,2021-02-04,California,State,9003,17,0,498,0,0,19,...,0,,,,,,,,,
4,2021-02-05,California,State,8396,17,0,511,0,0,27,...,0,,,,,,,,,


In [4]:
# understand columns
# list of columns
df.columns

Index(['date', 'area', 'area_type', 'unvaccinated_cases', 'vaccinated_cases',
       'boosted_cases', 'unvaccinated_hosp', 'vaccinated_hosp', 'boosted_hosp',
       'unvaccinated_deaths', 'vaccinated_deaths', 'boosted_deaths',
       'population_unvaccinated', 'population_vaccinated',
       'population_boosted', 'unvaccinated_cases_per_100k',
       'vaccinated_cases_per_100k', 'boosted_cases_per_100k',
       'unvaccinated_hosp_per_100k', 'vaccinated_hosp_per_100k',
       'boosted_hosp_per_100k', 'unvaccinated_deaths_per_100k',
       'vaccinated_deaths_per_100k', 'boosted_deaths_per_100k'],
      dtype='object')

In [8]:
# load data dictionary from the source in order to understand the columns
data_dict_url = "https://data.chhs.ca.gov/dataset/e39edc8e-9db1-40a7-9e87-89169401c3f5/resource/0c33ce39-a523-43b6-9fb3-a5bfe25d0cc6/download/postvax_odp_data-dictionary_12p_booster.xlsx"
data_dict = pd.read_excel(data_dict_url, skiprows=1)
data_dict

Unnamed: 0,COLUMN_NAME,FORMAT,DEFINITION
0,DATE,Date,Reporting time period\n\nValues:\nDate in YYY-...
1,AREA,Plain text,"State of Residence\n\nValue: \n""California"""
2,AREA_TYPE,Plain text,Geographic type of the Area field.\n\nValues: ...
3,UNVACCINATED_CASES,Numeric,Total number of laboratory-confirmed COVID-19 ...
4,VACCINATED_CASES,Numeric,Total number of laboratory-confirmed COVID-19 ...
5,BOOSTED_CASES,Numeric,Total number of laboratory-confirmed COVID-19 ...
6,UNVACCINATED_DEATHS,Numeric,Total number of laboratory-confirmed COVID-19 ...
7,VACCINATED_DEATHS,Numeric,Total number of laboratory-confirmed COVID-19 ...
8,BOOSTED_DEATHS,Numeric,Total number of laboratory-confirmed COVID-19 ...
9,UNVACCINATED_HOSP,Numeric,Total number of hospitalized laboratory-confir...


Below are important definitions:
- UNVACCINATED_CASES: Total number of laboratory-confirmed COVID-19 cases among persons age 12+ with episode date on the provided date with no record of any doses of COVID-19 vaccine. Persons considered partially vaccinated are not included in the unvaccinated cases.
- VACCINATED_CASES: Total number of laboratory-confirmed COVID-19 cases among persons age 12+ with episode date on the provided date with a complete primary COVID-19 vaccine series (episode date ≥14 days after the 2nd dose of a two-dose series or ≥14 days after a single-dose vaccine). Persons considered partially vaccinated are not included in the vaccinated cases.
- BOOSTED_CASES: Total number of laboratory-confirmed COVID-19 cases among persons age 12+ with episode date on the provided date with a complete COVID-19 vaccine series and additional or booster dose (episode date ≥14 days after the additional or booster dose). 
- POPULATION_UNVACCINATED: Number of persons age 12+ with a complete primary COVID-19 vaccine series based on information in the California Immunization Registry. This number only includes those persons considered  vaccinated defined as ≥14 days after the 2nd dose of a two-dose series or ≥14 days after a single-dose vaccine. 
- POPULATION_VACCINATED: Number of persons age 12+ that are with a complete COVID-19 vaccine series and additional or booster dose based on information in the California Immunization Registry. This number only includes those persons considered having received an additional or booster dose  ≥14 days after the additional or booster dose. 


So basically:
- UNVACCINATED_CASES: People who are NOT fully vaccinated and are confirmed to have COVID-19
- VACCINATED_CASES: People who are fully vaccinated and are confirmed to have COVID-19
- BOOSTED_CASES: People who are fully vaccinated and had additional or booster dose, and are confirmed to have COVID-19

Similar definitions apply to "_DEATHS" for COVID-19 related deaths and "_HOSP" for COVID-19 related hospitalizations. 

In [9]:
# remove unnecessary columns
# we know that this is California statewide data, so we don't need area or area_type
# we also don't need to normalize the number by 100k residents because we don't have other states or county data that we compare to
to_drop = ['area', 'area_type','unvaccinated_cases_per_100k',
       'vaccinated_cases_per_100k', 'boosted_cases_per_100k',
       'unvaccinated_hosp_per_100k', 'vaccinated_hosp_per_100k',
       'boosted_hosp_per_100k', 'unvaccinated_deaths_per_100k',
       'vaccinated_deaths_per_100k', 'boosted_deaths_per_100k']
df = df.drop(columns=to_drop)

In [13]:
# check null values
df.isna().sum()

date                       0
unvaccinated_cases         0
vaccinated_cases           0
boosted_cases              0
unvaccinated_hosp          0
vaccinated_hosp            0
boosted_hosp               0
unvaccinated_deaths        0
vaccinated_deaths          0
boosted_deaths             0
population_unvaccinated    0
population_vaccinated      0
population_boosted         0
dtype: int64

In [14]:
# data is clean, ready for statistical exploration
df.describe()

Unnamed: 0,unvaccinated_cases,vaccinated_cases,boosted_cases,unvaccinated_hosp,vaccinated_hosp,boosted_hosp,unvaccinated_deaths,vaccinated_deaths,boosted_deaths,population_unvaccinated,population_vaccinated,population_boosted
count,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0
mean,5552.871148,3740.896359,903.498599,272.733894,65.07563,8.997199,44.638655,8.151261,0.610644,13271090.0,16223360.0,1563477.0
std,8690.525805,9465.363219,3022.649704,188.301603,88.85016,26.846892,29.421214,8.947699,1.910115,8123912.0,7529373.0,2975283.0
min,481.0,6.0,0.0,47.0,0.0,0.0,2.0,0.0,0.0,4950144.0,339181.0,0.0
25%,1684.0,77.0,0.0,120.0,7.0,0.0,18.0,0.0,0.0,7054967.0,11442650.0,0.0
50%,2918.0,1171.0,0.0,221.0,49.0,0.0,43.0,6.0,0.0,10315920.0,19376090.0,0.0
75%,5341.0,2106.0,27.0,383.0,84.0,2.0,61.0,14.0,0.0,16939700.0,22163040.0,1348838.0
max,52890.0,58638.0,15882.0,792.0,448.0,151.0,131.0,45.0,13.0,32630550.0,23714710.0,11150740.0


### First impressions

- Mean of vaccinated cases is nearly 1.5 times smaller than mean of unvaccinated cases.
- Mean of vaccinated hospitalizations is more than 2 times smaller than mean of unvaccinated hospitalizations.
- Mean of vaccinated deaths is more than 3 times smaller than mean of unvaccinated deaths.

It seems like vaccination status has some relationships with number of cases, hospitalization and deaths