# Ensemble Challenge
Goal: to capture the complexity and nuances around the evolution of the pandemic at various stages and locations.

## Consider the following settings:
1. *Timepoint 1*: May 1st, 2020. Setting: Michigan State at the beginning of the pandemic when masking was the main preventative measure. No vaccines available.
2. *Timepoint 2*: May 1st, 2021. Setting: Michigan State prior to the arrival of the Delta variant. Vaccines available.
3. *Timepoint 3*: December 15th, 2021. Setting: Michigan State during the start of the first Omicron wave.

4. *BONUS*: Consider the same three time points, but change the setting to Louisiana, which had different COVID-19 dynamics compared to the Northern and Northeastern states.

## ...and related questions for each:
1. What is the most relevant data to use for model calibration?
2. What was our understanding of COVID-19 viral mechanisms at the time? For example, early in the pandemic, we didn't know if reinfection was a common occurance, or even possible.
3. What are the parameters related to contagiousness/transmissibility and severity of the dominant strain at the time?
4. What policies were in place for a stated location, and how can this information be incorporated into models? (See https://www.bsg.ox.ac.uk/research/covid-19-government-response-tracker for time series of interventions.)

## For each setting:
1. (a) Take a single model, calibrate it using historical data prior to the given date, and create a 4-week forecast for cases, hospitalizations, and deaths beginning on the given date. (b) Evaluate the forecast using the COVID-19 Forecasting Hub Error Metrics (WIS, MAE). The single model evaluation should be done in the same way as the ensemble.

2. Repeat (1), but with an ensemble of different models.

    a. It is fine to calibrate each model independently and weight naively.
    
    b. It would also be fine to calibrate the ensemble as a whole, assigning weights to the different component models, so that you minimize the error of the ensemble vs. historical data.
    
    c. Use the calibration scores and error metrics computed by the CDC Forecasting Hub. As stated on their [website](https://covid19forecasthub.org/doc/reports/): 
    
    “Periodically, we evaluate the accuracy and precision of the [ensemble forecast](https://covid19forecasthub.org/doc/ensemble/) and component models over recent and historical forecasting periods. Models forecasting incident hospitalizations at a national and state level are evaluated using [adjusted relative weighted interval scores (WIS, a measure of distributional accuracy)](https://arxiv.org/abs/2005.12881), and adjusted relative mean absolute error (MAE), and calibration scores. Scores are evaluated across weeks, locations, and targets. You can read [a paper explaining these procedures in more detail](https://www.medrxiv.org/content/10.1101/2021.02.03.21250974v1), and look at [the most recent monthly evaluation reports](https://covid19forecasthub.org/eval-reports). The final report that includes case and death forecast evaluations is 2023-03-13.” 

3. Produce the forecast outputs in the format specified by the CDC forecasting challenge, including the specified quantiles.

## Data
Use the following data sources:
1. Cases: [Johns Hopkins](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv), [Reich Lab](https://github.com/reichlab/covid19-forecast-hub/blob/master/data-truth/truth-Incident%20Cases.csv) (pulled from Johns Hopkins, but formatted)

2. Hospitalizations: [HealthData.gov](https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/g62h-syeh)

3. Deaths: [Johns Hopkins](https://github.com/reichlab/covid19-forecast-hub/blob/master/data-truth/truth-Incident%20Deaths.csv), [Reich Lab](https://github.com/reichlab/covid19-forecast-hub/blob/master/data-truth/truth-Cumulative%20Deaths.csv)

In [11]:
# Load inital dependencies
import pandas as pd
import numpy as np

In [48]:
# Read in case, hospitalization, and death data
# Note: for calibration, we need case and hospitalization *census* data, and *cumulative* death data (assuming no zombies)
def get_case_hosp_death_data(state, infectious_period):
    '''This function reads in daily incident cases and cumulative deaths from the COVID-19 Forecast Hub (https://github.com/reichlab/covid19-forecast-hub),
    and hospital census data from HealthData.gov. Datasets are sorted by date, selected by location, and then incident
    case data is converted to census data assuming an infectious period of infectious_period days.
    
    :param location: 2-letter state abbreviation
    :param infectious_period: duration of the infectious period (in days)
    '''
    
    # load data
    url = 'https://media.githubusercontent.com/media/reichlab/covid19-forecast-hub/master/data-truth/truth-Incident%20Cases.csv'
    raw_cases = pd.read_csv(url)
    
    # sort rows by date
    raw_cases['date'] = pd.to_datetime(raw_cases.date, infer_datetime_format = True)
    raw_cases.sort_values(by = 'date', ascending = True, inplace = True)
    
    # grab data for the given location
    
    fips_code = '1'
    raw_cases[(raw_cases["location"].astype(str).str.len() == 4.0) & (raw_cases["location"].astype(str).str[:1] == fips_code)] #
    
    
    location_cases1 = raw_cases[raw_cases["location"] == location]
    location_cases = location_cases1
    location_cases["infectious_period"] = infectious_period
    location_cases["census"] = 0
    location_cases = location_cases.reset_index()
    
    # convert from incident to census case data
    case_census = incident_to_census(location_cases, infectious_period)
    
    raw_hosp = pd.read_csv('hospitalization_data.csv')
    
    url = 'https://media.githubusercontent.com/media/reichlab/covid19-forecast-hub/master/data-truth/truth-Cumulative%20Deaths.csv'
    raw_deaths = pd.read_csv(url)
    
    return location_cases, loaction_hosp, location_deaths

  raw_cases = pd.read_csv(url)
  raw_deaths = pd.read_csv(url)


In [45]:
# Sort rows by date
raw_cases['date'] = pd.to_datetime(raw_cases.date, infer_datetime_format = True)
raw_cases.sort_values(by = 'date', ascending = True, inplace = True)
# display(raw_cases.head())

 
# display(raw_hosp.head())

raw_deaths['date'] = pd.to_datetime(raw_deaths.date, infer_datetime_format = True)
raw_deaths.sort_values(by = 'date', ascending = True, inplace = True)
# display(raw_deaths.head())

  raw_cases['date'] = pd.to_datetime(raw_cases.date, infer_datetime_format = True)


Unnamed: 0,date,location,location_name,value
0,2020-01-22,1001,Autauga County,0
233290,2020-01-22,6035,Lassen County,0
2642436,2020-01-22,45011,Barnwell County,0
2643574,2020-01-22,45013,Beaufort County,0
2644712,2020-01-22,45015,Berkeley County,0


  raw_hosp['date'] = pd.to_datetime(raw_hosp.date, infer_datetime_format = True)


Unnamed: 0,state,date,inpatient_beds_used_covid
29637,PR,2020-01-01,0.0
26949,TX,2020-01-01,0.0
33193,NC,2020-01-01,0.0
25773,HI,2020-01-01,0.0
25065,MT,2020-01-01,0.0


  raw_deaths['date'] = pd.to_datetime(raw_deaths.date, infer_datetime_format = True)


Unnamed: 0,date,location,location_name,value
0,2020-01-22,1001,Autauga County,0
233290,2020-01-22,6035,Lassen County,0
2642436,2020-01-22,45011,Barnwell County,0
2643574,2020-01-22,45013,Beaufort County,0
2644712,2020-01-22,45015,Berkeley County,0


In [72]:
raw_cases["location"].astype(str).str.extract(r'^6\d{3}')

ValueError: pattern contains no capture groups

In [49]:
# Incident to census function
def incident_to_census(data, duration):
    '''This function converts incident data to census data.
    
    :param data: data must be a DataFrame with a census and value column, where value column corresponds to incident data
    :param duration: this is the length of time before leaving state variable category
    :returns: DataFrame with census column filled out
    '''
    for i in range(0, len(data.index) - duration):
        for j in range(0, duration + 1):
            data.census[i + j] += data.value[i]
            
        data.census[i + duration] -= data.value[i]

    for i in range(len(data.index) - duration, len(data.index)):
        rows_left = len(data.index) - i
        for j in range(0, rows_left):
            data.census[i + j] += data.value[i]
        
    return data

In [None]:
# Convert US case incident data to census data
us_cases1 = raw_cases[raw_cases["location"] == "US"]
us_cases1
us_cases = us_cases1
infectious_period = 7
us_cases["infectious_period"] = infectious_period
us_cases["census"] = 0
us_cases = us_cases.reset_index()
us_cases

cases_census = incident_to_census(us_cases, infectious_period)

In [70]:
fips_code = '1'
raw_cases[(raw_cases["location"].astype(str).str.len() == 4.0) & (raw_cases["location"].astype(str).str[:1] == fips_code)] #

Unnamed: 0,date,location,location_name,value
0,2020-01-22,1001,Autauga County,0
1,2020-01-23,1001,Autauga County,0
2,2020-01-24,1001,Autauga County,0
3,2020-01-25,1001,Autauga County,0
4,2020-01-26,1001,Autauga County,0
...,...,...,...,...
76241,2023-02-28,1133,Winston County,0
76242,2023-03-01,1133,Winston County,15
76243,2023-03-02,1133,Winston County,0
76244,2023-03-03,1133,Winston County,0


In [65]:
type(raw_cases)

pandas.core.frame.DataFrame

In [15]:
# function to convert from incident to census

# function to get data from one state, given the fips code for that state

SyntaxError: invalid syntax (4189607268.py, line 2)

In [25]:
states = ['US', 'AL', 'AK', 'Skip', 'AZ', 'AR', 'CA', 'Skip 2', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'Skip 3', 
                'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 
                'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'Skip 4', 'RI', 'SC', 'SD', 'TN', 
                'TX', 'UT', 'VT', 'VA', 'Skip 5', 'WA', 'WV', 'WI', 'WY']
fips_dict = {state: fips for state, fips in zip(states, range(0, 100))}
fips_dict["US"] = "US"

In [17]:
my_list = [1, 'US']

In [19]:
my_list[1]

'US'

## Models:
1. You may consider any of the models you have seen in the started kit, or 6-month hackathon and evaluation scenarios.

2. You may search for new models in the literature, or use TA2 model extension/transformation capabilities to modify models already in Terarium.