# Replicating results from "Modeling COVID19 mortality in the US: Community context and mobility matter "

Modeling analyses after the work done in this [paper](https://www.medrxiv.org/content/10.1101/2020.06.18.20134122v1.full.pdf).

**Paper Data**:
- Chicago, detroit, los angeles, new orleans, new york (median of each encompassed county statistic used), san francisco, seattle
- Sociodemographic data, economic indicators, demographics, documented risk factors, daily PM_2.5, comorbidity data
- Google mobility data at retail and recreation centers as %-change in baseline. 5 day moving average as imputation
- PCA reduced sociodemographic and comorbidity data. Top 4 components (80% of variation explained) used and classified by most important features in each.
- Target: daily *cumulative* COVID deaths

**Paper Model**
- quasi-Poisson GAM
- 30 days of lagged mobility & sociodemographic PCs fitted
- log(population) offset (i.e. target is deaths per capita)

$$Y_t \sim Poisson(\mu_t)$$
$$E(Y_t) = \mu_t \quad\quad Var(Y_t) = \psi \mu_t$$
$$log(E(Y_t)) = \beta_0 + f(time_t) + f_{county}(time_t) + \sum_{i=1}^4 f(PC_i) + f(mobility_{t-0}, \dots, mobility_{t-30})$$

- $f(time_t)$: common epidemic trajectory across counties (thin-plate regression spline)
- $f_{county}(time_t)$: county specfici epidemic trajectory (factor-smooth interaction)
- $f(PC_i)$: cubic regression spline w/ 5 knots
- $f(mobility_{t-0}, \dots, mobility_{t-30})$ tensor product of penalized spline of mobility at time t till t-30 days.
    - Smoothing parameter fit using restricted max-likelihood (REML) R *mgcv*
    - See DLNM package in R

**Interpretation**
- Confidence intervals estimated with parametric bootstrapping (bias-corrected)
- magnitude of F-statistic for individual predictors used as a proxy for explanatory power.
- partial dependency plots

In [3]:
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime
from tqdm import tqdm
import pickle
from matplotlib import dates
import matplotlib.pyplot as plt
import seaborn as sns
import sys; sys.path.append('../')
from src.data_loader.data_loader import *
from src.utils.dates import get_today, lag_date, date2str, str2date, get_format
from src.utils.df_utils import get_date_columns
from src.pandas.align import align_lagged_dates

from scipy.stats import spearmanr

%matplotlib inline
%load_ext autoreload
%autoreload 2

In [4]:
df = get_time_series_dataframe(
    n_days=30,
    mobility_lag=30,
    county_features = [
        'FIPS',
        'Rural-urban_Continuum Code_2013',
        'Density per square mile of land area - Population',
        'Percent of adults with less than a high school diploma 2014-18',
        'PCTPOVALL_2018',
        'Unemployment_rate_2018',
        'Total_age65plus', 
        'POP_ESTIMATE_2018',
        'MEDHHINC_2018',
        "Percent of adults with a bachelor's degree or higher 2014-18",
        "Percent of adults with a high school diploma only 2014-18",
        "Unemployment_rate_2018",
        "Total households!!Average household size",
        "HospCt",
        "Beds"
    ]
)

KeyboardInterrupt: 

In [167]:
df.columns

Index(['FIPS', 'mobility_01', 'mobility_02', 'mobility_03', 'mobility_04',
       'mobility_05', 'mobility_06', 'mobility_07', 'mobility_08',
       'mobility_09', 'mobility_10', 'mobility_11', 'mobility_12',
       'mobility_13', 'mobility_14', 'mobility_15', 'mobility_16',
       'mobility_17', 'mobility_18', 'mobility_19', 'mobility_20',
       'mobility_21', 'mobility_22', 'mobility_23', 'mobility_24',
       'mobility_25', 'mobility_26', 'mobility_27', 'mobility_28',
       'mobility_29', 'mobility_30', 'mobility_31', 'mobility_32',
       'mobility_33', 'mobility_34', 'mobility_35', 'mobility_36',
       'mobility_37', 'mobility_38', 'mobility_39', 'mobility_40',
       'mobility_41', 'mobility_42', 'mobility_43', 'mobility_44',
       'mobility_45', 'mobility_46', 'mobility_47', 'mobility_48',
       'mobility_49', 'mobility_50', 'mobility_51', 'mobility_52',
       'mobility_53', 'mobility_54', 'mobility_55', 'mobility_56',
       'mobility_57', 'mobility_58', 'mobility_59', 'm

In [20]:
def _get_onset_date(row, thresh):
    above = row[row >= thresh]
    if len(above) == 0:
        return np.nan
    else:
        return above.idxmin()