# About

This notebook runs XGBoost to produce Stage 1 and Stage 2 results for a specified instrument, lag/lead, health outcome, and fixed effects combination.

XGBoost models were hyperparameter tuned using 5 fold time series cross validation (expanding window). 
- The best parameters were used to train a full model on the 2002-2017 datasets. 
- Held out performance is best assessed from the CV split, since we don't have access to 2018 data for PM2.5 or health.

The 2nd stage regression predicts the medical outcomes using the predicted PM2.5, as well as the same fixed effects from the first stage regression.

The 2nd stage model is used to make counterfactual predicted health outcomes if we reduced air pollution (predicted PM2.5)  by 1%, 10%, and 25%.

---
Running this notebook requires the following file structure changes:
- Just as in the OLS notebooks, please make sure the modeling data is in `'C:/Users/cilin/Research/CA_hospitals_capstone/data/'` and the medical data is in `'C:/Users/cilin/Research/CA_hospitals_capstone/output/'`. Please make sure these are named the same as when you ran the OLS notebooks.
- Add a new directory or make sure this one exists: `out_dir_xgb = 'C:/Users/cilin/Research/CA_hospitals_capstone/xgb/'`

Please also ensure that you have the `xgboost` package in your Python interpreter. If using conda, you can install xgboost
- This notebook uses the sklearn api for xgboost: `xgb.XGBRegressor()`
- https://github.com/dmlc/xgboost
- Installation Page: https://xgboost.readthedocs.io/en/stable/install.html

In [1]:
# optional. I'm getting annoying warnings that I just want to ignore:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# basics
import pandas as pd 
import numpy as np
import os 
import re
from datetime import datetime
from tqdm.notebook import tqdm
tqdm.pandas()
import requests
import urllib
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import TimeSeriesSplit
import patsy

# plotting
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import plotly.express as px
import seaborn as sns

# modeling
from patsy import dmatrices
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from statsmodels.sandbox.regression.gmm import IV2SLS
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None

Combinations of Instruments and Fixed Effects here:
https://docs.google.com/spreadsheets/d/1_MMYeQuxiov2OLE5AX0CE9R1T1mBrk7vpozy2fGjNBg/edit#gid=0 

Instruments:

- `Izmy_v3_normed_D_and_TPY`
- `Izmy_v4_nodist_normed_TPY`
- `Izmy_v5_all_normed_but_wspd_ratio`
- `Izmy_v7_all_normed_no_wspd`


We will use lead and lag of 9 months and 3 months respectively.

For fixed effects, we choose from a set of 9 possible combinations outlined here:

- https://docs.google.com/spreadsheets/d/1_MMYeQuxiov2OLE5AX0CE9R1T1mBrk7vpozy2fGjNBg/edit#gid=0

In [2]:
# set instrumental variable version
#predictor = 'Izmy_v3_normed_D_and_TPY'
predictor = 'Izmy_v4_nodist_normed_TPY'
#predictor = 'Izmy_v5_all_normed_but_wspd_ratio'
#predictor = 'Izmy_v7_all_normed_no_wspd'

# set FE to one of 8 sets (int)
FE_set_num = 7

# sets unique notebooks index (string - this is just to output a csv file name)
notebook_index = "diff16_no_outliers_fn_T"

# set option to filter outliers on medical outcomes data
filter_medical_outliers = True

# set medical target
medical_target = 'y_visits_all_malignant_cancers_fwd3_diff_r12' # select the name just as it is in 2sls 2nd stage output notebook
#medical_target = 'y_visits_blood_vessel_diseases_fwd3_diff_r12' 

In [3]:
lead_time = '9'
lag_time = '3'
lag_style = 'fwd'

# define lead time for IV: 'last_month', 'r6', 'r9', 'r12'
IV_lead = "r" + str(lead_time)
HO_lag = lag_style + str(lag_time)

if IV_lead:
    IV_lead_input = "_" + IV_lead 
else:
    # don't add underscore if empty string
    IV_lead_input = IV_lead

# define lag time for Health Outcome: '', 'fwd3', 'cent3', 'fwd6', 'cent6', 'fwd12', 'cent12'
if HO_lag:
    HO_lag_input = "_" + HO_lag 
else:
    # don't add underscore if empty string
    HO_lag_input = HO_lag

# IV options: 1 month, 6 months, 9 months, 12 months
IV_window_col = [f'pm25{IV_lead_input}']

# health outcome options (fwd or cent): 1 month, 3 months, 6 months, 12 months
health_outcome_window_col = [f'y_injuries{HO_lag_input}']

filter_cols = IV_window_col + health_outcome_window_col # columns to filter out at the beginning and end of df, before modeling

target_name_s1 = f'pm25{IV_lead_input}'
predictor_name_s1 = f'{predictor}{IV_lead_input}'

print(f"Stage 1\nTarget Name (target_name_s1) = {target_name_s1}\nPredictor Name (predictor_name_s1) = {predictor_name_s1}")

print(f"\nStage 2\nHealth Outcome Lag Input (HO_lag_input) = {HO_lag_input}")
print(f"Medical Health Outcome Target: {medical_target}")
print(f"Predictor Name: {target_name_s1 + '_hat'}")


Stage 1
Target Name (target_name_s1) = pm25_r9
Predictor Name (predictor_name_s1) = Izmy_v4_nodist_normed_TPY_r9

Stage 2
Health Outcome Lag Input (HO_lag_input) = _fwd3
Medical Health Outcome Target: y_visits_all_malignant_cancers_fwd3_diff_r12
Predictor Name: pm25_r9_hat


# Set Path

Add a new elif section for your path if you want

In [4]:
# local or gdrive
path_source = 'local_cornelia'

if path_source == 'gdrive':
  from google.colab import drive
  drive.mount('/content/gdrive')
  data_path = '/content/gdrive/MyDrive/Classes/W210_capstone/W210_Capstone/Data'
  fitted_models_path = '/content/gdrive/MyDrive/Classes/W210_capstone/W210_Capstone/fitted_models/2022-10-23'
  
elif path_source == 'local':
  data_path = '/Users/tj/trevorj@berkeley.edu - Google Drive/My Drive/Classes/W210_capstone/W210_Capstone/Data'
  fitted_models_path = '/Users/tj/trevorj@berkeley.edu - Google Drive/My Drive/Classes/W210_capstone/W210_Capstone/fitted_models/2022-10-23'

elif path_source == 'local_anand':

  # in Anand's computer
  in_dir_sc = 'C:/Users/anandadmin/MIDS/210 Capstone/data/'
  in_dir_h = 'C:/Users/anandadmin/MIDS/210 Capstone/output/'
  out_dir1 = 'C:/Users/anandadmin/MIDS/210 Capstone/models_s1/'
  out_dir2 = 'C:/Users/anandadmin/MIDS/210 Capstone/models_s2/'
  # folder containing csvs documenting which fixed effects are in which csv files
  out_dir3 = 'C:/Users/anandadmin/MIDS/210 Capstone/fixed_effects/'  
  # folder to store XGB model and output csv files
  out_dir_xgb = 'G:\\.shortcut-targets-by-id\\11wLy1WKwOTcthBs1rpfEzkqax2BZG-6E\W210_Capstone\\fitted_models\\2022-11-19\\XGB'

elif path_source == 'local_cornelia':
  in_dir_sc = 'C:/Users/cilin/Research/CA_hospitals_capstone/data/'
  in_dir_h = 'C:/Users/cilin/Research/CA_hospitals_capstone/output/'
  # folder containing stage 1 outputs
  out_dir1 = 'C:/Users/cilin/Research/CA_hospitals_capstone/models_s1/'
  # folder containing stage 2 outputs
  out_dir2 = 'C:/Users/cilin/Research/CA_hospitals_capstone/models_s2/'
  # folder containing csvs documenting which fixed effects are in which csv files
  out_dir3 = 'C:/Users/cilin/Research/CA_hospitals_capstone/fixed_effects/'
  # folder to store XGB model and output csv files
  out_dir_xgb = 'C:/Users/cilin/Research/CA_hospitals_capstone/xgb/'

elif path_source == 'msl':
  in_dir_sc = 'C:/Users/matts/Documents/Berkeley MIDS/DataSci 210 Capstone/non-push files/data/'
  in_dir_h = in_dir_sc + 'output/'
  # folder containing stage 1 outputs
  out_dir1 = in_dir_sc + 'models_s1/'
  # folder containing stage 2 outputs
  out_dir2 = in_dir_sc + 'models_s2/'
  # folder containing csvs documenting which fixed effects are in which csv files
  out_dir3 = in_dir_sc + 'fixed_effects/'
  # folder to store XGB model and output csv files
  out_dir_xgb = in_dir_sc + 'xgb/'



elif path_source == 'work':
  data_path = '/Users/trevorjohnson/trevorj@berkeley.edu - Google Drive/My Drive/Classes/W210_capstone/W210_Capstone/Data'
  fitted_models_path = '/Users/trevorjohnson/trevorj@berkeley.edu - Google Drive/My Drive/Classes/W210_capstone/W210_Capstone/fitted_models/2022-10-23'

In [5]:
# non-medical data
for file in os.listdir(in_dir_sc):
    if file.startswith('modeling'):
        # read in our modeling data
        df = pd.read_csv(os.path.join(in_dir_sc, file))

# add key to df
df['patzip_year_month'] = df.school_zip.astype(str) + '-' + df.year.astype(str) + '-' + df.month.astype(str)

display(df.head(1))

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,year_month,school_zip,school_county_v2,school_region_name,pm25,school_elevation_m,ps_elevation_m,population_0_4,population_0_4_male,population_0_4_female,population_5_9,population_5_9_male,population_5_9_female,population_10_14,population_10_14_male,population_10_14_female,population_15_19,population_15_19_male,population_15_19_female,total_pop_under19,pop_under19_male,pop_under19_female,total_population,total_population_male,total_population_female,point_source_pm25_tpy,dist_school_to_ps_m,angle_to_school,ps_wspd_merge,school_wdir_wrt_0n,ps_wdir_wrt_0n,school_wind_alignment,ps_wind_alignment,avg_wind_alignment,avg_wind_alignment_cosine,nearby_point_source_count,school_wspd,ca_agi_per_returns,total_tax_liability,tax_liability_per_capita,school_temperature,ps_temperature,school_count,pm25_last_month,pm25_r6,pm25_r9,pm25_r12,pm25_r24,pm25_slope6,pm25_slope9,pm25_slope12,pm25_slope24,pm25_lag_12mo,year,month,school_county_v2_alameda,school_county_v2_alpine,school_county_v2_amador,school_county_v2_butte,school_county_v2_calaveras,school_county_v2_colusa,school_county_v2_contra_costa,school_county_v2_del_norte,school_county_v2_el_dorado,school_county_v2_fresno,school_county_v2_glenn,school_county_v2_humboldt,school_county_v2_imperial,school_county_v2_inyo,school_county_v2_kern,school_county_v2_kings,school_county_v2_lake,school_county_v2_lassen,school_county_v2_los_angeles,school_county_v2_madera,school_county_v2_marin,school_county_v2_mariposa,school_county_v2_mendocino,school_county_v2_merced,school_county_v2_modoc,school_county_v2_mono,school_county_v2_monterey,school_county_v2_napa,school_county_v2_nevada,school_county_v2_orange,school_county_v2_placer,school_county_v2_plumas,school_county_v2_riverside,school_county_v2_sacramento,school_county_v2_san_benito,school_county_v2_san_bernardino,school_county_v2_san_diego,school_county_v2_san_francisco,school_county_v2_san_joaquin,school_county_v2_san_luis_obispo,school_county_v2_san_mateo,school_county_v2_santa_barbara,school_county_v2_santa_clara,school_county_v2_santa_cruz,school_county_v2_shasta,school_county_v2_sierra,school_county_v2_siskiyou,school_county_v2_solano,school_county_v2_sonoma,school_county_v2_stanislaus,school_county_v2_sutter,school_county_v2_tehama,school_county_v2_trinity,school_county_v2_tulare,school_county_v2_tuolumne,school_county_v2_ventura,school_county_v2_yolo,school_county_v2_yuba,month_01,month_02,month_03,month_04,month_05,month_06,month_07,month_08,month_09,month_10,month_11,month_12,y-m,central_wind_alignment_180_high,avg_count_ps_within_5km,avg_elevation_diff_m,avg_wspd_ratio_ps_sch,avg_wspd_ratio_sch_ps,avg_school_wspd,avg_ps_wspd,new_alignment_90_high,ps_pm25_tpy_top_20,school_to_ps_geod_dist_m_top_20,avg_wspd_top_15,Izmy_v1_unnormed,Izmy_v2_nodist_unnormed,Izmy_v3_normed_D_and_TPY,Izmy_v4_nodist_normed_TPY,Izmy_v5_all_normed_but_wspd_ratio,Izmy_v6_unnormed_no_wspd,Izmy_v7_all_normed_no_wspd,Izmy_v8_normed_D_and_TPY_no_wspd,avg_temp,diff_temp_s_ps,patzip_year_month
0,0,0,2000-01-01,90001,Los Angeles,Los Angeles County,32.149998,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.757031,-172.758321,-172.758321,82.561735,82.561735,82.561735,1.124995,0.0,0.757031,20049.704556,2608176.0,47.87313,14.277778,14.266667,9,,,,,,,,,,,2000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2000-01,,,,,,,,,,,,,,,,,,,,14.272222,0.011111,90001-2000-1


In [6]:
df = df.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1'])


# Data Clean

Non Med Data

In [7]:
def roll_selected_cols(df, cols_to_roll:list = ['Izmy_v1_unnormed'\
    ,'Izmy_v2_nodist_unnormed' \
    ,'Izmy_v3_normed_D_and_TPY' \
    ,'Izmy_v4_nodist_normed_TPY' \
    ,'Izmy_v5_all_normed'
    ,'Izmy_v6_unnormed_no_wspd'
    ,'Izmy_v7_all_normed_no_wspd'
    ,'Izmy_v8_normed_D_and_TPY_no_wspd'
    ,'new_alignment_90_high'
    ,'avg_temp']
    ,rolling_periods:list = [1, 6, 9, 12]):

    """Generates rolling averages for the input variables over the input time periods.
    Inputs: df (pd dataframe): contains the data on a y-m level
            cols_to_roll (list): list of columns to generate rolling avgs--must be in df
            rolling_periods (list): list of time windows (in months) to roll over
            
    Outputs: df: Pandas dataframe containing the new columns
             all_cols: list of list containing the new columns, separated by input type"""
    
    df_int = df.copy().sort_values(['school_zip', 'year_month']).reset_index(drop=True)
    
    all_cols_int = []

    # Roll each variable
    for col_index in range(len(cols_to_roll)):
        new_cols = []

        col_to_roll = cols_to_roll[col_index]
        rolling_periods = [1, 6, 9, 12]

        for period in rolling_periods:
            df_int[f'{col_to_roll}_r{period}'] = df_int.groupby('school_zip')[col_to_roll]\
                .apply(lambda x: x.rolling(window=period, min_periods=period, closed='left').mean())
            
            new_cols.append(col_to_roll + "_r" + str(period))

        all_cols_int.append([col_to_roll] + new_cols)
        
    return df_int, all_cols_int


cols_to_roll = [predictor
    ,'avg_wspd_top_15'
    ,'avg_temp'
    ,'diff_temp_s_ps']

rolling_periods = [int(lead_time)]

df, all_cols = roll_selected_cols(df=df, cols_to_roll=cols_to_roll, rolling_periods=rolling_periods)

# rename the last month column just to be consistent and safe
df.rename(columns={'pm25_last_month': 'pm25_r1'}, inplace=True)

# drop if year >=2018
df = df[df.year.le(2017)]

# print shape of data
print('Shape of our schools modeling data ', df.shape)
df.head(2)

Shape of our schools modeling data  (294897, 164)


Unnamed: 0,year_month,school_zip,school_county_v2,school_region_name,pm25,school_elevation_m,ps_elevation_m,population_0_4,population_0_4_male,population_0_4_female,population_5_9,population_5_9_male,population_5_9_female,population_10_14,population_10_14_male,population_10_14_female,population_15_19,population_15_19_male,population_15_19_female,total_pop_under19,pop_under19_male,pop_under19_female,total_population,total_population_male,total_population_female,point_source_pm25_tpy,dist_school_to_ps_m,angle_to_school,ps_wspd_merge,school_wdir_wrt_0n,ps_wdir_wrt_0n,school_wind_alignment,ps_wind_alignment,avg_wind_alignment,avg_wind_alignment_cosine,nearby_point_source_count,school_wspd,ca_agi_per_returns,total_tax_liability,tax_liability_per_capita,school_temperature,ps_temperature,school_count,pm25_r1,pm25_r6,pm25_r9,pm25_r12,pm25_r24,pm25_slope6,pm25_slope9,pm25_slope12,pm25_slope24,pm25_lag_12mo,year,month,school_county_v2_alameda,school_county_v2_alpine,school_county_v2_amador,school_county_v2_butte,school_county_v2_calaveras,school_county_v2_colusa,school_county_v2_contra_costa,school_county_v2_del_norte,school_county_v2_el_dorado,school_county_v2_fresno,school_county_v2_glenn,school_county_v2_humboldt,school_county_v2_imperial,school_county_v2_inyo,school_county_v2_kern,school_county_v2_kings,school_county_v2_lake,school_county_v2_lassen,school_county_v2_los_angeles,school_county_v2_madera,school_county_v2_marin,school_county_v2_mariposa,school_county_v2_mendocino,school_county_v2_merced,school_county_v2_modoc,school_county_v2_mono,school_county_v2_monterey,school_county_v2_napa,school_county_v2_nevada,school_county_v2_orange,school_county_v2_placer,school_county_v2_plumas,school_county_v2_riverside,school_county_v2_sacramento,school_county_v2_san_benito,school_county_v2_san_bernardino,school_county_v2_san_diego,school_county_v2_san_francisco,school_county_v2_san_joaquin,school_county_v2_san_luis_obispo,school_county_v2_san_mateo,school_county_v2_santa_barbara,school_county_v2_santa_clara,school_county_v2_santa_cruz,school_county_v2_shasta,school_county_v2_sierra,school_county_v2_siskiyou,school_county_v2_solano,school_county_v2_sonoma,school_county_v2_stanislaus,school_county_v2_sutter,school_county_v2_tehama,school_county_v2_trinity,school_county_v2_tulare,school_county_v2_tuolumne,school_county_v2_ventura,school_county_v2_yolo,school_county_v2_yuba,month_01,month_02,month_03,month_04,month_05,month_06,month_07,month_08,month_09,month_10,month_11,month_12,y-m,central_wind_alignment_180_high,avg_count_ps_within_5km,avg_elevation_diff_m,avg_wspd_ratio_ps_sch,avg_wspd_ratio_sch_ps,avg_school_wspd,avg_ps_wspd,new_alignment_90_high,ps_pm25_tpy_top_20,school_to_ps_geod_dist_m_top_20,avg_wspd_top_15,Izmy_v1_unnormed,Izmy_v2_nodist_unnormed,Izmy_v3_normed_D_and_TPY,Izmy_v4_nodist_normed_TPY,Izmy_v5_all_normed_but_wspd_ratio,Izmy_v6_unnormed_no_wspd,Izmy_v7_all_normed_no_wspd,Izmy_v8_normed_D_and_TPY_no_wspd,avg_temp,diff_temp_s_ps,patzip_year_month,Izmy_v4_nodist_normed_TPY_r1,Izmy_v4_nodist_normed_TPY_r6,Izmy_v4_nodist_normed_TPY_r9,Izmy_v4_nodist_normed_TPY_r12,avg_wspd_top_15_r1,avg_wspd_top_15_r6,avg_wspd_top_15_r9,avg_wspd_top_15_r12,avg_temp_r1,avg_temp_r6,avg_temp_r9,avg_temp_r12,diff_temp_s_ps_r1,diff_temp_s_ps_r6,diff_temp_s_ps_r9,diff_temp_s_ps_r12
0,2000-01-01,90001,Los Angeles,Los Angeles County,32.149998,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.757031,-172.758321,-172.758321,82.561735,82.561735,82.561735,1.124995,0.0,0.757031,20049.704556,2608176.0,47.87313,14.277778,14.266667,9,,,,,,,,,,,2000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2000-01,,,,,,,,,,,,,,,,,,,,14.272222,0.011111,90001-2000-1,,,,,,,,,,,,,,,,
1,2000-02-01,90001,Los Angeles,Los Angeles County,13.666667,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.965276,30.294778,30.294778,120.491364,120.491364,120.491364,0.547186,0.0,0.965276,20049.704556,2608176.0,47.87313,13.877778,13.866667,9,32.149998,,,,,,,,,,2000,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2000-02,,,,,,,,,,,,,,,,,,,,13.872222,0.011111,90001-2000-2,,,,,,,,,14.272222,,,,0.011111,,,


Med data

In [8]:
# medical data
df_med = pd.DataFrame()
df_med['patzip_year_month'] = df.patzip_year_month.unique()

for file in os.listdir(in_dir_h):
    # read in cornelia's healthcare data
    temp = pd.read_csv(os.path.join(in_dir_h, file)).iloc[:,1:]
    # rename number_of_visits column
    temp.rename(
        columns={'number_of_visits':'visits_'+file.split('.')[0]},
        inplace=True
    )

    # merge to df_med
    df_med = df_med.merge(
        temp[['patzip_year_month', 'visits_'+file.split('.')[0]]],
        on='patzip_year_month',
        how='left'
    )

# Filter out 2018 data bc it's all nulls 
# if year > 2017, drop
df_med['year_h'] = df_med.patzip_year_month.str.split('-').str[1]
df_med = df_med[df_med.year_h.le('2017')]
df_med.drop(columns='year_h', inplace=True)

# print shape of data
print('Shape of medical data ', df_med.shape)
df_med[~df_med.iloc[:, -1].isna()].head(2)

Shape of medical data  (294897, 11)


Unnamed: 0,patzip_year_month,visits_all_malignant_cancers,visits_all_nonblood_malignant_cancers,visits_blood_diseases,visits_blood_or_bv_diseases,visits_blood_vessel_diseases,visits_cardioresp_cancers,visits_hematopoietic_cancers,visits_injuries,visits_respiratory,visits_type_1_diabetes
36008,91214-2008-12,,,,,,,,12.0,9.0,1.0
36070,91214-2014-2,,,1.0,1.0,,,,5.0,10.0,1.0


In [9]:
df_med.columns

Index(['patzip_year_month', 'visits_all_malignant_cancers',
       'visits_all_nonblood_malignant_cancers', 'visits_blood_diseases',
       'visits_blood_or_bv_diseases', 'visits_blood_vessel_diseases',
       'visits_cardioresp_cancers', 'visits_hematopoietic_cancers',
       'visits_injuries', 'visits_respiratory', 'visits_type_1_diabetes'],
      dtype='object')

Dynamically add 'y_' and lag times to the column names

In [10]:
# Get the list of outcome columns from the merged health dataset (df_med above)
num_visits_col_names = [i for i in list(df_med.columns) if 'visit' in i]
print("num_visits_col_names:\n{}\n".format(num_visits_col_names))

# y_col_names is used in Section 5.4.
y_col_names = ['y_' + i for i in num_visits_col_names]
print("y_col_names:\n{}\n".format(y_col_names))

# create a list of columns for outcome variables with health outcome lag window added
# y_col_names_lag is used in Step 7
y_col_names_lag = []
for i in y_col_names:
    new_name = i + HO_lag_input
    y_col_names_lag.append(new_name)

print("y_col_names_lag:\n{}".format(y_col_names_lag))

# create a list of columns for outcome variables with health outcome lag window added
# y_col_names_lag is used in Step 7
y_col_names_lag_diff = []
for i in y_col_names_lag:
    new_name = i + '_diff_r12'
    y_col_names_lag_diff.append(new_name)

print("\ny_col_names_lag_diff:\n{}".format(y_col_names_lag_diff))

num_visits_col_names:
['visits_all_malignant_cancers', 'visits_all_nonblood_malignant_cancers', 'visits_blood_diseases', 'visits_blood_or_bv_diseases', 'visits_blood_vessel_diseases', 'visits_cardioresp_cancers', 'visits_hematopoietic_cancers', 'visits_injuries', 'visits_respiratory', 'visits_type_1_diabetes']

y_col_names:
['y_visits_all_malignant_cancers', 'y_visits_all_nonblood_malignant_cancers', 'y_visits_blood_diseases', 'y_visits_blood_or_bv_diseases', 'y_visits_blood_vessel_diseases', 'y_visits_cardioresp_cancers', 'y_visits_hematopoietic_cancers', 'y_visits_injuries', 'y_visits_respiratory', 'y_visits_type_1_diabetes']

y_col_names_lag:
['y_visits_all_malignant_cancers_fwd3', 'y_visits_all_nonblood_malignant_cancers_fwd3', 'y_visits_blood_diseases_fwd3', 'y_visits_blood_or_bv_diseases_fwd3', 'y_visits_blood_vessel_diseases_fwd3', 'y_visits_cardioresp_cancers_fwd3', 'y_visits_hematopoietic_cancers_fwd3', 'y_visits_injuries_fwd3', 'y_visits_respiratory_fwd3', 'y_visits_type_1_di

Merge non-medical and medical datasets

In [11]:
if isinstance(df.year_month[0], str):
  # if year month is still a string, convert it to datetime
  # don't try if already converted
    df['year_month'] = df['year_month'].map(lambda x: datetime.strptime(x, '%Y-%m-%d'))

# merge df_h to df_sc
df = df.merge(
    df_med,
    on='patzip_year_month',
    how='left'
)


# print shape of data
print('Shape of data ', df.shape)
df.head(2)

Shape of data  (294897, 174)


Unnamed: 0,year_month,school_zip,school_county_v2,school_region_name,pm25,school_elevation_m,ps_elevation_m,population_0_4,population_0_4_male,population_0_4_female,population_5_9,population_5_9_male,population_5_9_female,population_10_14,population_10_14_male,population_10_14_female,population_15_19,population_15_19_male,population_15_19_female,total_pop_under19,pop_under19_male,pop_under19_female,total_population,total_population_male,total_population_female,point_source_pm25_tpy,dist_school_to_ps_m,angle_to_school,ps_wspd_merge,school_wdir_wrt_0n,ps_wdir_wrt_0n,school_wind_alignment,ps_wind_alignment,avg_wind_alignment,avg_wind_alignment_cosine,nearby_point_source_count,school_wspd,ca_agi_per_returns,total_tax_liability,tax_liability_per_capita,school_temperature,ps_temperature,school_count,pm25_r1,pm25_r6,pm25_r9,pm25_r12,pm25_r24,pm25_slope6,pm25_slope9,pm25_slope12,pm25_slope24,pm25_lag_12mo,year,month,school_county_v2_alameda,school_county_v2_alpine,school_county_v2_amador,school_county_v2_butte,school_county_v2_calaveras,school_county_v2_colusa,school_county_v2_contra_costa,school_county_v2_del_norte,school_county_v2_el_dorado,school_county_v2_fresno,school_county_v2_glenn,school_county_v2_humboldt,school_county_v2_imperial,school_county_v2_inyo,school_county_v2_kern,school_county_v2_kings,school_county_v2_lake,school_county_v2_lassen,school_county_v2_los_angeles,school_county_v2_madera,school_county_v2_marin,school_county_v2_mariposa,school_county_v2_mendocino,school_county_v2_merced,school_county_v2_modoc,school_county_v2_mono,school_county_v2_monterey,school_county_v2_napa,school_county_v2_nevada,school_county_v2_orange,school_county_v2_placer,school_county_v2_plumas,school_county_v2_riverside,school_county_v2_sacramento,school_county_v2_san_benito,school_county_v2_san_bernardino,school_county_v2_san_diego,school_county_v2_san_francisco,school_county_v2_san_joaquin,school_county_v2_san_luis_obispo,school_county_v2_san_mateo,school_county_v2_santa_barbara,school_county_v2_santa_clara,school_county_v2_santa_cruz,school_county_v2_shasta,school_county_v2_sierra,school_county_v2_siskiyou,school_county_v2_solano,school_county_v2_sonoma,school_county_v2_stanislaus,school_county_v2_sutter,school_county_v2_tehama,school_county_v2_trinity,school_county_v2_tulare,school_county_v2_tuolumne,school_county_v2_ventura,school_county_v2_yolo,school_county_v2_yuba,month_01,month_02,month_03,month_04,month_05,month_06,month_07,month_08,month_09,month_10,month_11,month_12,y-m,central_wind_alignment_180_high,avg_count_ps_within_5km,avg_elevation_diff_m,avg_wspd_ratio_ps_sch,avg_wspd_ratio_sch_ps,avg_school_wspd,avg_ps_wspd,new_alignment_90_high,ps_pm25_tpy_top_20,school_to_ps_geod_dist_m_top_20,avg_wspd_top_15,Izmy_v1_unnormed,Izmy_v2_nodist_unnormed,Izmy_v3_normed_D_and_TPY,Izmy_v4_nodist_normed_TPY,Izmy_v5_all_normed_but_wspd_ratio,Izmy_v6_unnormed_no_wspd,Izmy_v7_all_normed_no_wspd,Izmy_v8_normed_D_and_TPY_no_wspd,avg_temp,diff_temp_s_ps,patzip_year_month,Izmy_v4_nodist_normed_TPY_r1,Izmy_v4_nodist_normed_TPY_r6,Izmy_v4_nodist_normed_TPY_r9,Izmy_v4_nodist_normed_TPY_r12,avg_wspd_top_15_r1,avg_wspd_top_15_r6,avg_wspd_top_15_r9,avg_wspd_top_15_r12,avg_temp_r1,avg_temp_r6,avg_temp_r9,avg_temp_r12,diff_temp_s_ps_r1,diff_temp_s_ps_r6,diff_temp_s_ps_r9,diff_temp_s_ps_r12,visits_all_malignant_cancers,visits_all_nonblood_malignant_cancers,visits_blood_diseases,visits_blood_or_bv_diseases,visits_blood_vessel_diseases,visits_cardioresp_cancers,visits_hematopoietic_cancers,visits_injuries,visits_respiratory,visits_type_1_diabetes
0,2000-01-01,90001,Los Angeles,Los Angeles County,32.149998,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.757031,-172.758321,-172.758321,82.561735,82.561735,82.561735,1.124995,0.0,0.757031,20049.704556,2608176.0,47.87313,14.277778,14.266667,9,,,,,,,,,,,2000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2000-01,,,,,,,,,,,,,,,,,,,,14.272222,0.011111,90001-2000-1,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2000-02-01,90001,Los Angeles,Los Angeles County,13.666667,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.965276,30.294778,30.294778,120.491364,120.491364,120.491364,0.547186,0.0,0.965276,20049.704556,2608176.0,47.87313,13.877778,13.866667,9,32.149998,,,,,,,,,,2000,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2000-02,,,,,,,,,,,,,,,,,,,,13.872222,0.011111,90001-2000-2,,,,,,,,,14.272222,,,,0.011111,,,,,,,,,,,,,


#### Fill in nulls conditionally on merged datasets

* For each health outcome, fill in the nulls for a zipcode with 0's ONLY if that row occurred after the first non-zero/not-null visit in that zipcode for that health outcome (keeps them as nulls otherwise)

In [12]:
def filter_nans(df, visits_cols = num_visits_col_names, replace_after_1 = True):
    """Function to generate columns in place that do one of two things
    
    If replace_after_1 == True, replacec NaNs with 0's only if that 
    row occurred after the first non-zero/not null visit in that zipcode for the specific
    health outcome. Keeps them as nulls otherwise.
    
    If replace_after_1 == False, replaces all NaNs with 0's if there exists any
    non-zero/not null visit in that zipcode for the specific health outcome. Keeps
    them as nulls otherwise.

    Args:
        df (DataFrame): Input dataframe
        visits_cols (list, optional): list of columns to selectively filter NaNs
        replace_after_1: Bool
    Returns:
        DataFrame with columns replaced with their NaN-filtered versions
    """

    def get_rowIndex(row):
        """Function intended for applying across df rows

        Args:
            row (int): row

        Returns:
            int: index of row
        """
      
        return row.name

    def compare_and_replace(orig_visits, dataset_row_idx, school_zip):
        """Function intended for applying across df rows
         Selectively replaces NaNs with 0's
        Args:
            orig_visits: original column that needs to be filtered
            dataset_row_idx: column with row indices for the entire df
            school_zip: column with school zips

        Returns:
            float or NaN
        """
        
        # school zip + zip idx
        first_val_row_idx = dict_row_idx[school_zip]
        zip_idx = dict_zip_idx[school_zip]
        max_idx = dict_max_zipindex_per_zip[school_zip]
        difference = max_idx - zip_idx + 1

        # If there is a nonzero/non-null, replace all NaNs after with 0's
        if replace_after_1 == True:
        # check the school zip first
            if dataset_row_idx < first_val_row_idx:
                orig_visits = orig_visits
            elif (dataset_row_idx >= first_val_row_idx) and (dataset_row_idx <=  first_val_row_idx + difference):
                if pd.isnull(orig_visits):
                    orig_visits = 0
                else:
                    orig_visits = orig_visits
            return orig_visits

        # If there's a nonzero/non-null value at any point, replace all NaNs for that zipcode with 0's
        elif replace_after_1 == False:
            # in the event that there is no nonzero/non-null anywhere in the zip
            if zip_idx == df_grouped_schools.shape[0]:
                orig_visits = orig_visits
            # there is a 1.0 somewhere in the zip, change all NaNs to 0's
            else:
                if pd.isnull(orig_visits):
                    orig_visits = 0
                else: 
                    orig_visits = orig_visits
            return orig_visits

        
    # group df by school_zip, year_month
    df_grouped_schools = df.groupby(['school_zip', 'year_month']).tail(1)

    unique_school_zips = list(df_grouped_schools['school_zip'].unique())

    # generate overall row index
    df_grouped_schools['rowIndex'] = df_grouped_schools.apply(get_rowIndex, axis=1)

    # generate row indices that rest per school zip
    df_grouped_schools['zipIndex'] = df_grouped_schools.groupby(['school_zip'])['year_month'].rank('first', ascending=True).astype(int)
    df_grouped_schools['zipIndex'] = df_grouped_schools['zipIndex'] - 1

    # generate dictionary that gets max index per school zip
    dict_max_zipindex_per_zip = {}
    for i in unique_school_zips:
        dict_max_zipindex_per_zip[i] = df_grouped_schools[df_grouped_schools['school_zip']==i]['zipIndex'].max()

    for i in visits_cols:
        dict_zip_idx = {}
        dict_row_idx = {}
        for j in unique_school_zips:
            temp = df_grouped_schools[df_grouped_schools['school_zip']==j]

            visits_series = pd.Series(temp[i]) # one school zip, filtered to 1 health outcome
            bool_not_null = visits_series.notnull()
            all_indices_not_null = np.where(bool_not_null)[0]

            # save index of the first non-NaN value within the zipcode indices
            # if everything every value for zip is NaN, set value to # of records in df
            try:
                groupby_index = all_indices_not_null[0]
            except IndexError:
                groupby_index = df_grouped_schools.shape[0]
            dict_zip_idx[j] = groupby_index
            
            # save index of the row from whole dataset; set value to # of records in df if not
            try:
                row_idx = temp.loc[temp['zipIndex'] == groupby_index, 'rowIndex'].values[0]
            except IndexError:
                row_idx = df_grouped_schools.shape[0]
            dict_row_idx[j] = row_idx
        
        df_grouped_schools[i] = df_grouped_schools.apply(lambda row: compare_and_replace(row[i], row['rowIndex'], row['school_zip']), axis=1)

    # drop rowIndex and zipIndex cols
    df_grouped_schools.drop(columns=['rowIndex', 'zipIndex'], inplace=True)

    return df_grouped_schools

In [13]:
# call function:
df.sort_values(['school_zip', 'year_month'], inplace=True)
df = filter_nans(df, visits_cols = num_visits_col_names, replace_after_1 = True)
print('Shape of data ', df.shape)
display(df)

Shape of data  (294897, 174)


Unnamed: 0,year_month,school_zip,school_county_v2,school_region_name,pm25,school_elevation_m,ps_elevation_m,population_0_4,population_0_4_male,population_0_4_female,population_5_9,population_5_9_male,population_5_9_female,population_10_14,population_10_14_male,population_10_14_female,population_15_19,population_15_19_male,population_15_19_female,total_pop_under19,pop_under19_male,pop_under19_female,total_population,total_population_male,total_population_female,point_source_pm25_tpy,dist_school_to_ps_m,angle_to_school,ps_wspd_merge,school_wdir_wrt_0n,ps_wdir_wrt_0n,school_wind_alignment,ps_wind_alignment,avg_wind_alignment,avg_wind_alignment_cosine,nearby_point_source_count,school_wspd,ca_agi_per_returns,total_tax_liability,tax_liability_per_capita,school_temperature,ps_temperature,school_count,pm25_r1,pm25_r6,pm25_r9,pm25_r12,pm25_r24,pm25_slope6,pm25_slope9,pm25_slope12,pm25_slope24,pm25_lag_12mo,year,month,school_county_v2_alameda,school_county_v2_alpine,school_county_v2_amador,school_county_v2_butte,school_county_v2_calaveras,school_county_v2_colusa,school_county_v2_contra_costa,school_county_v2_del_norte,school_county_v2_el_dorado,school_county_v2_fresno,school_county_v2_glenn,school_county_v2_humboldt,school_county_v2_imperial,school_county_v2_inyo,school_county_v2_kern,school_county_v2_kings,school_county_v2_lake,school_county_v2_lassen,school_county_v2_los_angeles,school_county_v2_madera,school_county_v2_marin,school_county_v2_mariposa,school_county_v2_mendocino,school_county_v2_merced,school_county_v2_modoc,school_county_v2_mono,school_county_v2_monterey,school_county_v2_napa,school_county_v2_nevada,school_county_v2_orange,school_county_v2_placer,school_county_v2_plumas,school_county_v2_riverside,school_county_v2_sacramento,school_county_v2_san_benito,school_county_v2_san_bernardino,school_county_v2_san_diego,school_county_v2_san_francisco,school_county_v2_san_joaquin,school_county_v2_san_luis_obispo,school_county_v2_san_mateo,school_county_v2_santa_barbara,school_county_v2_santa_clara,school_county_v2_santa_cruz,school_county_v2_shasta,school_county_v2_sierra,school_county_v2_siskiyou,school_county_v2_solano,school_county_v2_sonoma,school_county_v2_stanislaus,school_county_v2_sutter,school_county_v2_tehama,school_county_v2_trinity,school_county_v2_tulare,school_county_v2_tuolumne,school_county_v2_ventura,school_county_v2_yolo,school_county_v2_yuba,month_01,month_02,month_03,month_04,month_05,month_06,month_07,month_08,month_09,month_10,month_11,month_12,y-m,central_wind_alignment_180_high,avg_count_ps_within_5km,avg_elevation_diff_m,avg_wspd_ratio_ps_sch,avg_wspd_ratio_sch_ps,avg_school_wspd,avg_ps_wspd,new_alignment_90_high,ps_pm25_tpy_top_20,school_to_ps_geod_dist_m_top_20,avg_wspd_top_15,Izmy_v1_unnormed,Izmy_v2_nodist_unnormed,Izmy_v3_normed_D_and_TPY,Izmy_v4_nodist_normed_TPY,Izmy_v5_all_normed_but_wspd_ratio,Izmy_v6_unnormed_no_wspd,Izmy_v7_all_normed_no_wspd,Izmy_v8_normed_D_and_TPY_no_wspd,avg_temp,diff_temp_s_ps,patzip_year_month,Izmy_v4_nodist_normed_TPY_r1,Izmy_v4_nodist_normed_TPY_r6,Izmy_v4_nodist_normed_TPY_r9,Izmy_v4_nodist_normed_TPY_r12,avg_wspd_top_15_r1,avg_wspd_top_15_r6,avg_wspd_top_15_r9,avg_wspd_top_15_r12,avg_temp_r1,avg_temp_r6,avg_temp_r9,avg_temp_r12,diff_temp_s_ps_r1,diff_temp_s_ps_r6,diff_temp_s_ps_r9,diff_temp_s_ps_r12,visits_all_malignant_cancers,visits_all_nonblood_malignant_cancers,visits_blood_diseases,visits_blood_or_bv_diseases,visits_blood_vessel_diseases,visits_cardioresp_cancers,visits_hematopoietic_cancers,visits_injuries,visits_respiratory,visits_type_1_diabetes
0,2000-01-01,90001,Los Angeles,Los Angeles County,32.149998,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.757031,-172.758321,-172.758321,82.561735,82.561735,82.561735,1.124995,0.0,0.757031,20049.704556,2.608176e+06,47.873130,14.277778,14.266667,9,,,,,,,,,,,2000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2000-01,,,,,,,,,,,,,,,,,,,,14.272222,0.011111,90001-2000-1,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2000-02-01,90001,Los Angeles,Los Angeles County,13.666667,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.965276,30.294778,30.294778,120.491364,120.491364,120.491364,0.547186,0.0,0.965276,20049.704556,2.608176e+06,47.873130,13.877778,13.866667,9,32.149998,,,,,,,,,,2000,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2000-02,,,,,,,,,,,,,,,,,,,,13.872222,0.011111,90001-2000-2,,,,,,,,,14.272222,,,,0.011111,,,,,,,,,,,,,
2,2000-03-01,90001,Los Angeles,Los Angeles County,17.183334,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.199593,107.246620,107.246620,147.368213,147.368213,147.368213,0.172183,0.0,0.199593,20049.704556,2.608176e+06,47.873130,14.677778,14.666667,9,13.666667,,,,,,,,,,2000,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2000-03,,,,,,,,,,,,,,,,,,,,14.672222,0.011111,90001-2000-3,,,,,,,,,13.872222,,,,0.011111,,,,,,,,,,,,,
3,2000-04-01,90001,Los Angeles,Los Angeles County,17.366667,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.771902,73.610928,73.610928,155.392932,155.392932,155.392932,0.159581,0.0,0.771902,20049.704556,2.608176e+06,47.873130,16.055556,16.600000,9,17.183334,,,,,,,,,,2000,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2000-04,,,,,,,,,,,,,,,,,,,,16.327778,-0.544444,90001-2000-4,,,,,,,,,14.672222,,,,0.011111,,,,,,,,,,,,,
4,2000-05-01,90001,Los Angeles,Los Angeles County,17.616667,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,1.063785,73.965232,73.965232,155.432299,155.432299,155.432299,0.158167,0.0,1.063785,20049.704556,2.608176e+06,47.873130,17.855556,18.533333,9,17.366667,,,,,,,,,,2000,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-05,,,,,,,,,,,,,,,,,,,,18.194444,-0.677778,90001-2000-5,,,,,,,,,16.327778,,,,-0.544444,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
294892,2017-08-01,97635,Modoc,Superior California,5.925000,1484.680000,1331.890000,0.0,0.0,0.0,0.0,0.0,0.0,17.0,12.0,5.0,0.0,0.0,0.0,17.0,12.0,5.0,214.0,113.0,101.0,0.877750,60528.982992,21.694962,0.427147,128.596355,114.046726,106.901393,92.351765,99.626579,0.832774,0.0,0.371966,59799.084409,3.809391e+07,178008.912538,8.014286,19.914286,1,5.600000,2.841146,2.942708,3.593750,3.345443,0.461518,0.036927,-0.254611,-0.039781,6.550000,2017,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2017-08,128.313092,0.0,-673.963333,1.515257,1.474275,1.895626,2.163724,41.923230,30.157308,205925.295872,2.029675,114.808556,2.312278e+07,12428.042293,6444.545693,138.089359,71.899755,86.222436,7760.019217,13.964286,-11.900000,97635-2017-8,6482.153474,4478.345836,4342.789788,4380.227509,2.164433,2.608587,2.506471,2.474437,14.086170,10.063198,6.605095,7.770526,10.827660,2.965668,0.728964,-1.315258,,,,,,,,0.0,,
294893,2017-09-01,97635,Modoc,Superior California,3.678125,1484.680000,1331.890000,0.0,0.0,0.0,0.0,0.0,0.0,17.0,12.0,5.0,0.0,0.0,0.0,17.0,12.0,5.0,214.0,113.0,101.0,0.877750,60528.982992,21.694962,0.073094,132.915270,-97.997283,111.220308,119.692244,115.456276,0.570178,0.0,0.135517,59799.084409,3.809391e+07,178008.912538,8.014286,15.500000,1,5.925000,3.283333,3.102431,3.541667,3.244010,0.988214,0.364479,-0.030573,0.040514,7.009375,2017,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2017-09,121.254592,0.0,-673.963333,1.138648,1.710182,2.295742,2.090771,39.123371,30.157308,205925.295872,2.193256,72.582018,1.462739e+07,7814.384034,4056.614586,86.826489,64.158576,76.759739,6908.376471,11.757143,-7.485714,97635-2017-9,6444.545693,5008.947793,4775.122824,4473.625712,2.029675,2.436629,2.479211,2.455685,13.964286,11.625846,7.928904,7.733621,-11.900000,-0.159629,0.017853,-1.241448,,,,,,,,0.0,,
294894,2017-10-01,97635,Modoc,Superior California,3.931250,1484.680000,1331.890000,0.0,0.0,0.0,0.0,0.0,0.0,17.0,12.0,5.0,0.0,0.0,0.0,17.0,12.0,5.0,214.0,113.0,101.0,0.877750,60528.982992,21.694962,0.425838,13.870442,-67.817276,7.824520,89.512238,48.668379,1.660416,0.0,0.476738,59799.084409,3.809391e+07,178008.912538,6.050000,8.672340,1,3.678125,3.668750,3.113194,3.264063,3.259766,0.661071,0.442760,0.132299,0.045463,3.081250,2017,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2017-10,117.610480,0.0,-673.963333,0.992379,2.098003,2.572472,1.968177,36.263669,30.157308,205925.295872,2.270324,57.395807,1.163883e+07,6153.608210,3212.957923,68.373425,61.499497,73.680521,6631.246893,7.361170,-2.622340,97635-2017-10,4056.614586,5102.440152,4671.455319,4434.423646,2.193256,2.375880,2.486227,2.455118,11.757143,12.194818,9.608402,7.733621,-7.485714,-1.297573,-0.449078,-1.241448,,,,,,,,0.0,,
294895,2017-11-01,97635,Modoc,Superior California,4.284375,1484.680000,1331.890000,0.0,0.0,0.0,0.0,0.0,0.0,17.0,12.0,5.0,0.0,0.0,0.0,17.0,12.0,5.0,214.0,113.0,101.0,0.877750,60528.982992,21.694962,1.701108,32.728486,26.375591,11.033524,4.680629,7.857076,1.990612,0.0,2.559094,59799.084409,3.809391e+07,178008.912538,-0.700000,4.800000,1,3.931250,4.015104,3.397917,3.334896,3.230339,0.335446,0.351927,0.149650,0.067154,4.487500,2017,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2017-11,134.498873,0.0,-673.963333,0.747694,2.975230,3.196173,1.749303,46.899849,30.157308,205925.295872,2.472738,52.229601,1.051500e+07,5578.276660,2892.839642,61.980852,76.280400,91.166790,8205.011112,2.050000,-5.500000,97635-2017-11,3212.957923,5019.307382,4509.354802,4399.935525,2.270324,2.291536,2.460531,2.420958,7.361170,12.362156,10.384643,7.710704,-2.622340,-2.287010,-0.468227,-1.287282,,,,,,,,0.0,,


Create response variables

In [14]:
# Create response variables, which is visits / population
for col in num_visits_col_names:
    df['y_'+col] = 1000 * df[col] / df['total_pop_under19']
    
# add year trend
year_map = {label: idx for idx, label in enumerate(np.sort(df.year.unique()))}
df["year_trend"] = df.year.map(year_map)
df["year_trend"] = df.year_trend + 1

# fixing month datatype
df['month'] = df['month'].astype(str)

# create county_month
df['county_month'] = df.apply(lambda df: df['month'].rjust(2, '0') + '_' + df['school_county_v2'], axis=1)

# create year_month_county (in case we want to just direclty use this var for the interaction effects)
df['year_month_county'] = df.apply(lambda df: str(df['year']) + '_' + df['month'] + '_' + df['school_county_v2'], axis=1)

# reset index
df.reset_index(drop=True, inplace=True)

# print shape of data
print('Shape of data ', df.shape) # Old shape of data  (294897, 174)
df.head(2)

Shape of data  (294897, 187)


Unnamed: 0,year_month,school_zip,school_county_v2,school_region_name,pm25,school_elevation_m,ps_elevation_m,population_0_4,population_0_4_male,population_0_4_female,population_5_9,population_5_9_male,population_5_9_female,population_10_14,population_10_14_male,population_10_14_female,population_15_19,population_15_19_male,population_15_19_female,total_pop_under19,pop_under19_male,pop_under19_female,total_population,total_population_male,total_population_female,point_source_pm25_tpy,dist_school_to_ps_m,angle_to_school,ps_wspd_merge,school_wdir_wrt_0n,ps_wdir_wrt_0n,school_wind_alignment,ps_wind_alignment,avg_wind_alignment,avg_wind_alignment_cosine,nearby_point_source_count,school_wspd,ca_agi_per_returns,total_tax_liability,tax_liability_per_capita,school_temperature,ps_temperature,school_count,pm25_r1,pm25_r6,pm25_r9,pm25_r12,pm25_r24,pm25_slope6,pm25_slope9,pm25_slope12,pm25_slope24,pm25_lag_12mo,year,month,school_county_v2_alameda,school_county_v2_alpine,school_county_v2_amador,school_county_v2_butte,school_county_v2_calaveras,school_county_v2_colusa,school_county_v2_contra_costa,school_county_v2_del_norte,school_county_v2_el_dorado,school_county_v2_fresno,school_county_v2_glenn,school_county_v2_humboldt,school_county_v2_imperial,school_county_v2_inyo,school_county_v2_kern,school_county_v2_kings,school_county_v2_lake,school_county_v2_lassen,school_county_v2_los_angeles,school_county_v2_madera,school_county_v2_marin,school_county_v2_mariposa,school_county_v2_mendocino,school_county_v2_merced,school_county_v2_modoc,school_county_v2_mono,school_county_v2_monterey,school_county_v2_napa,school_county_v2_nevada,school_county_v2_orange,school_county_v2_placer,school_county_v2_plumas,school_county_v2_riverside,school_county_v2_sacramento,school_county_v2_san_benito,school_county_v2_san_bernardino,school_county_v2_san_diego,school_county_v2_san_francisco,school_county_v2_san_joaquin,school_county_v2_san_luis_obispo,school_county_v2_san_mateo,school_county_v2_santa_barbara,school_county_v2_santa_clara,school_county_v2_santa_cruz,school_county_v2_shasta,school_county_v2_sierra,school_county_v2_siskiyou,school_county_v2_solano,school_county_v2_sonoma,school_county_v2_stanislaus,school_county_v2_sutter,school_county_v2_tehama,school_county_v2_trinity,school_county_v2_tulare,school_county_v2_tuolumne,school_county_v2_ventura,school_county_v2_yolo,school_county_v2_yuba,month_01,month_02,month_03,month_04,month_05,month_06,month_07,month_08,month_09,month_10,month_11,month_12,y-m,central_wind_alignment_180_high,avg_count_ps_within_5km,avg_elevation_diff_m,avg_wspd_ratio_ps_sch,avg_wspd_ratio_sch_ps,avg_school_wspd,avg_ps_wspd,new_alignment_90_high,ps_pm25_tpy_top_20,school_to_ps_geod_dist_m_top_20,avg_wspd_top_15,Izmy_v1_unnormed,Izmy_v2_nodist_unnormed,Izmy_v3_normed_D_and_TPY,Izmy_v4_nodist_normed_TPY,Izmy_v5_all_normed_but_wspd_ratio,Izmy_v6_unnormed_no_wspd,Izmy_v7_all_normed_no_wspd,Izmy_v8_normed_D_and_TPY_no_wspd,avg_temp,diff_temp_s_ps,patzip_year_month,Izmy_v4_nodist_normed_TPY_r1,Izmy_v4_nodist_normed_TPY_r6,Izmy_v4_nodist_normed_TPY_r9,Izmy_v4_nodist_normed_TPY_r12,avg_wspd_top_15_r1,avg_wspd_top_15_r6,avg_wspd_top_15_r9,avg_wspd_top_15_r12,avg_temp_r1,avg_temp_r6,avg_temp_r9,avg_temp_r12,diff_temp_s_ps_r1,diff_temp_s_ps_r6,diff_temp_s_ps_r9,diff_temp_s_ps_r12,visits_all_malignant_cancers,visits_all_nonblood_malignant_cancers,visits_blood_diseases,visits_blood_or_bv_diseases,visits_blood_vessel_diseases,visits_cardioresp_cancers,visits_hematopoietic_cancers,visits_injuries,visits_respiratory,visits_type_1_diabetes,y_visits_all_malignant_cancers,y_visits_all_nonblood_malignant_cancers,y_visits_blood_diseases,y_visits_blood_or_bv_diseases,y_visits_blood_vessel_diseases,y_visits_cardioresp_cancers,y_visits_hematopoietic_cancers,y_visits_injuries,y_visits_respiratory,y_visits_type_1_diabetes,year_trend,county_month,year_month_county
0,2000-01-01,90001,Los Angeles,Los Angeles County,32.149998,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.757031,-172.758321,-172.758321,82.561735,82.561735,82.561735,1.124995,0.0,0.757031,20049.704556,2608176.0,47.87313,14.277778,14.266667,9,,,,,,,,,,,2000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2000-01,,,,,,,,,,,,,,,,,,,,14.272222,0.011111,90001-2000-1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,01_Los Angeles,2000_1_Los Angeles
1,2000-02-01,90001,Los Angeles,Los Angeles County,13.666667,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.965276,30.294778,30.294778,120.491364,120.491364,120.491364,0.547186,0.0,0.965276,20049.704556,2608176.0,47.87313,13.877778,13.866667,9,32.149998,,,,,,,,,,2000,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2000-02,,,,,,,,,,,,,,,,,,,,13.872222,0.011111,90001-2000-2,,,,,,,,,14.272222,,,,0.011111,,,,,,,,,,,,,,,,,,,,,,,,1,02_Los Angeles,2000_2_Los Angeles


### Make Rolling HO Sum Columns

In [15]:
# sort data on date
df = df.sort_values('year_month').reset_index(drop=True)

# train/test split 
# keep 2018 as the held out test set
df_test = df[df.year == 2018]
df = df[df.year != 2018]    # filter out 2018 from our dataset

In [16]:
# get rolling n month sum
def create_rolling_sum(df, var_name:str = 'number_of_visits_hem_cancers', num_months=3, center_arg:bool = False):
  """
    Creates rolling sums for the number of visits for a given health outcome. 
    Overwrite your dataframe with the output.
    Function saves the result as a column into the dataframe with subscripts 
    - '{var_name}_fwd{number of months}' for forward sums
    - '{var_name}_cent{number of months}' for centered sums

    Function includes the current month as one of the months in num_months.

    Dataframe input MUST be sorted by ['school_zip', 'year_month'] ahead of time.

    `df = df.sort_values(['school_zip', 'year_month'])`

    Suggested: filter out tail end of dates so rolling averages are not filled with imputed values.

  Args:
      `df` (dataframe): dataframe having columns for 'school_zip', datetime 'year_month', and number of visits. Dataframe must be sorted by 
      `var_name` (str, optional): health outcome number of visits. Defaults to 'number_of_visits_hem_cancers'.
      `num_months` (int, optional): Number of months to take rolling sum over. Defaults to 3.
      `center_arg` (bool, optional): If this sum should be centered on current month. Defaults to False.

  Returns:
      `df_int`: returns dataframe with column added
  """
  df_int = df.copy().sort_values(['school_zip', 'year_month']).reset_index(drop=True)
  
  if center_arg:
    df_int[f'{var_name}_cent{num_months}'] = df_int.groupby('school_zip')[var_name]\
                                      .apply(lambda x:x.rolling(num_months, center=True).sum())
  else:
    df_int[f'{var_name}_fwd{num_months}'] = df_int.groupby('school_zip')[var_name]\
                                      .apply(lambda x:x.rolling(num_months).sum().shift(1-num_months))

  
  return df_int 


df = df.sort_values(['school_zip', 'year_month'])
starting_cols = list(df.columns)

# 3 months ---
n = int(lag_time) # specify number of months

for health_outcome_y_col, health_outcome_visits_col in zip(y_col_names, num_visits_col_names):
    # forward looking columns
    df = create_rolling_sum(df=df, var_name=health_outcome_visits_col, num_months=n, center_arg=False)
    df[f'{health_outcome_y_col}_fwd{n}'] = 1000 * df[f'{health_outcome_visits_col}_fwd{n}'] / df['total_pop_under19']

    # centered columns
    df = create_rolling_sum(df=df, var_name=health_outcome_visits_col, num_months=n, center_arg=True)
    df[f'{health_outcome_y_col}_cent{n}'] = 1000 * df[f'{health_outcome_visits_col}_cent{n}'] / df['total_pop_under19']


# print columns added
ending_cols = list(df.columns)
window_3months_columns = [c for c in ending_cols if c not in starting_cols]
print(f"\nColumns added for health outcomes using {n} month window:\n{window_3months_columns}")
starting_cols = list(df.columns)


# # filter data to appropriate data range
# df = df[df.year >= 2002]


Columns added for health outcomes using 3 month window:
['visits_all_malignant_cancers_fwd3', 'y_visits_all_malignant_cancers_fwd3', 'visits_all_malignant_cancers_cent3', 'y_visits_all_malignant_cancers_cent3', 'visits_all_nonblood_malignant_cancers_fwd3', 'y_visits_all_nonblood_malignant_cancers_fwd3', 'visits_all_nonblood_malignant_cancers_cent3', 'y_visits_all_nonblood_malignant_cancers_cent3', 'visits_blood_diseases_fwd3', 'y_visits_blood_diseases_fwd3', 'visits_blood_diseases_cent3', 'y_visits_blood_diseases_cent3', 'visits_blood_or_bv_diseases_fwd3', 'y_visits_blood_or_bv_diseases_fwd3', 'visits_blood_or_bv_diseases_cent3', 'y_visits_blood_or_bv_diseases_cent3', 'visits_blood_vessel_diseases_fwd3', 'y_visits_blood_vessel_diseases_fwd3', 'visits_blood_vessel_diseases_cent3', 'y_visits_blood_vessel_diseases_cent3', 'visits_cardioresp_cancers_fwd3', 'y_visits_cardioresp_cancers_fwd3', 'visits_cardioresp_cancers_cent3', 'y_visits_cardioresp_cancers_cent3', 'visits_hematopoietic_canc

Make the 12 month difference between health outcome columns

In [17]:
def y_outcome_difference_maker(df, y_variable_col_name, num_months_back):

  '''
  This function takes the 12 month difference of the health outcome. 

  We would use it after we make our y_visits_ columns in Cornelia's notebook on those columns. In stage 2, we use these columns as our y variable. 

  Fix Cornelia's 2sls script notebooks to use the y_visits that are there originally and then run and OLS on this difference column for it right underneath.
  '''
  df_copy = df.copy().sort_values(['school_zip', 'year_month']).reset_index(drop=True)

  df_copy[f'{y_variable_col_name}_diff_r{num_months_back}'] = df_copy.groupby('school_zip')[y_variable_col_name]\
                                        .apply(lambda x:x - x.shift(num_months_back))

  print(f"Outcome generated: {y_variable_col_name}_diff_r{num_months_back}")
  return df_copy


In [18]:
# add the 12 month differences for the health outcomes
for var_name in y_col_names_lag:
    df = y_outcome_difference_maker(df, var_name, 12)

display(df)

Outcome generated: y_visits_all_malignant_cancers_fwd3_diff_r12
Outcome generated: y_visits_all_nonblood_malignant_cancers_fwd3_diff_r12
Outcome generated: y_visits_blood_diseases_fwd3_diff_r12
Outcome generated: y_visits_blood_or_bv_diseases_fwd3_diff_r12
Outcome generated: y_visits_blood_vessel_diseases_fwd3_diff_r12
Outcome generated: y_visits_cardioresp_cancers_fwd3_diff_r12
Outcome generated: y_visits_hematopoietic_cancers_fwd3_diff_r12
Outcome generated: y_visits_injuries_fwd3_diff_r12
Outcome generated: y_visits_respiratory_fwd3_diff_r12
Outcome generated: y_visits_type_1_diabetes_fwd3_diff_r12


Unnamed: 0,year_month,school_zip,school_county_v2,school_region_name,pm25,school_elevation_m,ps_elevation_m,population_0_4,population_0_4_male,population_0_4_female,population_5_9,population_5_9_male,population_5_9_female,population_10_14,population_10_14_male,population_10_14_female,population_15_19,population_15_19_male,population_15_19_female,total_pop_under19,pop_under19_male,pop_under19_female,total_population,total_population_male,total_population_female,point_source_pm25_tpy,dist_school_to_ps_m,angle_to_school,ps_wspd_merge,school_wdir_wrt_0n,ps_wdir_wrt_0n,school_wind_alignment,ps_wind_alignment,avg_wind_alignment,avg_wind_alignment_cosine,nearby_point_source_count,school_wspd,ca_agi_per_returns,total_tax_liability,tax_liability_per_capita,school_temperature,ps_temperature,school_count,pm25_r1,pm25_r6,pm25_r9,pm25_r12,pm25_r24,pm25_slope6,pm25_slope9,pm25_slope12,pm25_slope24,pm25_lag_12mo,year,month,school_county_v2_alameda,school_county_v2_alpine,school_county_v2_amador,school_county_v2_butte,school_county_v2_calaveras,school_county_v2_colusa,school_county_v2_contra_costa,school_county_v2_del_norte,school_county_v2_el_dorado,school_county_v2_fresno,school_county_v2_glenn,school_county_v2_humboldt,school_county_v2_imperial,school_county_v2_inyo,school_county_v2_kern,school_county_v2_kings,school_county_v2_lake,school_county_v2_lassen,school_county_v2_los_angeles,school_county_v2_madera,school_county_v2_marin,school_county_v2_mariposa,school_county_v2_mendocino,school_county_v2_merced,school_county_v2_modoc,school_county_v2_mono,school_county_v2_monterey,school_county_v2_napa,school_county_v2_nevada,school_county_v2_orange,school_county_v2_placer,school_county_v2_plumas,school_county_v2_riverside,school_county_v2_sacramento,school_county_v2_san_benito,school_county_v2_san_bernardino,school_county_v2_san_diego,school_county_v2_san_francisco,school_county_v2_san_joaquin,school_county_v2_san_luis_obispo,school_county_v2_san_mateo,school_county_v2_santa_barbara,school_county_v2_santa_clara,school_county_v2_santa_cruz,school_county_v2_shasta,school_county_v2_sierra,school_county_v2_siskiyou,school_county_v2_solano,school_county_v2_sonoma,school_county_v2_stanislaus,school_county_v2_sutter,school_county_v2_tehama,school_county_v2_trinity,school_county_v2_tulare,school_county_v2_tuolumne,school_county_v2_ventura,school_county_v2_yolo,school_county_v2_yuba,month_01,month_02,month_03,month_04,month_05,month_06,month_07,month_08,month_09,month_10,month_11,month_12,y-m,central_wind_alignment_180_high,avg_count_ps_within_5km,avg_elevation_diff_m,avg_wspd_ratio_ps_sch,avg_wspd_ratio_sch_ps,avg_school_wspd,avg_ps_wspd,new_alignment_90_high,ps_pm25_tpy_top_20,school_to_ps_geod_dist_m_top_20,avg_wspd_top_15,Izmy_v1_unnormed,Izmy_v2_nodist_unnormed,Izmy_v3_normed_D_and_TPY,Izmy_v4_nodist_normed_TPY,Izmy_v5_all_normed_but_wspd_ratio,Izmy_v6_unnormed_no_wspd,Izmy_v7_all_normed_no_wspd,Izmy_v8_normed_D_and_TPY_no_wspd,avg_temp,diff_temp_s_ps,patzip_year_month,Izmy_v4_nodist_normed_TPY_r1,Izmy_v4_nodist_normed_TPY_r6,Izmy_v4_nodist_normed_TPY_r9,Izmy_v4_nodist_normed_TPY_r12,avg_wspd_top_15_r1,avg_wspd_top_15_r6,avg_wspd_top_15_r9,avg_wspd_top_15_r12,avg_temp_r1,avg_temp_r6,avg_temp_r9,avg_temp_r12,diff_temp_s_ps_r1,diff_temp_s_ps_r6,diff_temp_s_ps_r9,diff_temp_s_ps_r12,visits_all_malignant_cancers,visits_all_nonblood_malignant_cancers,visits_blood_diseases,visits_blood_or_bv_diseases,visits_blood_vessel_diseases,visits_cardioresp_cancers,visits_hematopoietic_cancers,visits_injuries,visits_respiratory,visits_type_1_diabetes,y_visits_all_malignant_cancers,y_visits_all_nonblood_malignant_cancers,y_visits_blood_diseases,y_visits_blood_or_bv_diseases,y_visits_blood_vessel_diseases,y_visits_cardioresp_cancers,y_visits_hematopoietic_cancers,y_visits_injuries,y_visits_respiratory,y_visits_type_1_diabetes,year_trend,county_month,year_month_county,visits_all_malignant_cancers_fwd3,y_visits_all_malignant_cancers_fwd3,visits_all_malignant_cancers_cent3,y_visits_all_malignant_cancers_cent3,visits_all_nonblood_malignant_cancers_fwd3,y_visits_all_nonblood_malignant_cancers_fwd3,visits_all_nonblood_malignant_cancers_cent3,y_visits_all_nonblood_malignant_cancers_cent3,visits_blood_diseases_fwd3,y_visits_blood_diseases_fwd3,visits_blood_diseases_cent3,y_visits_blood_diseases_cent3,visits_blood_or_bv_diseases_fwd3,y_visits_blood_or_bv_diseases_fwd3,visits_blood_or_bv_diseases_cent3,y_visits_blood_or_bv_diseases_cent3,visits_blood_vessel_diseases_fwd3,y_visits_blood_vessel_diseases_fwd3,visits_blood_vessel_diseases_cent3,y_visits_blood_vessel_diseases_cent3,visits_cardioresp_cancers_fwd3,y_visits_cardioresp_cancers_fwd3,visits_cardioresp_cancers_cent3,y_visits_cardioresp_cancers_cent3,visits_hematopoietic_cancers_fwd3,y_visits_hematopoietic_cancers_fwd3,visits_hematopoietic_cancers_cent3,y_visits_hematopoietic_cancers_cent3,visits_injuries_fwd3,y_visits_injuries_fwd3,visits_injuries_cent3,y_visits_injuries_cent3,visits_respiratory_fwd3,y_visits_respiratory_fwd3,visits_respiratory_cent3,y_visits_respiratory_cent3,visits_type_1_diabetes_fwd3,y_visits_type_1_diabetes_fwd3,visits_type_1_diabetes_cent3,y_visits_type_1_diabetes_cent3,y_visits_all_malignant_cancers_fwd3_diff_r12,y_visits_all_nonblood_malignant_cancers_fwd3_diff_r12,y_visits_blood_diseases_fwd3_diff_r12,y_visits_blood_or_bv_diseases_fwd3_diff_r12,y_visits_blood_vessel_diseases_fwd3_diff_r12,y_visits_cardioresp_cancers_fwd3_diff_r12,y_visits_hematopoietic_cancers_fwd3_diff_r12,y_visits_injuries_fwd3_diff_r12,y_visits_respiratory_fwd3_diff_r12,y_visits_type_1_diabetes_fwd3_diff_r12
0,2000-01-01,90001,Los Angeles,Los Angeles County,32.149998,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.757031,-172.758321,-172.758321,82.561735,82.561735,82.561735,1.124995,0.0,0.757031,20049.704556,2.608176e+06,47.873130,14.277778,14.266667,9,,,,,,,,,,,2000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2000-01,,,,,,,,,,,,,,,,,,,,14.272222,0.011111,90001-2000-1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,01_Los Angeles,2000_1_Los Angeles,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2000-02-01,90001,Los Angeles,Los Angeles County,13.666667,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.965276,30.294778,30.294778,120.491364,120.491364,120.491364,0.547186,0.0,0.965276,20049.704556,2.608176e+06,47.873130,13.877778,13.866667,9,32.149998,,,,,,,,,,2000,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2000-02,,,,,,,,,,,,,,,,,,,,13.872222,0.011111,90001-2000-2,,,,,,,,,14.272222,,,,0.011111,,,,,,,,,,,,,,,,,,,,,,,,1,02_Los Angeles,2000_2_Los Angeles,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2000-03-01,90001,Los Angeles,Los Angeles County,17.183334,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.199593,107.246620,107.246620,147.368213,147.368213,147.368213,0.172183,0.0,0.199593,20049.704556,2.608176e+06,47.873130,14.677778,14.666667,9,13.666667,,,,,,,,,,2000,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2000-03,,,,,,,,,,,,,,,,,,,,14.672222,0.011111,90001-2000-3,,,,,,,,,13.872222,,,,0.011111,,,,,,,,,,,,,,,,,,,,,,,,1,03_Los Angeles,2000_3_Los Angeles,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2000-04-01,90001,Los Angeles,Los Angeles County,17.366667,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,0.771902,73.610928,73.610928,155.392932,155.392932,155.392932,0.159581,0.0,0.771902,20049.704556,2.608176e+06,47.873130,16.055556,16.600000,9,17.183334,,,,,,,,,,2000,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2000-04,,,,,,,,,,,,,,,,,,,,16.327778,-0.544444,90001-2000-4,,,,,,,,,14.672222,,,,0.011111,,,,,,,,,,,,,,,,,,,,,,,,1,04_Los Angeles,2000_4_Los Angeles,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2000-05-01,90001,Los Angeles,Los Angeles County,17.616667,44.728889,43.703333,6196.0,3209.0,2987.0,6672.0,3397.0,3275.0,5562.0,2850.0,2712.0,5075.0,2599.0,2476.0,23505.0,12055.0,11450.0,54481.0,27320.0,27161.0,14.241154,3854.812685,-90.196586,1.063785,73.965232,73.965232,155.432299,155.432299,155.432299,0.158167,0.0,1.063785,20049.704556,2.608176e+06,47.873130,17.855556,18.533333,9,17.366667,,,,,,,,,,2000,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-05,,,,,,,,,,,,,,,,,,,,18.194444,-0.677778,90001-2000-5,,,,,,,,,16.327778,,,,-0.544444,,,,,,,,,,,,,,,,,,,,,,,,1,05_Los Angeles,2000_5_Los Angeles,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
294892,2017-08-01,97635,Modoc,Superior California,5.925000,1484.680000,1331.890000,0.0,0.0,0.0,0.0,0.0,0.0,17.0,12.0,5.0,0.0,0.0,0.0,17.0,12.0,5.0,214.0,113.0,101.0,0.877750,60528.982992,21.694962,0.427147,128.596355,114.046726,106.901393,92.351765,99.626579,0.832774,0.0,0.371966,59799.084409,3.809391e+07,178008.912538,8.014286,19.914286,1,5.600000,2.841146,2.942708,3.593750,3.345443,0.461518,0.036927,-0.254611,-0.039781,6.550000,2017,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2017-08,128.313092,0.0,-673.963333,1.515257,1.474275,1.895626,2.163724,41.923230,30.157308,205925.295872,2.029675,114.808556,2.312278e+07,12428.042293,6444.545693,138.089359,71.899755,86.222436,7760.019217,13.964286,-11.900000,97635-2017-8,6482.153474,4478.345836,4342.789788,4380.227509,2.164433,2.608587,2.506471,2.474437,14.086170,10.063198,6.605095,7.770526,10.827660,2.965668,0.728964,-1.315258,,,,,,,,0.0,,,,,,,,,,0.0,,,18,08_Modoc,2017_8_Modoc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,0.0,,
294893,2017-09-01,97635,Modoc,Superior California,3.678125,1484.680000,1331.890000,0.0,0.0,0.0,0.0,0.0,0.0,17.0,12.0,5.0,0.0,0.0,0.0,17.0,12.0,5.0,214.0,113.0,101.0,0.877750,60528.982992,21.694962,0.073094,132.915270,-97.997283,111.220308,119.692244,115.456276,0.570178,0.0,0.135517,59799.084409,3.809391e+07,178008.912538,8.014286,15.500000,1,5.925000,3.283333,3.102431,3.541667,3.244010,0.988214,0.364479,-0.030573,0.040514,7.009375,2017,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2017-09,121.254592,0.0,-673.963333,1.138648,1.710182,2.295742,2.090771,39.123371,30.157308,205925.295872,2.193256,72.582018,1.462739e+07,7814.384034,4056.614586,86.826489,64.158576,76.759739,6908.376471,11.757143,-7.485714,97635-2017-9,6444.545693,5008.947793,4775.122824,4473.625712,2.029675,2.436629,2.479211,2.455685,13.964286,11.625846,7.928904,7.733621,-11.900000,-0.159629,0.017853,-1.241448,,,,,,,,0.0,,,,,,,,,,0.0,,,18,09_Modoc,2017_9_Modoc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,0.0,,
294894,2017-10-01,97635,Modoc,Superior California,3.931250,1484.680000,1331.890000,0.0,0.0,0.0,0.0,0.0,0.0,17.0,12.0,5.0,0.0,0.0,0.0,17.0,12.0,5.0,214.0,113.0,101.0,0.877750,60528.982992,21.694962,0.425838,13.870442,-67.817276,7.824520,89.512238,48.668379,1.660416,0.0,0.476738,59799.084409,3.809391e+07,178008.912538,6.050000,8.672340,1,3.678125,3.668750,3.113194,3.264063,3.259766,0.661071,0.442760,0.132299,0.045463,3.081250,2017,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2017-10,117.610480,0.0,-673.963333,0.992379,2.098003,2.572472,1.968177,36.263669,30.157308,205925.295872,2.270324,57.395807,1.163883e+07,6153.608210,3212.957923,68.373425,61.499497,73.680521,6631.246893,7.361170,-2.622340,97635-2017-10,4056.614586,5102.440152,4671.455319,4434.423646,2.193256,2.375880,2.486227,2.455118,11.757143,12.194818,9.608402,7.733621,-7.485714,-1.297573,-0.449078,-1.241448,,,,,,,,0.0,,,,,,,,,,0.0,,,18,10_Modoc,2017_10_Modoc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,0.0,,
294895,2017-11-01,97635,Modoc,Superior California,4.284375,1484.680000,1331.890000,0.0,0.0,0.0,0.0,0.0,0.0,17.0,12.0,5.0,0.0,0.0,0.0,17.0,12.0,5.0,214.0,113.0,101.0,0.877750,60528.982992,21.694962,1.701108,32.728486,26.375591,11.033524,4.680629,7.857076,1.990612,0.0,2.559094,59799.084409,3.809391e+07,178008.912538,-0.700000,4.800000,1,3.931250,4.015104,3.397917,3.334896,3.230339,0.335446,0.351927,0.149650,0.067154,4.487500,2017,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2017-11,134.498873,0.0,-673.963333,0.747694,2.975230,3.196173,1.749303,46.899849,30.157308,205925.295872,2.472738,52.229601,1.051500e+07,5578.276660,2892.839642,61.980852,76.280400,91.166790,8205.011112,2.050000,-5.500000,97635-2017-11,3212.957923,5019.307382,4509.354802,4399.935525,2.270324,2.291536,2.460531,2.420958,7.361170,12.362156,10.384643,7.710704,-2.622340,-2.287010,-0.468227,-1.287282,,,,,,,,0.0,,,,,,,,,,0.0,,,18,11_Modoc,2017_11_Modoc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,


Show General Statistics and Outliers

In [19]:
# for our y variable: y_col_names_lag
for i in y_col_names_lag:
    print(f"Outcome: {i}")
    display(df[i].describe())

    # df_temp_print = df[ np.abs(df[i]-df[i].mean()) <= min(3*df[i].std(), 1000*int(lag_time)) ]
    df_temp_print = df[ np.abs(df[i]-df[i].mean()) <= (3*df[i].std()) ]
    j = df[~df[i].isna()].shape[0]-df_temp_print[~df_temp_print[i].isna()].shape[0]
    print(f"If outliers (over 3 std away) were filtered out we lose {j} rows:")
    display(df_temp_print[i].describe())
    print("\n----\n")

Outcome: y_visits_all_malignant_cancers_fwd3


count    84843.000000
mean         0.212604
std          0.941026
min          0.000000
25%          0.000000
50%          0.000000
75%          0.203438
max         73.637703
Name: y_visits_all_malignant_cancers_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 655 rows:


count    84188.000000
mean         0.157435
std          0.312241
min          0.000000
25%          0.000000
50%          0.000000
75%          0.196773
max          3.021148
Name: y_visits_all_malignant_cancers_fwd3, dtype: float64


----

Outcome: y_visits_all_nonblood_malignant_cancers_fwd3


count    73144.000000
mean         0.118735
std          0.777079
min          0.000000
25%          0.000000
50%          0.000000
75%          0.070467
max         73.637703
Name: y_visits_all_nonblood_malignant_cancers_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 404 rows:


count    72740.000000
mean         0.083135
std          0.215550
min          0.000000
25%          0.000000
50%          0.000000
75%          0.065718
max          2.444988
Name: y_visits_all_nonblood_malignant_cancers_fwd3, dtype: float64


----

Outcome: y_visits_blood_diseases_fwd3


count    118306.000000
mean          0.331593
std           1.194500
min           0.000000
25%           0.000000
50%           0.140708
75%           0.417711
max         149.261157
Name: y_visits_blood_diseases_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 680 rows:


count    117626.000000
mean          0.284462
std           0.433476
min           0.000000
25%           0.000000
50%           0.137589
75%           0.412541
max           3.903201
Name: y_visits_blood_diseases_fwd3, dtype: float64


----

Outcome: y_visits_blood_or_bv_diseases_fwd3


count    120335.000000
mean          0.390894
std           1.339085
min           0.000000
25%           0.000000
50%           0.178731
75%           0.492622
max         149.261157
Name: y_visits_blood_or_bv_diseases_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 670 rows:


count    119665.000000
mean          0.336469
std           0.495614
min           0.000000
25%           0.000000
50%           0.175552
75%           0.485296
max           4.405752
Name: y_visits_blood_or_bv_diseases_fwd3, dtype: float64


----

Outcome: y_visits_blood_vessel_diseases_fwd3


count    88579.000000
mean         0.078135
std          0.626184
min          0.000000
25%          0.000000
50%          0.000000
75%          0.074036
max         84.803256
Name: y_visits_blood_vessel_diseases_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 256 rows:


count    88323.000000
mean         0.062033
std          0.151850
min          0.000000
25%          0.000000
50%          0.000000
75%          0.072813
max          1.907778
Name: y_visits_blood_vessel_diseases_fwd3, dtype: float64


----

Outcome: y_visits_cardioresp_cancers_fwd3


count    10632.000000
mean         0.019074
std          0.656128
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max         66.666667
Name: y_visits_cardioresp_cancers_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 6 rows:


count    10626.000000
mean         0.011177
std          0.081611
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.805576
Name: y_visits_cardioresp_cancers_fwd3, dtype: float64


----

Outcome: y_visits_hematopoietic_cancers_fwd3


count    66363.000000
mean         0.134498
std          0.664703
min          0.000000
25%          0.000000
50%          0.000000
75%          0.114840
max         38.402458
Name: y_visits_hematopoietic_cancers_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 502 rows:


count    65861.000000
mean         0.095008
std          0.209949
min          0.000000
25%          0.000000
50%          0.000000
75%          0.111475
max          2.128420
Name: y_visits_hematopoietic_cancers_fwd3, dtype: float64


----

Outcome: y_visits_injuries_fwd3


count    129380.000000
mean          6.169621
std          12.049327
min           0.000000
25%           1.853282
50%           4.509889
75%           7.922590
max        1000.000000
Name: y_visits_injuries_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 687 rows:


count    128693.000000
mean          5.687429
std           5.631682
min           0.000000
25%           1.834382
50%           4.484305
75%           7.841292
max          42.137199
Name: y_visits_injuries_fwd3, dtype: float64


----

Outcome: y_visits_respiratory_fwd3


count    144730.000000
mean          8.221355
std          14.142988
min           0.000000
25%           1.010226
50%           5.434783
75%          11.058954
max        1000.000000
Name: y_visits_respiratory_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 1093 rows:


count    143637.000000
mean          7.553922
std           8.165795
min           0.000000
25%           0.983284
50%           5.372215
75%          10.882163
max          50.644383
Name: y_visits_respiratory_fwd3, dtype: float64


----

Outcome: y_visits_type_1_diabetes_fwd3


count    65096.000000
mean         0.093490
std          1.854704
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        223.563604
Name: y_visits_type_1_diabetes_fwd3, dtype: float64

If outliers (over 3 std away) were filtered out we lose 76 rows:


count    65020.000000
mean         0.065952
std          0.256183
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          5.558767
Name: y_visits_type_1_diabetes_fwd3, dtype: float64


----



In [20]:
# for 12 month difference columns: y_col_names_lag_diff
for i in y_col_names_lag_diff:
    print(f"Outcome: {i}")
    display(df[i].describe())

    # df_temp_print = df[ np.abs(df[i]-df[i].mean()) <= min(3*df[i].std(), 1000*int(lag_time)) ]
    df_temp_print = df[ np.abs(df[i]-df[i].mean()) <= (3*df[i].std()) ]
    j = df[~df[i].isna()].shape[0]-df_temp_print[~df_temp_print[i].isna()].shape[0]
    print(f"If outliers (over 3 std away) were filtered out we lose {j} rows:")
    display(df_temp_print[i].describe())
    print("\n----\n")

Outcome: y_visits_all_malignant_cancers_fwd3_diff_r12


count    77659.000000
mean        -0.033094
std          1.015722
min        -73.637703
25%         -0.081269
50%          0.000000
75%          0.080366
max         24.075501
Name: y_visits_all_malignant_cancers_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 725 rows:


count    76934.000000
mean        -0.004727
std          0.401358
min         -3.076923
25%         -0.077501
50%          0.000000
75%          0.079408
max          3.012336
Name: y_visits_all_malignant_cancers_fwd3_diff_r12, dtype: float64


----

Outcome: y_visits_all_nonblood_malignant_cancers_fwd3_diff_r12


count    66620.000000
mean        -0.033740
std          0.826200
min        -73.637703
25%          0.000000
50%          0.000000
75%          0.000000
max         11.001100
Name: y_visits_all_nonblood_malignant_cancers_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 449 rows:


count    66171.000000
mean        -0.009232
std          0.275312
min         -2.486325
25%          0.000000
50%          0.000000
75%          0.000000
max          2.439495
Name: y_visits_all_nonblood_malignant_cancers_fwd3_diff_r12, dtype: float64


----

Outcome: y_visits_blood_diseases_fwd3_diff_r12


count    109991.000000
mean          0.017082
std           1.111794
min         -71.428571
25%          -0.099722
50%           0.000000
75%           0.156239
max          66.666667
Name: y_visits_blood_diseases_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 1355 rows:


count    108636.000000
mean          0.022007
std           0.507314
min          -3.306375
25%          -0.095854
50%           0.000000
75%           0.153526
max           3.350084
Name: y_visits_blood_diseases_fwd3_diff_r12, dtype: float64


----

Outcome: y_visits_blood_or_bv_diseases_fwd3_diff_r12


count    111939.000000
mean          0.029374
std           1.337979
min         -84.803256
25%          -0.103757
50%           0.000000
75%           0.184143
max          66.666667
Name: y_visits_blood_or_bv_diseases_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 1138 rows:


count    110801.000000
mean          0.032704
std           0.589523
min          -3.984064
25%          -0.101162
50%           0.000000
75%           0.180971
max           4.037712
Name: y_visits_blood_or_bv_diseases_fwd3_diff_r12, dtype: float64


----

Outcome: y_visits_blood_vessel_diseases_fwd3_diff_r12


count    81152.000000
mean         0.006351
std          0.736551
min        -84.803256
25%          0.000000
50%          0.000000
75%          0.000212
max         57.836900
Name: y_visits_blood_vessel_diseases_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 221 rows:


count    80931.000000
mean         0.010355
std          0.200255
min         -2.192982
25%          0.000000
50%          0.000000
75%          0.000167
max          2.177285
Name: y_visits_blood_vessel_diseases_fwd3_diff_r12, dtype: float64


----

Outcome: y_visits_cardioresp_cancers_fwd3_diff_r12


count    9297.000000
mean       -0.018755
std         0.701188
min       -66.666667
25%         0.000000
50%         0.000000
75%         0.000000
max         0.322789
Name: y_visits_cardioresp_cancers_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 6 rows:


count    9291.000000
mean       -0.009722
std         0.083292
min        -1.805576
25%         0.000000
50%         0.000000
75%         0.000000
max         0.322789
Name: y_visits_cardioresp_cancers_fwd3_diff_r12, dtype: float64


----

Outcome: y_visits_hematopoietic_cancers_fwd3_diff_r12


count    59928.000000
mean        -0.030020
std          0.671133
min        -38.402458
25%         -0.020599
50%          0.000000
75%          0.001457
max         14.459225
Name: y_visits_hematopoietic_cancers_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 620 rows:


count    59308.000000
mean        -0.007444
std          0.251637
min         -2.041616
25%         -0.005956
50%          0.000000
75%          0.001315
max          1.956063
Name: y_visits_hematopoietic_cancers_fwd3_diff_r12, dtype: float64


----

Outcome: y_visits_injuries_fwd3_diff_r12


count    120486.000000
mean          1.066814
std          13.041632
min       -1000.000000
25%          -0.224195
50%           0.389368
75%           1.987654
max        1000.000000
Name: y_visits_injuries_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 674 rows:


count    119812.000000
mean          0.920229
std           4.584674
min         -37.982082
25%          -0.219896
50%           0.386977
75%           1.970359
max          39.953654
Name: y_visits_injuries_fwd3_diff_r12, dtype: float64


----

Outcome: y_visits_respiratory_fwd3_diff_r12


count    135829.000000
mean          0.757231
std          13.195394
min       -1000.000000
25%          -0.813746
50%           0.151220
75%           2.213538
max        1000.000000
Name: y_visits_respiratory_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 963 rows:


count    134866.000000
mean          0.702583
std           5.735152
min         -38.809258
25%          -0.799648
50%           0.150196
75%           2.183570
max          40.322581
Name: y_visits_respiratory_fwd3_diff_r12, dtype: float64


----

Outcome: y_visits_type_1_diabetes_fwd3_diff_r12


count    58082.000000
mean        -0.006648
std          1.918203
min       -223.563604
25%          0.000000
50%          0.000000
75%          0.000000
max         19.704433
Name: y_visits_type_1_diabetes_fwd3_diff_r12, dtype: float64

If outliers (over 3 std away) were filtered out we lose 81 rows:


count    58001.000000
mean         0.011049
std          0.316518
min         -5.712523
25%          0.000000
50%          0.000000
75%          0.000000
max          5.712523
Name: y_visits_type_1_diabetes_fwd3_diff_r12, dtype: float64


----



Prepare Data for Modeling

In [21]:
# filter data to appropriate data range
df = df[df.year >= 2002]

print(min(df.year))
print(max(df.year))

2002
2017


In [22]:
# Select variables for modeling
date_var = 'year_month' 
zip_var = 'school_zip'
y_var_s2 = medical_target   # need to change this so it runs on Cornelia's PC
print(f"medical target: {y_var_s2}")

# stage 1 variables
instruments_cols = [predictor]

stage_1_IVs = [s + IV_lead_input for s in instruments_cols]
stage_1_target = [target_name_s1]

# stage 2 variables
stage_2_HO_targets = [s + HO_lag_input for s in y_col_names]

num_vars = ['school_elevation_m', 'nearby_point_source_count', 'school_wspd', \
            'tax_liability_per_capita', 'school_temperature', 'school_count', 'pm25_r6', 'pm25_r12']
counties = [i for i in df.columns if re.search('^school_county_v2_', i)]
months = [i for i in df.columns if re.search ('^month_', i)]
# potentially use county_month instead of the above 

xvars = num_vars + counties + months 
yvar = [y_var_s2]

medical target: y_visits_all_malignant_cancers_fwd3_diff_r12


# xgb stage 1

Setups

In [23]:
basics = counties + months + ['year_trend']
basics_str = " ~ school_county_v2 + month + year_trend * C(county_month)"
env = ['avg_temp'+IV_lead_input , 'avg_elevation_diff_m']
env_str = '+ avg_temp' + IV_lead_input + ' + avg_elevation_diff_m'


if FE_set_num == 1:
    # FE Set 1
    adds = []
    adds_str = ""
elif FE_set_num == 2:
    # FE Set 2
    adds = ['ca_agi_per_returns', 'total_population']
    adds_str = ' + ca_agi_per_returns + total_population'
elif FE_set_num == 3:
    # FE Set 3
    adds = ['school_count', 'total_population']
    adds_str = ' + school_count + total_population'
elif FE_set_num == 4:
    # FE Set 4
    adds = ['total_population', 'avg_count_ps_within_5km']
    adds_str = ' + total_population + avg_count_ps_within_5km'
elif FE_set_num == 5:
    # FE Set 5
    adds = ['ca_agi_per_returns']
    adds_str = ' + ca_agi_per_returns'
elif FE_set_num == 6:
    # FE Set 6
    adds = ['total_population']
    adds_str = ' + total_population'
elif FE_set_num == 7:
    # FE Set 7
    adds = ['ca_agi_per_returns', 'total_population', 'avg_wspd_top_15_r' + str(lead_time)]
    adds_str = ' + ca_agi_per_returns + total_population + avg_wspd_top_15_r' + str(lead_time)
elif FE_set_num == 8:
    env = []
    adds = []
    adds_str = ''
    env_str = ''
elif FE_set_num == 9:
    # FE Set 9
    adds = ['avg_wspd_top_15_r' + str(lead_time)]
    env = []
    adds_str = 'avg_wspd_top_15_r' + str(lead_time)
    env_str = ""




fixed_effects_cols = basics + env + adds
fixed_effects_cols_str = basics_str + env_str + adds_str

print("basics:\n{}\n".format(basics))
print("env:\n{}\n".format(env))

print("Fixed effects are:\n{}".format(fixed_effects_cols))
print("Fixed effects String is:\n{}".format(fixed_effects_cols_str))

basics:
['school_county_v2_alameda', 'school_county_v2_alpine', 'school_county_v2_amador', 'school_county_v2_butte', 'school_county_v2_calaveras', 'school_county_v2_colusa', 'school_county_v2_contra_costa', 'school_county_v2_del_norte', 'school_county_v2_el_dorado', 'school_county_v2_fresno', 'school_county_v2_glenn', 'school_county_v2_humboldt', 'school_county_v2_imperial', 'school_county_v2_inyo', 'school_county_v2_kern', 'school_county_v2_kings', 'school_county_v2_lake', 'school_county_v2_lassen', 'school_county_v2_los_angeles', 'school_county_v2_madera', 'school_county_v2_marin', 'school_county_v2_mariposa', 'school_county_v2_mendocino', 'school_county_v2_merced', 'school_county_v2_modoc', 'school_county_v2_mono', 'school_county_v2_monterey', 'school_county_v2_napa', 'school_county_v2_nevada', 'school_county_v2_orange', 'school_county_v2_placer', 'school_county_v2_plumas', 'school_county_v2_riverside', 'school_county_v2_sacramento', 'school_county_v2_san_benito', 'school_county_v2_

In [24]:
# check if all these columns are in the dataframe
in_col_list = [True if i in df.columns else i for i in fixed_effects_cols ]

print(f"These should be all true if all fixed effects are in df.columns:\n{in_col_list}")

print(f"\nStage 1 Variables---\n")
print(f"target_name_s1: {target_name_s1}\nIn df columns? {target_name_s1 in df.columns}\n")
print(f"predictor_name_s1: {predictor_name_s1}\nIn df columns? {predictor_name_s1 in df.columns}\n")

print(f"Fixed Effects (fixed_effects_cols): {fixed_effects_cols}\n")

target_name_s1_predictions = target_name_s1 + "_hat"
print(f"Saving predictions (target_name_s1_predictions) as `{target_name_s1_predictions}`")

# create a df for modeling stage 1: drops nulls in all columns used
df_model_s1 = df.dropna(subset=([target_name_s1, predictor_name_s1] + fixed_effects_cols))

print(f"Size of df before filtering for modeling: {df.shape}")
print(f"Size of df after filtering for modeling: {df_model_s1.shape}")


X_s1 = df_model_s1[[predictor_name_s1] + fixed_effects_cols]
y_s1 = df_model_s1[target_name_s1]



These should be all true if all fixed effects are in df.columns:
[True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True]

Stage 1 Variables---

target_name_s1: pm25_r9
In df columns? True

predictor_name_s1: Izmy_v4_nodist_normed_TPY_r9
In df columns? True

Fixed Effects (fixed_effects_cols): ['school_county_v2_alameda', 'school_county_v2_alpine', 'school_county_v2_amador', 'school_county_v2_butte', 'school_county_v2_calaveras', 'school_county_v2_colusa', 'school_county_v2_contra_costa', 'school_county_v2_del_norte', 'school_county_v2_el_dorado', 'school_county_v2_fresno', 'school_county_v2_g

In [25]:
# need to sort the values and use the index to sort things for TimeSeriesSplit
df_model_s1.sort_values(by=['year_month'], inplace=True)
df_model_s1 = df_model_s1.reset_index(drop=True)

In [26]:
# sklearn api version
def time_series_cv(
  df: pd.DataFrame, 
  xvars: list, 
  yvar: str, 
  hyperparams: dict = {'max_depth': [1, 5, 10], 'subsample': [.8, 1], 'eta': [.1, .3]}, 
  search_type='grid', 
  folds=5, 
  verbose=1):

  ''' 
  Inputs:
  - df: dataframe of your training data
  - xvars: a list of all the xvars to pass to xgboost
  - yvar: string of your target variable
  - verbose: optionality for diff amounts of printouts. Can be 0, 1, 2. 0 = silent, 1 = update after each fold, 2 = update after every single hyperparam combination. 
  - hyperparams: this must be a dictionary of lists. So each key is a xgb hyperparam, then it must have a list of values to tune with. 
    See the default for an example. Can put in an arbitrary number of hyperparam options. 
  
  Output:
  - dictionary with the following keys: ['fold', 'hyperparams', 'rmse_train', 'rmse_test']. 
  eval(best_hyperparams)
  - dictionary with the best hyperparameters to retrain model
  '''

  # need to sort the values and use the index to sort things
  df.sort_values(by=['year_month'], inplace=True)
  df = df.reset_index(drop=True)

  # this dictionary will hold all the final results
  final_res = {'fold':[], 'hyperparams':[], 'rmse_train': [], 'rmse_test': [],
                'huber_loss_train': [], 'huber_loss_test': []}

  # get only necessary fields in df
  df = df[xvars + [yvar]]

  # set up the time series split class, to do an expanding window cross fold. 
  tss = TimeSeriesSplit(n_splits=folds)
  tss_folds = tss.split(df)
  all_folds = [i for i in tss_folds]

  # get all combinations of hyperparams
  def expand_grid(hyperparams):
    keys = list(hyperparams.keys())
    hyperparams_df = pd.DataFrame(np.array(np.meshgrid(*[hyperparams[key_i] for key_i in keys])).T.reshape(-1, len(keys)))
    hyperparams_df.columns = keys 
    return hyperparams_df

  df_hyperparams = expand_grid(hyperparams)

  # loss functions
  def get_rmse(df_train, model):
    ytrue = df_train[yvar].values.flatten()
    yhat = model.predict(df_train.drop(columns=yvar))
    rmse = np.mean(((ytrue - yhat)**2)**.5)
    return rmse 
  
  def get_huber_loss(df_train, model):
    # # Let the delta for Huber Loss be 2*standard deviation of non-zero entries for the y variable
    # twice_std = 2 * df_train[df_train[yvar] > 0][yvar].std  

    # Let the delta for Huber Loss be 2*standard deviation of the y variable
    twice_std = 2 * df_train[yvar].std() 

    def huber_loss(y_actual,y_predicted,delta):
      # https://towardsdatascience.com/understanding-loss-functions-the-smart-way-904266e9393
      # approaches MSE for small error an approaches MAE in case of outliers.
      delta = 5
      total_points = y_actual.size
      total_error = 0
      for i in range(total_points):
        error = np.absolute(y_predicted[i] - y_actual[i])
        if error < delta:
          huber_error = (error*error)/2
        else:
          huber_error = delta*(error - (0.5*delta))
        total_error+=huber_error
      total_huber_error = total_error/total_points
      return total_huber_error  # mean huber_loss

    ytrue = df_train[yvar].values.flatten()
    yhat = model.predict(df_train.drop(columns=yvar))

    huber_loss_val = huber_loss(y_actual=ytrue, y_predicted=yhat, delta=twice_std)

    return huber_loss_val

  # loop over each expanding time series window
  for fold_count,fold in enumerate(all_folds):
    if verbose > 0:
      print('Working on fold {}/{}'.format(fold_count+1, folds))

    df_train = df.loc[fold[0]]
    df_test = df.loc[fold[1]]

    # within each time series cross fold, perform a grid search with all hyperparam combinations and evaluate results. 
    if search_type == 'grid':
      for param_set_i in range(df_hyperparams.shape[0]):
        hyperparams_i = {x:y for x,y in zip(df_hyperparams.columns, df_hyperparams.loc[param_set_i].to_list())}
        
        # fix datatype for some vars
        def fix_int(var, hyperparams_i):
          if var in hyperparams_i.keys():
            hyperparams_i[var] = int(hyperparams_i[var])
          
          return hyperparams_i
        
        hyperparams_i = fix_int('max_depth', hyperparams_i)
        hyperparams_i = fix_int('n_estimators', hyperparams_i)

        # if adding hyperparams based on integeters, do this fix_int so it isnt converted to float
        

        # fit xgb
        xgb_reg = xgb.XGBRegressor(booster="gbtree", **hyperparams_i, verbosity=0, random_state=20)
        xgb_reg = xgb_reg.fit(X=df_train[xvars], y=df_train[yvar], eval_set=[(df_test[xvars], df_test[yvar])], verbose=0)
        
        # save results
        rmse_train = get_rmse(df_train, xgb_reg)
        rmse_test = get_rmse(df_test, xgb_reg)
        huber_train = get_huber_loss(df_train, xgb_reg)
        huber_test = get_huber_loss(df_test, xgb_reg)
        final_res['fold'].append(fold_count)
        final_res['hyperparams'].append(hyperparams_i)
        final_res['rmse_train'].append(rmse_train)
        final_res['rmse_test'].append(rmse_test)
        final_res['huber_loss_train'].append(huber_train)
        final_res['huber_loss_test'].append(huber_test)

        if verbose == 2:
          print('{}: rmse train: {:.3f}, rmse test: {:.3f}'.format(hyperparams_i, rmse_train, rmse_test))

    elif search_type == 'random': 
      pass 
      # haven't done this yet
  
  # print out final best hyperparams before returning the output
  output2 = pd.DataFrame({
    'hyperparams': final_res['hyperparams'],
    'fold': final_res['fold'],
    'rmse_train': final_res['rmse_train'],
    'rmse_test': final_res['rmse_test']
  })
  output2['hyperparams'] = output2['hyperparams'].astype(str)
  output2 = output2.groupby('hyperparams')[['rmse_train', 'rmse_test']].mean().reset_index().sort_values('rmse_test')
  print('best hyperparams: {}'.format(output2.iloc[0,0]))
  print(f"best RMSE test: {output2.iloc[0]['rmse_test']}")

  best_hyperparams = output2.iloc[0,0]

  
  return final_res, eval(best_hyperparams)
  

Running Sklearn CV function for Stage 1

In [27]:
# specify best hyperparams for Stage 1

# hard code this because we don't want to retune many times on Cornelia's PC

best_params_s1 = {'max_depth': 18, 'n_estimators': 180, 'subsample': 0.8, 'eta': 0.1}

In [28]:
# fit one more model w/ best hyperparams
xgbr_s1_best = xgb.XGBRegressor(**best_params_s1, random_state=20)
xgbr_s1_best.fit(X_s1, y_s1)
y_hat = xgbr_s1_best.predict(X_s1)

# make sure its the same
rmse_val = np.mean(((y_s1 - y_hat)**2)**.5)
print(f"RMSE of the best Stage 1 Model: {rmse_val}")

df_model_s1[target_name_s1+"_hat"] = y_hat # generate predicted pm2.5

columns = [target_name_s1, target_name_s1+'_hat']

# compute visits by patzip_year_month
fig, axes = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(10, 5))

for idx, ax in enumerate(axes.flatten()):
    sns.histplot(
            df_model_s1[columns[idx]],
            ax=ax
        )

RMSE of the best Stage 1 Model: 0.11657807139995723


# xgb stage 2


Making the df_model for Stage 2: `df_model_s2`

In [29]:
# Prepare dataset

print(f"Name of the target variable for Stage 2: {y_var_s2} \nIn df_model_s1 columns? {y_var_s2 in df_model_s1.columns}\n")

predictor_name_s2 = target_name_s1 + "_hat"

print(f"Name of the predictor variable for Stage 2: {predictor_name_s2} \nIn df_model_s1 columns? {predictor_name_s2 in df_model_s1.columns}\n")

# drop na's in columns of interes, sort by year_month and reset index
df_model_s2 = df_model_s1.dropna(subset=([y_var_s2, predictor_name_s2] + fixed_effects_cols))
df_model_s2 = df_model_s2.sort_values('year_month').reset_index(drop=True)

print(f"Size of 2nd stage df before filtering for modeling: {df_model_s1.shape}")
print(f"Size of 2nd stage df after filtering for modeling: {df_model_s2.shape}")

Name of the target variable for Stage 2: y_visits_all_malignant_cancers_fwd3_diff_r12 
In df_model_s1 columns? True

Name of the predictor variable for Stage 2: pm25_r9_hat 
In df_model_s1 columns? True

Size of 2nd stage df before filtering for modeling: (262450, 238)
Size of 2nd stage df after filtering for modeling: (77571, 238)


In [30]:
# optional code to filter outliers

if filter_medical_outliers:
    print(f"Outcome: {y_var_s2}")
    display(df_model_s2[y_var_s2].describe())
    old_row_count = df_model_s2.shape[0]

    # df_model_s2 = df_model_s2[ np.abs(df_model_s2[y_var_s2]-df_model_s2[y_var_s2].mean()) <= min(3*df_model_s2[y_var_s2].std(), 1000*int(lag_time)) ]
    df_model_s2 = df_model_s2[ np.abs(df_model_s2[y_var_s2]-df_model_s2[y_var_s2].mean()) <= (3*df_model_s2[y_var_s2].std()) ]
    new_row_count = df_model_s2.shape[0]
    print(f"Outliers (over 3 std away) were filtered out, We drop {old_row_count - new_row_count} rows:")
    display(df_model_s2[y_var_s2].describe())

Outcome: y_visits_all_malignant_cancers_fwd3_diff_r12


count    77571.000000
mean        -0.032173
std          0.987722
min        -73.637703
25%         -0.081174
50%          0.000000
75%          0.080592
max         24.075501
Name: y_visits_all_malignant_cancers_fwd3_diff_r12, dtype: float64

Outliers (over 3 std away) were filtered out, We drop 767 rows:


count    76804.000000
mean        -0.003793
std          0.395213
min         -2.993349
25%         -0.077122
50%          0.000000
75%          0.079697
max          2.922746
Name: y_visits_all_malignant_cancers_fwd3_diff_r12, dtype: float64

In [31]:
# split out X and y for stage 2 predictions
X_s2 = df_model_s2[[predictor_name_s2] + fixed_effects_cols]
y_s2 = df_model_s2[y_var_s2]

For reference: best XGB CV results for notebook 06:

- {'max_depth': 18, 'subsample': 0.8, 'eta': 0.01, 'n_estimators': 180, 'reg_lambda': 1.0}
- best RMSE test: 1.4724818369485975

In [32]:
# tune XGBoost for Stage 2
output_s2, best_params_s2 = time_series_cv(df_model_s2, 
  xvars = ([predictor_name_s2] + fixed_effects_cols), 
  yvar = y_var_s2, 
  hyperparams = {'max_depth': [18], 
                  'subsample': [.8, 1], 
                  'eta': [0.01, 0.03], 
                  'n_estimators': [180],
                  'reg_lambda': [0.01, 0.1, 1]}, 
  search_type = 'grid', 
  folds = 5, 
  verbose=1)

Working on fold 1/5
Working on fold 2/5
Working on fold 3/5
Working on fold 4/5
Working on fold 5/5
best hyperparams: {'max_depth': 18, 'subsample': 0.8, 'eta': 0.01, 'n_estimators': 180, 'reg_lambda': 1.0}
best RMSE test: 0.2652389645140029


Search space takes ~1hour on Anand's Laptop with 32gb Memory:

```
hyperparams = {'max_depth': [18], 
                  'subsample': [.8, 1], 
                  'eta': [0.01, 0.03], 
                  'n_estimators': [180],
                  'reg_lambda': [0.01, 0.1, 1]}
```

In [33]:
best_params_s2  # best hyperparams for stage 2

{'max_depth': 18,
 'subsample': 0.8,
 'eta': 0.01,
 'n_estimators': 180,
 'reg_lambda': 1.0}

In [34]:
# Tabulate CV results
output_s2_df = pd.DataFrame({
  'hyperparams': output_s2['hyperparams'],
  'fold': output_s2['fold'],
  'rmse_train': output_s2['rmse_train'],
  'rmse_test': output_s2['rmse_test'],
  'huber_loss_train': output_s2['huber_loss_train'],
  'huber_loss_test': output_s2['huber_loss_test']
})

output_s2_df['hyperparams'] = output_s2_df['hyperparams'].astype(str)

output_s2_grp = output_s2_df.groupby('hyperparams')[['rmse_train', 'rmse_test', \
'huber_loss_train', 'huber_loss_test']].mean().reset_index().sort_values('rmse_test')

pd.set_option('max_colwidth', None)
output_s2_grp[['hyperparams', 'rmse_test']].head(20)

Unnamed: 0,hyperparams,rmse_test
2,"{'max_depth': 18, 'subsample': 0.8, 'eta': 0.01, 'n_estimators': 180, 'reg_lambda': 1.0}",0.265239
1,"{'max_depth': 18, 'subsample': 0.8, 'eta': 0.01, 'n_estimators': 180, 'reg_lambda': 0.1}",0.268513
0,"{'max_depth': 18, 'subsample': 0.8, 'eta': 0.01, 'n_estimators': 180, 'reg_lambda': 0.01}",0.269033
5,"{'max_depth': 18, 'subsample': 0.8, 'eta': 0.03, 'n_estimators': 180, 'reg_lambda': 1.0}",0.271545
8,"{'max_depth': 18, 'subsample': 1.0, 'eta': 0.01, 'n_estimators': 180, 'reg_lambda': 1.0}",0.274228
4,"{'max_depth': 18, 'subsample': 0.8, 'eta': 0.03, 'n_estimators': 180, 'reg_lambda': 0.1}",0.278819
3,"{'max_depth': 18, 'subsample': 0.8, 'eta': 0.03, 'n_estimators': 180, 'reg_lambda': 0.01}",0.279817
7,"{'max_depth': 18, 'subsample': 1.0, 'eta': 0.01, 'n_estimators': 180, 'reg_lambda': 0.1}",0.281291
11,"{'max_depth': 18, 'subsample': 1.0, 'eta': 0.03, 'n_estimators': 180, 'reg_lambda': 1.0}",0.281651
6,"{'max_depth': 18, 'subsample': 1.0, 'eta': 0.01, 'n_estimators': 180, 'reg_lambda': 0.01}",0.282697


Train final Stage 2 Model with Best Parameters

- Notebook 06 best results: Stage 2 RMSE = 0.7906362606392728

In [35]:
# Train the model using best stage 2's hyper parameters
# fit one more model w/ best hyperparams
xgbr_s2_best = xgb.XGBRegressor(**best_params_s2, random_state=20)
xgbr_s2_best.fit(X_s2, y_s2)
y_s2_hat = xgbr_s2_best.predict(X_s2)

# make sure its the same
rmse_val_s2 = np.mean(((y_s2 - y_s2_hat)**2)**.5)
print(f"RMSE of the best Stage 2 Model: {rmse_val_s2}")

# generate predicted health outcomes
df_model_s2[y_var_s2+"_hat"] = y_s2_hat

columns = [y_var_s2, y_var_s2+"_hat"]

# compute visits by patzip_year_month
fig, axes = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(10, 5))

for idx, ax in enumerate(axes.flatten()):
    ax.set_xlim(0, 15)
    sns.histplot(
            df_model_s2[columns[idx]],
            ax=ax
        )

RMSE of the best Stage 2 Model: 0.19554708888144037


## Counterfactual Generation: Make the 1%, 10%, 25% mitigation pm25_hat columns

Make these columns, run the s2 XGB on these columns to get their new HO predicted values

Save out csv with columns.

In [36]:
print(f"Rows originally: {df_model_s2.shape[0]}")

# we don't want to be predicting negative rates (our original y variable), but it is okay to predict negative differences
# filter out negative rates and see how many we dropped
if y_var_s2 not in y_col_names_lag_diff:
    print(f"Rows filtered out: {df_model_s2.shape[0] - df_model_s2[df_model_s2[y_var_s2 + '_hat'] >= 0].shape[0]}")

    df_model_s2 = df_model_s2[df_model_s2[y_var_s2 + '_hat'] >= 0]  # filter out the places we predict negative health outcome


Rows originally: 76804


Creates the Counterfactual Differences Columns

- Each column gives the difference between the health outcome with X% reduction in predicted PM2.5 and with 0% reduction in predicted PM2.5.
- Each column begins with `'Difference_'` and ends with `'_X%_reduction'` where `X` is the percent we hypothetically reduce PM2.5.

In [37]:

# Make Column of 0.99*df_model_s2[predictor_name_s2]
reduced_pm25_col_name = predictor_name_s2 + "_1%_reduction"
df_model_s2[reduced_pm25_col_name] = 0.99 * df_model_s2[predictor_name_s2]
X_temp = df_model_s2[[reduced_pm25_col_name] + fixed_effects_cols]

 # XGBRegressor needs columns to be named the same as during predict, so rename our prediction column for Stage 2
X_temp.rename(columns={reduced_pm25_col_name: predictor_name_s2}, inplace=True)

# Predict Health Outcome for this column, add as column to df_model_s2
df_model_s2[y_var_s2 + "_hat" + "_1%_reduction"] = xgbr_s2_best.predict(X_temp)
df_model_s2["Difference_" + y_var_s2 + "_hat" + "_1%_reduction"] = df_model_s2[y_var_s2 + "_hat" + "_1%_reduction"] - df_model_s2[y_var_s2 + "_hat"]


# Make Column of 0.90*df_model_s2[predictor_name_s2]
reduced_pm25_col_name = predictor_name_s2 + "_10%_reduction"
df_model_s2[reduced_pm25_col_name] = 0.90 * df_model_s2[predictor_name_s2]
X_temp = df_model_s2[[reduced_pm25_col_name] + fixed_effects_cols]
X_temp.rename(columns={reduced_pm25_col_name: predictor_name_s2}, inplace=True)


# Predict Health Outcome for this column, add as column to df_model_s2
df_model_s2[y_var_s2 + "_hat" + "_10%_reduction"] = xgbr_s2_best.predict(X_temp)
df_model_s2["Difference_" + y_var_s2 + "_hat" + "_10%_reduction"] = df_model_s2[y_var_s2 + "_hat" + "_10%_reduction"] - df_model_s2[y_var_s2 + "_hat"]


# Make Column of 0.75*df_model_s2[predictor_name_s2]
reduced_pm25_col_name = predictor_name_s2 + "_25%_reduction"
df_model_s2[reduced_pm25_col_name] = 0.75 * df_model_s2[predictor_name_s2]
X_temp = df_model_s2[[reduced_pm25_col_name] + fixed_effects_cols]
X_temp.rename(columns={reduced_pm25_col_name: predictor_name_s2}, inplace=True)


# Predict Health Outcome for this column, add as column to df_model_s2
df_model_s2[y_var_s2 + "_hat" + "_25%_reduction"] = xgbr_s2_best.predict(X_temp)
df_model_s2["Difference_" + y_var_s2 + "_hat" + "_25%_reduction"] = df_model_s2[y_var_s2 + "_hat" + "_25%_reduction"] - df_model_s2[y_var_s2 + "_hat"]


Adding fixed versions of difference columns to make the counterfactual map legend color bars less weird.

- Fixed versions keep the data to plot that only has mitigation showing no reduction or some reduction in health outcome.
- All have suffix of `'_fixed'` ad the end.

In [38]:
df_model_s2["Difference_" + y_var_s2 + "_hat" + "_1%_reduction_fixed"] = df_model_s2["Difference_" + y_var_s2 + "_hat" + "_1%_reduction"] \
                                                                    .mask(
                                                                       df_model_s2["Difference_" + y_var_s2 + "_hat" + "_1%_reduction"] > 0, 0 
                                                                    )


df_model_s2["Difference_" + y_var_s2 + "_hat" + "_10%_reduction_fixed"] = df_model_s2["Difference_" + y_var_s2 + "_hat" + "_10%_reduction"] \
                                                                    .mask(
                                                                       df_model_s2["Difference_" + y_var_s2 + "_hat" + "_10%_reduction"] > 0, 0 
                                                                    )

df_model_s2["Difference_" + y_var_s2 + "_hat" + "_25%_reduction_fixed"] = df_model_s2["Difference_" + y_var_s2 + "_hat" + "_25%_reduction"] \
                                                                    .mask(
                                                                       df_model_s2["Difference_" + y_var_s2 + "_hat" + "_25%_reduction"] > 0, 0 
                                                                    )

In [39]:
# List the columns added for Counterfactual Analysis
cols_added_counterfactual = [i for i in df_model_s2.columns if "Difference_" in i]

print(f"Columns added for the counterfactual analysis: {cols_added_counterfactual}")

Columns added for the counterfactual analysis: ['Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_1%_reduction', 'Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_10%_reduction', 'Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_25%_reduction', 'Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_1%_reduction_fixed', 'Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_10%_reduction_fixed', 'Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_25%_reduction_fixed']


In [40]:
# check out the relevant columns needed for the counterfactual plot
counter_fact_columns = ['year_month', 'school_zip'] + [i for i in df_model_s2.columns if predictor_name_s2 in i] + cols_added_counterfactual
df_model_s2[counter_fact_columns].head(10)

Unnamed: 0,year_month,school_zip,pm25_r9_hat,pm25_r9_hat_1%_reduction,pm25_r9_hat_10%_reduction,pm25_r9_hat_25%_reduction,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_1%_reduction,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_10%_reduction,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_25%_reduction,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_1%_reduction_fixed,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_10%_reduction_fixed,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_25%_reduction_fixed
0,2002-01-01,94954,14.52686,14.381592,13.074174,10.895145,-0.000497,0.00069,0.008096,-0.000497,0.0,0.0
22,2002-01-01,94541,14.298583,14.155598,12.868725,10.723937,0.075358,0.08007,0.084902,0.0,0.0,0.0
21,2002-01-01,95682,11.973226,11.853494,10.775903,8.979919,0.005953,0.010681,0.004392,0.0,0.0,0.0
20,2002-01-01,95828,11.831411,11.713098,10.64827,8.873558,-0.022529,-0.017721,-0.014108,-0.022529,-0.017721,-0.014108
19,2002-01-01,94565,16.460915,16.296305,14.814823,12.345686,-0.05838,-0.060703,-0.055591,-0.05838,-0.060703,-0.055591
17,2002-01-01,94577,14.509278,14.364185,13.058351,10.881959,0.008056,0.007728,0.011101,0.0,0.0,0.0
16,2002-01-01,94544,14.189492,14.047598,12.770543,10.642119,0.014504,0.014866,0.015697,0.0,0.0,0.0
15,2002-01-01,94122,20.757551,20.549976,18.681795,15.568163,-0.075197,-0.078701,-0.040787,-0.075197,-0.078701,-0.040787
14,2002-01-01,95136,13.957859,13.81828,12.562073,10.468394,-0.04708,-0.043089,-0.042779,-0.04708,-0.043089,-0.042779
13,2002-01-01,92124,17.390959,17.217049,15.651862,13.043219,-0.046434,-0.049771,-0.049712,-0.046434,-0.049771,-0.049712


In [41]:
# summary statistics on our counterfactual column
df_model_s2[cols_added_counterfactual].describe()

Unnamed: 0,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_1%_reduction,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_10%_reduction,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_25%_reduction,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_1%_reduction_fixed,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_10%_reduction_fixed,Difference_y_visits_all_malignant_cancers_fwd3_diff_r12_hat_25%_reduction_fixed
count,76804.0,76804.0,76804.0,76804.0,76804.0,76804.0
mean,3.5e-05,0.000275,0.001168,-0.040516,-0.039786,-0.0388
std,0.175684,0.171671,0.168071,0.117597,0.114723,0.111789
min,-1.916411,-1.844287,-1.798303,-1.916411,-1.844287,-1.798303
25%,-0.028806,-0.028388,-0.027704,-0.028806,-0.028388,-0.027704
50%,7.6e-05,0.000365,0.000891,0.0,0.0,0.0
75%,0.028886,0.02929,0.030117,0.0,0.0,0.0
max,2.007987,1.922167,1.953944,0.0,0.0,0.0


In [42]:
# number of rows total
df_model_s2.shape

(76804, 251)

In [43]:
# number of rows where 1% reduction reduced health outcome
df_model_s2[df_model_s2[y_var_s2 + '_hat'] > df_model_s2[y_var_s2 + '_hat_1%_reduction']].shape

(38304, 251)

In [44]:
# number of rows where 10% reduction reduced health outcome
df_model_s2[df_model_s2[y_var_s2 + '_hat'] > df_model_s2[y_var_s2 + '_hat_10%_reduction']].shape

(38022, 251)

In [45]:
# number of rows where 25% reduction reduced health outcome
df_model_s2[df_model_s2[y_var_s2 + '_hat'] > df_model_s2[y_var_s2 + '_hat_25%_reduction']].shape

(37461, 251)

In [46]:
# Take a look at the top 50 counties where 25% reduction in pollution reduced health outcomes, by frequency.
df_model_s2[df_model_s2[y_var_s2 + '_hat'] > df_model_s2[y_var_s2 + '_hat_25%_reduction']]['school_zip'].value_counts().head(50)

95458    138
95912    136
95454    130
92365    127
93266    125
93256    121
92121    120
96132    118
96118    118
92661    118
95988    118
95135    114
93647    114
94517    112
95461    112
93428    110
95324    109
94571    108
95453    107
92887    107
93624    105
95328    105
92283    105
94132    104
95547    104
92124    104
95465    103
94965    103
92679    103
92782    102
95322    102
93453    101
92866    101
92612    101
95631    100
94939     99
92548     99
95922     99
94949     98
95949     98
95672     98
95987     98
92391     97
94618     97
93311     97
95326     97
95357     97
95677     96
91364     96
95315     95
Name: school_zip, dtype: int64

Saving Stage 2 csv: 

In [47]:
counterfactual_csv_name = notebook_index + "_XGB_s2_INSTRUMENT_" + predictor_name_s1 + "_FE_SET_" + str(FE_set_num) + "_TARGETING_" + y_var_s2

print(counterfactual_csv_name)

diff16_no_outliers_fn_T_XGB_s2_INSTRUMENT_Izmy_v4_nodist_normed_TPY_r9_FE_SET_7_TARGETING_y_visits_all_malignant_cancers_fwd3_diff_r12


In [48]:
# dropping health data before saving csv
df_model_s2.drop(columns=(num_visits_col_names + y_col_names), inplace=True)    
df_model_s2.to_csv(os.path.join(out_dir_xgb, counterfactual_csv_name + ".csv"))

### Save XGBoost Models
- save as `.txt`

Stage 1

In [49]:
stage1_model_name = notebook_index + "_XGB_MODEL_s1_INSTRUMENT_" + predictor_name_s1 + "_FE_SET_" + str(FE_set_num) + "_TARGETING_" + y_var_s2
stage1_model_name

'diff16_no_outliers_fn_T_XGB_MODEL_s1_INSTRUMENT_Izmy_v4_nodist_normed_TPY_r9_FE_SET_7_TARGETING_y_visits_all_malignant_cancers_fwd3_diff_r12'

In [50]:
xgbr_s1_best.save_model(os.path.join(out_dir_xgb, stage1_model_name+ ".txt"))

In [51]:
# load it back in to see if it works!
models1_loaded = xgb.XGBRegressor()
models1_loaded.load_model(os.path.join(out_dir_xgb, stage1_model_name+ ".txt"))
print(models1_loaded.best_ntree_limit)

models1_loaded.predict(X_s1)

180


array([21.942682 , 22.881857 , 23.642935 , ...,  3.2541504,  3.451995 ,
        3.5449603], dtype=float32)

Stage 2

In [52]:
stage2_model_name = notebook_index + "_XGB_MODEL_s2_INSTRUMENT_" + predictor_name_s1 + "_FE_SET_" + str(FE_set_num) + "_TARGETING_" + y_var_s2
stage2_model_name

'diff16_no_outliers_fn_T_XGB_MODEL_s2_INSTRUMENT_Izmy_v4_nodist_normed_TPY_r9_FE_SET_7_TARGETING_y_visits_all_malignant_cancers_fwd3_diff_r12'

In [53]:
xgbr_s2_best.save_model(os.path.join(out_dir_xgb, stage2_model_name+ ".txt"))

In [54]:
# load it back in to see if it works!
models2_loaded = xgb.XGBRegressor()
models2_loaded.load_model(os.path.join(out_dir_xgb, stage2_model_name+ ".txt"))
print(models2_loaded.best_ntree_limit)

models2_loaded.predict(X_s2)

180


array([0.07857811, 0.06686186, 0.06826557, ..., 0.28133696, 0.08945359,
       0.06420609], dtype=float32)

## Verify the Exclusion Restriction

Compare the Instrument Values to the residuals of the 2nd stage, using the fixed effects chosen.

In [55]:
def fit_1st_stage(df, pm_col, instr_col, fixed_effects):
    ''' First stage to check whether wfeI is a strong instrument for pm25I
    '''
    
    temp = df.copy()

    # create FE and interactions between FE and continous vars
    f = pm_col + fixed_effects
    y, X_fe = patsy.dmatrices(f, temp, return_type="dataframe")
    
    X = temp[[instr_col]]
    
    # concat
    X = pd.concat([X, X_fe], axis=1)

    # fit model
    fit_1st_stage = sm.OLS(y, X).fit(
        cov_type='cluster',
        cov_kwds={'groups':temp['school_county_v2']},
        use_t=True
    )
    
    # display estimates
    print('Outcome: ', pm_col)
    print('-------------------------------------')
    display(pd.concat(
        [
            pd.DataFrame(fit_1st_stage.params).reset_index(drop=False).rename(columns={"index":"variable", 0:"coef"}).iloc[0:2,:],
            pd.DataFrame(fit_1st_stage.bse).reset_index(drop=False).rename(columns={"index":"variable", 0:"std err"}).iloc[0:2,1:2],
            fit_1st_stage.conf_int(alpha=0.05, cols=None).reset_index(drop=True).rename(columns={0:"[0.025", 1:"0.975]"}).iloc[0:2,:],
            pd.DataFrame(fit_1st_stage.pvalues.values).reset_index(drop=True).rename(columns={0:"p-value"}).iloc[0:2,:]
        ],
        axis=1
    ))
    
    # save pm25I_hat
    temp[pm_col+'_hat_OLS'] = fit_1st_stage.get_prediction(X).summary_frame()['mean']

    return temp


In [56]:
def fit_2sls(df, outcome, independent_var_name, fixed_effects):
    ''''''
    #independent_var_name example: 'pm25_hat'

    # drop if outcome is nan
    temp = df[~df[outcome].isna()]
    temp.reset_index(drop=True, inplace=True)

    # optional code to filter outliers
    if filter_medical_outliers:
        # print(f"Outcome: {outcome}")
        # display(temp[outcome].describe())
        old_row_count = temp.shape[0]

        #temp = temp[ np.abs(temp[outcome]-temp[outcome].mean()) <= min(3*temp[outcome].std(), 1000*int(lag_time)) ]
        temp = temp[ np.abs(temp[outcome]-temp[outcome].mean()) <= (3*temp[outcome].std()) ]
        new_row_count = temp.shape[0]
        print(f"Outliers (over 3 std away) were filtered out, We drop {old_row_count - new_row_count} rows.")
        #display(temp[outcome].describe())

    # create FE and interactions between FE and continous vars
    f = outcome + fixed_effects #" ~ county + month + year_trend * C(county_month)"
    y, X_fe = patsy.dmatrices(f, temp, return_type="dataframe")
    
    #X = temp[['pm25_hat']]
    X = temp[[independent_var_name]]
    
    # concat
    X = pd.concat([X, X_fe], axis=1)

    # fit model
    model_a = sm.OLS(y, X).fit(
        cov_type='cluster',
        cov_kwds={'groups':temp['school_county_v2']},
        use_t=True
    )
    
    # display estimates
    print('Outcome: ', outcome)
    print('-------------------------------------')
    display(pd.concat(
        [
            pd.DataFrame(model_a.params).reset_index(drop=False).rename(columns={"index":"variable", 0:"coef"}).iloc[0:2,:],
            pd.DataFrame(model_a.bse).reset_index(drop=False).rename(columns={"index":"variable", 0:"std err"}).iloc[0:2,1:2],
            model_a.conf_int(alpha=0.05, cols=None).reset_index(drop=True).rename(columns={0:"[0.025", 1:"0.975]"}).iloc[0:2,:],
            pd.DataFrame(model_a.pvalues.values).reset_index(drop=True).rename(columns={0:"p-value"}).iloc[0:2,:]
        ],
        axis=1
    ))

    # save outcome_hat
    temp[outcome+'_hat_OLS'] = model_a.get_prediction(X).summary_frame()['mean']
    temp[outcome+'_hat_OLS'+'_residuals'] = model_a.resid


    return temp

In [57]:
fixed_effects_cols_str

' ~ school_county_v2 + month + year_trend * C(county_month)+ avg_temp_r9 + avg_elevation_diff_m + ca_agi_per_returns + total_population + avg_wspd_top_15_r9'

In [58]:
df_all_iv_filtered = fit_1st_stage(df_model_s1, target_name_s1, predictor_name_s1, fixed_effects_cols_str)


Outcome:  pm25_r9
-------------------------------------


Unnamed: 0,variable,coef,std err,[0.025,0.975],p-value
0,Izmy_v4_nodist_normed_TPY_r9,-0.000107,1.9e-05,-0.000145,-7e-05,4.021765e-07
1,Intercept,15.636282,2.221806,11.187193,20.085371,2.753969e-09


In [59]:
rmse_val = np.mean(((df_all_iv_filtered[target_name_s1] - df_all_iv_filtered[target_name_s1+'_hat_OLS'])**2)**.5)
print('RMSE for {} and {}: {}'.format(target_name_s1, target_name_s1+'_hat_OLS', rmse_val))

RMSE for pm25_r9 and pm25_r9_hat_OLS: 1.7011202481774543


In [60]:
s2_df = fit_2sls(df_all_iv_filtered, y_var_s2, target_name_s1+"_hat_OLS", fixed_effects_cols_str)

Outliers (over 3 std away) were filtered out, We drop 767 rows.
Outcome:  y_visits_all_malignant_cancers_fwd3_diff_r12
-------------------------------------


Unnamed: 0,variable,coef,std err,[0.025,0.975],p-value
0,pm25_r9_hat_OLS,0.007882,0.003682,0.000483,0.01528,0.037289
1,Intercept,-0.182179,0.078232,-0.339392,-0.024965,0.024045


  return np.sqrt(self.var_pred_mean)


In [61]:
rmse_val = np.mean(((s2_df[y_var_s2] - s2_df[y_var_s2+'_hat_OLS'])**2)**.5)
print('RMSE for {} and {}: {}'.format(y_var_s2, y_var_s2+'_hat_OLS', rmse_val))

RMSE for y_visits_all_malignant_cancers_fwd3_diff_r12 and y_visits_all_malignant_cancers_fwd3_diff_r12_hat_OLS: 0.20679298998165824


In [62]:
print(f"Correlation Matrix Between Instrument (predictor_name_s1) and Medical Outcome ({y_var_s2})'s Residuals:\n")

df_corr = s2_df[[predictor_name_s1] + [i for i in s2_df.columns if 'OLS_residuals' in i]].corr()

# plot the results
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(df_corr, vmin=-1, center=0, vmax=1, cmap=sns.diverging_palette(20, 220, n=200), square=True)

ax.set_xticklabels(ax.get_xticklabels(), 
    rotation=45, horizontalalignment='right')
ax.set_title(f"Showing that {predictor_name_s1} is not correlated \nwith the residuals of the 2nd stage", fontdict = {"fontsize": 20})
print('')

display(df_corr)

Correlation Matrix Between Instrument (predictor_name_s1) and Medical Outcome (y_visits_all_malignant_cancers_fwd3_diff_r12)'s Residuals:




Unnamed: 0,Izmy_v4_nodist_normed_TPY_r9,y_visits_all_malignant_cancers_fwd3_diff_r12_hat_OLS_residuals
Izmy_v4_nodist_normed_TPY_r9,1.0,-3.254732e-12
y_visits_all_malignant_cancers_fwd3_diff_r12_hat_OLS_residuals,-3.254732e-12,1.0
