# Scaled Community Risk Factors

The goal of this notebook is to operationalize a measure of "hypothetical COVID-19 community risk" for incarcerated people in each Pennsylvania SCI. Hypothetical COVID-19 community risk is a measure of the odds of negative health outcomes that incarcerated people might face if they were not in prison and instead were living in their communities. For expositional purposes, the hypothetical covid-19 community risk factor for a single Philadelphian would be operationalized as the death-rate or infection rate for Philadelphia. SCI however, house many incarcerated people from different counties, and so it is necessary to represent this fact in this measure. To do this, we operationalize this measure as the weighted average of community death/infection rates for each county represented in a given SCI. Community death and infection rates are scaled by their relative proportion within each SCI and summed (all weights sum to 1). 


In [1]:
import pandas as pd
import numpy as np
from plotnine import ggplot, geoms, theme, scales, ggtitle, labels, aes
from plotnine import element_blank,element_line,element_rect,element_text

import matplotlib.pyplot as plt
%matplotlib inline 

In [51]:
county_df = pd.read_csv('../../data/State_wide_data/COVID-19_Aggregate_Death_Data_Current_Daily_County_Health.csv')
snapshot = pd.read_csv('../../data/comparing_SCI-County/County_pop_over_SCI.csv')

### County Proportions in SCIs

In [52]:
snapshot['proportion_by_sci'] = snapshot.groupby('location')['incarcerated_persons'].apply(lambda x: x/x.sum())

### Clean and format data to merge

In [53]:
county_df['date'] = pd.to_datetime(county_df['Date of Death'])
county_df_deathrate = county_df[['County Name','Total Death Rate','date']]
county_df_deathrate.columns = ['county','Total Death Rate','date']
county_df_deathrate = county_df_deathrate.sort_values(by='date')

# Take latest date of county_df 03/08
county_df_deathrate = county_df_deathrate.loc[county_df_deathrate.date == county_df_deathrate.date.max()]

snapshot['county'] = snapshot['county'].str.title()

snapshot = snapshot.loc[:, ~snapshot.columns.str.contains('^Unnamed')]

#### Check overlap in county names

In [54]:
# counties in public health data not in snapshot
[i for i in county_df_deathrate['county'].unique() if i not in snapshot['county'].unique()]

['Pennsylvania', 'McKean']

In [55]:
# counties in snapshot not in public health data
[i for i in snapshot['county'].unique() if i not in county_df_deathrate['county'].unique()]

['Mckean', 'Out Of State']

In [56]:
snapshot.loc[snapshot['county'] == "Mckean",'county'] = "McKean"

**Note:** Notice that there is a county value in the snapshot data called "out of state" and a field in the public health data called "Pennsylvania". These values do not exist in the other dataset. Further analysis below indicates that "Pennsylvania is the total over the entire state, and that 'Out of State" never constitutes any more than 1% of the population of an SCI.

For these reasons, these values are dropped from the data.

In [57]:
Penn_pop = county_df.loc[(county_df['County Name'] == "Pennsylvania") &
                    (county_df['date'] == county_df.date.max())]['2019 Population '].values[0]
notPenn_pop = county_df.loc[(county_df['County Name'] != "Pennsylvania") &
                    (county_df['date'] == county_df.date.max())]['2019 Population '].sum()

Penn_pop == notPenn_pop

True

In [58]:
snapshot.loc[snapshot['county'] == 'Out Of State',['location','proportion_by_sci']]

Unnamed: 0,location,proportion_by_sci
1176,ALBION,0.003784
1177,BENNER TOWNSHIP,0.003579
1178,CAMBRIDGE SPRINGS,0.001147
1179,CAMP HILL,0.003532
1180,CHESTER,0.00295
1181,COAL TOWNSHIP,0.004801
1182,DALLAS,0.0
1183,FAYETTE,0.002051
1184,FOREST,0.001336
1185,FRACKVILLE,0.005455


In [59]:
snapshot = snapshot.loc[snapshot['county'] != "Out Of State"]
county_df_deathrate = county_df_deathrate.loc[county_df_deathrate['county'] != 'Pennsylvania']

##### Merge data

In [60]:
merged_df = snapshot.merge(county_df_deathrate,on='county',how='left')

## Calculate Hypothetical COVID-19 Community Risk Factor

In [61]:
def HCRF(subset):
    """
    CALCULATES A WEIGHTED AVERAGE. weights sum to 1, 
    so no denominator needed in weighted avg. 
    """
    scaled_values = subset['Total Death Rate'] * subset['proportion_by_sci']
    return np.sum(scaled_values)

In [64]:
hcrf = merged_df.groupby('location').apply(HCRF).reset_index()
hcrf.columns = ['location','HCRF']

#### Combined HCRF to merged_df

In [67]:
merged_df = merged_df.merge(hcrf,on='location',how='left')

In [69]:
merged_df.to_csv('../../data/comparing_SCI-County/County_pop_over_SCI_with_HCRF.csv')