<h1>Exploring Health Datasets for Link-Health to Determine Association between ACP Eligibility and Health Outcomes</h1>

In [103]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore') #Turn off warnings

<h2>Cleaning the PLACES Dataset (DO NOT TOUCH)</h2>

places_zcta = pd.read_csv("noah_csv/places_zcta.csv")

places_zcta

In [104]:
ma_acp_adoption = pd.read_csv("noah_csv/MA_ACP_Adoption.csv")
ma_acp_adoption

Unnamed: 0,City,Adoption Rate,Eligible Households,Enrolled Households,Eligible Unconnected Households
0,Abington,17%,2169,376,448
1,Agawam Town,20%,5414,1069,2213
2,Amesbury Town,26%,1821,477,382
3,Arlington,14%,4816,651,925
4,Attleboro,25%,6110,1525,1392
...,...,...,...,...,...
72,Wilmington,11%,1975,225,403
73,Winchester,9%,1925,182,370
74,Winthrop Town,11%,4462,499,1586
75,Woburn,17%,4790,826,1069


<h2>Merging Zip Codes with Towns in the ACP Dataset and PLACES Data</h2>

In [105]:
ma_zips_raw = pd.read_csv("noah_csv/ma_zips.csv")
ma_zips = ma_zips_raw[["Zipcode", "City"]]
ma_zips = ma_zips.rename(columns={"Zipcode":"Zip Code"})
ma_zips

Unnamed: 0,Zip Code,City
0,1001,Agawam
1,1002,Amherst
2,1003,Amherst
3,1004,Amherst
4,1005,Barre
...,...,...
698,2783,Taunton
699,2790,Westport
700,2791,Westport Point
701,5501,Andover


In [106]:
#Clean Zip Codes
ma_zips["Zip Code"] = ma_zips["Zip Code"].map(
    lambda zip: str(zip) if zip > 10000 else "0" + str(zip)
)
ma_zips

Unnamed: 0,Zip Code,City
0,01001,Agawam
1,01002,Amherst
2,01003,Amherst
3,01004,Amherst
4,01005,Barre
...,...,...
698,02783,Taunton
699,02790,Westport
700,02791,Westport Point
701,05501,Andover


In [107]:
ma_acp_adoption_zc = ma_acp_adoption.merge(ma_zips, left_on="City", right_on="City")
ma_acp_adoption_zc

Unnamed: 0,City,Adoption Rate,Eligible Households,Enrolled Households,Eligible Unconnected Households,Zip Code
0,Abington,17%,2169,376,448,02351
1,Arlington,14%,4816,651,925,02474
2,Arlington,14%,4816,651,925,02476
3,Attleboro,25%,6110,1525,1392,02703
4,Belmont,11%,2455,268,471,02478
...,...,...,...,...,...,...
206,Worcester,60%,42567,25455,14501,01614
207,Worcester,60%,42567,25455,14501,01615
208,Worcester,60%,42567,25455,14501,01653
209,Worcester,60%,42567,25455,14501,01654


The problem here is that we have the ACP data on a town level from the superhighway dataset, but the PLACES data is on a ZCTA level meaning we would somehow need to aggregate all the data from different zip codes in the same town in the PLACES dataset, but by population not percent so it is a weighted average

My thoughts:
- Add Town to the PLACES dataset, merging by zip code
- Find a dataset that provides population by ZCTA and merge with PLACES dataset on Zip Code
- Aggregate percent data in the PLACES dataset by population to create a weighted average so the PLACES and ACP datasets would have the same levels of granularity
    - This would certainly screw up the confidence level piece so we would have to drop that
    
Option 2 is to not worry about population and just assume the same adoption rate per different ZCTAs in the same city (like assuming both area 01614 and 01615 separately have a 60% adoption rate) which would likely be a major oversight

<h2>Exporting MA places data separately</h2>
<h4>The full PLACES dataset was too large to upload to GitHub so I exported a separate places_ma dataset which is now in the noah_csv folder</h4>

In [112]:
places_ma = pd.read_csv("noah_csv/places_ma.csv")
places_ma.drop(

Unnamed: 0_level_0,Unnamed: 0,Year,Zip Code,Category,Measure,Data_Value_Type,Data_Value,Confidence Interval
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,2020,1005,Health Outcomes,Arthritis among adults aged >=18 years,Crude prevalence,27.2,"(25.8, 28.6)"
1,1,2020,1007,Health Outcomes,Stroke among adults aged >=18 years,Crude prevalence,2.1,"(1.9, 2.4)"
2,2,2020,1008,Health Outcomes,Obesity among adults aged >=18 years,Crude prevalence,29.2,"(27.9, 30.5)"
3,3,2020,1009,Health Outcomes,Obesity among adults aged >=18 years,Crude prevalence,29.7,"(28.0, 31.5)"
4,4,2020,1026,Health Outcomes,Chronic kidney disease among adults aged >=18 ...,Crude prevalence,2.8,"(2.6, 3.0)"
...,...,...,...,...,...,...,...,...
25719,25719,2020,1057,Health Risk Behaviors,Current smoking among adults aged >=18 years,Crude prevalence,15.4,"(13.5, 17.2)"
25720,25720,2020,1833,Health Outcomes,All teeth lost among adults aged >=65 years,Crude prevalence,7.5,"(4.8, 10.6)"
25722,25722,2019,1830,Prevention,Cholesterol screening among adults aged >=18 y...,Crude prevalence,89.4,"(88.9, 90.0)"
25723,25723,2020,1245,Health Status,Fair or poor self-rated health status among ad...,Crude prevalence,10.9,"(8.9, 13.2)"
