Steps:
1. Finalise data sets (be brutal, identify roots and stems; address missing values, model missing value evaluate to mean)
2. Model linear regression statistics (feature importances; chicken feed/auto)
3. Prediction: random forest
4. data visualisation (pairplots)

In [2]:
import pandas as pd

### Covid 19 Cases by County (USA Facts/CDC)

For most states, USAFacts directly collects the daily county-level cumulative totals of positive cases and deaths from a table, dashboard, or PDF on the state public health website. This data is compiled either through scraping or manual entry. The underlying data is available for download below the US county map and has helped government agencies like the Centers for Disease Control and Prevention in its nationwide efforts.

REFERENCES:
1. https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

In [3]:
covid_cases = pd.read_csv("data/covid_confirmed_usafacts_200803.csv")

In [4]:
covid_cases.head()

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,7/23/20,7/24/20,7/25/20,7/26/20,7/27/20,7/28/20,7/29/20,7/30/20,7/31/20,8/1/20
0,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,AL,1,0,0,0,0,0,0,...,905,921,932,942,965,974,974,1002,1015,1030
2,1003,Baldwin County,AL,1,0,0,0,0,0,0,...,2461,2513,2662,2708,2770,2835,2835,3028,3101,3142
3,1005,Barbour County,AL,1,0,0,0,0,0,0,...,534,539,552,562,569,575,575,585,598,602
4,1007,Bibb County,AL,1,0,0,0,0,0,0,...,289,303,318,324,334,337,338,352,363,368


In [5]:
covid_cases_dropped = covid_cases.drop(columns=['8/1/20'])

In [6]:
covid_cases_dropped_only = covid_cases_dropped.iloc[:,-192:]

In [7]:
covid_cases_total = covid_cases_dropped['Total Cases']= covid_cases_dropped.iloc[:, -192:].sum(axis=1)

In [8]:
covid_cases_filter = covid_cases_dropped.loc[:,["countyFIPS", "County Name", "State", "stateFIPS", "Total Cases"]]
covid_cases_filter["countyFIPS"] = covid_cases_filter["countyFIPS"].astype(str)
print(covid_cases_filter.dtypes)

countyFIPS     object
County Name    object
State          object
stateFIPS       int64
Total Cases     int64
dtype: object


In [9]:
covid_cases_filter['countyFIPS_2d'] = covid_cases_filter['countyFIPS'].str[-2:]
covid_cases_filter = covid_cases_filter.loc[:,["stateFIPS", "countyFIPS_2d", "County Name", "State", "Total Cases"]]
covid_cases_filter

Unnamed: 0,stateFIPS,countyFIPS_2d,County Name,State,Total Cases
0,1,0,Statewide Unallocated,AL,0
1,1,01,Autauga County,AL,39746
2,1,03,Baldwin County,AL,76970
3,1,05,Barbour County,AL,24625
4,1,07,Bibb County,AL,13636
...,...,...,...,...,...
3190,56,37,Sweetwater County,WY,7361
3191,56,39,Teton County,WY,13823
3192,56,41,Uinta County,WY,9737
3193,56,43,Washakie County,WY,3104


### Covid 19 Deaths by County (USA Facts/CDC)

For most states, USAFacts directly collects the daily county-level cumulative totals of positive cases and deaths from a table, dashboard, or PDF on the state public health website. This data is compiled either through scraping or manual entry. The underlying data is available for download below the US county map and has helped government agencies like the Centers for Disease Control and Prevention in its nationwide efforts.

REFERENCES:
1. https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

In [10]:
covid_deaths = pd.read_csv("data/covid_deaths_usafacts_200803.csv")

In [11]:
covid_deaths_dropped = covid_deaths.drop(columns=['8/1/20'])

In [12]:
covid_deaths_total = covid_deaths_dropped['Total Deaths']= covid_deaths_dropped.iloc[:, -192:].sum(axis=1)

In [13]:
covid_deaths_filter = covid_deaths_dropped.loc[:,["countyFIPS", "County Name", "State", "stateFIPS", "Total Deaths"]]

In [14]:
covid_deaths_filter = covid_deaths_dropped.loc[:,["countyFIPS", "County Name", "State", "stateFIPS", "Total Deaths"]]
covid_deaths_filter["countyFIPS"] = covid_deaths_filter["countyFIPS"].astype(str)
print(covid_deaths_filter.dtypes)

countyFIPS      object
County Name     object
State           object
stateFIPS        int64
Total Deaths     int64
dtype: object


In [15]:
covid_deaths_filter['countyFIPS_2d'] = covid_deaths_filter['countyFIPS'].str[-2:]
covid_deaths_filter = covid_deaths_filter.loc[:,["stateFIPS", "countyFIPS_2d", "County Name", "State", "Total Deaths"]]
covid_deaths_filter

Unnamed: 0,stateFIPS,countyFIPS_2d,County Name,State,Total Deaths
0,1,0,Statewide Unallocated,AL,0
1,1,01,Autauga County,AL,909
2,1,03,Baldwin County,AL,958
3,1,05,Barbour County,AL,155
4,1,07,Bibb County,AL,103
...,...,...,...,...,...
3190,56,37,Sweetwater County,WY,34
3191,56,39,Teton County,WY,101
3192,56,41,Uinta County,WY,0
3193,56,43,Washakie County,WY,291


### Per capital incidence of poverty by U.S county (U.S Census)

The poverty universe is made up of persons for whom the Census Bureau can determine poverty status (either "in poverty" or "not in poverty").

REFERENCES:
1. SAIPE Model Input Data: https://www.census.gov/data/datasets/time-series/demo/saipe/model-tables.html

In [16]:
poverty = pd.read_csv("data/allpovu.csv")
poverty_all_ages = poverty.loc[:,["State FIPS code", "County FIPS code", "Name", "State Postal Code", "Poverty Universe, All Ages"]]
poverty_all_ages

Unnamed: 0,State FIPS code,County FIPS code,Name,State Postal Code,"Poverty Universe, All Ages"
0,0,0,United States,US,319184033.0
1,1,0,Alabama,AL,4763811.0
2,1,1,Autauga County,AL,55073.0
3,1,3,Baldwin County,AL,215255.0
4,1,5,Barbour County,AL,21979.0
...,...,...,...,...,...
3196,56,37,Sweetwater County,WY,42205.0
3197,56,39,Teton County,WY,22888.0
3198,56,41,Uinta County,WY,20135.0
3199,56,43,Washakie County,WY,7735.0


### County Population by Racial/Ethnic Characteristics 2010-2019 (U.S. Census Bureau)

METHODOLOGY FOR THE UNITED STATES POPULATION ESTIMATES: VINTAGE 2019
Nation, States, Counties, and Puerto Rico – April 1, 2010 to July 1, 2019

Each year, the United States Census Bureau produces and publishes estimates of the population for the
nation, states, counties, state/county equivalents, and Puerto Rico.1 We estimate the resident population for
each year since the most recent decennial census by using measures of population change. The resident
population includes all people currently residing in the United States.

With each annual release of population estimates, the Population Estimates Program revises and updates the
entire time series of estimates from April 1, 2010 to July 1 of the current year, which we refer to as the
vintage year. We use the term “vintage” to denote an entire time series created with a consistent population
starting point and methodology. The release of a new vintage of estimates supersedes any previous series
and incorporates the most up-to-date input data and methodological improvements

REFERENCES:
1. Annual County Resident Population Estimates by Age, Sex, Race, and Hispanic Origin: April 1, 2010 to July 1, 2019 (https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html)
2. File Layout: https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/cc-est2019-alldata.pdf

In [17]:
race = pd.read_csv("data/cc-est2019-alldata.csv", encoding = "ISO-8859-1")

In [18]:
race

Unnamed: 0,SUMLEV,STATE,COUNTY,STNAME,CTYNAME,YEAR,AGEGRP,TOT_POP,TOT_MALE,TOT_FEMALE,...,HWAC_MALE,HWAC_FEMALE,HBAC_MALE,HBAC_FEMALE,HIAC_MALE,HIAC_FEMALE,HAAC_MALE,HAAC_FEMALE,HNAC_MALE,HNAC_FEMALE
0,50,1,1,Alabama,Autauga County,1,0,54571,26569,28002,...,607,538,57,48,26,32,9,11,19,10
1,50,1,1,Alabama,Autauga County,1,1,3579,1866,1713,...,77,56,9,5,4,1,0,0,2,1
2,50,1,1,Alabama,Autauga County,1,2,3991,2001,1990,...,64,66,2,3,2,7,2,3,2,0
3,50,1,1,Alabama,Autauga County,1,3,4290,2171,2119,...,51,57,13,7,5,5,2,1,1,1
4,50,1,1,Alabama,Autauga County,1,4,4290,2213,2077,...,48,44,7,5,0,2,2,1,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
716371,50,56,45,Wyoming,Weston County,12,14,499,280,219,...,4,5,0,0,0,0,0,0,0,0
716372,50,56,45,Wyoming,Weston County,12,15,352,180,172,...,1,2,0,0,0,0,3,0,0,0
716373,50,56,45,Wyoming,Weston County,12,16,229,107,122,...,2,0,0,0,0,0,0,0,0,0
716374,50,56,45,Wyoming,Weston County,12,17,198,82,116,...,1,1,0,0,1,0,0,0,0,0


In [None]:
race.columns.tolist()

# SELECTION - Z Value

# WA_MALE
# WAC_MALE

# WA_FEMALE
# WAC_FEMALE

# BA_MALE
# BAC_MALE

# BA_FEMALE
# BAC_FEMALE

# IA_MALE
# IAC_MALE

# IA_FEMALE
# IAC_FEMALE

# AA_MALE
# AAC_MALE 

# AA_FEMALE
# AAC_FEMALE

# NA_MALE
# NAC_MALE 

# NA_FEMALE
# NAC_FEMALE

# TOM_MALE
# TOM_FEMALE

### Incidence of Pre-existing Conditions & Coverage of Flu Vaccine

People of any age with the following conditions are at increased risk of severe illness from COVID-19 (according to CDC, 17 July 17 2020:

PolicyMap worked with journalists at the New York Times to create this index assessing a county’s relative risk of its population developing severe COVID-19 symptoms. The index represents the relative risk for a high proportion of residents in each county to develop serious health complications from COVID-19 because of underlying health conditions identified by the CDC as contributing to a person’s risk of developing severe symptoms from the virus. These conditions include COPD, heart disease, high blood pressure, diabetes, and obesity.

Estimates of COPD, heart disease, high blood pressure, and diabetes and obesity prevalence at the tract and ZCTA level are from PolicyMap’s Health Outcome Estimates. Estimates of diabetes and obesity prevalence at the county level are from the CDC’s U.S. Diabetes Surveillance System.

Normalized scores were then converted to percentiles and z scores for easier interpretation. Percentiles rank counties from the lowest score to the highest on a scale of 0 to 100, where a score of 50 represents the median value. A county’s z score shows how many standard deviations above or below the average a county’s risk level falls. A score of 0.6, for example, would mean that the county has a higher risk than average, but is still within one standard deviation of the average and is therefore not unusually high. Risk categories from very low to very high are assigned based on z scores.

Constrained features to the following (according to CDC advisory 28 July, 2020):
- Serious heart conditions, such as heart failure, coronary artery disease, or cardiomyopathies (CVDINFR4, CVDCRHD4)
- Cancer (CHCOCNCR)
- Chronic kidney disease (CHCKDNY)
- COPD (CHCCOPD1)
- Obesity (BMI> 30) ( _BMI5CAT value 4; not available at county level)
- Sickle cell disease (not available)
- Solid organ transplantation 
- Type 2 diabetes mellitus (proxy; taking insulin: INSULIN)


Proxy Prevention Coverage
- Adult flu shot/spray past 12 mos (FLUSHOT6)


REFERENCES:
1. Covid 19 People with Certain Medical Conditions https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/people-with-medical-conditions.html?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fcoronavirus%2F2019-ncov%2Fneed-extra-precautions%2Fgroups-at-higher-risk.html
2. Centers for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System Survey Data. Atlanta, Georgia: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, 2017.: https://www.cdc.gov/brfss/smart/smart_2017.html
3. Evidence used to update the list of underlying medical conditions that increase a person’s risk of severe illness from COVID-19: https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/evidence-table.html
4. PolicyMap Severe COVID-19 Health Risk Index: https://www.policymap.com/download-covid19-data/

In [19]:
# CDC SMART Data
# preexisting = pd.read_sas("data/llcp2018_2.xpt")
# preexisting.to_csv('data/llcp2018.csv')
# preexisting = pd.read_csv("data/MMSA2017.csv")
# preexisting["_MMSA"] = preexisting["_MMSA"].astype(str)
# print(preexisting.dtypes)
# preexisting['countyFIPS_2d'] = preexisting['_MMSA'].str[2:4]
# preexisting['stateFIPS_2d'] = preexisting['_MMSA'].str[0:2]

In [1]:
import pandas as pd
preexisting = pd.read_csv("data/COVID_Risk_Index_Data.csv")

In [5]:
preexisting

Unnamed: 0,geo_boundary_type_id,geo_boundary_identifier,geo_boundary_definition_id,time_frame,index_raw,index_normalized,index_zscore,index_percentile,index_category
0,4,1001,54,2020,39300,0.98,0.36,65.42,Above Average
1,4,1003,54,2020,145554,0.99,0.43,68.39,Above Average
2,4,1005,54,2020,24233,1.18,1.85,97.09,High
3,4,1007,54,2020,18562,1.06,0.95,83.36,Above Average
4,4,1009,54,2020,45082,1.05,0.89,81.75,Above Average
...,...,...,...,...,...,...,...,...,...
3229,4,72151,54,2020,-9999,-9999.00,-9999.00,-9999.00,
3230,4,72153,54,2020,-9999,-9999.00,-9999.00,-9999.00,
3231,4,78010,54,2020,-9999,-9999.00,-9999.00,-9999.00,
3232,4,78020,54,2020,-9999,-9999.00,-9999.00,-9999.00,


In [7]:
preexisting["geo_boundary_identifier"] = preexisting["geo_boundary_identifier"].astype(str)
print(preexisting.dtypes)

preexisting.columns.tolist()

geo_boundary_type_id            int64
geo_boundary_identifier        object
geo_boundary_definition_id      int64
time_frame                      int64
index_raw                       int64
index_normalized              float64
index_zscore                  float64
index_percentile              float64
index_category                 object
dtype: object


In [14]:
preexisting['countyFIPS_2d'] = preexisting['geo_boundary_identifier'].str[2:]
preexisting['stateFIPS_2d'] = preexisting['geo_boundary_identifier'].str[0:2]

In [15]:
preexisting

Unnamed: 0,geo_boundary_type_id,geo_boundary_identifier,geo_boundary_definition_id,time_frame,index_raw,index_normalized,index_zscore,index_percentile,index_category,countyFIPS_2d,stateFIPS_2d
0,4,1001,54,2020,39300,0.98,0.36,65.42,Above Average,01,10
1,4,1003,54,2020,145554,0.99,0.43,68.39,Above Average,03,10
2,4,1005,54,2020,24233,1.18,1.85,97.09,High,05,10
3,4,1007,54,2020,18562,1.06,0.95,83.36,Above Average,07,10
4,4,1009,54,2020,45082,1.05,0.89,81.75,Above Average,09,10
...,...,...,...,...,...,...,...,...,...,...,...
3229,4,72151,54,2020,-9999,-9999.00,-9999.00,-9999.00,,151,72
3230,4,72153,54,2020,-9999,-9999.00,-9999.00,-9999.00,,153,72
3231,4,78010,54,2020,-9999,-9999.00,-9999.00,-9999.00,,010,78
3232,4,78020,54,2020,-9999,-9999.00,-9999.00,-9999.00,,020,78


In [17]:
preexisting_clean = preexisting.loc[:,["countyFIPS_2d", "countyFIPS_2d", "index_percentile"]]
preexisting_clean

Unnamed: 0,countyFIPS_2d,countyFIPS_2d.1,index_percentile
0,01,01,65.42
1,03,03,68.39
2,05,05,97.09
3,07,07,83.36
4,09,09,81.75
...,...,...,...
3229,151,151,-9999.00
3230,153,153,-9999.00
3231,010,010,-9999.00
3232,020,020,-9999.00


### Flu Coverage (CDC Wonder)? 