Steps:
1. Finalise data sets (be brutal, identify roots and stems; address missing values, model missing value evaluate to mean)
2. Model linear regression statistics (feature importances; chicken feed/auto)
3. Prediction: random forest
4. data visualisation (pairplots)

In [1]:
import pandas as pd

### Covid 19 Cases by County (USA Facts/CDC)

For most states, USAFacts directly collects the daily county-level cumulative totals of positive cases and deaths from a table, dashboard, or PDF on the state public health website. This data is compiled either through scraping or manual entry.

REFERENCES:
1. https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

In [2]:
covid_cases = pd.read_csv("data/covid_confirmed_usafacts_200803.csv")

In [3]:
covid_cases.head()

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,7/23/20,7/24/20,7/25/20,7/26/20,7/27/20,7/28/20,7/29/20,7/30/20,7/31/20,8/1/20
0,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,AL,1,0,0,0,0,0,0,...,905,921,932,942,965,974,974,1002,1015,1030
2,1003,Baldwin County,AL,1,0,0,0,0,0,0,...,2461,2513,2662,2708,2770,2835,2835,3028,3101,3142
3,1005,Barbour County,AL,1,0,0,0,0,0,0,...,534,539,552,562,569,575,575,585,598,602
4,1007,Bibb County,AL,1,0,0,0,0,0,0,...,289,303,318,324,334,337,338,352,363,368


In [4]:
covid_cases_dropped = covid_cases.drop(columns=['8/1/20'])

In [5]:
covid_cases_dropped_only = covid_cases_dropped.iloc[:,-192:]

In [6]:
covid_cases_total = covid_cases_dropped['Total Cases']= covid_cases_dropped.iloc[:, -192:].sum(axis=1)

In [7]:
covid_cases_filter = covid_cases_dropped.loc[:,["countyFIPS", "County Name", "State", "stateFIPS", "Total Cases"]]
covid_cases_filter["countyFIPS"] = covid_cases_filter["countyFIPS"].astype(str)
print(covid_cases_filter.dtypes)

countyFIPS     object
County Name    object
State          object
stateFIPS       int64
Total Cases     int64
dtype: object


In [8]:
covid_cases_filter['countyFIPS_2d'] = covid_cases_filter['countyFIPS'].str[-2:]
covid_cases_filter = covid_cases_filter.loc[:,["stateFIPS", "countyFIPS_2d", "County Name", "State", "Total Cases"]]
covid_cases_filter

Unnamed: 0,stateFIPS,countyFIPS_2d,County Name,State,Total Cases
0,1,0,Statewide Unallocated,AL,0
1,1,01,Autauga County,AL,39746
2,1,03,Baldwin County,AL,76970
3,1,05,Barbour County,AL,24625
4,1,07,Bibb County,AL,13636
...,...,...,...,...,...
3190,56,37,Sweetwater County,WY,7361
3191,56,39,Teton County,WY,13823
3192,56,41,Uinta County,WY,9737
3193,56,43,Washakie County,WY,3104


### Covid 19 Cases & Deaths by County (USA Facts/CDC)

In [9]:
covid_deaths = pd.read_csv("data/covid_deaths_usafacts_200803.csv")

In [10]:
covid_deaths_dropped = covid_deaths.drop(columns=['8/1/20'])

In [11]:
covid_deaths_total = covid_deaths_dropped['Total Deaths']= covid_deaths_dropped.iloc[:, -192:].sum(axis=1)

In [12]:
covid_deaths_filter = covid_deaths_dropped.loc[:,["countyFIPS", "County Name", "State", "stateFIPS", "Total Deaths"]]

In [13]:
covid_deaths_filter = covid_deaths_dropped.loc[:,["countyFIPS", "County Name", "State", "stateFIPS", "Total Deaths"]]
covid_deaths_filter["countyFIPS"] = covid_deaths_filter["countyFIPS"].astype(str)
print(covid_deaths_filter.dtypes)

countyFIPS      object
County Name     object
State           object
stateFIPS        int64
Total Deaths     int64
dtype: object


In [14]:
covid_deaths_filter['countyFIPS_2d'] = covid_deaths_filter['countyFIPS'].str[-2:]
covid_deaths_filter = covid_deaths_filter.loc[:,["stateFIPS", "countyFIPS_2d", "County Name", "State", "Total Deaths"]]
covid_deaths_filter

Unnamed: 0,stateFIPS,countyFIPS_2d,County Name,State,Total Deaths
0,1,0,Statewide Unallocated,AL,0
1,1,01,Autauga County,AL,909
2,1,03,Baldwin County,AL,958
3,1,05,Barbour County,AL,155
4,1,07,Bibb County,AL,103
...,...,...,...,...,...
3190,56,37,Sweetwater County,WY,34
3191,56,39,Teton County,WY,101
3192,56,41,Uinta County,WY,0
3193,56,43,Washakie County,WY,291


### Per capital incidence of poverty by U.S county (U.S Census)

The poverty universe is made up of persons for whom the Census Bureau can determine poverty status (either "in poverty" or "not in poverty").

REFERENCES:
1. SAIPE Model Input Data: https://www.census.gov/data/datasets/time-series/demo/saipe/model-tables.html

In [15]:
poverty = pd.read_csv("data/allpovu.csv")
poverty_all_ages = poverty.loc[:,["State FIPS code", "County FIPS code", "Name", "State Postal Code", "Poverty Universe, All Ages"]]
poverty_all_ages

Unnamed: 0,State FIPS code,County FIPS code,Name,State Postal Code,"Poverty Universe, All Ages"
0,0,0,United States,US,319184033.0
1,1,0,Alabama,AL,4763811.0
2,1,1,Autauga County,AL,55073.0
3,1,3,Baldwin County,AL,215255.0
4,1,5,Barbour County,AL,21979.0
...,...,...,...,...,...
3196,56,37,Sweetwater County,WY,42205.0
3197,56,39,Teton County,WY,22888.0
3198,56,41,Uinta County,WY,20135.0
3199,56,43,Washakie County,WY,7735.0


### County Population by Racial/Ethnic Characteristics 2010-2019 (U.S. Census Bureau)

METHODOLOGY FOR THE UNITED STATES POPULATION ESTIMATES: VINTAGE 2019
Nation, States, Counties, and Puerto Rico – April 1, 2010 to July 1, 2019

Each year, the United States Census Bureau produces and publishes estimates of the population for the
nation, states, counties, state/county equivalents, and Puerto Rico.1 We estimate the resident population for
each year since the most recent decennial census by using measures of population change. The resident
population includes all people currently residing in the United States.

With each annual release of population estimates, the Population Estimates Program revises and updates the
entire time series of estimates from April 1, 2010 to July 1 of the current year, which we refer to as the
vintage year. We use the term “vintage” to denote an entire time series created with a consistent population
starting point and methodology. The release of a new vintage of estimates supersedes any previous series
and incorporates the most up-to-date input data and methodological improvements

REFERENCES:
1. Annual County Resident Population Estimates by Age, Sex, Race, and Hispanic Origin: April 1, 2010 to July 1, 2019 (https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html)

In [16]:
race = pd.read_csv("data/cc-est2019-alldata.csv", encoding = "ISO-8859-1")

In [17]:
race

Unnamed: 0,SUMLEV,STATE,COUNTY,STNAME,CTYNAME,YEAR,AGEGRP,TOT_POP,TOT_MALE,TOT_FEMALE,...,HWAC_MALE,HWAC_FEMALE,HBAC_MALE,HBAC_FEMALE,HIAC_MALE,HIAC_FEMALE,HAAC_MALE,HAAC_FEMALE,HNAC_MALE,HNAC_FEMALE
0,50,1,1,Alabama,Autauga County,1,0,54571,26569,28002,...,607,538,57,48,26,32,9,11,19,10
1,50,1,1,Alabama,Autauga County,1,1,3579,1866,1713,...,77,56,9,5,4,1,0,0,2,1
2,50,1,1,Alabama,Autauga County,1,2,3991,2001,1990,...,64,66,2,3,2,7,2,3,2,0
3,50,1,1,Alabama,Autauga County,1,3,4290,2171,2119,...,51,57,13,7,5,5,2,1,1,1
4,50,1,1,Alabama,Autauga County,1,4,4290,2213,2077,...,48,44,7,5,0,2,2,1,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
716371,50,56,45,Wyoming,Weston County,12,14,499,280,219,...,4,5,0,0,0,0,0,0,0,0
716372,50,56,45,Wyoming,Weston County,12,15,352,180,172,...,1,2,0,0,0,0,3,0,0,0
716373,50,56,45,Wyoming,Weston County,12,16,229,107,122,...,2,0,0,0,0,0,0,0,0,0
716374,50,56,45,Wyoming,Weston County,12,17,198,82,116,...,1,1,0,0,1,0,0,0,0,0


In [18]:
race.columns.tolist()

['SUMLEV',
 'STATE',
 'COUNTY',
 'STNAME',
 'CTYNAME',
 'YEAR',
 'AGEGRP',
 'TOT_POP',
 'TOT_MALE',
 'TOT_FEMALE',
 'WA_MALE',
 'WA_FEMALE',
 'BA_MALE',
 'BA_FEMALE',
 'IA_MALE',
 'IA_FEMALE',
 'AA_MALE',
 'AA_FEMALE',
 'NA_MALE',
 'NA_FEMALE',
 'TOM_MALE',
 'TOM_FEMALE',
 'WAC_MALE',
 'WAC_FEMALE',
 'BAC_MALE',
 'BAC_FEMALE',
 'IAC_MALE',
 'IAC_FEMALE',
 'AAC_MALE',
 'AAC_FEMALE',
 'NAC_MALE',
 'NAC_FEMALE',
 'NH_MALE',
 'NH_FEMALE',
 'NHWA_MALE',
 'NHWA_FEMALE',
 'NHBA_MALE',
 'NHBA_FEMALE',
 'NHIA_MALE',
 'NHIA_FEMALE',
 'NHAA_MALE',
 'NHAA_FEMALE',
 'NHNA_MALE',
 'NHNA_FEMALE',
 'NHTOM_MALE',
 'NHTOM_FEMALE',
 'NHWAC_MALE',
 'NHWAC_FEMALE',
 'NHBAC_MALE',
 'NHBAC_FEMALE',
 'NHIAC_MALE',
 'NHIAC_FEMALE',
 'NHAAC_MALE',
 'NHAAC_FEMALE',
 'NHNAC_MALE',
 'NHNAC_FEMALE',
 'H_MALE',
 'H_FEMALE',
 'HWA_MALE',
 'HWA_FEMALE',
 'HBA_MALE',
 'HBA_FEMALE',
 'HIA_MALE',
 'HIA_FEMALE',
 'HAA_MALE',
 'HAA_FEMALE',
 'HNA_MALE',
 'HNA_FEMALE',
 'HTOM_MALE',
 'HTOM_FEMALE',
 'HWAC_MALE',
 'HWAC_FEMA

### Incidence of Pre-existing Conditions & Coverage of Flu Vaccine

People of any age with the following conditions are at increased risk of severe illness from COVID-19 (according to CDC, 17 July 17 2020:

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.

REFERENCES:
1. Covid 19 People with Certain Medical Conditions https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/people-with-medical-conditions.html?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fcoronavirus%2F2019-ncov%2Fneed-extra-precautions%2Fgroups-at-higher-risk.html
2. 2017 SMART: BRFSS City and County Data and Documentation: https://www.cdc.gov/brfss/smart/smart_2017.html


In [20]:
preexisting = pd.read_csv("data/MMSA2017.csv")

In [21]:
preexisting["_MMSA"] = preexisting["_MMSA"].astype(str)
print(preexisting.dtypes)

Unnamed: 0      int64
DISPCODE      float64
STATERE1      float64
SAFETIME      float64
HHADULT       float64
               ...   
_AIDTST3      float64
_MMSA          object
_MMSAWT       float64
SEQNO         float64
MMSANAME       object
Length: 178, dtype: object


In [22]:
preexisting['countyFIPS_2d'] = preexisting['_MMSA'].str[2:4]

In [23]:
preexisting['stateFIPS_2d'] = preexisting['_MMSA'].str[0:2]
preexisting

Unnamed: 0.1,Unnamed: 0,DISPCODE,STATERE1,SAFETIME,HHADULT,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,HLTHPLN1,...,_RFSEAT3,_FLSHOT6,_PNEUMO2,_AIDTST3,_MMSA,_MMSAWT,SEQNO,MMSANAME,countyFIPS_2d,stateFIPS_2d
0,0,1200.0,,1.0,4.0,3.0,10.0,88.0,88.0,1.0,...,2.0,,,2.0,10100.0,110.207620,2.017000e+09,"b'Aberdeen, SD, Micropolitan Statistical Area'",10,10
1,1,1200.0,,1.0,,2.0,3.0,88.0,88.0,1.0,...,1.0,1.0,2.0,2.0,10100.0,28.646615,2.017000e+09,"b'Aberdeen, SD, Micropolitan Statistical Area'",10,10
2,2,1200.0,,1.0,3.0,2.0,8.0,5.0,1.0,1.0,...,2.0,,,2.0,10100.0,115.602476,2.017000e+09,"b'Aberdeen, SD, Micropolitan Statistical Area'",10,10
3,3,1200.0,,1.0,2.0,4.0,30.0,3.0,3.0,1.0,...,2.0,,,1.0,10100.0,376.026237,2.017000e+09,"b'Aberdeen, SD, Micropolitan Statistical Area'",10,10
4,4,1200.0,,1.0,1.0,1.0,88.0,88.0,,1.0,...,1.0,1.0,1.0,2.0,10100.0,20.708628,2.017000e+09,"b'Aberdeen, SD, Micropolitan Statistical Area'",10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230870,230870,1200.0,,1.0,1.0,3.0,3.0,10.0,88.0,1.0,...,1.0,1.0,1.0,2.0,49340.0,972.800188,2.017001e+09,"b'Worcester, MA-CT, Metropolitan Statistical A...",34,49
230871,230871,1200.0,,1.0,3.0,4.0,88.0,2.0,88.0,1.0,...,9.0,,,,49340.0,1912.000660,2.017001e+09,"b'Worcester, MA-CT, Metropolitan Statistical A...",34,49
230872,230872,1200.0,,1.0,2.0,4.0,3.0,1.0,3.0,1.0,...,1.0,,,2.0,49340.0,298.676655,2.017001e+09,"b'Worcester, MA-CT, Metropolitan Statistical A...",34,49
230873,230873,1200.0,,1.0,4.0,1.0,88.0,88.0,,1.0,...,1.0,,,1.0,49340.0,650.621713,2.017001e+09,"b'Worcester, MA-CT, Metropolitan Statistical A...",34,49
