Steps:
1. Finalise data sets (be brutal, identify roots and stems; address missing values, model missing value evaluate to mean)
2. Model linear regression statistics (feature importances; chicken feed/auto)
3. Prediction: random forest
4. data visualisation (pairplots)

In [1]:
import pandas as pd

### Covid 19 Cases by County (USA Facts/CDC)

For most states, USAFacts directly collects the daily county-level cumulative totals of positive cases and deaths from a table, dashboard, or PDF on the state public health website. This data is compiled either through scraping or manual entry. The underlying data is available for download below the US county map and has helped government agencies like the Centers for Disease Control and Prevention in its nationwide efforts.

REFERENCES:
1. https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

In [2]:
covid_cases = pd.read_csv("data/covid_confirmed_usafacts_200803.csv")

In [3]:
covid_cases.head()

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,7/23/20,7/24/20,7/25/20,7/26/20,7/27/20,7/28/20,7/29/20,7/30/20,7/31/20,8/1/20
0,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,AL,1,0,0,0,0,0,0,...,905,921,932,942,965,974,974,1002,1015,1030
2,1003,Baldwin County,AL,1,0,0,0,0,0,0,...,2461,2513,2662,2708,2770,2835,2835,3028,3101,3142
3,1005,Barbour County,AL,1,0,0,0,0,0,0,...,534,539,552,562,569,575,575,585,598,602
4,1007,Bibb County,AL,1,0,0,0,0,0,0,...,289,303,318,324,334,337,338,352,363,368


In [4]:
covid_cases_dropped = covid_cases.drop(columns=['8/1/20'])

In [5]:
covid_cases_dropped_only = covid_cases_dropped.iloc[:,-192:]

In [6]:
covid_cases_total = covid_cases_dropped['Total Cases']= covid_cases_dropped.iloc[:, -192:].sum(axis=1)

In [7]:
covid_cases_filter = covid_cases_dropped.loc[:,["countyFIPS", "County Name", "State", "stateFIPS", "Total Cases"]]
covid_cases_filter["countyFIPS"] = covid_cases_filter["countyFIPS"].astype(str)
print(covid_cases_filter.dtypes)

countyFIPS     object
County Name    object
State          object
stateFIPS       int64
Total Cases     int64
dtype: object


In [88]:
covid_cases_filter['countyFIPS_2d'] = covid_cases_filter['countyFIPS'].str[-3:]
covid_cases_filter['countyFIPS_2d'] = covid_cases_filter['countyFIPS_2d'].astype(str).str.zfill(3)
covid_cases_filter = covid_cases_filter.loc[:,["countyFIPS", "stateFIPS", "countyFIPS_2d", "County Name", "State", "Total Cases"]]

In [89]:
covid_cases_clean = covid_cases_filter.copy()

In [90]:
covid_cases_clean = covid_cases_clean.loc[covid_cases_clean['County Name'] != "Statewide Unallocated"]
covid_cases_clean["countyFIPS_2d"] = covid_cases_clean["countyFIPS_2d"].astype(int)
covid_cases_clean

Unnamed: 0,countyFIPS,stateFIPS,countyFIPS_2d,County Name,State,Total Cases
1,1001,1,1,Autauga County,AL,39746
2,1003,1,3,Baldwin County,AL,76970
3,1005,1,5,Barbour County,AL,24625
4,1007,1,7,Bibb County,AL,13636
5,1009,1,9,Blount County,AL,19311
...,...,...,...,...,...,...
3190,56037,56,37,Sweetwater County,WY,7361
3191,56039,56,39,Teton County,WY,13823
3192,56041,56,41,Uinta County,WY,9737
3193,56043,56,43,Washakie County,WY,3104


In [91]:
test_cases = covid_cases_clean.loc[(covid_cases_clean["County Name"] == "Montgomery County")]
test_cases

Unnamed: 0,countyFIPS,stateFIPS,countyFIPS_2d,County Name,State,Total Cases
51,1101,1,101,Montgomery County,AL,270419
164,5097,5,97,Montgomery County,AR,494
501,13209,13,209,Montgomery County,GA,3094
677,17135,17,135,Montgomery County,IL,5509
766,18107,18,107,Montgomery County,IN,23333
874,19137,19,137,Montgomery County,IA,1067
968,20125,20,125,Montgomery County,KS,4069
1098,21173,21,173,Montgomery County,KY,4260
1229,24031,24,31,Montgomery County,MD,1221275
1475,28097,28,97,Montgomery County,MS,11547


### Covid 19 Deaths by County (USA Facts/CDC)

For most states, USAFacts directly collects the daily county-level cumulative totals of positive cases and deaths from a table, dashboard, or PDF on the state public health website. This data is compiled either through scraping or manual entry. The underlying data is available for download below the US county map and has helped government agencies like the Centers for Disease Control and Prevention in its nationwide efforts.

REFERENCES:
1. https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

In [11]:
covid_deaths = pd.read_csv("data/covid_deaths_usafacts_200803.csv")

In [12]:
covid_deaths_dropped = covid_deaths.drop(columns=['8/1/20'])

In [13]:
covid_deaths_total = covid_deaths_dropped['Total Deaths']= covid_deaths_dropped.iloc[:, -192:].sum(axis=1)

In [14]:
covid_deaths_filter = covid_deaths_dropped.loc[:,["countyFIPS", "County Name", "State", "stateFIPS", "Total Deaths"]]

In [15]:
covid_deaths_filter = covid_deaths_dropped.loc[:,["countyFIPS", "County Name", "State", "stateFIPS", "Total Deaths"]]
covid_deaths_filter["countyFIPS"] = covid_deaths_filter["countyFIPS"].astype(str)
print(covid_deaths_filter.dtypes)

countyFIPS      object
County Name     object
State           object
stateFIPS        int64
Total Deaths     int64
dtype: object


In [16]:
covid_deaths_filter['countyFIPS_2d'] = covid_deaths_filter['countyFIPS'].str[-3:]
covid_deaths_filter['countyFIPS_2d'] = covid_deaths_filter['countyFIPS_2d'].astype(str).str.zfill(3)
covid_deaths_filter = covid_deaths_filter.loc[:,["countyFIPS", "stateFIPS", "countyFIPS_2d", "County Name", "State", "Total Deaths"]]
covid_deaths_filter

Unnamed: 0,countyFIPS,stateFIPS,countyFIPS_2d,County Name,State,Total Deaths
0,0,1,0,Statewide Unallocated,AL,0
1,1001,1,01,Autauga County,AL,909
2,1003,1,03,Baldwin County,AL,958
3,1005,1,05,Barbour County,AL,155
4,1007,1,07,Bibb County,AL,103
...,...,...,...,...,...,...
3190,56037,56,37,Sweetwater County,WY,34
3191,56039,56,39,Teton County,WY,101
3192,56041,56,41,Uinta County,WY,0
3193,56043,56,43,Washakie County,WY,291


In [17]:
covid_deaths_clean = covid_deaths_filter.copy()
covid_deaths_clean = covid_deaths_clean.loc[covid_deaths_clean['County Name'] != "Statewide Unallocated"]

In [18]:
covid_deaths_clean["countyFIPS_2d"] = covid_deaths_clean["countyFIPS_2d"].astype(int)
covid_deaths_clean.describe()

Unnamed: 0,stateFIPS,countyFIPS_2d,Total Deaths
count,3146.0,3146.0,3146.0
mean,30.267324,43.578195,3606.441513
std,15.150104,28.737545,25563.202914
min,1.0,0.0,0.0
25%,18.0,19.0,0.0
50%,29.0,41.0,125.0
75%,45.0,67.0,789.0
max,56.0,99.0,710054.0


### Per capital incidence of poverty by U.S county (U.S Census)

The poverty universe is made up of persons for whom the Census Bureau can determine poverty status (either "in poverty" or "not in poverty").

REFERENCES:
1. SAIPE Model Input Data: https://www.census.gov/data/datasets/time-series/demo/saipe/model-tables.html

In [85]:
poverty = pd.read_csv("data/allpovu.csv")
poverty
# poverty_all_ages = poverty.loc[:,["State FIPS code", "County FIPS code", "Name", "State Postal Code", "Poverty Universe, All Ages"]]
# poverty_all_ages.rename(columns={'State FIPS code': 'stateFIPS', 'County FIPS code': 'countyFIPS_2d'}, inplace=True)
# poverty_all_ages

Unnamed: 0,State FIPS code,County FIPS code,Name,State Postal Code,"Poverty Universe, All Ages","Poverty Universe, Age 5-17 related","Poverty Universe, Age 0-17","Poverty Universe, Age 0-4","Poverty Universe, All Ages.1","Poverty Universe, Age 5-17 related.1",...,"Poverty Universe, Age 0-17.18","Poverty Universe, Age 0-4.18","Poverty Universe, All Ages.19","Poverty Universe, Age 5-17 related.19","Poverty Universe, Age 0-17.19","Poverty Universe, Age 0-4.19","Poverty Universe, All Ages.20","Poverty Universe, Age 5-17 related.20","Poverty Universe, Age 0-17.20","Poverty Universe, Age 0-4.20"
0,0,0,United States,US,319184033.0,52529919.0,72163269.0,19301529.0,317741588.0,52669201.0,...,71741141.0,19181906.0,276207757.0,51642359.0,71684956.0,18968750.0,271059449.0,51060953.0,71338364.0,19382484.0
1,1,0,Alabama,AL,4763811.0,781913.0,1069994.0,284188.0,4752519.0,790771.0,...,1104080.0,296196.0,4368014.0,804291.0,1120718.0,293558.0,4348444.0,789510.0,1088427.0,295264.0
2,1,1,Autauga County,AL,55073.0,9677.0,12987.0,,55021.0,9911.0,...,12377.0,,43711.0,9245.0,12507.0,,43524.0,8856.0,12148.0,
3,1,3,Baldwin County,AL,215255.0,34508.0,46265.0,,209922.0,34058.0,...,34503.0,,139273.0,25048.0,34302.0,,136585.0,24609.0,33859.0,
4,1,5,Barbour County,AL,21979.0,3848.0,5106.0,,22224.0,3901.0,...,7148.0,,26480.0,5422.0,7341.0,,25482.0,5138.0,6966.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3196,56,37,Sweetwater County,WY,42205.0,8155.0,11023.0,,42690.0,8349.0,...,10009.0,,37170.0,7784.0,10656.0,,39688.0,9049.0,12244.0,
3197,56,39,Teton County,WY,22888.0,3061.0,4172.0,,23080.0,3154.0,...,3545.0,,18235.0,2494.0,3554.0,,14750.0,2110.0,3177.0,
3198,56,41,Uinta County,WY,20135.0,4298.0,5757.0,,20328.0,4372.0,...,6200.0,,19525.0,4722.0,6521.0,,20389.0,5482.0,7455.0,
3199,56,43,Washakie County,WY,7735.0,1320.0,1722.0,,7916.0,1414.0,...,2057.0,,8155.0,1620.0,2164.0,,8472.0,1651.0,2231.0,


In [20]:
poverty_all_ages.rename(columns={'Name': 'County Name', 'State Postal Code': 'State'}, inplace=True)
poverty_clean = poverty_all_ages.copy()
poverty_clean['County FIPS code'] = poverty_clean['County FIPS code'].astype(str).str.zfill(3)
poverty_clean["countyFIPS"] = poverty_clean["State FIPS code"] + poverty_clean["County FIPS code"]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3201 entries, 0 to 3200
Data columns (total 5 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   stateFIPS                   3201 non-null   int64  
 1   countyFIPS_2d               3201 non-null   int64  
 2   County Name                 3201 non-null   object 
 3   State                       3201 non-null   object 
 4   Poverty Universe, All Ages  3193 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 125.2+ KB


In [None]:
poverty_clean["countyFIPS_2d"] = poverty_clean["countyFIPS_2d"].astype(int)
poverty_clean["stateFIPS"] = poverty_clean["stateFIPS"].astype(int)
poverty_clean["countyFIPS"] = poverty_clean["countyFIPS"].astype(int)
poverty_clean.info()

In [21]:
poverty_clean = poverty_clean.loc[poverty_clean['countyFIPS_2d'] != 0]

Unnamed: 0,stateFIPS,countyFIPS_2d,County Name,State,"Poverty Universe, All Ages"
2,1,1,Autauga County,AL,55073.0
3,1,3,Baldwin County,AL,215255.0
4,1,5,Barbour County,AL,21979.0
5,1,7,Bibb County,AL,20212.0
6,1,9,Blount County,AL,57238.0
...,...,...,...,...,...
3196,56,37,Sweetwater County,WY,42205.0
3197,56,39,Teton County,WY,22888.0
3198,56,41,Uinta County,WY,20135.0
3199,56,43,Washakie County,WY,7735.0


In [66]:
# poverty_clean["Poverty Universe, All Ages"] = poverty_clean["Poverty Universe, All Ages"].astype(int)

poverty_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3149 entries, 2 to 3200
Data columns (total 5 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   stateFIPS                   3149 non-null   int64  
 1   countyFIPS_2d               3149 non-null   int64  
 2   County Name                 3149 non-null   object 
 3   State                       3149 non-null   object 
 4   Poverty Universe, All Ages  3141 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 147.6+ KB


In [67]:
null_data_pov = poverty_clean[poverty_clean.isnull().any(axis=1)]
null_data_pov

Unnamed: 0,stateFIPS,countyFIPS_2d,County Name,State,"Poverty Universe, All Ages"
92,2,201,Prince of Wales-Outer Ketchikan Census Area,AK,
95,2,232,Skagway-Hoonah-Angoon Census Area,AK,
98,2,270,Wade Hampton Census Area,AK,
100,2,280,Wrangell-Petersburg Census Area,AK,
565,15,5,Kalawao County,HI,
2465,46,113,Shannon County,SD,
2969,51,515,Bedford city,VA,
2974,51,560,Clifton Forge,VA,


In [71]:
poverty_clean['Poverty Universe, All Ages'] = poverty_clean['Poverty Universe, All Ages'].fillna((poverty_clean['Poverty Universe, All Ages'].mean()))

In [77]:
poverty_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3149 entries, 2 to 3200
Data columns (total 5 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   stateFIPS                   3149 non-null   int64  
 1   countyFIPS_2d               3149 non-null   int64  
 2   County Name                 3149 non-null   object 
 3   State                       3149 non-null   object 
 4   Poverty Universe, All Ages  3149 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 147.6+ KB


### County Population by Racial/Ethnic Characteristics 2010-2019 (U.S. Census Bureau)

METHODOLOGY FOR THE UNITED STATES POPULATION ESTIMATES: VINTAGE 2019
Nation, States, Counties, and Puerto Rico – April 1, 2010 to July 1, 2019

Each year, the United States Census Bureau produces and publishes estimates of the population for the
nation, states, counties, state/county equivalents, and Puerto Rico.1 We estimate the resident population for
each year since the most recent decennial census by using measures of population change. The resident
population includes all people currently residing in the United States.

With each annual release of population estimates, the Population Estimates Program revises and updates the
entire time series of estimates from April 1, 2010 to July 1 of the current year, which we refer to as the
vintage year. We use the term “vintage” to denote an entire time series created with a consistent population
starting point and methodology. The release of a new vintage of estimates supersedes any previous series
and incorporates the most up-to-date input data and methodological improvements

REFERENCES:
1. Annual County Resident Population Estimates by Age, Sex, Race, and Hispanic Origin: April 1, 2010 to July 1, 2019 (https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html)
2. File Layout: https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/cc-est2019-alldata.pdf

In [22]:
race = pd.read_csv("data/cc-est2019-alldata.csv", encoding = "ISO-8859-1")

In [23]:
# race.columns.tolist()

# SELECTION - Z Value
# sum columns by race and gender 
# e.g. race["WA_MALE_TOTAL"] = race.loc[:, ["WA_MALE", "WAC_MALE"].sum()

# WA_MALE
# WAC_MALE

# WA_FEMALE
# WAC_FEMALE

# BA_MALE
# BAC_MALE

# BA_FEMALE
# BAC_FEMALE

# IA_MALE
# IAC_MALE

# IA_FEMALE
# IAC_FEMALE

# AA_MALE
# AAC_MALE 

# AA_FEMALE
# AAC_FEMALE

# NA_MALE
# NAC_MALE 

# NA_FEMALE
# NAC_FEMALE

# TOM_MALE
# TOM_FEMALE

race["WA_MALE_TOTAL"] = race.loc[:, ["WA_MALE", "WAC_MALE"]].sum(axis=1)
race["WA_FEMALE_TOTAL"] = race.loc[:, ["WA_FEMALE", "WAC_FEMALE"]].sum(axis=1)
race["BA_MALE_TOTAL"] = race.loc[:, ["BA_MALE", "BAC_MALE"]].sum(axis=1)
race["BA_FEMALE_TOTAL"] = race.loc[:, ["BA_FEMALE", "BAC_FEMALE"]].sum(axis=1)
race["IA_MALE_TOTAL"] = race.loc[:, ["IA_MALE", "IAC_MALE"]].sum(axis=1)
race["IA_FEMALE_TOTAL"] = race.loc[:, ["IA_FEMALE", "IAC_FEMALE"]].sum(axis=1)
race["AA_MALE_TOTAL"] = race.loc[:, ["AA_MALE", "AAC_MALE"]].sum(axis=1)
race["AA_FEMALE_TOTAL"] = race.loc[:, ["AA_FEMALE", "AAC_FEMALE"]].sum(axis=1)
race["NA_MALE_TOTAL"] = race.loc[:, ["NA_MALE", "NAC_MALE"]].sum(axis=1)
race["NA_FEMALE_TOTAL"] = race.loc[:, ["NA_FEMALE", "NAC_FEMALE"]].sum(axis=1)

In [24]:
race["YEAR"] = race["YEAR"].astype(int)
race

Unnamed: 0,SUMLEV,STATE,COUNTY,STNAME,CTYNAME,YEAR,AGEGRP,TOT_POP,TOT_MALE,TOT_FEMALE,...,WA_MALE_TOTAL,WA_FEMALE_TOTAL,BA_MALE_TOTAL,BA_FEMALE_TOTAL,IA_MALE_TOTAL,IA_FEMALE_TOTAL,AA_MALE_TOTAL,AA_FEMALE_TOTAL,NA_MALE_TOTAL,NA_FEMALE_TOTAL
0,50,1,1,Alabama,Autauga County,1,0,54571,26569,28002,...,42928,44393,9263,10436,396,453,500,693,71,55
1,50,1,1,Alabama,Autauga County,1,1,3579,1866,1713,...,2890,2684,767,679,28,21,47,43,4,1
2,50,1,1,Alabama,Autauga County,1,2,3991,2001,1990,...,3091,3109,824,777,41,27,49,63,4,7
3,50,1,1,Alabama,Autauga County,1,3,4290,2171,2119,...,3352,3301,884,842,44,39,55,55,8,6
4,50,1,1,Alabama,Autauga County,1,4,4290,2213,2077,...,3292,3209,1027,868,35,27,64,45,10,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
716371,50,56,45,Wyoming,Weston County,12,14,499,280,219,...,514,409,1,2,7,2,38,25,0,0
716372,50,56,45,Wyoming,Weston County,12,15,352,180,172,...,349,339,0,1,4,2,7,2,0,0
716373,50,56,45,Wyoming,Weston County,12,16,229,107,122,...,212,240,0,0,2,4,0,0,0,0
716374,50,56,45,Wyoming,Weston County,12,17,198,82,116,...,161,230,0,0,2,2,1,0,0,0


In [25]:
# YEAR: 12 = 7/1/2019 & AGEGRP: 0 = Total

race_12 = race.loc[(race['YEAR'] == 12) & (race['AGEGRP'] == 0)]
race_12.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3142 entries, 209 to 716357
Data columns (total 90 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   SUMLEV           3142 non-null   int64 
 1   STATE            3142 non-null   int64 
 2   COUNTY           3142 non-null   int64 
 3   STNAME           3142 non-null   object
 4   CTYNAME          3142 non-null   object
 5   YEAR             3142 non-null   int64 
 6   AGEGRP           3142 non-null   int64 
 7   TOT_POP          3142 non-null   int64 
 8   TOT_MALE         3142 non-null   int64 
 9   TOT_FEMALE       3142 non-null   int64 
 10  WA_MALE          3142 non-null   int64 
 11  WA_FEMALE        3142 non-null   int64 
 12  BA_MALE          3142 non-null   int64 
 13  BA_FEMALE        3142 non-null   int64 
 14  IA_MALE          3142 non-null   int64 
 15  IA_FEMALE        3142 non-null   int64 
 16  AA_MALE          3142 non-null   int64 
 17  AA_FEMALE        3142 non-nul

In [26]:
race_12.loc[:,["STATE", "COUNTY", "STNAME", "CTYNAME", "TOT_POP", "WA_MALE_TOTAL", "WA_FEMALE_TOTAL"
               , "BA_MALE_TOTAL", "BA_FEMALE_TOTAL", "IA_MALE_TOTAL", "IA_FEMALE_TOTAL"
               , "AA_MALE_TOTAL", "AA_FEMALE_TOTAL", "NA_MALE_TOTAL", "NA_FEMALE_TOTAL"]]

Unnamed: 0,STATE,COUNTY,STNAME,CTYNAME,TOT_POP,WA_MALE_TOTAL,WA_FEMALE_TOTAL,BA_MALE_TOTAL,BA_FEMALE_TOTAL,IA_MALE_TOTAL,IA_FEMALE_TOTAL,AA_MALE_TOTAL,AA_FEMALE_TOTAL,NA_MALE_TOTAL,NA_FEMALE_TOTAL
209,1,1,Alabama,Autauga County,55869,42250,43920,10751,12270,395,446,727,879,87,75
437,1,3,Alabama,Baldwin County,223234,191540,202761,19832,21115,2721,2624,2337,3394,254,267
665,1,5,Alabama,Barbour County,24686,12906,11608,12743,11280,285,182,127,141,72,41
893,1,7,Alabama,Bibb County,22394,17635,16976,5951,3719,159,151,72,73,50,16
1121,1,9,Alabama,Blount County,57826,54866,56713,1174,1080,592,598,232,265,102,62
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
715445,56,37,Wyoming,Sweetwater County,42343,41325,38927,828,640,881,801,516,619,97,99
715673,56,39,Wyoming,Teton County,23464,23328,21591,248,181,310,272,358,573,63,43
715901,56,41,Wyoming,Uinta County,20226,19698,19209,199,186,391,402,123,180,54,44
716129,56,43,Wyoming,Washakie County,7805,7602,7321,80,57,169,198,66,102,13,13


In [27]:
race_12.describe()

Unnamed: 0,SUMLEV,STATE,COUNTY,YEAR,AGEGRP,TOT_POP,TOT_MALE,TOT_FEMALE,WA_MALE,WA_FEMALE,...,WA_MALE_TOTAL,WA_FEMALE_TOTAL,BA_MALE_TOTAL,BA_FEMALE_TOTAL,IA_MALE_TOTAL,IA_FEMALE_TOTAL,AA_MALE_TOTAL,AA_FEMALE_TOTAL,NA_MALE_TOTAL,NA_FEMALE_TOTAL
count,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,...,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0
mean,50.0,30.280076,103.572884,12.0,0.0,104468.3,51450.45,53017.89,39526.52,40206.83,...,80336.65,81715.76,14085.58,15289.41,1771.046467,1772.44303,6459.161,7024.877,387.793125,382.213558
std,0.0,15.144339,107.70406,0.0,0.0,333456.7,163867.7,169627.6,117934.1,119046.0,...,240143.0,242433.4,56469.54,65123.35,6577.875055,6677.061211,44552.79,48929.02,3681.779723,3625.889024
min,50.0,1.0,1.0,12.0,0.0,86.0,41.0,45.0,13.0,11.0,...,29.0,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50.0,18.0,35.0,12.0,0.0,10902.5,5459.75,5407.25,4544.25,4461.75,...,9183.0,8992.5,219.0,148.25,121.0,115.0,65.0,77.0,9.0,8.0
50%,50.0,29.0,79.0,12.0,0.0,25726.0,12869.0,12828.5,10944.5,11019.5,...,22101.5,22294.5,1278.5,839.0,325.0,300.0,204.0,232.0,28.0,25.0
75%,50.0,45.0,133.0,12.0,0.0,68072.75,34152.25,34512.5,29107.75,29758.75,...,58778.25,60189.5,6424.0,6014.0,1008.75,971.75,952.75,1083.0,108.75,100.0
max,50.0,56.0,840.0,12.0,0.0,10039110.0,4949041.0,5090066.0,3552806.0,3545503.0,...,7240650.0,7229593.0,1140998.0,1362219.0,188264.0,185298.0,1529785.0,1731856.0,171165.0,166656.0


In [28]:
race_12.rename(columns={'CTYNAME': 'County Name'}, inplace=True)
race_12.rename(columns={'STATE': 'stateFIPS'}, inplace=True)
race_12.rename(columns={'COUNTY': 'countyFIPS_2d'}, inplace=True)
race_12

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,SUMLEV,stateFIPS,countyFIPS_2d,STNAME,County Name,YEAR,AGEGRP,TOT_POP,TOT_MALE,TOT_FEMALE,...,WA_MALE_TOTAL,WA_FEMALE_TOTAL,BA_MALE_TOTAL,BA_FEMALE_TOTAL,IA_MALE_TOTAL,IA_FEMALE_TOTAL,AA_MALE_TOTAL,AA_FEMALE_TOTAL,NA_MALE_TOTAL,NA_FEMALE_TOTAL
209,50,1,1,Alabama,Autauga County,12,0,55869,27092,28777,...,42250,43920,10751,12270,395,446,727,879,87,75
437,50,1,3,Alabama,Baldwin County,12,0,223234,108247,114987,...,191540,202761,19832,21115,2721,2624,2337,3394,254,267
665,50,1,5,Alabama,Barbour County,12,0,24686,13064,11622,...,12906,11608,12743,11280,285,182,127,141,72,41
893,50,1,7,Alabama,Bibb County,12,0,22394,11929,10465,...,17635,16976,5951,3719,159,151,72,73,50,16
1121,50,1,9,Alabama,Blount County,12,0,57826,28472,29354,...,54866,56713,1174,1080,592,598,232,265,102,62
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
715445,50,56,37,Wyoming,Sweetwater County,12,0,42343,21808,20535,...,41325,38927,828,640,881,801,516,619,97,99
715673,50,56,39,Wyoming,Teton County,12,0,23464,12142,11322,...,23328,21591,248,181,310,272,358,573,63,43
715901,50,56,41,Wyoming,Uinta County,12,0,20226,10224,10002,...,19698,19209,199,186,391,402,123,180,54,44
716129,50,56,43,Wyoming,Washakie County,12,0,7805,3963,3842,...,7602,7321,80,57,169,198,66,102,13,13


### Incidence of Pre-existing Conditions & Coverage of Flu Vaccine

People of any age with the following conditions are at increased risk of severe illness from COVID-19 (according to CDC, 17 July 17 2020:

PolicyMap worked with journalists at the New York Times to create this index assessing a county’s relative risk of its population developing severe COVID-19 symptoms. The index represents the relative risk for a high proportion of residents in each county to develop serious health complications from COVID-19 because of underlying health conditions identified by the CDC as contributing to a person’s risk of developing severe symptoms from the virus. These conditions include COPD, heart disease, high blood pressure, diabetes, and obesity.

Estimates of COPD, heart disease, high blood pressure, and diabetes and obesity prevalence at the tract and ZCTA level are from PolicyMap’s Health Outcome Estimates. Estimates of diabetes and obesity prevalence at the county level are from the CDC’s U.S. Diabetes Surveillance System.

Normalized scores were then converted to percentiles and z scores for easier interpretation. Percentiles rank counties from the lowest score to the highest on a scale of 0 to 100, where a score of 50 represents the median value. A county’s z score shows how many standard deviations above or below the average a county’s risk level falls. A score of 0.6, for example, would mean that the county has a higher risk than average, but is still within one standard deviation of the average and is therefore not unusually high. Risk categories from very low to very high are assigned based on z scores.

Constrained features to the following (according to CDC advisory 28 July, 2020):
- Serious heart conditions, such as heart failure, coronary artery disease, or cardiomyopathies (CVDINFR4, CVDCRHD4)
- Cancer (CHCOCNCR)
- Chronic kidney disease (CHCKDNY)
- COPD (CHCCOPD1)
- Obesity (BMI> 30) ( _BMI5CAT value 4; not available at county level)
- Sickle cell disease (not available)
- Solid organ transplantation 
- Type 2 diabetes mellitus (proxy; taking insulin: INSULIN)


Proxy Prevention Coverage
- Adult flu shot/spray past 12 mos (FLUSHOT6)


REFERENCES:
1. Covid 19 People with Certain Medical Conditions https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/people-with-medical-conditions.html?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fcoronavirus%2F2019-ncov%2Fneed-extra-precautions%2Fgroups-at-higher-risk.html
2. Centers for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System Survey Data. Atlanta, Georgia: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, 2017.: https://www.cdc.gov/brfss/smart/smart_2017.html
3. Evidence used to update the list of underlying medical conditions that increase a person’s risk of severe illness from COVID-19: https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/evidence-table.html
4. PolicyMap Severe COVID-19 Health Risk Index: https://www.policymap.com/download-covid19-data/

In [None]:
# CDC SMART Data
# preexisting = pd.read_sas("data/llcp2018_2.xpt")
# preexisting.to_csv('data/llcp2018.csv')
# preexisting = pd.read_csv("data/MMSA2017.csv")
# preexisting["_MMSA"] = preexisting["_MMSA"].astype(str)
# print(preexisting.dtypes)
# preexisting['countyFIPS_2d'] = preexisting['_MMSA'].str[2:4]
# preexisting['stateFIPS_2d'] = preexisting['_MMSA'].str[0:2]

In [29]:
preexisting = pd.read_csv("data/risk_clean3.csv")

In [30]:
preexisting

Unnamed: 0,countyFIPS,stateFIPS,countyFIPS_2d,index_percentile,index_category
0,1001,10,1,65.42,Above Average
1,1003,10,3,68.39,Above Average
2,1005,10,5,97.09,High
3,1007,10,7,83.36,Above Average
4,1009,10,9,81.75,Above Average
...,...,...,...,...,...
3138,56037,56,3,10.42,Low
3139,56039,56,3,2.94,Very low
3140,56041,56,4,27.13,Below Average
3141,56043,56,4,32.76,Below Average


In [31]:
print(preexisting.dtypes)

countyFIPS            int64
stateFIPS             int64
countyFIPS_2d         int64
index_percentile    float64
index_category       object
dtype: object


In [32]:
preexisting.describe()

Unnamed: 0,countyFIPS,stateFIPS,countyFIPS_2d,index_percentile
count,3143.0,3143.0,3143.0,3143.0
mean,30390.411709,34.588291,13.360165,48.120646
std,15164.71772,15.498601,16.958315,181.451891
min,1001.0,10.0,0.0,-9999.0
25%,18178.0,21.0,3.0,27.02
50%,29177.0,34.0,8.0,51.38
75%,45082.0,48.0,15.0,75.47
max,56045.0,90.0,99.0,100.0


In [34]:
preexisting.rename(columns={'index_percentile': 'Risk Index'}, inplace=True)

In [57]:
preexisting["countyFIPS"] = preexisting["countyFIPS"].astype(int)
preexisting["Risk Index"] = preexisting["Risk Index"].astype(int)

preexisting.describe()

Unnamed: 0,countyFIPS,stateFIPS,countyFIPS_2d,Risk Index
count,3143.0,3143.0,3143.0,3143.0
mean,30390.411709,34.588291,13.360165,47.64238
std,15164.71772,15.498601,16.958315,181.442873
min,1001.0,10.0,0.0,-9999.0
25%,18178.0,21.0,3.0,26.5
50%,29177.0,34.0,8.0,51.0
75%,45082.0,48.0,15.0,75.0
max,56045.0,90.0,99.0,100.0


### Flu Coverage (CDC Wonder)? 

## Merging DataFrames

In [37]:
merged_cases_death_1 = covid_cases_clean.merge(covid_deaths_clean, on=["stateFIPS", "countyFIPS_2d", "countyFIPS", "County Name", "State"], how='left', validate="1:1")

In [61]:
merged_cases_death_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3145 entries, 0 to 3144
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   countyFIPS     3145 non-null   object 
 1   stateFIPS      3145 non-null   int64  
 2   countyFIPS_2d  3145 non-null   int64  
 3   County Name    3145 non-null   object 
 4   State          3145 non-null   object 
 5   Total Cases    3145 non-null   int64  
 6   Total Deaths   3130 non-null   float64
dtypes: float64(1), int64(3), object(3)
memory usage: 196.6+ KB


In [80]:
merged_cases_death_pov_2 = merged_cases_death_1.merge(poverty_clean, on=["County Name", "State"], how='left', validate="1:1")

In [81]:
merged_cases_death_pov_2.info()

null_data = merged_cases_death_pov[merged_cases_death_pov.isnull().any(axis=1)]
null_data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3145 entries, 0 to 3144
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   countyFIPS                  3145 non-null   object 
 1   stateFIPS_x                 3145 non-null   int64  
 2   countyFIPS_2d_x             3145 non-null   int64  
 3   County Name                 3145 non-null   object 
 4   State                       3145 non-null   object 
 5   Total Cases                 3145 non-null   int64  
 6   Total Deaths                3130 non-null   float64
 7   stateFIPS_y                 3118 non-null   float64
 8   countyFIPS_2d_y             3118 non-null   float64
 9   Poverty Universe, All Ages  3118 non-null   float64
dtypes: float64(4), int64(3), object(3)
memory usage: 270.3+ KB


Unnamed: 0,countyFIPS,stateFIPS,countyFIPS_2d,County Name,State,Total Cases,Total Deaths,"Poverty Universe, All Ages"
50,1101,1,1,Montgomery County,AL,270419,6635.0,
51,1103,1,3,Morgan County,AL,71926,337.0,
52,1105,1,5,Perry County,AL,12179,69.0,
53,1107,1,7,Pickens County,AL,16762,500.0,
54,1109,1,9,Pike County,AL,30320,252.0,
...,...,...,...,...,...,...,...,...
3117,55133,55,33,Waukesha County,WI,116212,3346.0,
3118,55135,55,35,Waupaca County,WI,9630,556.0,
3119,55137,55,37,Waushara County,WI,2327,0.0,
3120,55139,55,39,Winnebago County,WI,45941,786.0,


In [42]:
merged_cases_death_pov_race = merged_cases_death_pov.merge(race_12, on=["stateFIPS", "countyFIPS_2d", "County Name"], how='left', validate="1:1")

In [50]:
merged_cases_death_pov_race["countyFIPS"] = merged_cases_death_pov_race["countyFIPS"].astype(int)
merged_cases_death_pov_race.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3145 entries, 0 to 3144
Data columns (total 95 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   countyFIPS                  3145 non-null   int64  
 1   stateFIPS                   3145 non-null   int64  
 2   countyFIPS_2d               3145 non-null   int64  
 3   County Name                 3145 non-null   object 
 4   State                       3145 non-null   object 
 5   Total Cases                 3145 non-null   int64  
 6   Total Deaths                3130 non-null   float64
 7   Poverty Universe, All Ages  1901 non-null   float64
 8   SUMLEV                      1901 non-null   float64
 9   STNAME                      1901 non-null   object 
 10  YEAR                        1901 non-null   float64
 11  AGEGRP                      1901 non-null   float64
 12  TOT_POP                     1901 non-null   float64
 13  TOT_MALE                    1901 

In [58]:
merged_cases_death_pov_race_risk_2 = merged_cases_death_pov_race.merge(preexisting, on=["countyFIPS", "stateFIPS", "countyFIPS_2d"], how='left', validate="1:1")

In [60]:
merged_cases_death_pov_race_risk_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3145 entries, 0 to 3144
Data columns (total 97 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   countyFIPS                  3145 non-null   int64  
 1   stateFIPS                   3145 non-null   int64  
 2   countyFIPS_2d               3145 non-null   int64  
 3   County Name                 3145 non-null   object 
 4   State                       3145 non-null   object 
 5   Total Cases                 3145 non-null   int64  
 6   Total Deaths                3130 non-null   float64
 7   Poverty Universe, All Ages  1901 non-null   float64
 8   SUMLEV                      1901 non-null   float64
 9   STNAME                      1901 non-null   object 
 10  YEAR                        1901 non-null   float64
 11  AGEGRP                      1901 non-null   float64
 12  TOT_POP                     1901 non-null   float64
 13  TOT_MALE                    1901 

In [None]:
merged_cases_death_pov_race_risk_1.loc[merged_cases_death_pov_race_risk_1["Risk Index"] != "NaN"]

In [None]:
null_data = merged_cases_death_pov_race_risk_1[merged_cases_death_pov_race_risk_1.isnull().any(axis=1)]

In [None]:
null_data