# DATA 602 PROJECT PROPOSAL

## Covid Rates and Income

### <u>Research Question</u>

Is there a relationship between Covid rates (cases, hospitalizations, death) and Income? 

### <u>Justification</u>

This question has relevance across industries. Understanding Covid susceptibility allows for better risk assessment and contingency planning for future public health events. 

### <u>Data Sources</u>

This analysis will focus on three datasets from two data sources:

* IRS Data by Zip Code - 2019 (source: [US Dept of Treasury](https://catalog.data.gov/dataset/zip-code-data))
* Provisional COVID-19 Death Counts in the United States by County (source: [CDC](https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-in-the-United-St/kn79-hsxy))  
* United States COVID-19 Community Levels by County (source: [CDC](https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-Community-Levels-by-County/3nnm-4jni))

### <u>Libraries</u>

This project will utilize the following libraries:  

* Pandas  
* NumPy  
* Matplotlib  
* Seaborn  
* Additional libraries as needed

### <u>EDA & Summary Statistics</u>

#### IRS DATA

In [25]:
irs.head()

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,ELF,CPREP,...,N85300,A85300,N11901,A11901,N11900,A11900,N11902,A11902,N12000,A12000
0,1,AL,0,1,778210.0,491030.0,84770.0,189600.0,712890.0,30670.0,...,0.0,0.0,62720.0,51936.0,671860.0,1700965.0,669570.0,1694792.0,1980.0,3512.0
1,1,AL,0,2,525940.0,247140.0,123910.0,139860.0,481760.0,18960.0,...,0.0,0.0,85860.0,122569.0,438020.0,1274802.0,435210.0,1266557.0,3670.0,7410.0
2,1,AL,0,3,285700.0,105140.0,128140.0,44560.0,260570.0,10670.0,...,0.0,0.0,73980.0,154932.0,212040.0,575315.0,208470.0,564202.0,5020.0,13653.0
3,1,AL,0,4,179070.0,38820.0,123110.0,13740.0,164300.0,5020.0,...,0.0,0.0,51330.0,139065.0,126850.0,401581.0,123310.0,388749.0,3040.0,10377.0
4,1,AL,0,5,257010.0,28180.0,216740.0,7150.0,236850.0,8400.0,...,90.0,141.0,104290.0,460071.0,152790.0,598248.0,144640.0,539385.0,9180.0,56257.0


In [26]:
irs.describe()

Unnamed: 0,STATEFIPS,zipcode,agi_stub,N1,mars1,MARS2,MARS4,ELF,CPREP,PREP,...,N85300,A85300,N11901,A11901,N11900,A11900,N11902,A11902,N12000,A12000
count,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0,...,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0,166159.0
mean,29.666885,48859.485553,3.499949,1860.508,912.7834,647.8571,258.03676,1689.549,75.771821,952.1963,...,62.932011,301.6957,398.2022,2277.891,1411.865,4982.449,1375.67,3915.695,43.742861,1012.98
std,15.121486,27167.679271,1.707871,37223.35,22249.99,12000.8,6336.0643,33470.49,1866.726495,19368.59,...,3210.746066,17279.29,7639.843,77505.17,29398.22,105199.3,29036.42,75702.7,991.034426,51363.45
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18.0,27020.0,2.0,70.0,0.0,40.0,0.0,70.0,0.0,50.0,...,0.0,0.0,20.0,22.0,50.0,168.0,50.0,151.0,0.0,0.0
50%,29.0,48843.0,3.0,260.0,80.0,110.0,30.0,240.0,0.0,150.0,...,0.0,0.0,60.0,148.0,190.0,647.0,180.0,572.0,0.0,0.0
75%,42.0,70652.5,5.0,1080.0,390.0,380.0,100.0,990.0,40.0,560.0,...,0.0,0.0,240.0,669.0,770.0,2523.0,740.0,2238.0,30.0,62.0
max,56.0,99999.0,6.0,5506120.0,4069770.0,1818210.0,945490.0,4827070.0,338290.0,3022550.0,...,932390.0,4668052.0,1160480.0,21541410.0,4297720.0,18454510.0,4268070.0,9218146.0,203100.0,12627690.0


In [27]:
irs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166159 entries, 0 to 166158
Columns: 152 entries, STATEFIPS to A12000
dtypes: float64(148), int64(3), object(1)
memory usage: 192.7+ MB


The key for this data has been loaded into the project [GitHub repository](https://github.com/josh1den/DATA-602/blob/main/FINAL_PROJECT/data_overview.pdf). The original IRS dataset contains over 166k observations across 152 columns, and must be read in locally as it is 192.7mb in size. The data will transformed to its necessary components for this analysis, written to .csv, and uploaded to the project GitHub repository. 

#### CDC Community Levels By County

In [24]:
cdc_comm.head()

Unnamed: 0,county,county_fips,state,county_population,health_service_area_number,health_service_area,health_service_area_population,covid_inpatient_bed_utilization,covid_hospital_admissions_per_100k,covid_cases_per_100k,covid-19_community_level,date_updated
0,Lincoln County,55069,Wisconsin,27593.0,282,"Marathon (Wausau), WI - Wood, WI",291401.0,4.7,13.4,177.58,Medium,2022-08-18
1,Manitowoc County,55071,Wisconsin,78981.0,355,"Sheboygan (Sheboygan), WI - Manitowoc, WI",244410.0,3.4,9.8,169.66,Low,2022-08-18
2,Marathon County,55073,Wisconsin,135692.0,282,"Marathon (Wausau), WI - Wood, WI",291401.0,4.7,13.4,209.3,High,2022-08-18
3,Monroe County,55081,Wisconsin,46253.0,290,"La Crosse (La Crosse), WI - Monroe, WI",257027.0,3.9,15.6,216.2,High,2022-08-18
4,Portage County,55097,Wisconsin,70772.0,400,"Portage, WI",70772.0,5.9,7.1,217.6,Medium,2022-08-18


In [28]:
cdc_comm.describe()

Unnamed: 0,county_fips,county_population,health_service_area_number,health_service_area_population,covid_inpatient_bed_utilization,covid_hospital_admissions_per_100k,covid_cases_per_100k
count,112836.0,112835.0,112836.0,112829.0,112648.0,112778.0,112836.0
mean,31438.02789,102920.0,400.462033,580860.4,3.25508,7.716028,144.930991
std,16331.50567,329363.8,243.44496,995262.5,2.66225,6.788769,186.263383
min,1001.0,86.0,1.0,2274.0,0.0,0.0,0.0
25%,19033.0,11131.0,186.0,90212.0,1.3,3.0,44.8175
50%,30027.0,26118.0,409.0,224914.0,2.8,6.5,106.76
75%,46111.0,67215.0,587.0,554557.0,4.5,10.7,194.2
max,78000.0,10039110.0,905.0,13214800.0,36.0,171.2,13017.75


In [29]:
cdc_comm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112836 entries, 0 to 112835
Data columns (total 12 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   county                              112836 non-null  object 
 1   county_fips                         112836 non-null  int64  
 2   state                               112836 non-null  object 
 3   county_population                   112835 non-null  float64
 4   health_service_area_number          112836 non-null  int64  
 5   health_service_area                 112836 non-null  object 
 6   health_service_area_population      112829 non-null  float64
 7   covid_inpatient_bed_utilization     112648 non-null  float64
 8   covid_hospital_admissions_per_100k  112778 non-null  float64
 9   covid_cases_per_100k                112836 non-null  float64
 10  covid-19_community_level            112782 non-null  object 
 11  date_updated              

This dataset contains over 112k observations across 12 columns containing information about Covid-19 cases and hospitalizations. 

#### CDC Provisional Death Counts By County

In [22]:
cdc_prov.head()

Unnamed: 0,Date as of,Start Date,End Date,State,County name,FIPS County Code,Urban Rural Code,Deaths involving COVID-19,Deaths from All Causes,Footnote
0,10/19/2022,01/01/2020,10/15/2022,AK,Aleutians East Borough,2013,Noncore,,22.0,One or more data cells have counts between 1-9...
1,10/19/2022,01/01/2020,10/15/2022,AK,Anchorage Municipality,2020,Medium metro,734.0,7081.0,
2,10/19/2022,01/01/2020,10/15/2022,AK,Bethel Census Area,2050,Noncore,39.0,317.0,
3,10/19/2022,01/01/2020,10/15/2022,AK,Denali Borough,2068,Noncore,,24.0,One or more data cells have counts between 1-9...
4,10/19/2022,01/01/2020,10/15/2022,AK,Dillingham Census Area,2070,Noncore,,96.0,One or more data cells have counts between 1-9...


In [30]:
cdc_prov.describe()

Unnamed: 0,FIPS County Code,Deaths involving COVID-19,Deaths from All Causes
count,3085.0,2706.0,3084.0
mean,30357.15624,391.218404,3027.573281
std,15162.540083,1179.109564,8527.317813
min,1001.0,10.0,14.0
25%,18175.0,29.0,304.0
50%,29147.0,75.0,717.5
75%,45075.0,297.5,2120.5
max,56045.0,31094.0,223502.0


In [31]:
cdc_prov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3085 entries, 0 to 3084
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Date as of                 3085 non-null   object 
 1   Start Date                 3085 non-null   object 
 2   End Date                   3085 non-null   object 
 3   State                      3085 non-null   object 
 4   County name                3085 non-null   object 
 5   FIPS County Code           3085 non-null   int64  
 6   Urban Rural Code           3085 non-null   object 
 7   Deaths involving COVID-19  2706 non-null   float64
 8   Deaths from All Causes     3084 non-null   float64
 9   Footnote                   379 non-null    object 
dtypes: float64(2), int64(1), object(7)
memory usage: 241.1+ KB


This dataset contains 3085 observations across 10 columns, containing total Covid deaths by State and County. 

### Combining the Datasets

These datasets will be transformed and combined to create a master dataframe containing Zip code, Income, and Covid rate information. 