# COVID-19 and Census Data 
___

<b> Table of Contents: </b>
<br> [0. Importing Libraries and Loading the Data](#0)
<br> [1. COVID Confirmed Cases as Target Variable](#1)
<br> [2. COVID Deaths as Target Variable](#2)

This dataset is made up of COVID-19 data tracking for each county in the tristate area, New York, New Jersey, and Connecticut. The csv file has data for COVID-19 cases and deaths on every single day since Janurary 22, 2020 through June 10, 2020. This project utlizes this data source because of the relability of its' county level data. The Census Economic and Income data is made up of county level data and in order to combine the two datasets there needed to be consistency in the formats. 

Data source: USAFACTS https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
- COVID-19 Confirmed Cases and Deaths
- Population (population data from 2019 US county Data)

Data range: 
- Janurary 22, 2020 - June 9, 2020 (stopped gathering 2 weeks before final Capstone was due to leave time for data cleaning and prep)


Quick Notes on Data Cleaning:
- Calculated total cases and deaths per month
- Changed state abbreviations to full name to match the Census data
- Fill in state name for any counties that were left blank 
- Compare the county names and order from this dataset to the Census data 
- Delete counties not present in both datasets
- Add county population counts if listed as 0

<a id = "0"> <h2> 0. Importing Libraries and Loading the Data </h2> </a>
___

_Importing Libraries_

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm
import numpy as np
import csv

_Loading the Data_

In [2]:
df_tristate_cases = pd.read_csv("tristate_final_data.csv")
df_tristate_cases.head()

Unnamed: 0,NAME,Covid Confirmed Cases,Covid Deaths,Population,Covid Case Rate (per 1000),Covid Death Rate (per 1000),Households SNAP,Estimated Individuals SNAP,SNAP % Population,SNAP Per Capita Benefit or TAM,...,Other Race Alone,Hispanic or Latino,Median Age,Male Median Age,Female Median Age,Total Households,Average Household Size,Total Families,state code,county code
0,"Albany County, New York",87943,3989,304204,289.092188,13.112911,15087,34247,11%,"$51,473,977.00",...,7647,289287,38.5,36.8,40.0,126251,2,60631,36,1
1,"Allegany County, New York",2601,126,48946,53.140195,2.574266,2994,7305,15%,"$10,979,956.00",...,557,48276,37.8,36.2,39.2,18208,2,10576,36,3
2,"Bronx County, New York",2544478,234021,1385108,1837.024983,168.955056,184934,512267,37%,"$769,937,572.00",...,73243,643695,32.8,30.6,34.9,483449,3,368196,36,5
3,"Broome County, New York",24218,1802,200600,120.727817,8.983051,13226,30684,15%,"$46,118,533.00",...,5087,193822,40.2,38.1,42.2,82167,2,40559,36,7
4,"Cattaraugus County, New York",3875,128,80317,48.246324,1.593685,5801,13980,17%,"$21,012,556.00",...,1363,78972,40.7,39.8,41.6,32263,2,18801,36,9


<b> Number of rows, number of columns in dataset </b>

In [8]:
print('There are # number of rows in the dataset    :', df_tristate_cases.shape[0])
print('There are # number of columns in the dataset :', df_tristate_cases.shape[1])

There are # number of rows in the dataset    : 91
There are # number of columns in the dataset : 43


In [9]:
df_tristate_cases.describe().round(2)

Unnamed: 0,Covid Confirmed Cases,Covid Deaths,Population,Covid Case Rate (per 1000),Covid Death Rate (per 1000),Median Age,Male Median Age,Female Median Age,Average Household Size,state code,county code
count,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0
mean,345605.69,24234.99,348836.19,494.78,30.01,40.16,38.84,41.45,2.44,33.16,47.79
std,706010.22,59348.61,468866.99,569.68,36.52,3.1,3.23,3.02,0.5,7.59,36.96
min,304.0,0.0,4836.0,16.82,0.0,29.8,28.5,31.1,2.0,9.0,1.0
25%,5696.0,222.5,64956.0,88.36,3.93,38.5,37.0,39.9,2.0,34.0,15.0
50%,36536.0,1815.0,149265.0,230.12,13.46,40.4,39.2,41.9,2.0,36.0,37.0
75%,297332.0,19315.0,467878.0,698.36,48.04,41.7,40.55,42.95,3.0,36.0,78.0
max,3587059.0,347696.0,2504700.0,2523.79,168.96,51.3,50.9,51.8,3.0,36.0,123.0


In [10]:
df_tristate_cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 0 to 90
Data columns (total 43 columns):
NAME                                          91 non-null object
Covid Confirmed Cases                         91 non-null int64
Covid Deaths                                  91 non-null int64
Population                                    91 non-null int64
Covid Case Rate (per 1000)                    91 non-null float64
Covid Death Rate (per 1000)                   91 non-null float64
 Households SNAP                              91 non-null object
 Estimated Individuals SNAP                   91 non-null object
 SNAP % Population                            91 non-null object
 SNAP Per Capita Benefit or TAM               91 non-null object
 Total Citizen Educated in US                 91 non-null object
 Citizen Less than High School  Education     91 non-null object
 Citizen High School  Graduate                91 non-null object
 Citizen Some College  Education              91 non-