# DataEng: Data Integration Activity

Name: Karan Patel

In [1]:
import pandas as pd
import numpy as np

## Aggregate Census Data to County Level

Create a python program that produces a __one-row-per-county__ version of the ACS data set. To do this you will need to think about how to properly aggregate Census Tract-level data into County-level summaries. 

In this step you can also eliminate unneeded columns from the ACS data. 

In [2]:
original_acs_df = pd.read_csv('./data/acs2017_census_tract_data.csv.gz', usecols=['State', 'County', 'TotalPop', 'IncomePerCap', 'Poverty'])
display(original_acs_df.head())

Unnamed: 0,State,County,TotalPop,IncomePerCap,Poverty
0,Alabama,Autauga County,1845,33018.0,10.7
1,Alabama,Autauga County,2172,18996.0,22.4
2,Alabama,Autauga County,3385,21236.0,14.7
3,Alabama,Autauga County,4267,28068.0,2.3
4,Alabama,Autauga County,9965,36905.0,12.2


__Question__: Show your aggregated county-level data rows for the following counties: Loudoun County Virginia, Washington County Oregon, Harlan County Kentucky, Malheur County oregon

__Answer__: See output from code block below.

In [9]:
def acs_data_county_aggregator(df):
    total_pop = df['TotalPop'].sum()
    
    data = {'TotalPop': total_pop.astype(int),
            'IncomePerCap': ((df['IncomePerCap'] * df['TotalPop']).sum() / total_pop).round(2),
            'Poverty': ((df['Poverty'] * df['TotalPop']).sum() / total_pop).round(2)
            }
    return pd.Series(data)

acs_df = original_acs_df.groupby(['State', 'County']).apply(acs_data_county_aggregator).reset_index()

def get_criteria(county, state, df):
    return (df.County == county) & (df.State == state)

display(acs_df[get_criteria('Loudoun County', 'Virginia', acs_df)])
display(acs_df[get_criteria('Washington County', 'Oregon', acs_df)])
display(acs_df[get_criteria('Harlan County', 'Kentucky', acs_df)])
display(acs_df[get_criteria('Malheur County', 'Oregon', acs_df)])

Unnamed: 0,State,County,TotalPop,IncomePerCap,Poverty
2968,Virginia,Loudoun County,374558.0,50455.65,3.69


Unnamed: 0,State,County,TotalPop,IncomePerCap,Poverty
2241,Oregon,Washington County,572071.0,35369.05,10.32


Unnamed: 0,State,County,TotalPop,IncomePerCap,Poverty
1040,Kentucky,Harlan County,27548.0,15456.97,35.67


Unnamed: 0,State,County,TotalPop,IncomePerCap,Poverty
2230,Oregon,Malheur County,30421.0,17567.5,24.3


## Simplify the COVID Data

Simplify the COVID data along the time dimension. The COVID data set contains day-level resolution data from (approximately) March of 2020 through February of 2021. However, you will only need four data points per county: total cases, total deaths, cases reported during December of 2020 and deaths reported during December 2020. 

Create a python program that reduces the COVID data to one line per county. 


In [4]:
original_covid_df = pd.read_csv('./data/COVID_county_data.csv.gz', parse_dates=['date'])
display(original_covid_df.head())

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0.0
1,2020-01-22,Snohomish,Washington,53061.0,1,0.0
2,2020-01-23,Snohomish,Washington,53061.0,1,0.0
3,2020-01-24,Cook,Illinois,17031.0,1,0.0
4,2020-01-24,Snohomish,Washington,53061.0,1,0.0


__Question__: Show your simplified COVID data for the counties listed above. 

__Answer:__

In [5]:
def covid_data_county_aggregator(df):
    data = {'total-cases': df['cases'].sum().astype(int),
            'total-deaths': df['deaths'].sum().astype(int),
            'cases-dec-2020': df[df['date'].between('12-1-2020', '12-31-2020')]['cases'].sum().astype(int),
            'deaths-dec-2020': df[df['date'].between('12-1-2020', '12-31-2020')]['deaths'].sum().astype(int)
            }
    return pd.Series(data)

covid_df = original_covid_df.groupby(['state', 'county']).apply(covid_data_county_aggregator).reset_index()
display(covid_df)

Unnamed: 0,state,county,total-cases,total-deaths,cases-dec-2020,deaths-dec-2020
0,Alabama,Autauga,645935,9042,108652,1355
1,Alabama,Baldwin,2003567,23041,348455,4502
2,Alabama,Barbour,268771,4077,40753,931
3,Alabama,Bibb,261043,5272,47009,1244
4,Alabama,Blount,630106,8669,121270,1590
...,...,...,...,...,...,...
3269,Wyoming,Teton,305376,617,59845,67
3270,Wyoming,Uinta,200783,1037,41859,201
3271,Wyoming,Unknown,37,0,0,0
3272,Wyoming,Washakie,84354,2622,20107,350


## Integrate COVID Data with ACS Data

Create a single pandas DataFrame containing one row per county and using the columns described above. You are free to add additional columns if needed. For example, you might want to normalize all of the COVID data by the population of each county so that you have a consistent “number of cases/deaths per 100000 residents” value for each county.

__Question:__ List your integrated data for all counties in the State of Oregon.

In [10]:
display(acs_df)
display(covid_df)

Unnamed: 0,State,County,TotalPop,IncomePerCap,Poverty
0,Alabama,Autauga County,55036.0,27823.92,13.76
1,Alabama,Baldwin County,203360.0,29364.37,11.87
2,Alabama,Barbour County,26201.0,17561.09,26.87
3,Alabama,Bibb County,22580.0,20911.18,14.92
4,Alabama,Blount County,57667.0,22020.72,15.60
...,...,...,...,...,...
3215,Wyoming,Sweetwater County,44527.0,31699.95,12.08
3216,Wyoming,Teton County,22923.0,49200.63,6.84
3217,Wyoming,Uinta County,20758.0,27114.84,14.82
3218,Wyoming,Washakie County,8253.0,27344.99,12.87


Unnamed: 0,state,county,total-cases,total-deaths,cases-dec-2020,deaths-dec-2020
0,Alabama,Autauga,645935,9042,108652,1355
1,Alabama,Baldwin,2003567,23041,348455,4502
2,Alabama,Barbour,268771,4077,40753,931
3,Alabama,Bibb,261043,5272,47009,1244
4,Alabama,Blount,630106,8669,121270,1590
...,...,...,...,...,...,...
3269,Wyoming,Teton,305376,617,59845,67
3270,Wyoming,Uinta,200783,1037,41859,201
3271,Wyoming,Unknown,37,0,0,0
3272,Wyoming,Washakie,84354,2622,20107,350
