# Project Name

## Group Members and Contributions

 - Chiehkun (Timo) Chen
 - Jordan Daley
 - Jacob Moul, PID: A13548393
 - Hannah Peterson, A13724073
 - Yun (Denise) Tang
 - George Thomas


## Introduction and Background

### Background Research

### Research Question

Question: Did the 2011-19 drought in California disproportionately affect low-income communities?

This project will examine the impacts of climate change on low-income communities (specifically through the climatic event of drought). We will focus on California, because it has experienced prolonged drought within the past decade (for 376 consecutive weeks—Dec 2011 - March 2019). In particular, we will investigate whether or not the California drought has had disproportionate negative effects on low-income communities compared to average and high-income communities. This question is important because as the effects of global warming become more severe, efforts must be made to protect communities that are most vulnerable to these negative effects.

To answer this question we are planning on analyzing different indicators of economic well being and different effects of drought for various communities over time, from before during and after the most recent drought. For example, we plan to investigate the relationships between income in communities and costs associated with the drought, such as utility rates, as well as potential health issues, such as respiratory illnesses, that are known to increase in conjunction with drought.

### Hypothesis

Hypothesis: The 2011-19 drought in California did disproportionately affect low-income communities.

We expect to find that these communities will have suffered more than relatively better-off communities because they have fewer safeguards to deal with environmental events, and also have less means to bear the cost of higher utility or healthcare rates, for example. 

## Data

### Data Sets

**Community Economic Data**
 - Data Set Name: 'cbp[yr]co.txt' (For years 2012-2016)
      - We modified these files to include only observations for California, and they have been renamed 'cbp[yr]co_mod.csv'
 - Source: https://www.census.gov/programs-surveys/cbp/data/datasets.html

> These data sets are County Business Pattern data sets, and are provided with the description: “This series includes the number of establishments, employment during the week of March 12, first quarter payroll, and annual payroll. This data is useful for studying the economic activity of small areas; analyzing economic changes over time; and as a benchmark for other statistical series, surveys, and databases between economic censuses”. After being condensed to just the state of California, the 2016 data set (out of many others) is composed of 36616 observations of 26 variables, several of which are identifying information such as state or county code. In addition, it contains values for first quarter payroll, annual payroll, and number of employees, among other variables, for different industries in each county of California. This data comes from the US Census Bureau. All of these data sets are downloadable in csv format.


## Data Cleaning/Pre-Processing

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import glob

In [2]:
### Import County Business Patterns Data, add year to each table

cbp00 = pd.read_csv('Data/cb_patterns/cbp00co_mod.csv')
cbp01 = pd.read_csv('Data/cb_patterns/cbp01co_mod.csv')
cbp02 = pd.read_csv('Data/cb_patterns/cbp02co_mod.csv')
cbp03 = pd.read_csv('Data/cb_patterns/cbp03co_mod.csv')
cbp04 = pd.read_csv('Data/cb_patterns/cbp04co_mod.csv')
cbp05 = pd.read_csv('Data/cb_patterns/cbp05co_mod.csv')
cbp06 = pd.read_csv('Data/cb_patterns/cbp06co_mod.csv')
cbp07 = pd.read_csv('Data/cb_patterns/cbp07co_mod.csv')
cbp08 = pd.read_csv('Data/cb_patterns/cbp08co_mod.csv')
cbp09 = pd.read_csv('Data/cb_patterns/cbp09co_mod.csv')
cbp10 = pd.read_csv('Data/cb_patterns/cbp10co_mod.csv')
cbp11 = pd.read_csv('Data/cb_patterns/cbp11co_mod.csv')
cbp12 = pd.read_csv('Data/cb_patterns/cbp12co_mod.csv')
cbp13 = pd.read_csv('Data/cb_patterns/cbp13co_mod.csv')
cbp14 = pd.read_csv('Data/cb_patterns/cbp14co_mod.csv')
cbp15 = pd.read_csv('Data/cb_patterns/cbp15co_mod.csv')
cbp16 = pd.read_csv('Data/cb_patterns/cbp16co_mod.csv')

cbp00['year'] = 2000
cbp01['year'] = 2001
cbp02['year'] = 2002
cbp03['year'] = 2003
cbp04['year'] = 2004
cbp05['year'] = 2005
cbp06['year'] = 2006
cbp07['year'] = 2007
cbp08['year'] = 2008
cbp09['year'] = 2009
cbp10['year'] = 2010
cbp11['year'] = 2011
cbp12['year'] = 2012
cbp13['year'] = 2013
cbp14['year'] = 2014
cbp15['year'] = 2015
cbp16['year'] = 2016

cbp_data = [cbp00, cbp01, cbp02, cbp03, cbp04, cbp05, cbp06, cbp07, cbp08, 
            cbp09, cbp10, cbp11, cbp12, cbp13, cbp14, cbp15, cbp16]

In [3]:
### Wrangle data

# Variables to be used in CBP analysis, fips is combined state and county code; for others, see county_layout_2015.txt
cbp_vars = ['fips', 'emp', 'ap', 'est', 'year']

simplified_cbp_data = []

for df in cbp_data:    
    # reformat FIPS county code for merging
    df[['fipstate', 'fipscty']] = df[['fipstate', 'fipscty']].astype(str)
    df['fips'] = df.fipstate.str.zfill(2) + df.fipscty.str.zfill(3)
    
    # select aggregate county data
    df = df[df.naics == '------']
    
    # drop unneccesary columns
    df = df[cbp_vars]
    
    simplified_cbp_data.append(df)

In [4]:
cbp_final = pd.concat(simplified_cbp_data).reset_index(drop=True)
cbp_final = cbp_final.rename(columns={'fips': 'FIPS', 'emp': 'Employment', 'ap': 'Annual Payroll', 
                                      'est': 'Establishments', 'year': 'Year'})

In [5]:
### Import and Wrangle Drought Data

# collect all drought file names
drought_files = glob.glob('Data/drought/*')

# import all drought tables
drought_data = []
for file in drought_files:
    drought_data.append(pd.read_csv(file))

simplified_drought_data = []

for df in drought_data:
    # add year column, reformat FIPS column for merging
    df['Year'] = pd.to_datetime(df.ValidStart).dt.year
    df['FIPS'] = df.FIPS.astype(str).str.zfill(5)
    
    # average drought index per county per year
    by_year = df.groupby(['FIPS', 'County'])[['None', 'D0', 'D1', 'D2', 'D3', 'D4']].mean().reset_index()
    by_year['Year'] = df.Year[0]
    
    simplified_drought_data.append(by_year)

In [6]:
drought_final = pd.concat(simplified_drought_data)
drought_final = drought_final.rename(columns={'None': 'No Drought'})

In [7]:
### Import and Wrangle Unemployment Data

unemploy = pd.read_csv('Data/Local_Area_Unemployment_Statistics__LAUS___Annual_Average.csv')

# select county unemployment data
unemploy = unemploy[(unemploy['Area Type'] == 'County') & (unemploy['Year'] < 2017)]

# select desired variable
unemploy = unemploy[['Area Name', 'Year', 'Unemployment Rate']].reset_index(drop=True)

# rename columns for merging
unemploy = unemploy.rename(columns={'Area Name': 'County'})

In [8]:
### Merge Data Sets

first_merge = pd.merge(left=cbp_final, right=drought_final, left_on=['FIPS', 'Year'], right_on=['FIPS', 'Year'], how='outer')
project_df = pd.merge(left=first_merge, right=unemploy, left_on=['County', 'Year'], right_on=['County', 'Year'])
project_df.head()

Unnamed: 0,FIPS,Employment,Annual Payroll,Establishments,Year,County,No Drought,D0,D1,D2,D3,D4,Unemployment Rate
0,6001,655730.0,28627936.0,36391.0,2000,Alameda County,97.8625,2.1375,0.0,0.0,0.0,0.0,3.6
1,6003,1048.0,13362.0,52.0,2000,Alpine County,94.659423,5.340577,0.0,0.0,0.0,0.0,6.3
2,6005,7378.0,174367.0,849.0,2000,Amador County,94.334038,5.665962,0.0,0.0,0.0,0.0,5.2
3,6007,53776.0,1209766.0,4597.0,2000,Butte County,94.517885,5.482115,0.0,0.0,0.0,0.0,6.2
4,16023,,,,2000,Butte County,60.268462,39.731538,30.603846,0.0,0.0,0.0,6.2


In [40]:
income_1999 = pd.read_csv('Data/income_by_county_1999.csv', header=1)
income_1999 = income_1999.iloc[:, 6:]
income_1999 = income_1999.rename(columns={income_1999.columns[0]: 'County', \
                                          income_1999.columns[2]: '% pop. in poverty'})
income_1999.head()

Unnamed: 0,County,Per capita income in 1999 (dollars),% pop. in poverty
0,California,22711,14.2
1,Alameda County,26680,11.0
2,Alpine County,24431,19.5
3,Amador County,22412,9.2
4,Butte County,17517,19.8


## Data Description

## Data Exploration

## Data Analysis

## Ethical Considerations

As the data for this research will only require looking at quantitative measures such as income values or disease rates, there will be no need for personal information if it presents itself. To best protect the privacy of the individuals we are collecting data from, all personal information not related to the data sets specifically (such as name or address of the household we are collecting utility data from) will be removed in the end results. We do not believe though that our question or datasets are invasive in nature and predict this will be of little occurrence if any. For our analyses, being aware of the racial inequalities present in low income communities is important. Before making any specific generalizations, we will make sure (if the data is available) that the ethnicities of households or individuals that are making up the census data are representative of the communities we are looking at. 

## Conclusions and Discussion