# Master's Capstone - James Quacinella

# Table of Contents

* [Abstract](#Abstract)
* [Methods](#Methods)
* [Pre-Data Collection](#Pre-Data-Collection)
* [Data Collection](#Data-Collection)
  * [Consumer Expenditure Report](#Consumer-Expenditure-Report)
  * [USDA Food Plans](#USDA-Food-Plans)
  * [Free Market Rent Data From HUD](#Free-Market-Rent-Data-From-HUD)
  * [Medical Expenditure Panel Survey from the AHRQ](#Medical-Expenditure-Panel-Survey-from-the-AHRQ)
  * [Tax Data](#Tax-Data)
* [Model Variables](#Model-Variables)
    * [Housing Costs](#Housing-Costs)
    * [Food Costs](#Food-Costs)
    * [Transportation Costs](#Transportation-Cost)
    * [Child Care Cost](#Child-Care-Cost)
    * [Health Insurance Costs](#Health-Insurance-Costs)
    * [Health Care Costs](#Health-Care-Costs)
    * [Other Necessities Cost](#Other-Necessities-Cost)
    * [Taxes](#Taxes-Data)
      * [Payroll Taxes](#Payroll-Taxes)
      * [State Tax Rate](#State-Tax-Rate)
      * [Federal Income Tax Rate](#Federal-Income-Tax-Rate)
* [Introductory Analysis](#Introductory-Analysis)
* [Appendix](#Appendix---Data-Tables)
  * [Data Tables](#Appendix---Data-Tables)
  * [Things to Revisit](#Appendix---Things-to-Revisit)


# Abstract

**Objectives:** This study will extend an established model for estimating the current living wage in 2015 to the past decade for the purpose of:

* an exploratory analysis trends in the gap between the estimated living wage and the minimum wage
* evaluating any correlation between the living wage gap and other economic metrics, including public funds spent on social services

**Methods:** The original data set for this model is for 2014. This study will extend the data sources of this model into the past to enable trend analysis. Data for economic metrics from public data sources will supplement this data for correlation analysis.


# Methods

## Model

The original model proposed estimated the living wage in terms of 9 variables:

** *basic_needs_budget* ** = *food_cost* + *child_care_cost* + ( *insurance_premiums* + *health_care_costs* ) + *housing_cost* + *transportation_cost* + *other_necessities_cost*

** *living_wage* ** = *basic_needs_budget* + ( *basic_needs_budget* \* *tax_rate* )

## Data Sources

The following data sources are used to find estimates of the model variables:

* The food cost is estimated from data from the USDA’s low-cost food plan national average in June 2014.
* Child care is based off state-level estimates published by the National Association of Child Care Resource and Referral Agencies.
* Insurance costs are based on the insurance component of the 2013 Medical Expenditure Panel Survey.
* Housing costs are estimated from the HUD Fair Market Rents (FMR) estimates
* Other variables are pulled from the 2014 Bureau of Labor Statistics Consumer Expenditure Survey.

These data sets extend into the past, allowing for calculating the model for years past. The data will also have to be adjusted for inflation 6.

## Analytic Approach

First, data will be gathered from the data sources of the original model but will be extended into the past. The methodology followed by the model will be replicated to come up with a data set representing estimates of the living wage across time. After the data set is prepared, the trend of the living wage as compared to minimum wage can be examined. Has the gap increased or decreased over time, and at what rate? Have certain areas seen larger than average increases or decreases in this gap? 

Once preliminary trend analysis is done, this data set will be analyzed in comparison to other economic trends to see if any interesting correlations can be found. Correlations to GDP growth rate and the national rate of unemployment can be made, but the primary investigation will be to see if the living wage gap correlates to national spending on SNAP (Food stamps). In other words, we will see if there is any (potentially time lagged) relationship between the living wage gap and how much the United States needs to spend to support those who cannot make ends meet. A relationship here can potentially indicate that shrinking this gap could lower public expenditures.


## Presentation Of Results

Results will be presented for both parts of the data analysis. For studying the living wage gap trends, this report will present graphs of time series, aggregated in different ways, of the living wage as well as the living wage gap. Some of these time series will be presented along side data on public expenditures on SNAP to visually inspect for correlations.

## Background / Sources

- Glasmeier AK, Nadeau CA, Schultheis E: LIVING WAGE CALCULATOR User’s Guide / Technical Notes 2014 Update
- USDA low-cost food plan, June, 2014
- Child Care in America 2014 State fact sheets
- 2013 Medical Expenditure Panel Survey Available
- Consumer Expenditure Survey
- Inflation Calculator

------

------

------

# Pre-Data Collection ([TOC](#Table-of-Contents))

Lets do all of our imports now:

In [2]:
import numpy as np
from prettytable import PrettyTable
from IPython.core.display import HTML
from collections import OrderedDict, defaultdict
from bs4 import BeautifulSoup
import os
from pprint import pprint
import pandas as pd
import itertools
import editdist

# Path to local dir on my laptop
PROJECT_PATH = "/home/james/Development/Masters/Thesis"

def constant_factory(value):
    ''' Always prodcues a constant value; used fo defaultdict '''
    return itertools.repeat(value).next

def caption(msg, tablenum):
    ''' Help convert text into suitable table caption '''
    return "<br><b>Table %d - %s</b>" % (tablenum, msg)

Lets setup some inflation multipliers:

In [3]:
# Multiply a dollar value to get the equivalent 2014 dollars
# Original numbers from model; used to confirm methodology matches the original model
inflation_multipliers = {
    2010: 1.092609, 
    2011: 1.059176,
    2012: 1.037701,
    2013: 1.022721,
    2014: 1.0
}

# Updated inflation numbers should scale to 2015 dollars
updated_inflation_multipliers = {
    2000: 1.3811731,
    2001: 1.3429588,
    2002: 1.3220567,
    2003: 1.2925978,
    2004: 1.2590683,
    2005: 1.2178085,
    2006: 1.179752,
    2007: 1.1470807,
    2008: 1.1046664,
    2009: 1.1086106, 
    2010: 1.0907198, 
    2011: 1.0573444,
    2012: 1.0359069,
    2013: 1.0209524,
    2014: 1.004655
}

Global identifiers used throughout the project:

In [4]:
# Constants used to refer to US regions
REGION_EAST = 'east'
REGION_MIDWEST = 'midwest'
REGION_SOUTH = 'south'
REGION_WEST = 'west'
REGION_BASE = 'base'   # Used for when a state is not in a region (Alaska, Hawaii mostly)

# Create a state initial to region mapping to use for regional weighting
state_to_region_mapping = defaultdict(constant_factory(REGION_BASE))
    
state_to_region_mapping.update(
    { 
    'PA': REGION_EAST, 'NJ': REGION_EAST, 'NY': REGION_EAST, 'CT': REGION_EAST, 'MA': REGION_EAST,
    'NH': REGION_EAST, 'VT': REGION_EAST, 'ME': REGION_EAST, 'RI': REGION_EAST, 
    'OH': REGION_MIDWEST, 'IL': REGION_MIDWEST, 'IN': REGION_MIDWEST, 'WI': REGION_MIDWEST, 'MI': REGION_MIDWEST,
    'MN': REGION_MIDWEST, 'IA': REGION_MIDWEST, 'MO': REGION_MIDWEST, 'KS': REGION_MIDWEST, 'NE': REGION_MIDWEST,
    'SD': REGION_MIDWEST, 'ND': REGION_MIDWEST,
    'TX': REGION_SOUTH, 'OK': REGION_SOUTH, 'AR': REGION_SOUTH, 'LA': REGION_SOUTH, 'MS': REGION_SOUTH,
    'AL': REGION_SOUTH, 'GA': REGION_SOUTH, 'FL': REGION_SOUTH, 'SC': REGION_SOUTH, 'NC': REGION_SOUTH,
    'VA': REGION_SOUTH, 'WV': REGION_SOUTH, 'KY': REGION_SOUTH, 'TN': REGION_SOUTH, 'MD': REGION_SOUTH,
    'DE': REGION_SOUTH,
    'CA': REGION_WEST, 'OR': REGION_WEST, 'WA': REGION_WEST, 'NV': REGION_WEST, 'ID': REGION_WEST,
    'UT': REGION_WEST, 'AZ': REGION_WEST, 'MT': REGION_WEST, 'WY': REGION_WEST, 'CO': REGION_WEST,
    'NM': REGION_WEST, 'AK': REGION_BASE, 'HI': REGION_BASE
})

Lets setup regional differences for the food data:

In [5]:
# Multiply price of food by regional multipler to get better estimate of food costs
food_regional_multipliers = {
    REGION_EAST: 0.08,
    REGION_WEST: 0.11,
    REGION_SOUTH: -0.07,
    REGION_MIDWEST: -0.05,
}

------

------

------

#  Data Collection 

([TOC](#Table-of-Contents))

The following sections will outline how I gathered the data for the various model parameters as well as other data we need to calculate their values. The original model was made for 2014 data and extending this data to the past means we need to be careful that any changes in the underlying data methodology of these parameters needs to be noted.

## Data Sources

### Consumer Expenditure Report

Wget commands used to get the Consumer Expenditure Reports:

In [None]:
# Get CEX for 2013 and 2014 (XLSX format)
for i in `seq 2013 2014`; do wget http://www.bls.gov/cex/$i/aggregate/cusize.xlsx -O ${i}_cex.xlsx; done

# Get CEX for 2004 - 2012 (XLS format)
for i in `seq 2004 2012`; do wget http://www.bls.gov/cex/$i/aggregate/cusize.xls -O ${i}_cex.xls; done

# Get CEX for 2001 to 2003 (TXT format)
for i in `seq 2001 2003`; do wget http://www.bls.gov/cex/aggregate/$i/cusize.txt -O ${i}_cex.txt; done


# Get CEX region for 2013 and 2014 (XLSX format)
for i in `seq 2013 2014`; do wget http://www.bls.gov/cex/$i/aggregate/region.xlsx -O ${i}_region_cex.xlsx; done

# Get CEX region for 2004 - 2012 (XLS format)
for i in `seq 2004 2012`; do wget http://www.bls.gov/cex/$i/aggregate/region.xls -O ${i}_region_cex.xls; done

# Get CEX region for 2001 to 2003 (TXT format)
for i in `seq 2001 2003`; do wget http://www.bls.gov/cex/aggregate/$i/region.txt -O ${i}_region_cex.txt; done

### USDA Food Plans

Wget commands used to gather data files:

In [None]:
# Change command to get '10 - '15
for i in {1..9}; do  wget http://www.cnpp.usda.gov/sites/default/files/usda_food_plans_cost_of_food/CostofFoodJun0$i.pdf; done

### Free Market Rent Data From HUD

Below are the wget commands for getting the FMR data

#### TODO 

* Extract counties -> state -> region mapping

In [None]:
cd data/fmr
for i in `seq 2014 2015`; do wget http://www.huduser.gov/portal/datasets/fmr/fmr${i}f/FY${i}_4050_RevFinal.xls -O fmr${i}.xlsx; done
for i in `seq 2010 2013`; do wget http://www.huduser.gov/portal/datasets/fmr/fmr${i}f/FY${i}_4050_Final.xls -O fmr${i}.xlsx; done
for i in `seq 2009 2009`; do wget http://www.huduser.gov/portal/datasets/fmr/fmr${i}r/FY${i}_4050_Rev_Final.xls -O fmr${i}.xlsx; done

# GRRRR
wget http://www.huduser.gov/portal/datasets/fmr/fmr2008r/FMR_county_fy2008r_rdds.xls
wget http://www.huduser.gov/portal/datasets/fmr/fmr2007f/FY2007F_County_Town.xls
wget http://www.huduser.gov/portal/datasets/fmr/fmr2006r/FY2006_County_Town.xls
wget http://www.huduser.gov/portal/datasets/fmr/fmr2005r/Revised_FY2005_CntLevel.xls
wget http://www.huduser.gov/portal/datasets/FMR/FMR2004F/FMR2004F_County.xls
wget http://www.huduser.gov/portal/datasets/fmr/FMR2003F_County.xls
wget http://www.huduser.gov/portal/datasets/fmr/FMR2002F.xls

In [23]:
# Counties dict will map county ID to useful infomation, mostly region and state
counties = { }

### Medical Expenditure Panel Survey from the AHRQ

Below are the wget commands used to download this data. This data will have to be further parsed from HTML.

In [None]:
# Load insurance data
cd data/insurance
for i in `seq 2001 2014`; do 
    wget -O ${i}_tiic2.html http://meps.ahrq.gov/mepsweb/data_stats/summ_tables/insr/state/series_2/${i}/tiic2.htm ;
done

### Tax Data

Here is all the files we need for tax data:

In [None]:
# Data from Tax Foundation on individual tax rates per state per year
cd data/taxes
wget -O State_Individual_Income_Tax_Rates_2000-2014.xlsx http://taxfoundation.org/sites/taxfoundation.org/files/docs/State%20Individual%20Income%20Tax%20Rates%2C%202000-2014.xlsx

---

---

---

## Model Variables  

[TOC](#Table-of-Contents)

### Housing Costs

Definition from the model:

> We assumed  that  a  one  adult  family  would  rent a  single occupancy unit (zero bedrooms) for an individual adult household, that a two adult family would  rent a one bedroom apartment,

The counties are identified by the FIPS code, which is just state code + county code + subcounty code (only post 2005). 

We need to do some string matching to find FIPS codes for 2002, since they are not in the file. Exact matches work for 84% of the data. The other data is filled in via finding name with smallest levishtein distance. Used [py-editdist]( http://www.mindrot.org/projects/py-editdist) instead of nltk's implementation due to speed issues.

Final data can be found in the [Appendix: Housing Costs Data Table](#Housing-Costs-Data-Table).

#### Methodology Confidence 

**TODO**

#### TODO

* Handle inflation
* Look into 2005 and 2006 transition
* figure out multi level index for data
* subset all data to include counties that are across all years


In [27]:
# Fair Market Rent data
fmr_data = { }

def pad_county(county):
    ''' Pad counties to three digits when we need to construct one manually. '''
    return '%03d' % county

def pad_fips(fip):
    ''' Add 99999 to end of fip code (which nullifies the subcounty identifier) '''
    return int(str(fip) + '99999')

# For now, loading 2002 - 2014
for year in range(2002, 2015):
    with open(PROJECT_PATH + "/data/fmr/fmr%d.csv" % year, 'rb') as csvfile:
        # Store dataframe from csv into dict
        fmr_data[year] = pd.read_csv(csvfile)
        
        # Lower case headings to make life easier
        fmr_data[year].columns = map(str.lower, fmr_data[year].columns)
        
        # Custom processing per year
        if year > 2012:
            # Left out "fips2010"
            fmr_data[year] = fmr_data[year][["fmr0", "county", "cousub", "countyname", "fips2000", "pop2010", "state", "state_alpha"]]
            
            # TODO: should we do this?
            # fmr_data[year]['fips'] = fmr_data[year]['fips2000']
            fmr_data[year].rename(columns={'fips2000':'fips'}, inplace=True)
            
            fmr_data[year] = fmr_data[year].query('cousub == 99999').reset_index(drop=True)
        elif year > 2005:
            fmr_data[year] = fmr_data[year][["fmr0", "county", "cousub", "countyname", "fips", "pop2000", "state", "state_alpha"]]
            fmr_data[year] = fmr_data[year].query('cousub == 99999').reset_index(drop=True)
        elif year == 2005:
            fmr_data[year] = fmr_data[year][["fmr_0bed", "county", "countyname", "pop2000", "state", "state_alpha", "stco"
]]
            fmr_data[year].rename(columns={'stco':'fips', 'fmr_0bed': 'fmr0'}, inplace=True)
            fmr_data[year]['fips'] = fmr_data[year]['fips'].map(pad_fips)
        elif year == 2004:
            fmr_data[year] = fmr_data[year][["new_fmr0", "county", "countyname", "pop100", "state", "state_alpha"]]
            fmr_data[year]['fips'] = fmr_data[year]['state'].map(str) + fmr_data[year]['county'].map(pad_county)
            fmr_data[year].rename(columns={'stco':'fips', 'new_fmr0': 'fmr0'}, inplace=True)
            fmr_data[year]['fips'] = fmr_data[year]['fips'].map(pad_fips)
        elif year == 2003:
            fmr_data[year] = fmr_data[year][["fmr0", "county", "countyname", "pop", "state", "state_alpha"]]
            fmr_data[year]['fips'] = fmr_data[year]['state'].map(str) + fmr_data[year]['county'].map(pad_county)
            fmr_data[year]['fips'] = fmr_data[year]['fips'].map(pad_fips)
        elif year == 2002:
            # NOTE: we have to calculate FIPS codes by hand in cell below
            fmr_data[year] = fmr_data[year][["fmr0br", "areaname", "st"]]
            fmr_data[year].rename(columns={'st':'state_alpha', 'fmr0br': 'fmr0', 'areaname': 'countyname'}, inplace=True)

        # Inflation
        fmr_data[year]['fmr0_inf'] = fmr_data[year]['fmr0'] * updated_inflation_multipliers[year]
        
        # Add region column
        # METHOD: the defaultdict will use region_base if the state is not in the initial state to region mapping
        fmr_data[year]['region'] = fmr_data[year]['state_alpha'].map(lambda x: state_to_region_mapping[x])

        
##### Handle 2002 data ######

# Custom comparator to compare column of strings to given string
def compare_lambda(y):
    def compare(x):
        return (x[0], x[1], editdist.distance(x[1], y))
    return compare

# Init list of fips we need to find and a bitmap of which 2002 counties we processes
fips = [ None ] * len(fmr_data[2002]['countyname'])
found_bitmap = [ False ] * len(fmr_data[2002]['countyname'])

# For each count in 2002 ...
for idx, countyname in enumerate(fmr_data[2002]['countyname']):
    # See if any row mathes this countyname exactly
    county_matches = fmr_data[2003]['countyname'].map(lambda x: x.lower()) == countyname.lower()
    found = np.any(county_matches)
    if found: 
        found_bitmap[idx] = True
        fips[idx] = fmr_data[2003]['fips'][idx]


# 84% found a match. can we do better with lev dist?
# print np.sum(found_bitmap) / float(len(found_bitmap))

# Get list of counties (as tuples) in 2003 which we try to match to
good_counties = list(enumerate(fmr_data[2003]['countyname']))

# For each county in 2002 ...
for idx, countyname in enumerate(fmr_data[2002]['countyname']):
    # If already matched, we skip; otherwise ...
    if not found_bitmap[idx]:
        # Get list of distances from 2002 countyname to all 2003 countynames
        # NOTE: use of compare_lambda to create custom comparator that also 
        # returns data in (idx, countyname, levdist) form
        distances = map(compare_lambda(countyname.lower()), 
                        map(lambda x: (x[0], x[1].lower()), 
                            good_counties))
        
        # Find the minimum distance (with custom key to only compare third element, which is levdist)
        min_distance = min(distances, key=lambda x: x[2])
        
        # Update bitmap and store appropriate FIPS code from 2003 
        found_bitmap[idx] = True
        fips[idx] = fmr_data[2003]['fips'][idx]

# Add calculated fips to new column in 2002
fmr_data[2002]['fips'] = fips


##### Construct final multi-level dataframe #####
fmr_df = pd.DataFrame()
for year in range(2002, 2014):
    mindex = pd.MultiIndex.from_tuples(zip([year]*len(fmr_data[year]), fmr_data[year]['fips']), names=["year", "fips"])
    new_df = fmr_data[year]
    new_df.index = mindex
    new_df.columns = fmr_data[year].columns
    fmr_df = pd.concat([fmr_df, new_df])

#### Issue with county change from 2005 to 2006

In [96]:
for year in range(2003, 2014):
    x = set(fmr_data[year]['fips'])
    y = set(fmr_data[year+1]['fips'])
    print("Diff between %d and %d is: %s" % (year, year+1, len(y.difference(x))))
    print("Diff between %d and %d is: %s" % (year, year+1, len(x.difference(y))))
    print

# print(list(set(fmr_data[2005]['fips']))[0:10])
# print(list(set(fmr_data[2006]['fips']))[0:10])

print set(fmr_data[2006]['fips']).difference(set(fmr_data[2005]['fips']))

Diff between 2003 and 2004 is: 2
Diff between 2003 and 2004 is: 0

Diff between 2004 and 2005 is: 12
Diff between 2004 and 2005 is: 33

Diff between 2005 and 2006 is: 39
Diff between 2005 and 2006 is: 67

Diff between 2006 and 2007 is: 0
Diff between 2006 and 2007 is: 0

Diff between 2007 and 2008 is: 0
Diff between 2007 and 2008 is: 0

Diff between 2008 and 2009 is: 0
Diff between 2008 and 2009 is: 0

Diff between 2009 and 2010 is: 0
Diff between 2009 and 2010 is: 0

Diff between 2010 and 2011 is: 5
Diff between 2010 and 2011 is: 3

Diff between 2011 and 2012 is: 0
Diff between 2011 and 2012 is: 0

Diff between 2012 and 2013 is: 0
Diff between 2012 and 2013 is: 0

Diff between 2013 and 2014 is: 0
Diff between 2013 and 2014 is: 0

set([7204999999, 7209599999, 7208399999, 7214199999, 7207199999, 7200199999, 5119599999, 7213199999, 7211799999, 7209399999, 5166099999, 7208199999, 7211599999, 7207999999, 7200999999, 5175099999, 5153099999, 5100599999, 7205799999, 7203999999, 7204399999, 72

## Food Costs

([TOC](#Table-of-Contents))
 
Data for the food calculations have been successfully downloaded in PDF form. The main way to calculate this is, from the PDF:

>Adult  food  consumption  costs  are  estimated  by  averaging  the  low - cost  plan  food  costs for  males  and  females  between  19  and  50

Note, we add 20% to the values from the data sheets, since the notes on all published PDFs from the USDA state to add 20% to the listed values for individuals since:

>The costs given are for individuals in 4-person families. For individuals in other size families, the following adjustments are suggested: 1-person—add 20 percent; ...

The notes for the model also state that regional weights are applied to give a better estimate for food costs across the nation. The result of this section are values for 2014 that match exactly to the data given on the model website, so I am confident the implementation of the methodology below is correct.

The final data can be seen in the [Appendix: Food Costs Data Table](#Food-Costs-Data-Table)

### Notes: Change of USDA Methodology

In 2006, the data from the USDA changed the age ranges for their healthy meal cost calculations. The differences in range are minimal and should not effect overall estimations.

### Methodology Confidence

The methodology of this section produces numbers exactly like the original model, so the confidence in the methodology is **high**.

In [13]:
# The base food cost (not regionally weighed) for nation (data pulled manually from PDFs)
national_monthly_food_cost_per_year = {
    2014: {"base": np.average([241.50, 209.80])},
    2013: {"base": np.average([234.60, 203.70])},
    2012: {"base": np.average([234.00, 203.00])},
    2011: {"base": np.average([226.80, 196.90])},
    2010: {"base": np.average([216.30, 187.70])},
    2009: {"base": np.average([216.50, 187.90])},
    2008: {"base": np.average([216.90, 189.60])},
    2007: {"base": np.average([200.20, 174.10])},
    2006: {"base": np.average([189.70, 164.80])},
    2005: {"base": np.average([186.20, 162.10])},
    2004: {"base": np.average([183.10, 159.50])},
    2003: {"base": np.average([174.20, 151.70])},
    2002: {"base": np.average([170.30, 148.60])},
    2001: {"base": np.average([166.80, 145.60])},
}

# Create ordered dict to make sure we process things in order
national_monthly_food_cost_per_year = OrderedDict(sorted(national_monthly_food_cost_per_year.items(), 
                                                        key=lambda t: t[0]))

# Adjust the data according to notes above
for year in national_monthly_food_cost_per_year:
    # Inflation and 20% adjustment
    national_monthly_food_cost_per_year[year]["base"] = \
        national_monthly_food_cost_per_year[year]["base"] * 1.20 * updated_inflation_multipliers[year]

    # Regional adjustment
#     national_monthly_food_cost_per_year[year]["regional"] = { }
    for region in food_regional_multipliers:
        national_monthly_food_cost_per_year[year][region] = \
            national_monthly_food_cost_per_year[year]["base"] * (1 + food_regional_multipliers[region])

national_monthly_food_cost_per_year_df = pd.DataFrame.from_dict(national_monthly_food_cost_per_year)

In yearly form:

In [165]:
# # Print it nicely in yearly costs
# pt = PrettyTable()
# pt.add_column("Year", national_monthly_food_cost_per_year.keys())
# pt.add_column("Food Cost (per year)", [np.round(x["base"] * 12) for x in national_monthly_food_cost_per_year.values()])
# for region in food_regional_multipliers:
#     pt.add_column("Food Cost (%s)" % region, [np.round(x["regional"][region] * 12) for x in national_monthly_food_cost_per_year.values()])

# # Print as HTML
# HTML(pt.get_html_string() + caption("Food Data Loaded from USDA Pricing on Meals", 1))

## Transportation Cost

([TOC](#Table-of-Contents))
 
Looking at the (1) Cars and trucks (used), (2) gasoline and motor oil, (3) other vehicle expenses, and (4)  public  transportation fields under "Transportation" in the 2014 Consumer Expenditure Report, we can pull out information from each to model the claculation done in the original model. For each sub-variable, we get the amount of money (in millions) and the percentgae of that that single adults spend. After multiple those numbers (accounting for units) and dividiing by the total number of single adults in the survey gives us a mean total cost per adult.

The original model takes into account regional drift by scaling based on each regions. NOTE: See todo in this section

Since this data reflects conditions in 2013, we account for inflation to get the 2014 estimate that is produced in the original model.

### TODO:

* Figure out how to do regional differences correctly. Emailed model creator for clarification

In [55]:
# Transportation data from 2014 survey is for year 2013, etc
cex = {
    2012: {
        "single_adults": 37770.0,
        "transport": {
            "used_car": 209764.0,
            "gasoline": 328170.0,
            "other_vehicle": 324668.0,
            "public": 67486.0,
            "used_car_percent": 0.152,
            "gasoline_percent": 0.158,
            "other_vehicle_percent": 0.191,
            "public_percent": 0.174,
            "regional": {
                REGION_EAST:   16.4 / 17.6,  
                REGION_MIDWEST: 18.0 / 17.6,
                REGION_SOUTH: 18.9 / 17.6,
                REGION_WEST: 16.5 / 17.6,
            }
        }

    },
    2013: {
        "single_adults": 37884.0,
        "transport": {
            "used_car": 214524.0,
            "gasoline": 313481.0,
            "other_vehicle": 345454.0,
            "public": 73842.0,
            "used_car_percent": 0.146,
            "gasoline_percent": 0.157,
            "other_vehicle_percent": 0.163,
            "public_percent": 0.172,
            "regional": {
                REGION_EAST: 15.7 / 17.0,     # 0.923
                REGION_MIDWEST: 16.9 / 17.0,  # 0.994
                REGION_SOUTH: 18.3 / 17.0,    # 1.076
                REGION_WEST: 16.1 / 17.0,     # 0.947
            }
        }
    },
}

# Ideal numbers from model
ideal_transport_2013 = (3764, 4569, 4697, 4054)

# Base price for transport
transportation_costs = defaultdict(dict)

for year in cex:
    transportation_costs[year]["base"] = \
        (1000000 * ((cex[year]["transport"]["used_car"] * cex[year]["transport"]["used_car_percent"]) + \
                    (cex[year]["transport"]["gasoline"] * cex[year]["transport"]["gasoline_percent"]) + \
                    (cex[year]["transport"]["other_vehicle"] * cex[year]["transport"]["other_vehicle_percent"] ) + \
                    (cex[year]["transport"]["public"] * cex[year]["transport"]["public_percent"] )) /  float(cex[year]["single_adults"] * 1000) ) * inflation_multipliers[year]

    # Account for regional drift
    for region in cex[year]["transport"]["regional"]:
        transportation_costs[year][region] = transportation_costs[year]["base"] * cex[year]["transport"]["regional"][region]

transportation_costs["2014_ideal"]["base"] = 0.0
transportation_costs["2014_ideal"][REGION_EAST] = ideal_transport_2013[0]
transportation_costs["2014_ideal"][REGION_MIDWEST] = ideal_transport_2013[1]
transportation_costs["2014_ideal"][REGION_SOUTH] = ideal_transport_2013[2]
transportation_costs["2014_ideal"][REGION_WEST] = ideal_transport_2013[3]

# Print it nicely
errors = []
pt = PrettyTable()
pt.add_column("Year", transportation_costs.keys())
for region in sorted(transportation_costs[2013].keys()):
    data = [ transportation_costs[year][region] for year in transportation_costs  ]
    pt.add_column("Trans Cost (%s)" % region, data)
    errors.append(transportation_costs["2014_ideal"][region] - data[-2])

print(sum([np.abs(error) for error in errors]))

# Print as HTML
HTML(pt.get_html_string())

5209.92768399


Year,Trans Cost (base),Trans Cost (east),Trans Cost (midwest),Trans Cost (south),Trans Cost (west)
2012,4326.89007326,4031.87484099,4425.22848402,4646.48990822,4056.45944368
2013,4037.18458744,3728.45870723,4013.43644281,4345.91046766,3823.45128575
2014_ideal,0.0,3764.0,4569.0,4697.0,4054.0


### Testing theory about regional difference

In [123]:
# Order: NE, MW, S, W
used_car_rations = (2.5 / 3.2, 3.5 / 3.2, 3.5 / 3.2, 2.9 / 3.2)
gas_rations = (3.8 / 4.6, 4.7 / 4.6, 5.2 / 4.6, 4.5 / 4.6)
other_rations = (5.2 / 5.1, 5.0  / 5.1, 5.1 / 5.1,  5.1 / 5.1)
public_rations = (1.6/1.1,  0.9/1.1,  0.8/1.1, 1.2/1.1)

error = []
for region in range(4):
    val = (1000000 * 
         ( (
            (cex[2013]["transport"]["used_car"] * cex[2013]["transport"]["used_car_percent"] * used_car_rations[region]) + \
            (cex[2013]["transport"]["gasoline"] * cex[2013]["transport"]["gasoline_percent"] * gas_rations[region]) + \
            (cex[2013]["transport"]["other_vehicle"] * cex[2013]["transport"]["other_vehicle_percent"] * other_rations[region]) + \
            (cex[2013]["transport"]["public"] * cex[2013]["transport"]["public_percent"] * public_rations[region])
        ) /  (float(cex[2013]["single_adults"] * 1000)) ) * inflation_multipliers[2013])
    errors.append( val - ideal_transport_2013[region] )

print(sum([np.abs(error) for error in errors]))

7516.27558175


In [134]:
# Calculate regional diff values from aggregated data (since 'combined' only goes back to 2012)
print 1/ (6790803*1000000*20.1 / (1152035*1000000*18.6))
print 1/ (6790803*1000000*21.7 / (1152035*1000000*21.7))
print 1/ (6790803*1000000*34.3 / (1152035*1000000*37.1))
print 1/ (6790803*1000000*23.9 / (1152035*1000000*22.6))


print 1152035/6790803.0



0.15698618246
0.169646358465
0.183495040788
0.160418732272
0.169646358465


## Child Care Cost

([TOC](#Table-of-Contents))
 
Manually download PDFs from ChildCareAware.org. Sadly, they only go back to 2010. I can now either:

* have to find other estimates of child care costs from pre-2010 (prefered)
* check if the Consumer Expenditure Survey has data on this
* impute the data (dont think this is a good idea)
* limit the analysis going back to 2010 (which seems limiting since other data, like the Consumer Expenditure Survey in 2014 provides 2013 data and that is the latest currently).

Currently I am only focusing on modeling costs for a single adult (an assumption I made early on) since I am interested in trends, and the other 'family configurations' are just linear combinations of the costs for one adult and for one child. However if I wanted to extend the numbers for 1 adult + 1 child, I would have to look into this further. For now I'll move on.

## Health Insurance Costs

([TOC](#Table-of-Contents))
 
The model uses data from the Medical Expenditure Panel Survey from the Agency for Healthcare Research and Quality (searchable [here](http://meps.ahrq.gov/mepsweb/data_stats/quick_tables_search.jsp?component=2&subcomponent=2)). Specifically, the model assumes a single adult's insurance costs are best estimated from Table X.C.1 Employee contribution distributions (in dollars) for private-sector employees enrolled in single coverage. This survey gives the mean cost for a single adult per state. 

Table X.C.1 was only added to the survey starting in 2006. There is an alternative table that appears in all years (Table II.C.2: Average total employee contribution (in dollars) per enrolled employee for single coverage at private-sector establishments), which is what is downloaded from the previous section.

One problem is that in 2007 this survey was not done. I solved this by linearly impute data from 2006 and 2008, which seems resonable if we can assume that costs tend to go up every year and not go down. This is true for the data I have looked at.
    
Another problem is that some states do not appear in the earlier data due to funding issues (and not being able to get a statistically significant sample). I fix this by using the value in the data for 'states not specified' and fill in the missing states.

Below is code on processing each html file.

Final table shown in [Appendix: Insurance Costs Data Table](#Insurance-Costs-Data-Table)

In [143]:
# Process HTML files with BeautifulSoup
insurance_costs = {}
insurance_costs_path = os.path.join(PROJECT_PATH, "data/insurance")

# Loop thru all the files
for filename in [f for f in os.listdir(insurance_costs_path) if f.endswith('.html')]:
    states = {}
    
    # File is for what year?
    year = int(filename.split('_')[0])
    
    # Open file
    full_filename = os.path.join(insurance_costs_path, filename)
    f = open(full_filename, "r")
    
    # Import into BeautifulSoup
    data = f.readlines()
    soup = BeautifulSoup(''.join(data))

    # Works for years 2008 - 2014
    if year in range(2008, 2015):
        for tr in soup.find_all('tr'):
            # State is located in the TR element
            state = tr.get_text().split("\n")[1].lower().strip()
            
            # Find the data, but if you can't, skip it
            td = tr.find_all('td')
            value = None
            if td: 
                try:
                    value = float(td[0].get_text().strip().replace(",", ""))
                    
                    # Account for inflation and round up
                    value = float(np.round(value * updated_inflation_multipliers[year]))
                except ValueError as e:
                    continue

                # We need to stop processing after the first chunk or if we couldnt get a value
                if state not in states and value:
                    states[state] = value
    # Works for 2001 - 2006
    elif year in range(2001, 2007):
        for tr in soup.find_all('tr'):
            td = tr.find_all('td')

            value = None
            if len(td) > 2: 
                # Same as above, but state is fist TD, not in TR
                state = td[0].get_text().lower().strip()
                try:
                    value = float(td[1].get_text().strip().replace(",", ""))
                    
                    # Account for inflation and round up
                    value = float(np.round(value * updated_inflation_multipliers[year]))
                except ValueError as e:
                    continue

            if state not in states and value:
                states[state] = value
    else:
        pass

    # Add data from file to global dict
    insurance_costs[year] = states

    
# For each state in 2007, linearly impute the data
insurance_costs[2007] = { }
for state in insurance_costs[2014]:
    insurance_costs[2007][state] = (insurance_costs[2006][state] + insurance_costs[2008][state]) / 2.0

def state_filter(state):
    ''' Filter out some entries from the html that we pick up as states'''
    return "district" not in state and 'united' not in state and 'separately' not in state

# Get all states in 2014, assuming thats the largest set of states
full_set_of_states = set([state for state in sorted(insurance_costs[2014].keys()) if state_filter(state)])
for year in range(2001, 2015):   
    # Find current set of states from this year
    current_set_of_states = set([state for state in sorted(insurance_costs[year].keys()) if state_filter(state)])
    
    # Find difference between states we have now and states in 2014
    diff = full_set_of_states.difference(current_set_of_states)
    
    # If there are some states missing, fill in those states with given value from "States not shown separately" in data
    if diff and 'states not shown separately' in insurance_costs[year]:
        # Fill in each state
        for state in list(diff):
            insurance_costs[year][state] = insurance_costs[year]['states not shown separately']

# Create final dataframe results for this  model variable
insurance_costs_df = pd.DataFrame(insurance_costs)

## Health Care Costs

([TOC](#Table-of-Contents))

Calculated from the CEX data from above, essentially done once regional differencing is done

### TODO

* Complete data load once regional differences are figured out

## Other Necessities Cost

([TOC](#Table-of-Contents))

Calculated from the CEX data from above, essentially done once regional differencing is done

>   Expenditures for other necessities are based on 
2013 data by household size  from  the  2014 Bureau  of  Labor  Statistics  Consumer  Expenditure  Survey
including: (1) Apparel  and  services,  (2)  Housekeeping  supplies,  (3)  Personal  care  products  and  services, 
(4)  Reading, and (5) Miscellaneous.  These costs were further adjusted for regional differences using annual  expenditure  shares  reported  by  region



### TODO

* Complete data load once regional differences are figured out

In [48]:
# Update cex dictionary with values for other variable
cex[2013].update( 
    {
        "single_adults": 37884.0,
        "other": {
            "apparel": 226385.0, 
            "housekeeping": 80097.0,
            "personal_care": 81837.0,
            "reading": 13086,
            "misc": 99290,
            
            "apparel_percent": 0.13,
            "housekeeping_percent": 0.164,
            "personal_care_percent": 0.182,
            "reading_percent": 0.205,
            "misc_percent": 0.228,
            
            "apparel_region": [ x / 3.3 for x in (3.3, 3.6, 3.2, 3.3)],
            "housekeeping_region": [ x / 1.2 for x in (1.0, 1.4, 1.2, 1.1 )],
            "personal_care_region": [ x / 1.2 for x in (1.2, 1.3, 1.2, 1.2 )],
            "reading_region": [ x / 1.0 for x in (1,1,1,1)],
            "misc_region": [ x / 1.5 for x in (1.4, 1.5, 1.3, 1.6)],
        }
    }
)

# Values for 'other' from county webpages
ideal_other_2013 = (2096, 2127, 2253, 2284)

for region in range(4):
    val = (1000000 * 
         ( (
            (cex[2013]["other"]["apparel"] * cex[2013]["other"]["apparel_percent"] * cex[2013]["other"]["apparel_region"][region]) + \
            (cex[2013]["other"]["housekeeping"] * cex[2013]["other"]["housekeeping_percent"] * cex[2013]["other"]["housekeeping_region"][region]) + \
            (cex[2013]["other"]["personal_care"] * cex[2013]["other"]["personal_care_percent"] * cex[2013]["other"]["personal_care_region"][region]) + \
            (cex[2013]["other"]["reading"] * cex[2013]["other"]["reading_percent"] * cex[2013]["other"]["reading_region"][region]) + \
            (cex[2013]["other"]["misc"] * cex[2013]["other"]["misc_percent"] * cex[2013]["other"]["misc_region"][region])
        ) /  (float(cex[2013]["single_adults"] * 1000)) ) * inflation_multipliers[2013])

    # Print difference between calc and data from website
    print "Diff for region %d: %f %f" % (region, val, ideal_other_2013[region])

Diff for region 0: 2134.921071 2096.000000
Diff for region 1: 2399.604462 2127.000000
Diff for region 2: 2129.205730 2253.000000
Diff for region 3: 2245.958136 2284.000000


In [42]:
float(cex[2013]["single_adults"] * 1000)

37884000.0

## Taxes Data

([TOC](#Table-of-Contents))

From the model documentation:

> Estimates for payroll taxes, state income tax, and federal income tax rates are included in the calculation of a living wage. Property taxes and sales taxes are already represented in the budget estimates through the cost of rent and other necessities. 

All tax data can be found in the [Appendix: Tax Data Tables](#Tax-Data-Tables).

Lets look at the other tax break downs:

### Payroll Taxes

>A flat payroll tax and state income tax rate is applied to the basic needs budget. Payroll tax is a
nationally representative rate as specified in the Federal Insurance Contributions Act. 
>>The payroll tax rate (Social Security and Medicare taxes) is 6.2% of total wages as of 2014.

I am not sure where the model gets 6.2% from. The data from the [SSA website](https://www.ssa.gov/oact/progdata/taxRates.html) states that 6.2% is the rate for the Social Security part of the FICA tax. This might be a mistake in the original model. I will use 6.2% for any work in confirming how close I am to the real model, but will use the combined rate (which includes Medicare's Hospital Insurance rate) when calculating final numbers for my model.

Another thing to note is that in 2011 and 2012, the rate for the Social Security part of the FICA tax was 2% lower for individuals.

In [7]:
# Data from FICA rates
updated_fica_tax_rate = dict(zip(
        [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014],
        [0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0765, 0.0565, 0.0565, 0.0765, 0.0765]))

# Create dataframe
updated_fica_tax_rate_df = pd.DataFrame.from_dict({"fica rate": updated_fica_tax_rate}).transpose()

# Data that the model used (see notes above)
fica_tax_rate = {
    2013: 0.062
}

### State Tax Rate

>The state tax rate is taken from the second lowest income tax rate for 2011 for the state as reported by the CCH State Tax Handbook (the lowest bracket was used if the second lowest bracket was for incomes of over 30,000 ) (we assume no deductions). 26 
>>State income tax rates are for the 2011 tax year. These rates were taken from the 2011 CCH Tax Handbook
(various organizations provide the CCH State Tax Handbook rates (including The Tax Foundation)). No updates
were available as of March 30, 2014

Using the excel file provide by [The Tax Foundation](http://taxfoundation.org/tax-topics/state-taxes#article), the second lowest tax bracket's rate is chosen as the rate for the state (except when the bracket is for incomes > 30k, as the original model suggests). 

This only came into play in the later years for Vermont, North Dakota, and RI. To be consistent, I used the lowest tax bracket for all years.

Note that I used the rate under "Single" since the model is only for adults. This is done by hand by importing correct numbers from the spreadsheet, which is imported via CSV below:

In [9]:
updated_state_tax_rate_df = pd.DataFrame.from_csv("data/taxes/formatted_state_taxes.csv")

### Federal Income Tax Rate

> The federal income tax rate is calculated as a percentage of total income based on the average tax paid by median-income four-person families as reported by the Tax Policy Center of the Urban Institute and Brookings Institution for 2013. 27
>>The Tax Policy Center reported that the average federal income tax rate for 2013 was 5.32%. This estimate
includes the effects of (1) the Earned Income Tax Credit (assuming two eligible children), (2) the Child Tax Credit
expansion as part of EGTRRA, and (3) the Making Work Pay Credit enacted in the American Recovery and
Reinvestment Act of 2009.

One issue is that the model authors used ["Historical Federal Income Tax Rates for a Family of Four"](http://www.taxpolicycenter.org/taxfacts/displayafact.cfm?Docid=226). Since I am 
focusing on single adults, I should use ["Historical Average Federal Tax Rates for Nonelderly Childless Households"](http://www.taxpolicycenter.org/taxfacts/displayafact.cfm?DocID=465&Topic2id=20&Topic3id=22). However, that data stops at 2011 for some reason, so for consistency, I will stick with the model definition and use the Family of Four rate.

Also, the model officially used a number that is different than what is on the updated link above. I will use the number used by the model to confirm the methodology (if I can), but use numbers from the updated data.

In [154]:
# Original model used 5.32% for the tax rate in 2013; the document this was taken from has since been updated. 
# So to confirm my methodology, I will use the 5.32% value; however, I will use updated information going forward
updated_federal_income_tax_rate = dict(zip(
        [2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014],
        [0.0802, 0.0671, 0.0656, 0.0534, 0.0537, 0.0569, 0.0585, 0.0593, 0.0354, 0.0447, 0.0450, 0.0559, 0.0584, 0.0579, 0.0534]))

# Create dataframe
updated_federal_income_tax_rate_df = \
    pd.DataFrame.from_dict({'federal income tax rate': updated_federal_income_tax_rate}).transpose()

# Value used by model, used for verifying methodology
federal_income_tax_rate = {
    2013: 0.0532
}


-----

-----

-----

## Correlation Data

### Minimum Wage or Mediun Wage per County or State

-----

-----

-----

## Creating Final Merged Data Frame

Take all data loaded in prior into a multi-level index data frame

**TODO:** Do this as we continue with analysis

# Introductory Analysis

Create visualizations on:

* Find national mean living wage gap, plot it over time
* Look at distributions over states of living wage gap over time (facet grid, each graph is a state showing gap over time)
* Seperate counties based on race and find national means of gap per year


# Correlations with Economic Metrics

* motion chart of states, x = gap, y = life exp, debt levels

# Appendix - Data Tables 

([TOC](#Table-of-Contents))

## Housing Costs Data Table

The model definition is specified in [Model Variable: Housing](#Housing-Costs).

In [28]:
fmr_df

Unnamed: 0_level_0,Unnamed: 1_level_0,county,countyname,cousub,fips,fmr0,fmr0_inf,pop,pop100,pop2000,pop2010,region,state,state_alpha
year,fips,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2002,101599999,,"Anchorage, AK MSA",,101599999,519,686.147427,,,,,base,,AK
2002,108199999,,ALEUTIANS EAST,,108199999,538,711.266505,,,,,base,,AK
2002,100999999,,ALEUTIANS WEST,,100999999,461,609.468139,,,,,base,,AK
2002,107399999,,BETHEL,,107399999,695,918.829407,,,,,base,,AK
2002,111599999,,BRISTOL BAY,,111599999,558,737.707639,,,,,base,,AK
2002,111799999,,DILLINGHAM,,111799999,671,887.100046,,,,,base,,AK
2002,111399999,,FAIRBANKS-NORTH STAR,,111399999,423,559.229984,,,,,base,,AK
2002,107999999,,HAINES BOROUGH,,107999999,501,662.350407,,,,,base,,AK
2002,110399999,,JUNEAU,,110399999,748,988.898412,,,,,base,,AK
2002,104599999,,KENAI PENINSULA,,104599999,455,601.535799,,,,,base,,AK


## Food Costs Data Table

The model definition is specified in [Model Variable: Food Costs](#Food-Costs).

In [179]:
national_monthly_food_cost_per_year_df

Unnamed: 0,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
base,187.44,191.34,195.54,205.56,208.98,212.7,224.58,243.9,242.64,264.848422,269.263723,272.085202,268.955169,270.78
east,202.4352,206.6472,211.1832,222.0048,225.6984,229.716,242.5464,263.412,262.0512,286.036295,290.804821,293.852018,290.471582,292.4424
midwest,178.068,181.773,185.763,195.282,198.531,202.065,213.351,231.705,230.508,251.606001,255.800537,258.480942,255.50741,257.241
south,174.3192,177.9462,181.8522,191.1708,194.3514,197.811,208.8594,226.827,225.6552,246.309032,250.415262,253.039238,250.128307,251.8254
west,208.0584,212.3874,217.0494,228.1716,231.9678,236.097,249.2838,270.729,269.3304,293.981748,298.882732,302.014574,298.540237,300.5658


## Insurance Costs Data Table

The model definition is specified in [Model Variable: Health Insurance Costs](#Health-Insurance-Costs).

In [144]:
insurance_costs_df

Unnamed: 0,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
alabama,622,620.0,636.0,726.0,838.0,891.0,925.0,959.0,1025.0,1193.0,1195.0,1279.0,1410.0,1362.0
alaska,449,533.0,433.0,535.0,895.0,714.0,764.0,814.0,842.0,909.0,1146.0,1208.0,1102.0,1286.0
arizona,503,547.0,560.0,662.0,752.0,803.0,807.0,811.0,851.0,974.0,1209.0,1200.0,1102.0,1096.0
arkansas,496,533.0,644.0,616.0,796.0,699.0,740.0,781.0,750.0,967.0,1028.0,1024.0,978.0,958.0
california,369,446.0,475.0,554.0,592.0,658.0,699.5,741.0,795.0,1145.0,1032.0,1035.0,1116.0,1129.0
colorado,499,590.0,581.0,677.0,741.0,717.0,857.5,998.0,971.0,965.0,1122.0,1148.0,1188.0,1244.0
connecticut,629,620.0,789.0,773.0,749.0,862.0,927.0,992.0,1082.0,1348.0,1273.0,1368.0,1536.0,1305.0
delaware,559,495.0,711.0,694.0,905.0,735.0,810.0,885.0,1101.0,1289.0,1183.0,1373.0,1459.0,1237.0
district of columbia,507,,710.0,634.0,765.0,699.0,845.0,991.0,906.0,1180.0,1235.0,1133.0,1198.0,1197.0
florida,584,569.0,750.0,723.0,892.0,860.0,962.5,1065.0,969.0,1172.0,1202.0,1213.0,1440.0,1394.0


## Tax Data Tables

The model definition is specified in [Model Variable: Taxes](#Taxes-Data).

In [164]:
updated_fica_tax_rate_df

Unnamed: 0,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
fica rate,0.0765,0.0765,0.0765,0.0765,0.0765,0.0765,0.0765,0.0765,0.0765,0.0765,0.0765,0.0565,0.0565,0.0765,0.0765


In [162]:
updated_federal_income_tax_rate_df

Unnamed: 0,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
federal income tax rate,0.0802,0.0671,0.0656,0.0534,0.0537,0.0569,0.0585,0.0593,0.0354,0.0447,0.045,0.0559,0.0584,0.0579,0.0534


In [10]:
updated_state_tax_rate_df

Unnamed: 0,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
AL,2.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
AK,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZ,2.87,3.2,3.2,3.2,3.2,3.2,3.04,2.88,2.88,2.88,2.88,2.88,2.88,2.88
AR,1.0,2.5,2.5,2.5,2.5,2.5,2.5,2.5,2.5,2.5,2.5,2.5,2.5,2.5
CA,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.25,2.25,2.0,2.0,2.0,2.0
CO,4.63,4.63,4.63,4.63,4.63,4.63,4.63,4.63,4.63,4.63,4.63,4.63,4.63,4.63
CT,3.0,4.5,4.5,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
DE,0.0,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.2,2.2
FL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GA,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0


## Appendix - Things to Revisit 

([TOC](#Table-of-Contents))

* Go back and redo CEX data to do proper regional weighting
* Make sure all inflation adjustments are done and done correctly