# Final exam DSE 200

Your task in this take-home exam is to analyze the evolution of poverty and income distribution in the United States.

### Data source: 

https://www.census.gov/did/www/saipe/data/statecounty/data/index.html

Use the given code snippet to access the data files

The data fields are described here:https://www.census.gov/did/www/saipe/data/statecounty/data/2000.html 
Extract the required fields. The required fields are: 
   'State FIPS', 'County FIPS','Poverty Estimate All Ages', 'Poverty Percent All Ages', 
   'Poverty Estimate Under Age 18', 'Poverty Percent Under Age 18', 'Poverty Estimate Ages 5-17', 
   'Poverty Percent Ages 5-17', 'Median Household Income','Name','Postal'
 
### Pandas Data-Frames

1)
    - create a data frame with just the country-wide data
    - create a data frame with just the data of the states
    - create a data frame with just the data of the counties

    
2) Plot a graph of 'Poverty Percent All Ages' of the entire country vs year (line graph).
   Plot a graph of 'Median Household Income' of the entire country vs year (line graph)

3) Plot the total poverty in each state across the years and compare it with the country-wide poverty

4) Plot county-wide poverty stats

- Create a dataframe with the Unique FIPS code (obtained by combining the state and county FIPS), 'Poverty Percent All Ages' in every county in 2000, 'Poverty Percent All Ages' in every county in 2013 and the change ratio between 2000 and 2013. (change ratio = poverty % in 2013 / poverty % in 2000. Divide this by the nationwide change ratio for normalization. A value > 1 indicates that poverty is higher compared to the nation-wide poverty and a value < 1 indicates that poverty is lower compared to the nationwide poverty.)


FIPS code is a unique code to identify counties and states in the US. In this data you have been given state code and county code separately. You will have to combine the state code and the county code to generate a unique code for each place, which will be used to plot on the map. Please note that before combining, make sure the state code is 2 digits and county code is 3 digits by adding zeroes to the front.


- Plot the 'Poverty Percent All Ages' for each county on the map for the years 2000 and 2013.
- Plot the normalized change ratio on the map.
- Find the counties which witnessed the most positive and most negative change in poverty,

For this visualization, we will use a library called Vincent, which lets you plot data for places using their FIPS code instead of using latitude/longitude. 

To install vincent do
`pip install vincent` or `conda install vincent`


To use it in the notebook and display the maps inside the notebook do

`import vincent`

`vincent.core.initialize_notebook()`

You can find further details about how to use it here - http://wrobstory.github.io/2013/10/mapping-data-python.html and https://github.com/wrobstory/vincent

Before closing your notebook, please clear the output of the vincent maps as it becomes difficult of reload the notebook later. For plotting the counties on the map, you will need to use the file us_counties.topo.json present in the exam folder.

Tips: -  
    Check the type of datafields before operating on them. It will also help you debug for errors.
    Clean the data before using it. Drop those rows before using them.

In [None]:
import urllib2
import pandas as pd

urls = ['https://www.census.gov/did/www/saipe/downloads/estmod93/est93ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod95/est95ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod97/est97ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod98/est98ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod99/est99ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod00/est00ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod01/est01ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod02/est02ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod03/est03ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod04/est04ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod05/est05ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod06/est06ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod07/est07ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod08/est08ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod09/est09ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod10/est10ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod11/est11all.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod12/est12ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod13/est13ALL.txt']

cols = [
    'State FIPS', 'County FIPS', 'FIPS', 'Poverty Estimate All Ages', 
    'Poverty Percent All Ages', 'Poverty Estimate Under Age 18', 
    'Poverty Percent Under Age 18', 'Poverty Estimate Ages 5-17', 
    'Poverty Percent Ages 5-17', 'Median Household Income', 'Name',
    'Postal', 'Year'
]

data_dict = {col: [] for col in cols}

def getUrl(urls):
    
    for url in urls:
        print 'processing', url
        response = urllib2.urlopen(url)
        lines = response.read().split('\n')
        del lines[-1]
        
        # extract year
        year = int(url[-9:-7])
        if year > 16:
            year += 1900
        else:
            year += 2000
        
        # extract data
        for line in lines:
            data_dict['State FIPS'].append(line[:2].strip().rjust(2,'0'))
            data_dict['County FIPS'].append(line[3:6].strip().rjust(3,'0'))
            data_dict['FIPS'].append(' ')
            data_dict['Poverty Estimate All Ages'].append(line[7:15].strip())
            data_dict['Poverty Percent All Ages'].append(line[34:38].strip())
            data_dict['Poverty Estimate Under Age 18'].append(line[49:57].strip())
            data_dict['Poverty Percent Under Age 18'].append(line[76:80].strip())
            data_dict['Poverty Estimate Ages 5-17'].append(line[91:99].strip())
            data_dict['Poverty Percent Ages 5-17'].append(line[118:122].strip())
            data_dict['Median Household Income'].append(line[133:139].strip())
            data_dict['Name'].append(line[193:238].strip())
            data_dict['Postal'].append(line[239:241].strip())
            data_dict['Year'].append(year)
            
    return None
        
getUrl(urls)

In [None]:
# make sure column lengths are the same
for key in data_dict.keys():
    print key,'column length is',len(data_dict[key])

In [None]:
# create dataframe
data_df = pd.DataFrame(data_dict)

# reorder columns
data_df = data_df[cols]

In [None]:
import numpy as np

# replace '.' values with NaN
data_df = data_df.replace('.', np.NaN)

In [None]:
# check for null values
print 'Checking Null values:'    

for col in cols:
    print col, 'null values:', data_df[col].isnull().sum()

In [None]:
# drop null values
print 'dataframe shape before dropping NaN values:', data_df.shape
data_df = data_df.dropna()
print 'dataframe shape after dropping NaN values:', data_df.shape

In [None]:
# convert data to numeric
data_df = data_df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

# convert FIPS columns back to strings of appropriate format
data_df['State FIPS'] = data_df['State FIPS'].apply(lambda x: str(x).rjust(2,'0'))
data_df['County FIPS'] = data_df['County FIPS'].apply(lambda x: str(x).rjust(3,'0'))

# concatenate State FIPS and County FIPS to create FIPS column data
data_df['FIPS'] = data_df.apply(lambda x: x['State FIPS'] + x['County FIPS'], axis = 1)

# check data types
print 'Checking data types:'
for col in cols:
    print col, 'is', type(data_df[col].values[0])

In [None]:
# check dataframe
data_df.head()

### Q1: Create the dataframes

Download and parse the data files and create the following three pandas dataframes: (your dataframes should have data in the format shown below)

 * US_stat: statistics for the whole United States.
 * states_stat: Statistics for each state.
 * county_stat: Statistics for each county.

In [None]:
# filter full dataframe by Name
US_stat = data_df[data_df['Name'] == 'United States']

# set index to Year
US_stat = US_stat.set_index('Year')

# check dataframe
print 'US_stat dataframe shape is', US_stat.shape
print 'same length as number of years? ', len(US_stat) == len(urls)
US_stat.head()

In [None]:
# reset index to Postal and Year
states_stat = data_df.set_index(['Postal', 'Year'])

# remove county level data by selecting only FIPS code '000'
states_stat = states_stat[states_stat['County FIPS'] == '000']

# remove rows where 'Name' is 'United States'
states_stat = states_stat[states_stat['Name'] != 'United States']

# sort by index
states_stat = states_stat.sort_index()

# check dataframe
print 'states_stat dataframe shape is', states_stat.shape
print 'same length as number of states times number of years?', \
    len(states_stat) == len(urls) * len(states_stat.Name.unique())
states_stat.head()

In [None]:
# reset index to Postal and Year
county_stat = data_df.set_index(['Postal', 'FIPS', 'Year'])

# remove non-county level data by selecting only FIPS code not equal '000'
county_stat = county_stat[county_stat['County FIPS'] != '000']

# remove rows where 'Name' is 'United States'
county_stat = county_stat[county_stat['Name'] != 'United States']

# sort by index
county_stat = county_stat.sort_index()

# check dataframe
print 'county_stat dataframe shape is', county_stat.shape
#print 'same length as number of states times number of years?', \
#    len(county_stat) == len(urls) * len(county_stat.Name.unique())
county_stat.head()

In [None]:
import matplotlib.pyplot as plt
%pylab inline

### Q2. Plot the US-wide statistics on poverty.

Plot the 'Poverty Percent All Ages' and 'Median Household Income' across entire US over the years. 

Compute the percentage of increase in poverty in the US between 2000 and 2013. - [poverty % in 2013] / [poverty % in 2000]

In [None]:
# calculate percentage median income increase
income_2000 = US_stat.loc[2000]['Median Household Income']
income_2013 = US_stat.loc[2013]['Median Household Income']
income_increase = float(income_2013) / income_2000
print 'nationwide median income in 2000 =', income_2000
print 'nationwide median income in 2013 =', income_2013
print 'nationwide income change ratio from 2000 to 2013 =', income_increase

# calculate percentage poverty increase
poverty_2000 = US_stat.loc[2000]['Poverty Percent All Ages']
poverty_2013 = US_stat.loc[2013]['Poverty Percent All Ages']
poverty_increase = poverty_2013 / poverty_2000
print '\nnationwide poverty percent in 2000 =', poverty_2000
print 'nationwide poverty percent in 2013 =', poverty_2013
print 'nationwide poverty percent change ratio from 2000 to 2013 =', poverty_increase

# create figure and axes
fig, ax = plt.subplots(1,2, figsize=(10, 3))

# create first subplot
ax[0].plot(US_stat['Median Household Income'].index.values, US_stat['Median Household Income'].values, 'b-')
ax[0].grid()
ax[0].set_title('Median Household Income ($)')
#ax[0].set_ylabel('$')

ax[1].plot(US_stat['Poverty Percent All Ages'].index.values, US_stat['Poverty Percent All Ages'].values, 'b-')
ax[1].grid()
ax[1].set_title('Poverty Percent All Ages (%)')
#ax[1].set_ylabel('%')

plt.show()

### 2000 was a good year

We see from these graphs that even though the median household in the US keeps increasing at a more or less constant rate, the poverty levels reached a minimum at 2000 and have increased dramatically , by 40% since then.

We will now attempt to identify the geographic distribution of the rise in poverty since 2000.

We start by plotting the time evolution of poverty for each of the states.

### Q3: Plot the change in poverty percentages by state.

For each state, plot the poverty levels across time and compare it with the nation-wide poverty level. Produce a graph similar to the ones below.

In [None]:
# get nationwide poverty percentage values
US_poverty = US_stat['Poverty Percent All Ages'].values

# get years for data
years = US_stat['Poverty Percent All Ages'].index.values

# get list of states
states = states_stat.index.get_level_values('Postal').unique()

# create figure and axes for plots
fig, ax = plt.subplots(11,5, figsize=(10, 20), sharex=True, sharey=True)

# create plots
state_index = 0
for i in range(11):
    for j in range(5):
        if state_index >= len(states):
            break
        else:
            state = states[state_index]
            state_index += 1
            state_poverty = states_stat.loc[state]['Poverty Percent All Ages'].values
            ax[i, j].plot(years, US_poverty, 'b-', label = 'US Poverty Percentage')
            ax[i, j].plot(years, state_poverty, 'r-', label = 'State Poverty Percentage')
            ax[i, j].set_title(state)
            ax[i, j].grid()           

# rotate x-axis ticklabels
for j in range(5):
    for label in ax[10, j].get_xticklabels():
        label.set_rotation(45)

# create main figure title
fig.suptitle('Percentage Poverty by State from 1993 to 2013', 
             fontsize = 14, fontweight = 'bold', y = 0.93)

# adjust vertical spacing between subplots
fig.subplots_adjust(hspace = 0.35)

# add legend to figure
#handles, labels = ax[0,0].get_legend_handles_labels()
#plt.figlegend(handles, labels, 'upper center')

plt.show()

### Q4: plot poverty statistics by county

Using the vincent library and the dataframe `county_stat`, generate the following three maps.

1. Overall percentage of poverty for each county in 2000.
![poverty2000](Poverty2000.jpg)
1. Overall percentage of poverty for each county in 2013.
![poverty2013](Poverty2013.jpg)
1. Ratio between percentage of change in poverty from 2000 to 2013 for each county. Divided by the nation-wide change in poverty (1.39).
![povertyChange](PovertyChange.jpg)

In [None]:
import vincent
vincent.core.initialize_notebook()

In [None]:
# function to get data for any year
def get_data_for_year(input_df, year):
    'Function returns data for specific year and converts FIPS column to integer'
    return_df = input_df.reset_index()
    return_df = return_df[return_df['Year'] == year]
    return_df['FIPS'] = return_df['FIPS'].apply(lambda x: int(x))
    return return_df

In [None]:
# get county data for year 2000
county_2000 = get_data_for_year(county_stat, 2000)
county_2000.head()

In [None]:
# plot poverty percent by county for year 2000
county_topo = 'us_counties.topo.json'

geo_data = [{'name': 'counties',
             'url': county_topo,
             'feature': 'us_counties.geo'}]

vis_pov_2000 = vincent.Map(data=county_2000, geo_data=geo_data, scale=850, projection='albersUsa',
                  data_bind='Poverty Percent All Ages', data_key='FIPS', 
                  map_key={'counties': 'properties.FIPS'})

vis_pov_2000.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.35)

vis_pov_2000.legend(title='Poverty 2000 (%)')

# change color scales
vis_pov_2000.scales['color'].type = 'threshold'
vis_pov_2000.scales['color'].domain = [0, 4, 6, 8, 10, 12, 20, 30]

vis_pov_2000.display()

In [None]:
# get county data for year 2013
county_2013 = get_data_for_year(county_stat, 2013)
county_2013.head()

In [None]:
# plot poverty data by county for year 2013
county_topo = 'us_counties.topo.json'

geo_data = [{'name': 'counties',
             'url': county_topo,
             'feature': 'us_counties.geo'}]

vis_pov_2013 = vincent.Map(data=county_2013, geo_data=geo_data, scale=850, projection='albersUsa',
                  data_bind='Poverty Percent All Ages', data_key='FIPS', 
                  map_key={'counties': 'properties.FIPS'})

vis_pov_2013.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.35)

vis_pov_2013.legend(title='Poverty 2013 (%)')

# change color scales
vis_pov_2013.scales['color'].type = 'threshold'
vis_pov_2013.scales['color'].domain = [0, 4, 6, 8, 10, 12, 20, 30]

vis_pov_2013.display()

In [None]:
# create dataframes for year 2000 and 2013 with only select columns
cols = ['FIPS', 'Name', 'Postal', 'Poverty Percent All Ages', 'Median Household Income']
county_2000_sub = county_2000[cols]
county_2013_sub = county_2013[cols]

# rename columns
county_2000_sub.columns = ['FIPS', 'Name', 'Postal', 'Poverty Percent 2000', 'Median Income 2000']
county_2013_sub.columns = ['FIPS', 'Name', 'Postal', 'Poverty Percent 2013', 'Median Income 2013']

# merge dataframes
comparison_df = pd.merge(county_2000_sub, county_2013_sub, on=['FIPS', 'Name', 'Postal'], how='inner')

# add percent change ratio column
comparison_df['Poverty Change'] = comparison_df['Poverty Percent 2013'] / comparison_df['Poverty Percent 2000']
comparison_df['Poverty Change'] = comparison_df['Poverty Change'].apply(lambda x: x / poverty_increase)

comparison_df['Income Change'] = comparison_df['Median Income 2013'] / comparison_df['Median Income 2000']
comparison_df['Income Change'] = comparison_df['Income Change'].apply(lambda x: x / income_increase)

comparison_df.head()

In [None]:
# plot change in poverty by county from year 2000 to year 2013
county_topo = 'us_counties.topo.json'

geo_data = [{'name': 'counties',
             'url': county_topo,
             'feature': 'us_counties.geo'}]

vis_pov_comp = vincent.Map(data=comparison_df, geo_data=geo_data, scale=850, projection='albersUsa',
                  data_bind='Poverty Change', data_key='FIPS', 
                  map_key={'counties': 'properties.FIPS'})

vis_pov_comp.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.35)

vis_pov_comp.legend(title='Poverty Change 2000-2013')

# change color scales
vis_pov_comp.scales['color'].type = 'threshold'
vis_pov_comp.scales['color'].domain = [0, 0.8, 0.9, 1, 1.1, 1.2, 1.3]

# change to divergent color scheme
#vis_pov_comp.scales['color'].range = ["#abdda4","#f46d43"]
#vis_pov_comp.colors(brew='YlGnBu')
#vis_pov_comp.colors(brew='Spectral')

vis_pov_comp.display()

### Q5: Identify the extremes.
Find out which are the counties in which the poverty percentage increased or decreased the most during the period 2000 - 2013.

In [None]:
# find counties with largest increase in poverty percent
largest_increase = comparison_df.sort_values('Poverty Change', ascending=False)
largest_increase.head(10)

Of the top 10 counties with the largest poverty change, four are in Michigan.  

Below we find the top 500 counties with the largest increase in poverty, and create a bar chart of the states with the most counties in the top 500.  The five states with the most counties in the top 500 are Indiana, Michigan, Georgia, Ohio, and Wisconsin.

In [None]:
n = 500
top_states = largest_increase[:n]['Postal'].value_counts()[:10]
x_pos = np.arange(len(top_states))
plt.bar(x_pos, top_states.values, alpha = 0.5)
plt.xticks(x_pos + 0.5, top_states.index, rotation = 45, size = 'small')
plt.ylabel('Number of Counties')
plt.title('States with Most Counties in Top '+str(n)+' Poverty Change')
plt.axes().yaxis.grid()
plt.show();

Below is a graph of all counties in which the poverty change was simply above the national average.

In [None]:
all_increase = comparison_df[comparison_df['Poverty Change'] > 1]
top_states = all_increase['Postal'].value_counts()[:10]
x_pos = np.arange(len(top_states))
plt.bar(x_pos, top_states.values, alpha = 0.5)
plt.xticks(x_pos + 0.5, top_states.index, rotation = 45, size = 'small')
plt.ylabel('Number of Counties')
plt.title('States with Most Counties with Poverty Change > 1')      
plt.axes().yaxis.grid()
plt.show();

The top states by number of counties with poverty change above the national average are GA, IN, OH, MI, and NC.  Four of these five states are also found in the top five states by number of counties with poverty change among the top 500 in the nation.  So states that tend to have the most counties with the highest increases in poverty also have lots of counties with merely above average increases in poverty.

Below, we find the counties with the lowest poverty change.

In [None]:
# find counties with smallest increase (or largest decrease) in poverty percent
smallest_increase = comparison_df.sort_values('Poverty Change', ascending=True)
smallest_increase.head(10)

Of the top 10 counties with the smallest poverty change, four are in North Dakota, and three are in Texas.  

Below we create a list of the top 500 counties with the smallest poverty change, and a bar chart of the states with the most counties in the list.  We see that by far the state with the most counties in the list is Texas, followed by North Dakota, Oklahoma, South Dakota, and Nebraska. 

In [None]:
n = 500
bottom_states = smallest_increase[:n]['Postal'].value_counts()[:10]

x_pos = np.arange(len(bottom_states))
plt.bar(x_pos, bottom_states.values, alpha = 0.5)
plt.xticks(x_pos + 0.5, bottom_states.index, rotation = 45, size = 'small')
plt.ylabel('Number of Counties')
plt.title('States with Most Counties in Bottom '+str(n)+' Poverty Change')
plt.axes().yaxis.grid()
plt.show();

Below is a graph of state counts for all counties in which the poverty change was simply below the national average (rather than in the more extreme group consisting of the bottom 500).

In [None]:
all_decrease = comparison_df[comparison_df['Poverty Change'] < 1]
bottom_states = all_decrease['Postal'].value_counts()[:10]
x_pos = np.arange(len(bottom_states))
plt.bar(x_pos, bottom_states.values, alpha = 0.5)
plt.xticks(x_pos + 0.5, bottom_states.index, rotation = 45, size = 'small')
plt.ylabel('Number of Counties')
plt.title('States with the Most Counties with Poverty Change < 1')      
plt.axes().yaxis.grid()
plt.show();

We find that the top state is TX, again by a large margin.  The remainder of the top five states are VA, NE, KY, and OK.  Of those, VA and KY were not among the top 10 states when limited to the top 500 counties.  Apparently VA and KY had lots of counties with poverty increases below the national average, but perhaps not so many counties that were far enough below the national average as to appear in the top 500.

### Further Analysis - Georgia
Interestingly, GA - the state with the most counties in the top 500 for highest poverty growth - is seventh in the list of states with counties having below national average poverty growth.  This could simply indicate that GA has a large number of counties.  However, GA did not appear among the top 10 states with counties in the bottom 500 for poverty growth, which may indicate that the counties in GA that experienced above average poverty growth were more likely to be in the extremes nationwide (i.e. in the top 500 counties for poverty growth) than counties that experienced below average growth were to fall in the bottom 500.  In other words, there may be a skewed distribution where many counties in GA experienced very high poverty growth, while among those that experienced low poverty growth, only a few experienced very low poverty growth.

We can determine how many counties in GA fell into each of the categories.

In [None]:
print 'GA counties with above average increase in poverty:',all_increase['Postal'].value_counts()['GA']
print 'GA counties in top 500 increase in poverty:',largest_increase[:500]['Postal'].value_counts()['GA']

print 'GA counties with below average increase in poverty:',all_decrease['Postal'].value_counts()['GA']
print 'GA counties in bottom 500 increase in poverty:',smallest_increase[:500]['Postal'].value_counts()['GA']

Of the 88 counties in GA that experienced an above average increase in poverty between 2000 and 2013, 48 of those were in the top 500 counties in terms of poverty increase, but of the 71 counties in GA that experienced a below average increase in poverty, only 1 was in the bottom 500 counties in terms of poverty increase.

This confirms that the counties in GA that experienced above average increases in poverty were likely to be counties that had some of the highest increases in poverty nationwide, while counties in GA that experienced below average increases in poverty were not among the lowest in the country.

To investigate this, below is a scatter plot of GA Poverty Percent in 2000 vs. 2013, with a line of slope equal to the national average change in poverty of 1.39 added for reference.  We can see that most of the data that falls below the line (i.e. below average poverty change) is still fairly close to the line, while the data that falls above the line (i.e. above average poverty change) is more likely to fall further from the line.

In [None]:
GA_df = comparison_df[comparison_df['Postal'] ==  'GA']
GA_pov_2000 = GA_df['Poverty Percent 2000'].values
GA_pov_2013 = GA_df['Poverty Percent 2013'].values
plt.scatter(GA_pov_2000, GA_pov_2013, alpha = 0.4)
plt.xlabel('GA Poverty Percent 2000')
plt.ylabel('GA Poverty Percent 2013')
plt.title('GA Poverty Percent in 2000 vs. 2013')
plt.plot([5,30],[5*poverty_increase,30*poverty_increase],'g--',linewidth = 2, label = 'Line of Slope 1.39')
plt.legend()
plt.show();

Additionally, below are histograms of poverty change by county for GA only, and for the nation as a whole.  We can clearly see that the distribution for the state of GA is right skewed.  While lots of counties fall both above and below 1 (which corresponds to the national average poverty change), very few counties fall below 0.78 (which corresponds to the poverty change cutoff for the bottom 500 counties), but many counties fall above 1.13 (the poverty change cutoff for the top 500 counties).

In [None]:
cutoff_top = largest_increase[:500]['Poverty Change'].min()
cutoff_bottom = smallest_increase[:500]['Poverty Change'].max()
#print len(comparison_df[comparison_df['Poverty Change']>=cutoff_top])
#print len(comparison_df[comparison_df['Poverty Change']<=cutoff_bottom])
print 'poverty change cutoff for top 500 counties =', cutoff_top
print 'poverty change cutoff for bottom 500 counties = ',cutoff_bottom

In [None]:
fig, ax = subplots(2,1,figsize=(5,5), sharex=True)
ax[0].hist(GA_df['Poverty Change'].values, bins = 20, alpha = 0.5)
ax[0].xaxis.grid()
ax[0].set_title('Poverty Change - GA Only')
ax[1].hist(comparison_df['Poverty Change'].values, bins = 30, alpha = 0.5)
ax[1].xaxis.grid()
ax[1].set_title('Poverty Change - Nationwide')
plt.show()

### Further Analysis - Median Income and Poverty Change
Some functions have been created to simplify the analysis process.  Below is a function to determine the percentage makeup of total poverty estimates by age group.

In [None]:
# function to calculate percentage makeup of total poverty estimates by age group
def add_pct_by_age(input_df):
    'Function to create additional dataframe columns indicating percentage makeup of total poverty by age group'
    return_df = input_df.reset_index()
    
    # add column for child and adult poverty estimates
    return_df['Poverty Estimate Under 5'] = return_df['Poverty Estimate Under Age 18'] - return_df['Poverty Estimate Ages 5-17']
    return_df['Poverty Estimate Over 18'] = return_df['Poverty Estimate All Ages'] - return_df['Poverty Estimate Under Age 18']
    
    # add columns for percent of each group contribution to total poverty estimates
    return_df['Pct of Total Under 5'] = return_df['Poverty Estimate Under 5'] / return_df['Poverty Estimate All Ages']
    return_df['Pct of Total Over 18'] = return_df['Poverty Estimate Over 18'] / return_df['Poverty Estimate All Ages']
    return_df['Pct of Total 5-17'] = return_df['Poverty Estimate Ages 5-17'] / return_df['Poverty Estimate All Ages']
    
    # reformat new columns
    return_df['Pct of Total Under 5'] = return_df['Pct of Total Under 5'].apply(lambda x: round(100 * x, 1))
    return_df['Pct of Total Over 18'] = return_df['Pct of Total Over 18'].apply(lambda x: round(100 * x, 1))
    return_df['Pct of Total 5-17'] = return_df['Pct of Total 5-17'].apply(lambda x: round(100 * x, 1))
    
    # remove extraneous index column
    return_df = return_df.drop('index', axis=1)
    return return_df

Below is a second function, which creates a county comparison dataframe from input values that indicate a start and end year and a column to be compared, and a boolean input to control whether the comparison is normalized by dividing the change for each county by the national change.

In [None]:
# function to create county comparison dataframe from input years and column to compare
def create_comparison_df(startYr, endYr, col, bnormalize):
    '''
    Function to create county comparison dataframe.
    startYr:  first year of data for comparison
    endYr:  second year of data for comparison
    col:  column name of data for comparison
    bnormalize:  boolean indicating whether to normalize change by dividing by national average change
    '''
    # create startYr and endYr dataframes
    start_df = get_data_for_year(county_stat, startYr)
    end_df = get_data_for_year(county_stat, endYr)
    start_df = add_pct_by_age(start_df)
    end_df = add_pct_by_age(end_df)
    
    # calculate national base rate of change for column of interest
    US_stat_start = get_data_for_year(US_stat, startYr)
    US_stat_end = get_data_for_year(US_stat, endYr)
    US_stat_start = add_pct_by_age(US_stat_start)
    US_stat_end = add_pct_by_age(US_stat_end)
    
    base_start = US_stat_start[US_stat_start['Year'] == startYr][col]
    base_end = US_stat_end[US_stat_end['Year'] == endYr][col]
    if bnormalize:
        base_change = base_end / base_start
    else:
        base_change = 1.0
    
    # create year dataframes with only select columns
    cols = ['FIPS', 'Name', 'Postal']
    cols.append(col)
    start_df_sub = start_df[cols]
    end_df_sub = end_df[cols]

    # rename columns
    startCol = col+' '+str(startYr)
    endCol = col+' '+str(endYr)
    start_df_sub.columns = ['FIPS', 'Name', 'Postal', startCol]
    end_df_sub.columns = ['FIPS', 'Name', 'Postal', endCol]

    # merge dataframes
    comparison_df = pd.merge(start_df_sub, end_df_sub, on=['FIPS', 'Name', 'Postal'], how='inner')

    # add percent change ratio column
    colName = col + ' Change'
    comparison_df[colName] = comparison_df[endCol] / comparison_df[startCol]
    comparison_df[colName] = comparison_df[colName].apply(lambda x: x / base_change)

    return comparison_df

Here we create a dataframe comparing median household income by county in year 2000 and year 2013, followed by a plot of the county data using the vincent package.

As was done previously for poverty change, we normalize median income change by dividing by the national average change in median income.  This results in change values less than 1 indicating income growth lower than national average income growth, and change values greater than 1 indicating an income growth larger than the national average.

In [None]:
income_comp_df = create_comparison_df(2000, 2013, 'Median Household Income', True)
income_comp_df.head()

In [None]:
# plot change in median household income county from year 2000 to year 2013
county_topo = 'us_counties.topo.json'

geo_data = [{'name': 'counties',
             'url': county_topo,
             'feature': 'us_counties.geo'}]

vis_inc_comp = vincent.Map(data=income_comp_df, geo_data=geo_data, scale=850, projection='albersUsa',
                  data_bind='Median Household Income Change', data_key='FIPS', 
                  map_key={'counties': 'properties.FIPS'})

vis_inc_comp.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.35)

vis_inc_comp.legend(title='Income Change 2000-2013')

# change color scales
vis_inc_comp.scales['color'].type = 'threshold'
vis_inc_comp.scales['color'].domain = [0, 0.8, 0.9, 1, 1.1, 1.2, 1.3]
#vis_inc_comp.scales['color'].domain = [1.3, 1.2, 1.1, 1, 0.9, 0.8, 0]

# change to divergent color scheme
#vis_inc_comp.scales['color'].range = ["#abdda4","#f46d43"]
#vis_inc_comp.colors(brew='YlGnBu')
#vis_inc_comp.colors(brew='Spectral')

vis_inc_comp.display()

For comparison, below is a plot of poverty change generated previously.

In [None]:
vis_pov_comp.display()

From a visual inspection of the two graphs, it appears that counties with income growth higher than the national average between 2000 and 2013 had poverty growth lower than the national average, and counties that had lower income growth had higher poverty growth.  That result makes intuitive sense, as we would expect changes in poverty and income levels to be negatively correlated.

To gauge the degree of correlation, let's create a scatterplot of income growth vs. poverty growth.  First, we join the data, then we create the scatterplot.

In [None]:
# create and join dataframes comparing income growth and poverty growth by county
income_comp_df = create_comparison_df(2000, 2013, 'Median Household Income', True)
poverty_comp_df = create_comparison_df(2000, 2013, 'Poverty Percent All Ages', True)
joined_df = pd.merge(income_comp_df, poverty_comp_df, on=['FIPS', 'Name', 'Postal'], how='inner')
joined_df.head()

In [None]:
# create scatterplot
x = joined_df['Median Household Income Change'].values
y = joined_df['Poverty Percent All Ages Change'].values
plt.scatter(x, y, alpha = 0.15)
plt.xlabel('Median Household Income Change')
plt.ylabel('Poverty Percent All Ages Change')
plt.title('Poverty and Income Change for US Counties from 2000 to 2013', fontsize = 12)
plt.grid()
plt.show()

# calculate correlation
print 'Correlation coefficients:'
print np.corrcoef(x, y)

While there are quite a few counties having above average income growth and above average poverty growth (or the opposite), the general trend of an inverse relationship between changes in poverty and income seems to hold.  This is verified visually by the scatterplot and also by the calculated correlation of -0.68.

### Exploratory Plots
A quick visual inspection of the histograms of median household income and poverty rates for years 2000 and 2013 shows an apparent difference in the mean values for each year.

In [None]:
# get median income data for 2000 and 2013
inc_2000 = joined_df['Median Household Income 2000'].values
inc_2013 = joined_df['Median Household Income 2013'].values

fig, ax = plt.subplots(2,1, figsize=(5, 5), sharex=True, sharey=False)
ax[0].hist(inc_2000, bins = 100, alpha = 0.5)
ax[0].set_title('Median Household Income 2000')
ax[0].grid()
ax[1].hist(inc_2013, bins = 100, alpha = 0.5)
ax[1].set_title('Median Household Income 2013')
ax[1].set_xlabel('$')
ax[1].grid()
plt.show();

In [None]:
# get poverty percent data for 2000 and 2013
pov_2000 = joined_df['Poverty Percent All Ages 2000'].values
pov_2013 = joined_df['Poverty Percent All Ages 2013'].values

fig, ax = plt.subplots(2,1, figsize=(5, 5), sharex=True, sharey=False)
ax[0].hist(pov_2000, bins = 100, alpha = 0.5)
ax[0].set_title('Poverty Percent All Ages 2000')
ax[0].grid()
ax[1].hist(pov_2013, bins = 100, alpha = 0.5)
ax[1].set_title('Poverty Percent All Ages 2013')
ax[1].set_xlabel('%')
ax[1].grid()
plt.show();

We can also see the difference in median income and poverty rate by county in 2000 and 2013 by scatter plotting the data, as shown below.  A dashed line of slope 1 has been added to the plots for reference.

In [None]:
fig, ax = plt.subplots(1,2, figsize=(12,5))

ax[0].scatter(inc_2000, inc_2013, alpha = 0.15)
ax[0].set_xlabel('Median Income 2000')
ax[0].set_ylabel('Median Income 2013')
ax[0].set_title('Median Income 2000 vs. 2013', fontsize = 12)

# add line of slope 1 for visual reference
inc_all = np.concatenate((inc_2000, inc_2013))
min_inc = np.min(inc_all)
max_inc = np.max(inc_all)
ax[0].plot([min_inc, max_inc], [min_inc, max_inc], 'g--', label = 'Line of Slope 1', linewidth = 2)
ax[0].grid()
ax[0].legend()

ax[1].scatter(pov_2000, pov_2013, alpha = 0.15)
ax[1].set_xlabel('Poverty Percent 2000')
ax[1].set_ylabel('Poverty Percent 2013')
ax[1].set_title('Poverty Percent 2000 vs. 2013', fontsize = 12)

# add line of slope 1 for visual reference
pov_all = np.concatenate((pov_2000, pov_2013))
min_pov = np.min(pov_all)
max_pov = np.max(pov_all)
ax[1].plot([min_pov, max_pov], [min_pov, max_pov], 'g--', label = 'Line of Slope 1', linewidth = 2)
ax[1].grid()
ax[1].legend()

# adjust horizontal spacing between subplots
fig.subplots_adjust(wspace = 0.35)
plt.show();

For the median income comparison graph, it is very clear that the vast majority of points lie above the reference line, indicating that median income by county in 2013 was generally higher than it was in 2000.

It also appears that most of the points in the poverty scatterplot lie above the slope 1 reference line, but perhaps not to the degree as for median income.  So the scatter plots seem to indicate a difference in both median income and poverty percent between the years 2000 and 2013.

### Statistical Testing
In order to make valid statistical conclusions regarding any true difference in the means of the data for 2000 and 2013, however, we must perform statistical testing.  Because we are comparing the same counties at different times, we will use a paired t-test.

In [None]:
import scipy.stats as stats

pov_2000 = joined_df['Poverty Percent All Ages 2000'].values
pov_2013 = joined_df['Poverty Percent All Ages 2013'].values
stats.ttest_rel(pov_2000, pov_2013)

In [None]:
inc_2000 = joined_df['Median Household Income 2000'].values
inc_2013 = joined_df['Median Household Income 2013'].values
stats.ttest_rel(inc_2000, inc_2013)

The p-values in the test output for both tests shows that the chances of seeing this large of a difference between poverty rates and income levels in 2000 and 2013 simply due to chance is essentially zero, confirming that the difference in mean values between the two years is statistically significant.

### Ideas for Additional Analysis
Several avenues of additional analysis that may yield interesting results are listed below:

1. Investigation of potential existence of a regression to the mean effect, where poverty growth may have been higher for counties with low initial poverty levels, and lower for counties with high initial poverty levels;

2. Investigation of whether poverty and/or income growth is related to initial poverty levels or initial median levels;

3. Investigation of whether the poverty growth rates for different age groups exhibits similar behavior;

4. Investigation of additional relationships between poverty and/or income growth using external data sources, such as
    * Unemployment and education data, available at https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/

    * Health information for diseases such as diabetes, available at http://www.cdc.gov/diabetes/data/county.html