# Poverty in America - 2016: Thinkful Data Science Prep Course Capstone

In this report, I will explore poverty rates in America. I will first start by describing the dataset that I will analyze, then I will analyze the data, and I will conclude by proposing some further research. <br><br> The following link to the three main sections of this report.

1. [Describe your dataset](#Describe your dataset) <br>
2. [Ask and answer analytic questions](#Ask and answer analytic questions) <br>
3. [Propose further research](#Propose further research) <br>

_Prep code: Import packages_

In [56]:
%matplotlib notebook

import seaborn as sns

from IPython.display import display
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
from sklearn import linear_model
import statsmodels.formula.api as smf

<a id="Describe your dataset"></a>

## Describe your dataset


> Describe and explore your dataset in the initial section of your Report. What does your data contain and what is its background? Where does it come from? Why is it interesting or significant? Conduct summary statistics and produce visualizations for the particular variables from the dataset that you will use.

_Load CSV file_

In [57]:
poverty_16 = pd.read_csv('/Users/jAir/Desktop/Research/thinkful/datasets/Poverty_16.csv', index_col=1)

In [58]:
poverty_16.head()

Unnamed: 0,Geographic Information,Unnamed: 2,Total Population,All Ages,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Age 0-17,...,Unnamed: 22,Unnamed: 23,Age 0-4,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Welfare,DorR
US,0,United States,323405935,44268996,44022086,44515906,14.0,13.9,14.1,14115713,...,57502,57732,4156949,4108397,4205501,21.3,21.1,21.5,123.0,
AL,1,Alabama,4860545,814197,796927,831467,17.2,16.8,17.6,267674,...,45650,46968,78675,74257,83093,27.4,25.9,28.9,1.2,r
AK,2,Alaska,741522,71916,67813,76019,9.9,9.3,10.5,24897,...,74375,77913,7919,6898,8940,15.1,13.2,17.0,0.8,r
AZ,4,Arizona,6908642,1107153,1086482,1127824,16.4,16.1,16.7,377445,...,52863,54099,106817,101686,111948,24.9,23.7,26.1,3.4,r
AR,5,Arkansas,2988231,497388,485180,509596,17.2,16.8,17.6,165724,...,43530,45282,50341,47095,53587,27.3,25.5,29.1,0.9,r


This dataset was download from the __[United States Census Bureau's website](https://www2.census.gov/programs-surveys/saipe/datasets/2016/2016-state-and-county/)__.

The welfare data came from __[usgovernmentspending.com](https://www.usgovernmentspending.com/compare_state_spending_2016b40a)__.

It contains poverty rates for all US states for 2016 and breaks that data down into several subgroups such as families with young Children. It also contains the median househild income and welfare spending in each state.

_Select Columns to analyze and exclude fow with the total for US_

In [59]:
poverty_sub = poverty_16[['Total Population','All Ages', 'Age 0-17',  
                                 'Age 5-17 in Families',
                                 'Age 0-4', 'Median Household Income', 'Welfare', 'DorR']]
poverty_sub = poverty_sub.iloc[1:51, :]
poverty_sub.head()

Unnamed: 0,Total Population,All Ages,Age 0-17,Age 5-17 in Families,Age 0-4,Median Household Income,Welfare,DorR
AL,4860545,814197,267674,185889,78675,46309,1.2,r
AK,741522,71916,24897,16061,7919,76144,0.8,r
AZ,6908642,1107153,377445,263614,106817,53481,3.4,r
AR,2988231,497388,165724,112376,50341,44406,0.9,r
CA,39296476,5527621,1782764,1242780,502432,67715,17.8,b


_Visualize the data with a bar chart to begin data exploration._

In [60]:
poverty_sub.iloc[1:51, 1:].plot(figsize = (17,8),kind = 'bar')
sns.set()
plt.ylabel("Number of People Living in Poverty")
plt.title('Poverty in America')
plt.show()

<IPython.core.display.Javascript object>

This bar highlights the need for two things. First, it is clear that states with large populations, such as California and Texas, have more people living in poverty than states with small populations, e.g. Alaska. Because of this, the data needs to be standardized in some way, which I will do below. Second, this chart shows us that comparing the number of people in poverty by age group is problematic - the chart just contains too much information to view. Below, I break out the different age groups for individual comparison.

_Add new column to dataframe that contains the percent of people in poverty which allows for a better comparison between levels of poverty in each state. Then create a bar chart of the percentage of the population of each state living in poverty. To add in viewing that data, I also sort the data by the perc pop pov column._


In [61]:
poverty_sub['perc_pop_pov'] = poverty_sub['All Ages']/poverty_sub['Total Population']
poverty_sort = poverty_sub.sort_values(by=['perc_pop_pov'])
poverty_sort.head()

Unnamed: 0,Total Population,All Ages,Age 0-17,Age 5-17 in Families,Age 0-4,Median Household Income,Welfare,DorR,perc_pop_pov
NH,1335015,97753,21896,14329,6300,70986,0.5,b,0.073222
HI,1428683,132597,33801,21877,10436,74659,0.9,b,0.092811
MD,6024752,572786,172868,117468,50081,78787,3.5,b,0.095072
CT,3587685,342171,97462,67644,27362,73380,2.4,b,0.095374
MN,5525050,531789,158751,106426,46337,65583,3.0,b,0.096251


In [62]:
poverty_sort.iloc[0:, -1].plot(figsize = (15,6), kind = 'bar')
plt.ylabel('Percentage of People Living in Poverty')
plt.title('State Poverty Rates')
plt.show()

The bar chart above helps us to easily compare the rates of poverty in each state. New Hampshire has the lowest rate of poverty, while Minisota has the highest rate.

_Before continuing with the analysis, it is important to understand the dataset better. To do that, I produce a histogram and a QQ plot._

In [63]:
poverty_sub.perc_pop_pov.hist()
plt.ylabel('Number of States')
plt.xlabel('Poverty Rate')
plt.title('Histogram of State Poverty Rates')
plt.show()

In [64]:
norm = np.random.normal(0, 1, 50)
norm.sort()
plt.plot(norm, poverty_sort.perc_pop_pov, "o") 
plt.ylabel('Percentage of People Living in Poverty')
plt.xlabel('Normal Distribution')
plt.title('QQ Plot State Poverty Rates/Normal Distribution')
plt.show() 

From the above histogram and the QQ plot, we see that the poverty data is roughly normal. Knowing this allows for some specific types of statistical analysis that I will cover below.

A final piece of data that I want to explore is welfare spending.

_Add a column for welfare spending per person_

In [65]:
poverty_sub['welfare_dollars_per_person'] = (poverty_sub['Welfare']*1000000000)/poverty_sub['Total Population']

In [66]:
list(poverty_sub)

['Total Population',
 'All Ages',
 'Age 0-17',
 'Age 5-17 in Families',
 'Age 0-4',
 'Median Household Income',
 'Welfare',
 'DorR',
 'perc_pop_pov',
 'welfare_dollars_per_person']

In [67]:
poverty_sort.iloc[0:, -1].plot(figsize = (15,6), kind = 'bar')
plt.ylabel('Number of Welfare Dollars Spent Per Person')
plt.title('State Welfare Spending')
plt.show()

Here we see a large range of welfare spending per person. I will analyze this further below.

<a id="Ask and answer analytic questions"></a>

## Ask and answer analytic questions

 > Ask three analytic questions and answer each one with a combination of statistics and visualizations. These analytic questions can focus on individuals behaviors or comparisons of the population.

The above exploration of the poverty rates in the US leads me to ask the following questions: <br>

1. Given the range of poverty rates, which states can be considered "abnormal"?
2. How do poverty rates for specific age ranges compare to the overall poverty rates for each state?
3. Is there a correlation between poverty rates and welfare spending?

In order to understand which states' poverty rates might be abnormal, I calculate the z-score for each states' poverty rate. Given that this data is roughly normal, any state with a z-score on either side of 2/-2 can be considered extreme.

_I begin by creating three new columns:_
1. z_score_perc_pov contains the z-score of perc_pop_pov;
2. perc_pop_child contains the percentage of the population that are children living in poverty;
3. z_score_perc_child contains the z_scores for perc_pop_child. The z-scores are calculated to help understand better how each state's poverty rates compare.

In [68]:
poverty_sub['z_score_perc_pov'] = (poverty_sub.perc_pop_pov - poverty_sub.perc_pop_pov.mean())/poverty_sub.perc_pop_pov.std(ddof=0)
poverty_sub['perc_pop_child'] = poverty_sub['Age 0-4']/poverty_sub['Total Population']
poverty_sub['z_score_perc_child'] = (poverty_sub.perc_pop_child - poverty_sub.perc_pop_child.mean())/poverty_sub.perc_pop_child.std(ddof=0)

_Then I plot the z-scores for overall poverty rates, sorted from least to greatest and color coded, with blue being low and red being high._

In [69]:
poverty_sort_z = poverty_sub.sort_values(by=['z_score_perc_pov'])

colors = plt.cm.coolwarm(np.linspace(0,1,51))
poverty_sort_z.iloc[0:, -3].plot(figsize = (15,6), kind = 'bar', stacked=True, color = colors)
plt.title('Comparison of State Poverty Rates')
plt.show()

This chart quickly shows us the range of poverty rates in America. New Hampshire stands out as having an abnormally low poverty rate, while Louisiana and Minnesota have abnormally high poverty rates.

In order to compare poverty rates for families with young children (0-4 years of age) with the overall rate of poverty in the general population, I plot the z-scores for the 0-4 age group, while keeping the data sort according to the rate of poverty in the general population.

In [70]:
poverty_sort_z.iloc[0:, -1].plot(figsize = (15,6), kind = 'bar', stacked=True, color = colors)
plt.title('Comparison of State Poverty Rates: Families with Young Children')
plt.show()

The bars that appear to be "out of order" represent those states that have significantly different rates of poverty for the 0-4 age group compared to the overall population. While both West Virgina and DC have overall high rates of poverty, their poverty rates for families with young children are lower. On the other end of the spectrum, we see that Alaska has a relatively low rate of poverty overall, but its rate for families with children is not as impressive.

_Another way to view the same data is to plot both the overall rate and the rate for 0-4 together_

In [71]:
poverty_sort_z.iloc[0:, [-3, -1]].plot(figsize = (15,6), kind = 'bar')
plt.title('Comparison of State Poverty Rates: Families with Young Children')
plt.show()

<IPython.core.display.Javascript object>

This view helps to highlight some of the difference between poverty rates better. As an example, my home state of Oregon is right in the middle for overall poverty rate, but its rate for families with young children is significantly better.

The final question that I wanted to analyze is: Is there a correlation between poverty rates and welfare spending?

_To address that question, I use a scatter plot._

In [72]:
poverty_sort_z.head()

Unnamed: 0,Total Population,All Ages,Age 0-17,Age 5-17 in Families,Age 0-4,Median Household Income,Welfare,DorR,perc_pop_pov,welfare_dollars_per_person,z_score_perc_pov,perc_pop_child,z_score_perc_child
NH,1335015,97753,21896,14329,6300,70986,0.5,b,0.073222,374.527627,-2.039935,0.004719,-2.193883
HI,1428683,132597,33801,21877,10436,74659,0.9,b,0.092811,629.950801,-1.36738,0.007305,-1.441894
MD,6024752,572786,172868,117468,50081,78787,3.5,b,0.095072,580.936775,-1.289733,0.008313,-1.148753
CT,3587685,342171,97462,67644,27362,73380,2.4,b,0.095374,668.95505,-1.279377,0.007627,-1.348239
MN,5525050,531789,158751,106426,46337,65583,3.0,b,0.096251,542.981511,-1.249273,0.008387,-1.127182


In [73]:
poverty_sort_z.plot.scatter('perc_pop_pov', 'welfare_dollars_per_person')
plt.ylabel('welfare_dollars_per_person')
plt.xlabel('Percentage of Population Living in Poverty')
plt.title('Correlation between Welfare Spending and Poverty')
plt.show()

<IPython.core.display.Javascript object>

The scatter plot seems to show a slight correlation between welfare spending and poverty rates. All the states with welfare spending above 600 dollars per person have rates of poverty below 13 percent. There also appears to be a slight downward slop to the data. In order to test this correlation, I replot the data and include the Pearson Correlation Coefficient and p-value:

With a -0.48 PCC, we can conclude that there is a maderate neggative correlation between welfare spending and poverty rates.

In [74]:
from mpl_toolkits.mplot3d import Axes3D



fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(poverty_sort_z['welfare_dollars_per_person'], poverty_sort_z['Median Household Income'], poverty_sort_z['perc_pop_pov'], marker = 'o', c = poverty_sort_z['DorR'], s=100)
ax.set_xlabel('Welfare Dollars Per Person')
ax.set_zlabel('Percent of Population Living in Poverty')
ax.set_ylabel('Median Household Income')
plt.show()

<IPython.core.display.Javascript object>

In [75]:
list(poverty_sort_z)

['Total Population',
 'All Ages',
 'Age 0-17',
 'Age 5-17 in Families',
 'Age 0-4',
 'Median Household Income',
 'Welfare',
 'DorR',
 'perc_pop_pov',
 'welfare_dollars_per_person',
 'z_score_perc_pov',
 'perc_pop_child',
 'z_score_perc_child']

In [76]:
poverty_sort_z = poverty_sort_z.rename(index=str, columns={'Total Population':'population',
 'Age 0-17':'0_17',
 'Age 5-17 in Families':'five_to_seventeen',
 'Age 0-4':'zero_to_four',                                         
 'Median Household Income':'income',
 'Welfare':'welfare'})

In [77]:
list(poverty_sort_z)

['population',
 'All Ages',
 '0_17',
 'five_to_seventeen',
 'zero_to_four',
 'income',
 'welfare',
 'DorR',
 'perc_pop_pov',
 'welfare_dollars_per_person',
 'z_score_perc_pov',
 'perc_pop_child',
 'z_score_perc_child']

In [86]:
from sklearn.model_selection import train_test_split

X = poverty_sort_z[['population',
 'five_to_seventeen',
 'income',
 'welfare']]
y = poverty_sort_z['zero_to_four']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

In [87]:
# Instantiate and fit our model.
reg_all = linear_model.LinearRegression()

reg_all.fit(X_train, y_train)

y_pred = reg_all.predict(X_test)

# Inspect the results.
print('\nCoefficients: \n', reg_all.coef_)
print('\nIntercept: \n', reg_all.intercept_)
print('\nR-squared:')
print(reg_all.score(X, y))


Coefficients: 
 [ 7.97387765e-04  4.16627704e-01 -1.33806702e-02 -6.72686039e+02]

Intercept: 
 940.9214040984953

R-squared:
0.9966294317206569


In [88]:
# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
linear_formula = 'zero_to_four ~ population+five_to_seventeen+income+welfare'

# Fit the model to our data using the formula.
lm = smf.ols(formula=linear_formula, data=poverty_sort_z).fit()
lm.params

Intercept            3274.147123
population              0.000516
five_to_seventeen       0.423224
income                 -0.003791
welfare             -2005.655156
dtype: float64

In [89]:
lm.pvalues

Intercept            4.439984e-01
population           3.354727e-01
five_to_seventeen    2.697212e-31
income               9.598543e-01
welfare              1.313704e-04
dtype: float64

In [90]:
lm.rsquared

0.9984175243677941

In [91]:
from sklearn.model_selection import cross_val_score
cv_results = cross_val_score(reg_all, X, y, cv=7)
cv_results

array([0.9599934 , 0.97804345, 0.99747567, 0.99765762, 0.99408013,
       0.99155859, 0.97044144])

<a id="Propose further research"></a>

## Propose further research

> Lastly, make a proposal for a realistic future research project on this dataset that would use some data science techniques you'd like to learn in the bootcamp. Just like your earlier questions, your research proposal should present one or more clear questions. Then you should describe the techniques you would apply in order to arrive at an answer.

For future analysis, I would like to explore how poverty rates have changed over time. Data is readily available for atleast as far back as 1995, giving us 11 years worth of data to work with. My main question would be: Is there a long term correlation between welfare spending and poverty rates. I would like to use some machine learning models to predict what would happen to poverty rates if a state increased its spending on welfare. I would also like to include other relevant data, such as spending on unemployment, housing assistance, healthcare, food programs, and other social safety net programs. Other interesting data to include might be unemployment rates, houselessness rates, levels of education, rates of urbanization, major industries for each state, among many others. 