### Unit 4.2.2  Capstone Analytic Report and Research Proposal

As a capstone to this fundamentals course, prepare an Analytic Report and Research Proposal on a dataset of your choosing. Your Report should accomplish these three goals:

1. Describe your dataset. Describe and explore your dataset in the initial section of your Report. What does your data contain and what is its background? Where does it come from? Why is it interesting or significant? Conduct summary statistics and produce visualizations for the particular variables from the dataset that you will use.

2. Ask and answer analytic questions. Ask three analytic questions and answer each one with a combination of statistics and visualizations. These analytic questions can focus on individuals behaviors or comparisons of the population.

3. Propose further research. Lastly, make a proposal for a realistic future research project on this dataset that would use some data science techniques you'd like to learn in the bootcamp. Just like your earlier questions, your research proposal should present one or more clear questions. Then you should describe the techniques you would apply in order to arrive at an answer.

#### 1. Describe your dataset.

I have choosen a data from a [Kaggle dataset](https://www.kaggle.com/muonneutrino/us-census-demographic-data/data#acs2015_county_data.csv) based on US Census Demographic Data based on data for each county or county equivalent in the US, including DC and Puerto Rico.  I will be comparing two datasets, one from 2015 and the other from 2017.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# plt.style.use('classic')
%matplotlib inline

In [2]:
df15 = pd.read_csv('acs2015_county_data.csv', index_col=None)
df17 = pd.read_csv('acs2017_county_data.csv', index_col=None)

In [3]:
df15.head(5)

Unnamed: 0,CensusId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001,Alabama,Autauga,55221,26745,28476,2.6,75.8,18.5,0.4,...,0.5,1.3,1.8,26.5,23986,73.6,20.9,5.5,0.0,7.6
1,1003,Alabama,Baldwin,195121,95314,99807,4.5,83.1,9.5,0.6,...,1.0,1.4,3.9,26.4,85953,81.5,12.3,5.8,0.4,7.5
2,1005,Alabama,Barbour,26932,14497,12435,4.6,46.2,46.7,0.2,...,1.8,1.5,1.6,24.1,8597,71.8,20.8,7.3,0.1,17.6
3,1007,Alabama,Bibb,22604,12073,10531,2.2,74.5,21.4,0.4,...,0.6,1.5,0.7,28.8,8294,76.8,16.1,6.7,0.4,8.3
4,1009,Alabama,Blount,57710,28512,29198,8.6,87.9,1.5,0.3,...,0.9,0.4,2.3,34.9,22189,82.0,13.5,4.2,0.4,7.7


In [4]:
df17.head(5)

Unnamed: 0,CountyId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001,Alabama,Autauga County,55036,26899,28137,2.7,75.4,18.9,0.3,...,0.6,1.3,2.5,25.8,24112,74.1,20.2,5.6,0.1,5.2
1,1003,Alabama,Baldwin County,203360,99527,103833,4.4,83.1,9.5,0.8,...,0.8,1.1,5.6,27.0,89527,80.7,12.9,6.3,0.1,5.5
2,1005,Alabama,Barbour County,26201,13976,12225,4.2,45.7,47.8,0.2,...,2.2,1.7,1.3,23.4,8878,74.1,19.1,6.5,0.3,12.4
3,1007,Alabama,Bibb County,22580,12251,10329,2.4,74.6,22.0,0.4,...,0.3,1.7,1.5,30.0,8171,76.0,17.4,6.3,0.3,8.2
4,1009,Alabama,Blount County,57667,28490,29177,9.0,87.4,1.5,0.3,...,0.4,0.4,2.1,35.0,21380,83.9,11.9,4.0,0.1,4.9


In [5]:
# Let's get a list of column names to see what we are dealing with here.
list(df15.columns.values)

['CensusId',
 'State',
 'County',
 'TotalPop',
 'Men',
 'Women',
 'Hispanic',
 'White',
 'Black',
 'Native',
 'Asian',
 'Pacific',
 'Citizen',
 'Income',
 'IncomeErr',
 'IncomePerCap',
 'IncomePerCapErr',
 'Poverty',
 'ChildPoverty',
 'Professional',
 'Service',
 'Office',
 'Construction',
 'Production',
 'Drive',
 'Carpool',
 'Transit',
 'Walk',
 'OtherTransp',
 'WorkAtHome',
 'MeanCommute',
 'Employed',
 'PrivateWork',
 'PublicWork',
 'SelfEmployed',
 'FamilyWork',
 'Unemployment']

In [6]:
# Let's get a list of column names to see what we are dealing with here.
list(df17.columns.values)

['CountyId',
 'State',
 'County',
 'TotalPop',
 'Men',
 'Women',
 'Hispanic',
 'White',
 'Black',
 'Native',
 'Asian',
 'Pacific',
 'VotingAgeCitizen',
 'Income',
 'IncomeErr',
 'IncomePerCap',
 'IncomePerCapErr',
 'Poverty',
 'ChildPoverty',
 'Professional',
 'Service',
 'Office',
 'Construction',
 'Production',
 'Drive',
 'Carpool',
 'Transit',
 'Walk',
 'OtherTransp',
 'WorkAtHome',
 'MeanCommute',
 'Employed',
 'PrivateWork',
 'PublicWork',
 'SelfEmployed',
 'FamilyWork',
 'Unemployment']

The only differences between the datasets are the additional column of `VotingAgeCitizen` and the `Id` column 2015 has `CensusId` and 2017 has `CountyId`.  If you look at the documentation in Kaggle, they are both county ids.  

I'm creating State/Region data per [Wikipedia](https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States) so I can run some queries on the different US regions.  

In [11]:
region = {"Connecticut": 'Northeast',"Maine": 'Northeast',"Massachusetts": 'Northeast',
           "New Hampshire": 'Northeast', "Rhode Island": 'Northeast',"Vermont": 'Northeast',
           "New Jersey": 'Northeast',"New York": 'Northeast', "Pennsylvania": 'Northeast',
           "Indiana": 'Midwest',"Illinois": 'Midwest',"Michigan": 'Midwest',"Ohio": 'Midwest',
           "Wisconsin": 'Midwest',"Iowa": 'Midwest',"Kansas": 'Midwest',"Minnesota": 'Midwest',
           "Missouri": 'Midwest',"Nebraska": 'Midwest',"North Dakota": 'Midwest',
           "South Dakota": 'Midwest',"Delaware": 'South',"District of Columbia": 'South',
           "Florida": 'South',"Georgia": 'South',"Maryland": 'South',"North Carolina": 'South',
           "South Carolina": 'South',"Virginia": 'South',"West Virginia": 'South',"Alabama": 'South',
           "Kentucky": 'South',"Mississippi": 'South',"Tennessee": 'South',"Arkansas": 'South',
           "Louisiana": 'South',"Oklahoma": 'South',"Texas": 'South',"Arizona": 'West',
           "Colorado": 'West',"Idaho": 'West',"New Mexico": 'West',"Montana": 'West',
           "Utah": 'West',"Nevada": 'West',"Wyoming": 'West',"Alaska": 'West',"California": 'West',
            "Hawaii": 'West',"Oregon": 'West',"Washington": 'West'}

In [13]:
df15['Region'] = region

NameError: name 'region' is not defined