# Full Technical Report: Where's the American Dream?
**_Exploring Intergenerational Economic Mobility by U.S. County_**

Nateé Johnson  
January 2020

### Overview

The goal of the project was to model and visualize Absolute Upward Mobility (AUM) by county as a function of location-specific characteristics. The aim was to determine which features of a county elevated a low-income child’s opportunity to achieve greater economic outcomes than their parents. My approach was to make a definition of what it meant for a county to be a hotbed of opportunity (a.k.a “Mobile”) versus a place where average upward mobility is hampered (a.k.a “Not Mobile”).  Using binary classification models, I looked at “feature importance” to determine which characteristics of a child’s neighborhood impact their economic outcomes.
Using SciKit-learn in python, I found my Random Forest classifier had greater distinguishing power than my logistic regression model. I encountered missing data (missingno library) and class imbalances (~10% of the counties met the criteria to be labeled as “Mobile”). To address this, I experimented with SMOTE (Synthetic Minority Oversampling Technique) - but found that imputing the median yielded comparable results and was more interpretable. I accounted for the class imbalance within the Random Forest classifier using the ‘class_weight’ parameter. I used python’s Folium library to generate a multi-layer choropleth map with pop-up markers and a dynamic tooltip. 
Prior to modeling and visualizing, I spent a significant amount of time studying the datasets and becoming more familiar with the topic area, which was new to me. I used Pandas to clean data from Opportunity Insights and County Health Rankings, and merged them on the FIPScode (after some fun string manipulation techniques to get them in the same format). Opportunity Insights is a massive effort exploring the topic of Intergenerational Economic Mobility. Researchers from Harvard University, Brown University and the Census Bureau, compiled decades of data from tax returns, W-2s, and U.S. census survey results to link parents and children. Looking at parents’ income and their age at childbirth, they traced those children to adulthood and captured their income at a comparable age. To create my map, I had to reformat the dataframe to fit the structure of the county’s GeoJSON file. This was an important step that allowed the information for each county to be dynamically called from the dataframe as a user hovers over the relevant spatial area. Full functionality of each layer is experienced when viewed singularly. (https://nateej1.github.io/AmericanDream_Geo/)

### Imports
Import libraries and write settings here.

In [1]:
# Data manipulation
import pandas as pd
import pickle
import numpy as np
import seaborn as sns
from pandas_profiling import ProfileReport
import os
import sys

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30


# Visualizations
import chart_studio.plotly as py
import folium
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

# Anaylsis & Modeling Packages
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, auc, confusion_matrix, average_precision_score, precision_recall_curve, classification_report
from sklearn import tree 
from IPython.display import Image  



### Local Scripts

Import functions to clean data

In [None]:


src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

from d02_processing.cleaned_data import Clean_OppIns_raw

data = Clean_OppIns_raw()

In [2]:
data

## Data Exploration

### Data Sources & Cleaning

#### Opportunity Insights

Using two datasets:

*Where is the Land of Opportunity? The Geography of Intergenerational Mobility in the United States*

Chetty, Hendren, Kline and Saez (2014)
Descriptive Statistics by County and Commuting Zone  
http://www.equality-of-opportunity.org/assets/documents/mobility_geo.pdf 

**Data source**: http://www.equality-of-opportunity.org/data/ 


Preliminary scripts: 

- [ ] Cleaned mobility df
- [ ] Cleaned county health df

In [None]:
# This returns cleaned County Health data
Clean_CountyHealth_raw()

# This returns cleaned Opportunity Insights data
Clean_OppIns_raw()

In [None]:
## Using Pandas Profiling library for comprehensive look at data

ProfileReport(df

#### County Health Rankings

Data was originally sources from Tableau Public:  
https://public.tableau.com/s/sites/default/files/media/County_Health_Rankings.csv

Files for individual years can be found at the source:  
https://www.countyhealthrankings.org/app/

### Data Decisions

insert image of original data set and talk about decision to average over years '97-'02

Given missing values, date ranges, etc. I will proceed using: 




## Analysis/Modeling
Do work here

### Modeling on Income Data Only

Decision Tree and Logistic Regression

In choosing features to model on, I am conscious of data leakage and auto-correlation (?). 

In [None]:
county_mobility = pd.read_pickle('../../data/02_intermediate/county_mobility_incomeOnly')
county_health = pd.read_pickle('../../data/02_intermediate/county_measures')

#### Setting up data

In [None]:
features = [
    'Top 1% Income Share',
    'Interquartile Income Range',
    'Median Parent Income'
]

X = county_mobility.dropna(subset=features, axis=0)[features]
y = county_mobility.dropna(subset=features, axis=0)[['Target']]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=5)

X_train, X_test , y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 5) 

**SMOTE - Addressing Unbalanced Classes**

In [None]:
smote = SMOTE(random_state=42, sampling_strategy=1)
X_res, y_res = smote.fit_resample(X_train, y_train) # np.array(y_train).ravel())

## Results

### Feature Correlation

### Mapping Results

The output from the following code can be found here: http://nateej1.github.io/AmericanDream_Geo

Static Image
![AUM-map](../../AUM-map-preview.png)

In [None]:
# Instantiating map object
aum_map = folium.Map(location=[37, -95], zoom_start=5)
bins = list(data_imp['Absolute Upward Mobility'].quantile([0, 0.25, 0.5, 0.75, 1]))

choro = folium.Choropleth(
    geo_data=updated_county_json,
    data=data_imp,
    columns=['County FIPS Code', 'Absolute Upward Mobility'],
    key_on='properties.GEO_ID',
    fill_color='YlOrRd',
    fill_opacity=0.5,
    nan_fill_color='gray',
    nan_fill_opacity=0.9,
    line_opacity=0.5,
    legend_name='Absolute Upward Mobility',
    bins=bins,
    reset=True, 
    highlight=True,
    name='Absolute Upward Mobility'
    
)

choro2 = folium.Choropleth(
    geo_data=updated_county_json,
    data=data_imp,
    columns=['County FIPS Code', 'Target'],
    key_on='properties.GEO_ID',
    fill_color='PuBu',
    fill_opacity=0.5,
    nan_fill_color='gray',
    nan_fill_opacity=0.9,
    line_opacity=0.5,
    legend_name='Mobility Outcomes',
    bins=3,
    reset=True, 
    highlight=True,
    name='Child Can Move atleast 1 quartile up'
    )

locations = list(zip(data_imp.lat, data_imp.lon))
popup_content = list(zip(data_imp['County Name'], data_imp['State'], round(data_imp['Absolute Upward Mobility'],1), round(data_imp['Teenage Birth Rate']*100, 1)
                         ))
popups = ['<center> {} County, {} <br>  <b>AUM:</b> {} <br><b>Teen Birth Rate:</b> {}% <br>'.format(
    name, state, aum, share) for (name, state, aum, share) in popup_content]


tooltip = folium.features.GeoJsonTooltip(fields=['NAME', 'State', 'Absolute Upward Mobility', '% Unemployment', '% Teenage Birth Rate', '% Share Between p25 and p75'],
                                         aliases=[
                                             'County', 'State', 'Absolute Upward Mobility', 'Unemployment Rate','Teen Birth Rate', '"Middle Class" Income Share'],
                                         style=('background-color: grey; color: white; font-family:'
                                                'courier new; font-size: 24px; padding: 10px;'),
                                         localize=True)

choro.geojson.add_child(tooltip)

data_imp_copy = data_imp.reset_index()

for idx, row in data_imp_copy.iterrows():
    if row['Absolute Upward Mobility'] >= 60:
        location = locations[idx][0], locations[idx][1]
        marker = folium.Marker(location=location)    
        popup = popups[idx]
        folium.Popup(popup, max_width='150%').add_to(marker)
        icons = folium.Icon(color='green', icon='ok-sign').add_to(marker)
        marker.add_to(choro2)
    elif row['Absolute Upward Mobility'] <= 32:
        location = locations[idx][0], locations[idx][1]
        marker = folium.Marker(location=location)    
        popup = popups[idx]
        folium.Popup(popup, max_width='150%').add_to(marker)
        icons = folium.Icon(color='red').add_to(marker)
        marker.add_to(choro2)
    else: pass

    
choro.add_to(aum_map)
choro2.add_to(aum_map)
folium.LayerControl().add_to(aum_map)
aum_map.save(os.path.join('../../results', 'aum_map8.html'))

## Conclusions and Next Steps

With more robust datasets, this analysis could inform policy decisions that can foster more equitable environments for upward mobility. 

In continuing this investigation, I would like to incorporate education data from Urban Institute, create a map product with more interactive features, allowing the audience to engage in their own exploration of the topic (using Dash)