Pull demographics
============

Using NYC demographic data, try to see if we can use the data in our subway prediction.

In [1]:
import os

import pandas as pd

In [2]:
files = [f'demographics/{n}' for n in sorted(os.listdir('demographics'))]
files

['demographics/demo_east_harlem_north.xls',
 'demographics/demo_east_harlem_south.xls',
 'demographics/demo_lenox_hill.xls',
 'demographics/demo_upper_east_carnegie.xls',
 'demographics/demo_yorkville.xls']

In [4]:
demos = [pd.read_excel(f) for f in files]
demos[0].head()

Unnamed: 0,2010-2014 ACS Economic Profile,Unnamed: 1,Unnamed: 2,NYC Census FactFinder,Unnamed: 4,Unnamed: 5
0,Selected Neighborhood: East Harlem North,,,,,
1,Selected Economic Characteristics\n(Grayed val...,Number,,,Percent,
2,,Estimate,MOE,CV*,Estimate,MOE
3,EMPLOYMENT STATUS,,,,,
4,Population 16 years and over,47371,1722,2.2,1,


We need to get these files into something useable. Values to include are population (total and employed in labor force), commuting (car, public, walked), and household income distribution. I'll just keep the raw numbers for this.

In [11]:
test = demos[1]
population = test.iloc[4]['Unnamed: 1']
employed = test.iloc[7]['Unnamed: 1']
population, employed

(49229, 25095)

In [12]:
# commuting

car_alone = test.iloc[28]['Unnamed: 1']
public = test.iloc[30]['Unnamed: 1']
walked = test.iloc[31]['Unnamed: 1']
mean_travel_time_min = test.iloc[35]['Unnamed: 1']
car_alone, public, walked, mean_travel_time_min

(1564, 17124, 4050, 33.2)

In [13]:
# income
median_income = test.iloc[80]['Unnamed: 1']
median_income

35857

In [17]:
details = {'neighborhood': test.iloc[0]['2010-2014 ACS Economic Profile'].split(':')[1].strip(),
           'population': test.iloc[4]['Unnamed: 1'],
           'employed': test.iloc[7]['Unnamed: 1'],
           'commute_car': test.iloc[28]['Unnamed: 1'],
           'commute_public': test.iloc[30]['Unnamed: 1'],
           'commute_walk': test.iloc[31]['Unnamed: 1'],
           'mean_commute_minutes': test.iloc[35]['Unnamed: 1'],
           'median_income': test.iloc[80]['Unnamed: 1']}
details

{'commute_car': 1564,
 'commute_public': 17124,
 'commute_walk': 4050,
 'employed': 25095,
 'mean_commute_minutes': 33.2,
 'median_income': 35857,
 'neighborhood': 'East Harlem South',
 'population': 49229}

In [20]:
frame = pd.DataFrame(details, index=[0])

In [21]:
frame

Unnamed: 0,commute_car,commute_public,commute_walk,employed,mean_commute_minutes,median_income,neighborhood,population
0,1564,17124,4050,25095,33.2,35857,East Harlem South,49229


In [22]:
def pull_demographics(frame, index):
    details = {'neighborhood': frame.iloc[0]['2010-2014 ACS Economic Profile'].split(':')[1].strip(),
           'population': frame.iloc[4]['Unnamed: 1'],
           'employed': frame.iloc[7]['Unnamed: 1'],
           'commute_car': frame.iloc[28]['Unnamed: 1'],
           'commute_public': frame.iloc[30]['Unnamed: 1'],
           'commute_walk': frame.iloc[31]['Unnamed: 1'],
           'mean_commute_minutes': frame.iloc[35]['Unnamed: 1'],
           'median_income': frame.iloc[80]['Unnamed: 1']}
    new_frame = pd.DataFrame(details, index=[index])
    return new_frame

In [24]:
demo_frames = [pull_demographics(f, i) for i, f in enumerate(demos)]
demo = pd.concat(demo_frames)
demo

Unnamed: 0,commute_car,commute_public,commute_walk,employed,mean_commute_minutes,median_income,neighborhood,population
0,1467,16331,2407,22315,35.0,26099,East Harlem North,47371
1,1564,17124,4050,25095,33.2,35857,East Harlem South,49229
2,3153,25407,12422,49080,29.5,98797,Lenox Hill-Roosevelt Island,69894
3,2697,14528,5487,30007,26.7,155213,Upper East Side-Carnegie Hill,49172
4,3463,32344,6899,51031,33.3,98840,Yorkville,71578


In [25]:
demo.to_csv('demographics/extracted.csv')

Create SQL database
------------------------

In [26]:
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2

In [27]:
dbname = 'demographics'
engine = create_engine(f'postgres://mikemoran@localhost/{dbname}')
if not database_exists(engine.url):
    create_database(engine.url)

In [28]:
demo.to_sql('demographics_nyc_table', engine, if_exists='replace')