# Mo' Money, Mo' Problems

<br>

## I hoped to answer one central question:
# <font color = blue> Do rich neighborhoods complain more than poor ones?

<br>

## I also wanted to confirm an assumption about an indicator of gentrification: <br><br>_Department of Buildings Job Filings are a precursor to tenant displacement and therefore gentrification_

***

The NYC Housing Database we used for this project contains data collected from the city of New York by housing activists. While the database contains a tremendous amount of information, I concentrated on two factors:

## Housing Preservation and Development complaints (also called 311 complaints) <br>& <br>Department of Buildings Job Filings 

HPD Complaints are any grievences filed by tenants through the 311 system that were routed to the HPD organization. <br>
DOB job Filings are permit applications for construction work in a residential building

For this analysis, I concentrated on a cluster of 10 zip codes in north Manhattan, including Harlem. I pulled the income information for each zip code from the census bureau and joined that information with a count of HPD complaints and DOB job filings from the NYC Housing database. I used this combined data to calculate the percent change for each category per year.

# So, do rich neighborhoods complain more?<br>
The short answer is: 
## **Yes!**
like, a lot more.

In [11]:
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize

In [13]:
#Import our 3 csvs containing the data from SQL queries and Census
df_hpd = pd.read_csv('data/hpd_complaintsbyzip.csv')
df_dob = pd.read_csv('data/dobjob_byzip.csv')
df_income = pd.read_csv('data/census_income_12_16.csv')
#Merge income and dob dataframes
df_inc_dob = pd.merge(df_income, df_dob, left_on=['zip', 'year'], right_on=['zip', 'year'])
#Merge the above DF with the HPD complaints
df_final = pd.merge(df_inc_dob, df_hpd, left_on=['zip', 'year'], right_on=['zip', 'year'])
#Make a list of each zipcode in the DF for our for loop
zip_list = df_final['zip'].unique().tolist()
#Create our dataframe with all info, plus the percent change between each
#make a blank DF
df_out = pd.DataFrame()

for zip_code in zip_list:
    #Make a temporary df that contains only the info from one zip code
    df_zip = df_final[df_final['zip'] == zip_code]
    #calculate the percent change in the three values we're interested in: income, DOB Job filings and HPD complaints
    df_zip['income_change'] = df_zip['income'].pct_change().values*100
    df_zip['dobjob_change'] = df_zip['dobjob_count'].pct_change().values*100
    df_zip['hpd_complaints_change'] = df_zip['hpd_complaints_count'].pct_change().values*100
    #concat this new information together into one df
    frames = [df_out, df_zip]
    df_out = pd.concat(frames)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


# The first question we had to answer was: which zip codes saw an increase in income

# Now, income alone isn't a great indicator of whether a neighborhood has gentrified. <n>
    
### Ideally, income would be combined with other data: like amount of higher-education degrees and house value. 
<n>But for our purposes we can use income as a general indicator and it appears like many of these zip codes have seen a substantial increase

In [24]:
df_out[['zip', 'year', 'income', 'income_change']]

Unnamed: 0,zip,year,income,income_change
0,10027,2012,35694,
6,10027,2013,37872,6.101866
15,10027,2014,40013,5.653253
25,10027,2015,40782,1.921875
34,10027,2016,42754,4.835467
1,10030,2012,30674,
7,10030,2013,31925,4.078373
16,10030,2014,32561,1.992169
26,10030,2015,33196,1.950186
35,10030,2016,33720,1.578503


# And how did the number of complaints change during that same period?

In [30]:
df_out[['zip', 'year', 'hpd_complaints_count', 'hpd_complaints_change', 'income_change']]

Unnamed: 0,zip,year,hpd_complaints_count,hpd_complaints_change,income_change
0,10027,2012,1,,
6,10027,2013,4,300.0,6.101866
15,10027,2014,2113,52725.0,5.653253
25,10027,2015,4266,101.893043,1.921875
34,10027,2016,3624,-15.049226,4.835467
1,10030,2012,1,,
7,10030,2013,1,0.0,4.078373
16,10030,2014,1558,155700.0,1.992169
26,10030,2015,3219,106.61104,1.950186
35,10030,2016,2413,-25.038832,1.578503


# Hm. Increases in the 100k+ percent? That seems fishy

There are a number of potential reasons for this:

1. Rich people really do complain more
2. Something changed with the data reporting in 2014
3. Some other factor caused the increase

## What can we take away from this? 
<br>
Honestly, not much. One of the issues with this data set is that in the absence of ground-level reporting, like speaking with the people who manage the HPD complaints database, it's difficult to discern the implications of these numbers. Also, this is not a complete set of 311 complaints, just the ones that ended up with the HPD. We'd have a better idea if we got ALL the 311 complaints rather than a subset

<br>

# How about our other question: can we verify that after an increase in DOB Job filings a neighborhood's income increases?

In [32]:
df_out[['zip', 'year','dobjob_count','dobjob_change','income_change']]

Unnamed: 0,zip,year,dobjob_count,dobjob_change,income_change
0,10027,2012,766,,
6,10027,2013,864,12.793734,6.101866
15,10027,2014,945,9.375,5.653253
25,10027,2015,1022,8.148148,1.921875
34,10027,2016,1004,-1.761252,4.835467
1,10030,2012,181,,
7,10030,2013,199,9.944751,4.078373
16,10030,2014,262,31.658291,1.992169
26,10030,2015,303,15.648855,1.950186
35,10030,2016,243,-19.80198,1.578503


## It certainly seems like that hypothesis holds true <br>
Generally, there is a large increase in DOB Job Filings followed by a large increase in income and the filings tend to begin falling off as the income levels out

# The Process
## -Why did you choose the project you chose?
I think understanding the factors that surround gentrification and the process by which that happens is important
## -How did you obtain your data?
Acquiring the data was relatively easy once I remembered how SQL worked. The data came from a few painful queries to an amazing database.
## -What were the central challenges of transforming and aggregating your data?
1. Calculating the percent change was soooo much more painful than it needed to be. 
2. Much of the information in the NYC housing database is partial or incomplete. Understanding what information I could get from where and how to join that together in a meanful way was the biggest challenge

## -What discoveries did you make during the data wrangling process? (These can be either programmatic discoveries or new knowledge about the subject itself.)
My biggest discovery actually had nothing to do with my data. While I attempted to extract zip codes from one database to be joined to another, I found that when filing for DOB Jobs many building owners used the wrong zip code on their applications. While normally I would assume this was an accident, the number of different zip codes used and the frequency that the same building owner would use the same wrong zip code were too high to be an accident.


# What's next?

## 1. Bigger data set
Now that the groundwork has been laid, expanding this analysis to look at the whole city rather than a small subset should be relatively easy
## 2. More journalism
Before we can make any real conclusions, I need a better understanding of the other factors that could influence the increase in HPD complaints
## 3. Better statistics
We all know that correlation is not causation. At best, I've shown a corollary. In addition to understanding the confounding varibles discussed above, a deeper analysis to understand the staistical significance is necessary
## 4. Show the changes over time
My limited understanding of leaflet prevented this, but this map may convey the data better with a time-slider
## 5. Show data in another form
Maps are great, but maybe not the best way to display what I wanted to investigate