# 1. Proposal:

## What is the problem you are attempting to solve?
<br> 
I want to see if it makes sense to buy or rent in NY/MA/CT next year. The project goals are to bring visibility into the rental and housing market in NY MA and CT and to shed light on the factors that would impact the decision to buy given projected performance of that investment 5 or 10 years into the future using economic data from the Federal Reserve as well as Zillow data. 
<br>

## How is your solution valuable?
<br>
This problem would impact those that are in a position to buy but are not yet sure if the market has bottomed out and if it makes sense to rent for a bit longer. Given the significance in my own life, these three states are the locations I will be looking at specifically. This product would thus be valuable to anyone that is in a similar position of looking at the east coast and attempting to make a big decision as to where to live and where would bring the most potential (from a location and investment perspective). The inputs that I can take from the user would be things like priorities of surrounding areas like proximity to schools or highways or public transportation.
<br>

## What is your data source and how will you access it?
<br>
Zillow data combined with interest rate/unemployment/inflation data. I already have the datasets I need.  They will be sourced from the below links:
<br>
https://www.zillow.com/research/data/ 
https://www.kaggle.com/zillow/zecon 
https://www.kaggle.com/federalreserve/interest-rates
<br>
The variables I will be looking at the interest rates data, location and listing information for Massachusetts, New York and Connecticut specifically. I may also be scraping data from google maps for proximity to major cities to get commute times. 
<br>

## What techniques from the course do you anticipate using?

<br>
After EDA, cleaning and feature engineering, I will be using the economics specialization for this project, so I will be relying primarily on linear regression (Probit and Tobit) and robust regression (Huber, Thiel-Sen, and RANSAC) to assume performance gains or losses in 5 or 10 years in the future in buying vs renting scenarios as these long term projections would help with making a decision. I will try all these models and see which performs best. I can use K means clustering to create profiles around particular county profiles that would be most advantageous based on proximity to Boston and NYC, as well as schools, public transport and highways. I can also highlight certain counties or towns as suggestions based on this clustering.
<br>

## What do you anticipate to be the biggest challenge you’ll face?
<br>How to compile it all together and leverage the various data sets in a way that is optimal. Bringing in adequate complexity. Scraping the data I need and getting it to work with my dataframes.
<br>

https://www.zillow.com/research/data/ <br>

https://www.kaggle.com/zillow/zecon <br>

https://www.kaggle.com/federalreserve/interest-rates
<br>

Notes:

Sq footage, proximity to city, lot size etc. Input/free variable - how long would we be in that location. 10 years? 5? If able to do regression for an area, see how much housing prices are increasing year over year that could factor in. Mortgage parameters. Etc. 

Clustering techniques, similar neighborhoods in CT/MA/NY
inputs for similarity?
Schools?
Public transit?
Price sq foot
Increase in value 
Proximity to highways 
Colleges
Company headquarters 
Google maps API to pull in ancillary data prox to trains, highways etc. Transit time to grand central or back bay. 

See if we can get data on town/neighborhood level
Rate of increase. 

# 2. Ok Lets begin with loading the data, wrangling, cleaning and EDA:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Here I am importing interest rate data from the 50s to today. I am also importing sale prices of homes as well as rental prices of homes, both sets are from Zillow. 

In [2]:
df_interestrates = pd.read_csv('/Users/ir3n3br4t515/Desktop/index.csv')
df_zillowsale = pd.read_csv('/Users/ir3n3br4t515/Desktop/Sale_Prices_State.csv')
df_zillowrent = pd.read_csv('/Users/ir3n3br4t515/Desktop/State_MedianRentalPrice_AllHomes.csv')


In [21]:
df_interestrates.head()

Unnamed: 0,Year,Month,Day,Federal Funds Target Rate,Federal Funds Upper Target,Federal Funds Lower Target,Effective Federal Funds Rate,Real GDP (Percent Change),Unemployment Rate,Inflation Rate
0,1954,7,1,,,,0.8,4.6,5.8,
1,1954,8,1,,,,1.22,,6.0,
2,1954,9,1,,,,1.06,,6.1,
3,1954,10,1,,,,0.85,8.0,5.7,
4,1954,11,1,,,,0.83,,5.3,


In [34]:
#I will only be looking at the year, month, unemployment, inflation and effective federal funds rates for each year so i will make a new df.
df_econ = df_interestrates[['Year', 'Month', 'Unemployment Rate', 'Inflation Rate', 'Effective Federal Funds Rate']]
df_econ.head()

Unnamed: 0,Year,Month,Unemployment Rate,Inflation Rate,Effective Federal Funds Rate
0,1954,7,5.8,,0.8
1,1954,8,6.0,,1.22
2,1954,9,6.1,,1.06
3,1954,10,5.7,,0.85
4,1954,11,5.3,,0.83


In [33]:
df_econ_transposed = df_econ.T
df_econ_transposed.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,894,895,896,897,898,899,900,901,902,903
Year,1954.0,1954.0,1954.0,1954.0,1954.0,1954.0,1955.0,1955.0,1955.0,1955.0,...,2016.0,2016.0,2016.0,2016.0,2016.0,2016.0,2017.0,2017.0,2017.0,2017.0
Month,7.0,8.0,9.0,10.0,11.0,12.0,1.0,2.0,3.0,4.0,...,8.0,9.0,10.0,11.0,12.0,12.0,1.0,2.0,3.0,3.0
Unemployment Rate,5.8,6.0,6.1,5.7,5.3,5.0,4.9,4.7,4.6,4.7,...,4.9,4.9,4.8,4.6,4.7,,4.8,4.7,,
Inflation Rate,,,,,,,,,,,...,2.3,2.2,2.1,2.1,2.2,,2.3,2.2,,
Effective Federal Funds Rate,0.8,1.22,1.06,0.85,0.83,1.28,1.39,1.29,1.35,1.43,...,0.4,0.4,0.4,0.41,0.54,,0.65,0.66,,


In [38]:
df_econ.loc[:,'YearMonth'] = df_econ.loc[:,'Year'].add(df_econ.loc[:,'Month'])
df_econ.head()

Unnamed: 0,Year,Month,Unemployment Rate,Inflation Rate,Effective Federal Funds Rate,YearMonth
0,1954,7,5.8,,0.8,1961
1,1954,8,6.0,,1.22,1962
2,1954,9,6.1,,1.06,1963
3,1954,10,5.7,,0.85,1964
4,1954,11,5.3,,0.83,1965


### Below I am creating new dfs for our rent and sale data so that we are looking only at MA, NY and CT as these are the three states we are interested in based on the proposal. 

In [26]:
df_rent = df_zillowrent.iloc[[2, 13, 28]]

df_rent.head()

Unnamed: 0,RegionName,SizeRank,2010-01,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,2010-08,...,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09
2,New York,3,,,,,,,,,...,3150,3100,3195.0,3200.0,3490,3295,3260,3200,3375.0,3234.0
13,Massachusetts,14,,,1552.5,1675.0,1552.5,1775.0,1550.0,1600.0,...,2500,2550,2600.0,2699.0,2700,2700,2700,2695,2599.0,2500.0
28,Connecticut,29,,1700.0,1800.0,1900.0,1850.0,2000.0,1875.0,1800.0,...,1700,1725,1750.0,1800.0,1800,1850,1850,1850,1800.0,1800.0


In [27]:
df_sale = df_zillowsale.iloc[[1, 12, 27]]

df_sale.head()

Unnamed: 0,RegionID,RegionName,SizeRank,2008-03,2008-04,2008-05,2008-06,2008-07,2008-08,2008-09,...,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09
1,43,New York,2,,,,,,,,...,291200,297800,297400,294700,292600.0,296500.0,306300.0,311100.0,314400.0,
12,26,Massachusetts,13,314500.0,309500.0,297400.0,291300.0,287900.0,288200.0,285400.0,...,368000,368500,378400,388000,390200.0,384100.0,383300.0,385800.0,385600.0,
27,11,Connecticut,28,264000.0,263100.0,254900.0,242200.0,237300.0,237600.0,238700.0,...,236500,236500,234800,238800,241500.0,247300.0,242800.0,241000.0,239100.0,


### Let's see how much of our data we are missing!

In [17]:
df_sale.isnull().sum()*100/df_sale.count()


RegionID       0.0
RegionName     0.0
SizeRank       0.0
2008-03       50.0
2008-04       50.0
              ... 
2019-05        0.0
2019-06        0.0
2019-07        0.0
2019-08        0.0
2019-09        inf
Length: 142, dtype: float64

In [18]:
df_rent.isnull().sum()*100/df_rent.count()


RegionName      0.0
SizeRank        0.0
2010-01         inf
2010-02       200.0
2010-03        50.0
              ...  
2019-05         0.0
2019-06         0.0
2019-07         0.0
2019-08         0.0
2019-09         0.0
Length: 119, dtype: float64

In [19]:
df_interestrates.isnull().sum()*100/df_interestrates.count()


Year                              0.000000
Month                             0.000000
Day                               0.000000
Federal Funds Target Rate        95.670996
Federal Funds Upper Target      777.669903
Federal Funds Lower Target      777.669903
Effective Federal Funds Rate     20.212766
Real GDP (Percent Change)       261.600000
Unemployment Rate                20.212766
Inflation Rate                   27.323944
dtype: float64

For loop - for every row in this df, im gonna generate a request thats shaped like this: (use requests python library) every request i send hopefuly gets a response. I can save that to raw json to different files. Can save it to a new df if i want. Save it as a python list. Where each element is JSON data. Potentially if we have a lot of rows that we're generating requests for we might have an issue with rate limits and it might take a while to run. Could put in delays between requests. 

Maybe going from listing to town level. Or county.