# Applying ML to Predict NYC Real Estate Value & Investment Opportunity

*This notebook scrapes streeteasy.com for data on listings for sale in the Five Boroughs and applies ML to evaluate my predictive model*

In [1]:
from selenium import webdriver
# from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# from selenium.common.exceptions import TimeoutException
import pandas as pd
import numpy as np
import time
import random
# from retrying import retry
import tenacity
# import json
# import os
# from collections import OrderedDict
# from urlparse import urlparse

In [2]:
boroughs = ['manhattan','brooklyn','queens','bronx','staten-island']

In [3]:
def url_def(lst):
    lst_of_urls = []
    for item in lst:
        lst_of_urls.append('http://streeteasy.com/for-sale/'+str(item)+'/status:listed?refined_search=true')
    return lst_of_urls

In [4]:
urls = url_def(boroughs)
urls

['http://streeteasy.com/for-sale/manhattan/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/brooklyn/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/queens/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/bronx/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/staten-island/status:listed?refined_search=true']

### Feature Selection

What would be the most useful features to collect for this project?

- **Type of house, Location of house, Neighborhood, Number of rooms, number of baths**, availability of amenities in building (laundry, doorman, super)?, proximity to transit, proximity to waterfront, (with the Price of house as target).

Features in bold are available on streeteasy...

Q. ***Can any useful features be engineered from those available or retrieved from an alternate source?***

### Scraped features

The features available from Streeteasy.com are:

 - House type, Geo-location, House address, No. of beds, No. of baths, Square area of house, Neighborhood, Price
 
*What features can be derived from these? What additional insight will these derived features provide?*
*Can more useful features be retrieved from other sources to complement Streeteasy?*

In [5]:
listing_type = []
lat = []
lng = []
address = []
no_of_beds = []
no_of_baths = []
sq_area = []
n_hood = []
borough = []
price = []

### *Thoughts*

Do the project requirements dictate the statistical method/algorithm used? Will these, in turn, determine whether categorical or continuous variables are required?

- *Linear Regression*
- *Logistic Regression* **(This is not a classification task...)**
- *Random forest*

**N.B. This requirement directly dictates the page_scrape function below.**

**I'm going with numerical variables where possible...**

In [32]:
def page_scrape(page):
    # count = 0
    # On slow connections...
    # result = WebDriverWait(page, 30).until(EC.presence_of_element_located((By.ID, 'result-details')))
    # listings = result.find_element_by_tag_name('ul').find_element_by_tag_name('li')
    listings = page.find_element_by_id('result-details').find_element_by_tag_name('ul').find_elements_by_tag_name('li')
    # collect data here by iterating through each listing and appending to our lists
    for l in listings[:14]:
        # I need an IF statement to test whether the listing is legit before scraping to reduce the amount of N/A values
        # Initiating a counter to help identify at what listing the code breaks, if it does...
        # This has become redundant with the introduction of tenacity retry function
        # count +=1
        
        # longitude and latitude
        g = None
        try:
            g = l.get_attribute('se:map:point')
            if g:
                lt, ln = g.split(',')
                lat.append(float(lt))
                lng.append(float(ln))
            else:
                lat.append('N/A')
                lng.append('N/A')
        except:
            lat.append('N/A')
            lng.append('N/A')
        # time.sleep(1)
        
        # address
        ad = None
        try:
            ad = l.find_element_by_class_name('details-title').text.split('\n')[0]
            if ad:
                address.append(ad)
            else:
                address.append('N/A')
        except:
            address.append('N/A')
        # time.sleep(1)
        
        # price
        p = None
        try:
            p = float(l.find_element_by_class_name('price').text.replace('$','').replace(',', ''))
            if p:
                price.append(p)
            else:
                price.append('N/A')
        except:
            price.append('N/A')
        # time.sleep(1)
        
        # number of beds
        bd_detail = None
        try:
            bd_detail = l.find_element_by_class_name('details_info').find_element_by_tag_name('span')
            if bd_detail.text.find('bed') > 0:
                no_of_beds.append(float(bd_detail.text.split(' ')[0]))
            # do we want this as a string or float? what are the regression/ml requirements?
            else:
                no_of_beds.append('N/A')
        except:
            no_of_beds.append('N/A')
        # time.sleep(1)
        
        # number of baths
        baths = None
        try:
            lstn_details = l.find_element_by_class_name('details_info').find_elements_by_tag_name('span')
            for detail in lstn_details:
                if detail.text.find('bath') > 0:
                    try:
                        baths = float(detail.text.split(' ')[0])
                    except:
                        baths = 'N/A'
        except:
            baths = 'N/A'
        no_of_baths.append(baths)
        # time.sleep(1)
        
        # square area NB: value in previous listing is being appended to next listing. FIX!
        # update: fixed.
        area = None
        try:
            l_details = l.find_element_by_class_name('details_info').find_elements_by_tag_name('span')
            for detail in l_details:
                 if detail.text.find('ft') > 0:
                    area = float(detail.text.split(' ')[0].replace(',', ''))
            if area:
                sq_area.append(area)
            else:
                sq_area.append('N/A')
        except:
            sq_area.append('N/A')
        # time.sleep(1)
        
        # listing type and neighborhood
        l_type = None
        nhood = None
        try:
            area_details = l.find_elements_by_class_name('details_info')[1].text
            l_type, nhood = area_details.split(' in ')
            if l_type:
                listing_type.append(l_type)
            else:
                listing_type.append('N/A')
            if nhood:
                n_hood.append(nhood)
            else:
                n_hood.append('N/A')
        except:
            listing_type.append('N/A')
            n_hood.append('N/A')
        # time.sleep(1)
    # if count == 14:
        # print('Moving on to the next page...')
    # streeteasy introduces a captcha when they suspect scraping. How will this be overridden?
    # fixed by using Firefox in place of Chrome
    
    return listings

In [7]:
# To navigate to the next page. Self-explanatory, really...

def next_page():
    nxt = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, 'next')))
    # formerly: nxt = listns[-1].find_element_by_class_name('next'); WebDriverWait improves stability
    nxt.click()

In [41]:
# To "refresh" the browser page if a 300000ms TimeoutException occurs due to page crashing...

@tenacity.retry()
def retry():
    driver.get(driver.current_url)

# OR: 
# ret = tenacity.Retrying(retry=tenacity.retry_if_exception_type(TimeoutException))
# ret.call(driver.get(driver.currentl_url))

# checkpointing using ediblepickle package is another option...    

In [46]:
start_time = time.time()
# manhattan almost done, restarting with brooklyn due to strange firefox memory crash; using range(1,)
for x in range(0, len(urls)):
    start_borough_time = time.time()
    counter = 1
    page_crash = 0
    driver = webdriver.Firefox()
    driver.get(urls[x])
    time.sleep(2) # possible crash here without time.sleep?
    listns = page_scrape(driver)
    last_page = int(listns[-1].find_elements_by_class_name('page')[-1].text)
    # print 'Counter:', counter
    next_page()
    for i in range(1, last_page):
        try:
            counter += 1 # increment repetition may occur here if a break occurs in this try loop... avoided by using scraping in except loop
            # borough.append(boroughs[x])
            listns = page_scrape(driver)
            next_page()
        except:
            page_crash += 1
            print 'Retrying...'
            # ret.call(driver.get(driver.current_url))
            retry()
        finally:
            delay = random.uniform(0.5, 1)
            time.sleep(delay)
    
    print time.time() - start_borough_time
    driver.close()
    
    dic = {'building type':listing_type, 'latitude':lat, 'longitude':lng, 'address':address, 'beds':no_of_beds, 'baths':no_of_baths, 'area':sq_area, 'neighborhood':n_hood, 'price':price}
    data = pd.DataFrame(dic)
    data.to_csv(boroughs[x]+'.csv', index=False)
    
    # re-initializing lists
    listing_type = []
    lat = []
    lng = []
    address = []
    no_of_beds = []
    no_of_baths = []
    sq_area = []
    n_hood = []
    borough = []
    price = []
print time.time() - start_time

Retrying...
323.078999996
324.131999969


In [36]:
driver.close()

In [43]:
dic = {'building type':listing_type, 'latitude':lat, 'longitude':lng, 'address':address, 'beds':no_of_beds, 'baths':no_of_baths, 'area':sq_area, 'neighborhood':n_hood, 'price':price}

{'address': [u'834 Sterling Place #PH4',
  u'288 Albany Avenue',
  u'1291 Gates',
  u'798 Knickerbocker Avenue',
  u'160 Imlay Street #3A5',
  u'160 Imlay Street #4D3',
  u'160 Imlay Street #4D1',
  u'455 Marlborough Rd',
  u'322 Empire Boulevard',
  u'1854 85th Street',
  u'385 East 16th Street #6B',
  u'1151 Rogers Avenue',
  u'1889 Albany Avenue',
  u'560 East 28th Street',
  u'622 Grand Avenue #103',
  u'715 Macon Street',
  u'466 East 92nd Street',
  u'65 Monroe Street',
  u'105 Eighth Avenue #6',
  u'155 Hicks Street #1A',
  u'145 Newell Street',
  u'21 Dikeman Street',
  u'1031 E 57th Street',
  u'1976 Ocean Avenue #2',
  u'448 Neptune Avenue #20S',
  u'425 Prospect Place #2B',
  u'108 Neptune Avenue #7F',
  u'558 79',
  u'1 Northside Piers #10AB',
  u'37 West End Avenue #3B',
  u'917 Cleveland Street',
  u'604 East 82nd Street',
  u'3121 Farragut Rd',
  u'49 East 54th Street',
  u'1100 East 38th Street',
  u'2475 Ocean Avenue #6',
  u'1213 Avenue Z #C29',
  u'347 82nd Street',


In [44]:
data = pd.DataFrame(dic)
data.to_csv('Queens.csv', index=False)

In [39]:
dic = {'building type':listing_type, 'latitude':lat, 'longitude':lng, 'address':address, 'beds':no_of_beds, 'baths':no_of_baths, 'area':sq_area, 'neighborhood':n_hood, 'price':price}
data = pd.DataFrame(dic)
data.to_csv('Brooklyn.csv')
# The data is not yet good enough to warrant saving to disk...

data.head()

Unnamed: 0,address,area,baths,beds,building type,latitude,longitude,neighborhood,price
0,834 Sterling Place #PH4,,1.0,2.0,Condo,0.0,0.0,Crown Heights,890000.0
1,288 Albany Avenue,3000.0,2.0,5.0,Multi-family,40.6705,-73.9396,Crown Heights,1299000.0
2,1291 Gates,,4.0,6.0,Multi-family,40.6943,-73.9177,Bushwick,1275000.0
3,798 Knickerbocker Avenue,,,,,40.6925,-73.9078,,2395000.0
4,160 Imlay Street #3A5,1245.0,2.0,1.0,Condo,40.6805,-74.0103,Red Hook,1325000.0


In [40]:
data.tail()

Unnamed: 0,address,area,baths,beds,building type,latitude,longitude,neighborhood,price
6545,66-68 Washington Avenue #5R,1123.0,1.0,2.0,Condo,40.6967,-73.9678,Clinton Hill,699000.0
6546,405 Dean Street #4A,1125.0,1.0,1.0,Condo,40.6828,-73.9778,Park Slope,689000.0
6547,,,,,,,,,
6548,,,,,,,,,
6549,,,,,,,,,


In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12376 entries, 0 to 12375
Data columns (total 9 columns):
address          12376 non-null object
area             12376 non-null object
baths            12297 non-null object
beds             12376 non-null object
building type    12376 non-null object
latitude         12376 non-null object
longitude        12376 non-null object
neighborhood     12376 non-null object
price            12376 non-null object
dtypes: object(9)
memory usage: 870.3+ KB


In [17]:
print data.describe()

       address   area    baths     beds building type      latitude  \
count    12376  12376  12297.0  12376.0         12376  12376.000000   
unique   10444   2389     28.0     21.0             9   2066.000000   
top        N/A    N/A      1.0      2.0         Condo     40.772022   
freq        20   3717   4509.0   3803.0          6738    177.000000   

           longitude     neighborhood      price  
count   12376.000000            12376    12376.0  
unique   1724.000000              100     1971.0  
top       -73.990588  Upper West Side  2995000.0  
freq      177.000000              930      114.0  


### Handling 'N/A' values, duplicates and outliers...

*Are samples with missing data discarded or replaced with the feature median? What is the norm as pertains to this situation...?*

Depends on the feature where the 'N/A' occurs.

*Are statistical outliers really outliers in this use case? (Yes/**No**)?*

**A check for repeated listings must be implemented. Where is optimal?**

### Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# sns.set(style='whitegrid', context='notebook')


### Model Equations
- Multivariate Linear Regression:
 - $y = w_0x_0 + w_1x_1 + ... + w_mx_m = \sum\limits_{i=0}^{\infty} w_ix_i = w^Tx$

In [None]:
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics


### Performance Measure

RMSE