# Applying ML to Predict NYC Real Estate Value & Investment Opportunity

*This notebook scrapes streeteasy.com for data on listings for sale in the Five Boroughs and applies ML to evaluate my predictive model*

In [1]:
# 1. what are the requirements?
# from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
import json
# from collections import OrderedDict
from urlparse import urlparse

In [2]:
boroughs = ['manhattan','brooklyn','queens','bronx','staten-island']

In [3]:
def url_def(lst):
    lst_of_urls = []
    for item in lst:
        lst_of_urls.append('http://streeteasy.com/for-sale/'+str(item)+'/status:listed?refined_search=true')
    return lst_of_urls

In [4]:
urls = url_def(boroughs)
urls

['http://streeteasy.com/for-sale/manhattan/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/brooklyn/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/queens/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/bronx/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/staten-island/status:listed?refined_search=true']

In [21]:
# to visually inspect all generated urls...
# ?! this doesn't make sense! fix it!
'''
resultas = []
for url in urls:
    driver = webdriver.Chrome()
    try:
        driver.get(url)
        time.sleep(0.5)
    except:
        
    driver.close()
    
'''

### Feature Selection

What would be the most useful features to collect for this project?

- **Type of house, Location of house, Neighborhood, Number of rooms, number of baths**, availability of amenities in building (laundry, doorman, super)?, proximity to transit, proximity to waterfront, (with the Price of house as target).

Features in bold are available on streeteasy...

Q. ***Can any useful features be engineered from those available or retrieved from an alternate source?***

### Scraped features

The features available from Streeteasy.com are:

 - House type, Geo-location, House address, No. of beds, No. of baths, Square area of house, Neighborhood, Price
 
*What features can be derived from these? What additional insight will these derived features provide?*
*Can more useful features be retrieved from other sources to complement Streeteasy?*

In [19]:
listing_id = []
listing_type = []
lat_long = []
address = []
no_of_beds = []
no_of_baths = []
sq_area = []
n_hood = []
price = []

### *Thoughts*

Do the project requirements dictate the statistical method/algorithm used? Will these, in turn, determine whether categorical or continuous variables are required?

- *Linear Regression*
- *Logistic Regression*
- *Random forest*

Number of beds, baths, square area can be either categorical or continuous...

**N.B. This requirement directly dictates the page_scrape function below.**

In [34]:
def page_scrape(page):
    count = 0
    # time.sleep(5)
    listings = page.find_element_by_xpath('//*[@id="result-details"]/ul').find_elements_by_tag_name('li')
    # collect data here by iterating through each listing and appending to our lists
    for l in listings[:14]:
        # initiate a counter to help identify at what listing the code breaks, if it does...
        # also, a counter for the number of pages scraped should be implemented in the function that navigates pages
        count +=1
        
        # longitude and latitude
        g = l.get_attribute('se:map:point')
        if g:
            lat_long.append(g)
        else:
            lat_long.append('N/A')
        g = None
        time.sleep(1)
        
        # address
        ad = l.find_element_by_class_name('details-title').text.split('\n')[0]
        if ad:
            address.append(ad)
        else:
            address.append('N/A')
        ad = None
        time.sleep(1)
        
        # price
        p = float(l.find_element_by_class_name('price').text.replace('$','').replace(',', ''))
        if p:
            price.append(p)
        else:
            price.append('N/A')
        p = None
        time.sleep(1)
        
        # number of beds
        bd_detail = l.find_element_by_class_name('details_info').find_element_by_tag_name('span')
        if bd_detail.text.find('bed') > 0:
            no_of_beds.append(float(bd_detail.text.split(' ')[0]))
            # do we want this as a string or float? what are the regression/ml requirements?
        else:
            no_of_beds.append('N/A')
        bd_detail = None
        time.sleep(1)
        
        # number of baths
        lstn_details = l.find_element_by_class_name('details_info').find_elements_by_tag_name('span')
        for detail in lstn_details:
            if detail.text.find('bath') > 0:
                baths = float(detail.text.split(' ')[0])
            if detail.text.find('ft') > 0:
                area = float(detail.text.split(' ')[0].replace(',', ''))
        if baths:
            no_of_baths.append(baths)
        else:
            no_of_baths.append('N/A')
        baths = None
        time.sleep(1)
        
        # square area NB: value in previous listing is being appended to next listing. FIX!
        # update: fixed.
        if area:
            sq_area.append(area)
        else:
            sq_area.append('N/A')
        area = None
        time.sleep(1)
        
        # listing type and neighborhood
        area_details = l.find_elements_by_class_name('details_info')[1].text
        l_type = area_details.split(' in ')[0].strip(' ')
        nhood = area_details.split(' in ')[1].strip(' ')
            
        if l_type:
            listing_type.append(l_type)
            # a spell-checker is required to correct mispells in house type e.g. 'Condop' instead of 'Condo'
        else:
            listing_type.append('N/A')
        if nhood:
            n_hood.append(nhood)
        else:
            n_hood.append('N/A')
        l_type = None
        nhood = None
        time.sleep(1)
    
    print count
    if count == 14:
        print('Moving on to the next page...')
    # streeteasy introduces a captcha when they suspect scraping. How will this be overridden?

### Handling 'N/A' values and outliers...

*Are samples with missing data discarded or replaced with the feature median? What is the norm as pertains to this situation...?*

Are statistical outliers really outliers in this use case? (Yes/**No**)?

In [None]:
def get_last_page(page):
    last_page = int(page.find_element_by_xpath('//*[@id="result-details"]/ul/li[17]/nav/span[10]').text)
    return last_page

In [15]:
def next_page(page):
    url = driver.current_url
    split_url = 
        print("You're on the next page...")
    except:
        print('You have reached the last page...')

In [25]:
driver.close()

In [35]:
page_scrape(driver)

14
Moving on to the next page...


In [26]:
driver = webdriver.Chrome()
time.sleep(2)
driver.get(urls[0])
time.sleep(5)

page_scrape(driver)
# time.sleep(5)
# nxt_page(driver)

ValueError: invalid literal for float(): 2,467

In [36]:
dic = {'type':listing_type, 'geo':lat_long, 'addr':address, 'beds':no_of_beds, 'baths':no_of_baths, 'area':sq_area, 'hood':n_hood, 'price':price}
for key in dic:
    print dic[key]

[3.0, 3.0, 3.0, 3.0, 1.0, 3.0, 4.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 'N/A', 'N/A', 'N/A']
[u'45 East 22nd Street #46A', u'45 East 22nd Street #46A', u'45 East 22nd Street #46A', u'2 River Terrace #14N', u'453 Fdr Drive #C1703', u'45 Park Place #38', u'45 Park Place #24', u'45 Park Place #19W', u'45 Park Place #7E', u'200 East End Avenue #5N', u'1060 Fifth Avenue #10C', u'34 Gramercy Park East #8AR', u'435 East 85th Street #4MN', u'240 East 76th Street #11P', u'77 Bleecker Street #214', u'98 Park Terrace East #2H']
[3.0, 3.0, 1.0, 3.5, 4.5, 2.5, 3.5, 1.0, 2.5, 2.0, 1.5, 1.0, 1.0, 1.0]
[u'40.73989868,-73.98729706', u'40.73989868,-73.98729706', u'40.73989868,-73.98729706', u'40.71559906,-74.01609802', u'40.71289825,-73.97889709', u'40.71379852,-74.00990295', u'40.71379852,-74.00990295', u'40.71379852,-74.00990295', u'40.71379852,-74.00990295', u'40.77730179,-73.9434967', u'40.782321,-73.959613', u'40.73730087,-73.98500061', u'40.77569962,-73.94809723', u'40.77119827,-73.95700073', u'40.72710

In [18]:
nxt_page(driver)

You have reached the last page...


In [21]:
last_page = int(driver.find_element_by_xpath('//*[@id="result-details"]/ul/li[17]/nav/span[10]').text)
last_page


69

In [21]:
driver.close()

In [72]:
# 3. how will we iterate over all results pages and repeat step 2?
try:
    next_page = driver.find_element_by_class_name('next').find_element_by_tag_name('a')
    next_page.click()
except:
    print('You have reached the last page...')

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"class name","selector":"next"}
  (Session info: chrome=58.0.3029.110)
  (Driver info: chromedriver=2.27.440174 (e97a722caafc2d3a8b807ee115bfb307f7d2cfd9),platform=Windows NT 10.0.14393 x86_64)


In [46]:
next_page.click()

### Exloratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# sns.set(style='whitegrid', context='notebook')

### Model Equations
- Multiple Linear Regression:
 - $y = w_0x_0 + w_1x_1 + ... + w_mx_m = \sum\limits_{i=0}^{\infty} w_ix_i = w^Tx$

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics


In [7]:
import numpy as np
help(np.c_)

Help on CClass in module numpy.lib.index_tricks object:

class CClass(AxisConcatenator)
 |  Translates slice objects to concatenation along the second axis.
 |  
 |  This is short-hand for ``np.r_['-1,2,0', index expression]``, which is
 |  useful because of its common occurrence. In particular, arrays will be
 |  stacked along their last axis after being upgraded to at least 2-D with
 |  1's post-pended to the shape (column vectors made out of 1-D arrays).
 |  
 |  For detailed documentation, see `r_`.
 |  
 |  Examples
 |  --------
 |  >>> np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
 |  array([[1, 2, 3, 0, 0, 4, 5, 6]])
 |  
 |  Method resolution order:
 |      CClass
 |      AxisConcatenator
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from AxisConcatenator:
 |  
 |  __getitem__(self, key)
 |  
 |  __getslice__(self, i, j)
 |  
 |  __le