# Applying ML to Predict NYC Real Estate Value & Investment Opportunity

*This notebook scrapes streeteasy.com for data on listings for sale in the Five Boroughs and applies ML to evaluate my predictive model*

In [1]:
from selenium import webdriver
# from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
# import json
import random
# import os
# from collections import OrderedDict
# from urlparse import urlparse

In [2]:
boroughs = ['manhattan','brooklyn','queens','bronx','staten-island']

In [3]:
def url_def(lst):
    lst_of_urls = []
    for item in lst:
        lst_of_urls.append('http://streeteasy.com/for-sale/'+str(item)+'/status:listed?refined_search=true')
    return lst_of_urls

In [4]:
urls = url_def(boroughs)
urls

['http://streeteasy.com/for-sale/manhattan/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/brooklyn/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/queens/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/bronx/status:listed?refined_search=true',
 'http://streeteasy.com/for-sale/staten-island/status:listed?refined_search=true']

### Feature Selection

What would be the most useful features to collect for this project?

- **Type of house, Location of house, Neighborhood, Number of rooms, number of baths**, availability of amenities in building (laundry, doorman, super)?, proximity to transit, proximity to waterfront, (with the Price of house as target).

Features in bold are available on streeteasy...

Q. ***Can any useful features be engineered from those available or retrieved from an alternate source?***

### Scraped features

The features available from Streeteasy.com are:

 - House type, Geo-location, House address, No. of beds, No. of baths, Square area of house, Neighborhood, Price
 
*What features can be derived from these? What additional insight will these derived features provide?*
*Can more useful features be retrieved from other sources to complement Streeteasy?*

In [5]:
listing_type = []
lat = []
lng = []
address = []
no_of_beds = []
no_of_baths = []
sq_area = []
n_hood = []
price = []

### *Thoughts*

Do the project requirements dictate the statistical method/algorithm used? Will these, in turn, determine whether categorical or continuous variables are required?

- *Linear Regression*
- *Logistic Regression*
- *Random forest*

**N.B. This requirement directly dictates the page_scrape function below.**

**I'm going with numerical variables where possible...**

In [6]:
def page_scrape(page):
    count = 0
    listings = page.find_element_by_tag_name('body').find_element_by_id('result-details').find_element_by_tag_name('ul').find_elements_by_tag_name('li')
    # collect data here by iterating through each listing and appending to our lists
    for l in listings[:14]:
        # initiate a counter to help identify at what listing the code breaks, if it does...
        count +=1
        print count
        
        # longitude and latitude
        g = None
        try:
            g = l.get_attribute('se:map:point')
            if g:
                lt, ln = g.split(',')
                lat.append(float(lt))
                lng.append(float(ln))
            else:
                lat.append('N/A')
                lng.append('N/A')
        except:
            lat.append('N/A')
            lng.append('N/A')
        # time.sleep(1)
        
        # address
        ad = None
        try:
            ad = l.find_element_by_class_name('details-title').text.split('\n')[0]
            if ad:
                address.append(ad)
            else:
                address.append('N/A')
        except:
            address.append('N/A')
        # time.sleep(1)
        
        # price
        p = None
        try:
            p = float(l.find_element_by_class_name('price').text.replace('$','').replace(',', ''))
            if p:
                price.append(p)
            else:
                price.append('N/A')
        except:
            price.append('N/A')
        # time.sleep(1)
        
        # number of beds
        bd_detail = None
        try:
            bd_detail = l.find_element_by_class_name('details_info').find_element_by_tag_name('span')
            if bd_detail.text.find('bed') > 0:
                no_of_beds.append(float(bd_detail.text.split(' ')[0]))
            # do we want this as a string or float? what are the regression/ml requirements?
            else:
                no_of_beds.append('N/A')
        except:
            no_of_beds.append('N/A')
        # time.sleep(1)
        
        # number of baths
        baths = None
        try:
            lstn_details = l.find_element_by_class_name('details_info').find_elements_by_tag_name('span')
            for detail in lstn_details:
                if detail.text.find('bath') > 0:
                    try:
                        baths = float(detail.text.split(' ')[0])
                    except:
                        baths = 'N/A'
        except:
            baths = 'N/A'
        no_of_baths.append(baths)
        # time.sleep(1)
        
        # square area NB: value in previous listing is being appended to next listing. FIX!
        # update: fixed.
        area = None
        try:
            l_details = l.find_element_by_class_name('details_info').find_elements_by_tag_name('span')
            for detail in l_details:
                 if detail.text.find('ft') > 0:
                    area = float(detail.text.split(' ')[0].replace(',', ''))
            if area:
                sq_area.append(area)
            else:
                sq_area.append('N/A')
        except:
            sq_area.append('N/A')
        # time.sleep(1)
        
        # listing type and neighborhood
        l_type = None
        nhood = None
        try:
            area_details = l.find_elements_by_class_name('details_info')[1].text
            l_type, nhood = area_details.split(' in ')
            if l_type:
                listing_type.append(l_type)
            else:
                listing_type.append('N/A')
            if nhood:
                n_hood.append(nhood)
            else:
                n_hood.append('N/A')
        except:
            listing_type.append('N/A')
            n_hood.append('N/A')
        # time.sleep(1)
    if count == 14:
        print('Moving on to the next page...')
    # streeteasy introduces a captcha when they suspect scraping. How will this be overridden?
    # fixed by using Firefox in place of Chrome
    return listings

In [7]:
def next_page():
    nxt = listns[-1].find_element_by_class_name('next')
    nxt.click()

In [8]:
t = time.time()
driver = webdriver.Firefox()
for i in range(0, len(urls)):
    driver.get(urls[i])
    time.sleep(2)
    listns = page_scrape(driver)
    last_page = int(listns[-1].find_elements_by_class_name('page')[-1].text)
    counter = 1
    print 'Counter:', counter
    next_page()
    for i in range(1, last_page):
        counter += 1
        delay = random.uniform(2.5, 5)
        listns = page_scrape(driver)
        next_page()
        print 'Counter:', counter
        time.sleep(delay)
    print time.time() - t
total_time = time.time() - t
print total_time

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 13
1
2
3
4
5
6
7
8
9
10
11
12
13
14
M

10
11
12
13
14
Moving on to the next page...
Counter: 110
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 111
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 112
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 113
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 114
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 115
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 116
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 117
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 118
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 119
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 120
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 121
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Moving on to the next page...
Counter: 122
1
2
3
4
5
6
7
8
9
10
11
12
13


TimeoutException: Message: Timeout loading page after 300000ms


In [9]:
counter

176

In [10]:
driver.close()

In [10]:
dic = {'building type':listing_type, 'latitude':lat, 'longitude':lng, 'address':address, 'beds':no_of_beds, 'baths':no_of_baths, 'area':sq_area, 'neighborhood':n_hood, 'price':price}

In [11]:
data = pd.DataFrame(dic)
# data.to_csv('Streeteasy_data.csv') "The data is not yet good enough to warrant saving to disk...
data.head()

Unnamed: 0,address,area,baths,beds,building type,latitude,longitude,neighborhood,price
0,366 West 11th Street #12D,737,1.0,1.0,Condo,40.735199,-74.0093,West Village,1599000.0
1,438 East 12th Street #3R,1190,2.5,2.0,Condo,40.729,-73.982101,East Village,2195000.0
2,390 West End Avenue #PHK,623,1.0,1.0,Condo,40.783699,-73.980904,Upper West Side,999999.0
3,350 West 42nd Street #24J,525,1.0,,Condo,40.7579,-73.992302,Hell's Kitchen,830000.0
4,402 East 90th Street #6F,900,1.0,1.0,Condo,40.778999,-73.946999,Yorkville,849000.0


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2464 entries, 0 to 2463
Data columns (total 9 columns):
address          2464 non-null object
area             2464 non-null object
baths            2448 non-null object
beds             2464 non-null object
building type    2464 non-null object
latitude         2464 non-null float64
longitude        2464 non-null float64
neighborhood     2464 non-null object
price            2464 non-null float64
dtypes: float64(3), object(6)
memory usage: 173.3+ KB


In [13]:
print data.describe()

          latitude    longitude         price
count  2464.000000  2464.000000  2.464000e+03
mean     40.741372   -73.948381  3.105610e+06
std       0.821652     1.490471  5.143441e+06
min       0.000000   -74.018097  1.600000e+05
25%      40.736500   -73.992401  8.250000e+05
50%      40.759909   -73.979066  1.572500e+06
75%      40.776125   -73.963189  3.295000e+06
max      40.872398     0.000000  7.800000e+07


### Handling 'N/A' values and outliers...

*Are samples with missing data discarded or replaced with the feature median? What is the norm as pertains to this situation...?*

Are statistical outliers really outliers in this use case? (Yes/**No**)?

### Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# sns.set(style='whitegrid', context='notebook')


### Model Equations
- Multivariate Linear Regression:
 - $y = w_0x_0 + w_1x_1 + ... + w_mx_m = \sum\limits_{i=0}^{\infty} w_ix_i = w^Tx$

In [None]:
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics


In [7]:
import numpy as np
help(np.c_)

Help on CClass in module numpy.lib.index_tricks object:

class CClass(AxisConcatenator)
 |  Translates slice objects to concatenation along the second axis.
 |  
 |  This is short-hand for ``np.r_['-1,2,0', index expression]``, which is
 |  useful because of its common occurrence. In particular, arrays will be
 |  stacked along their last axis after being upgraded to at least 2-D with
 |  1's post-pended to the shape (column vectors made out of 1-D arrays).
 |  
 |  For detailed documentation, see `r_`.
 |  
 |  Examples
 |  --------
 |  >>> np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
 |  array([[1, 2, 3, 0, 0, 4, 5, 6]])
 |  
 |  Method resolution order:
 |      CClass
 |      AxisConcatenator
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from AxisConcatenator:
 |  
 |  __getitem__(self, key)
 |  
 |  __getslice__(self, i, j)
 |  
 |  __le

### Performance Measure

RMSE