# Getting all historic sales off streeteasy

In this notebook, we present an alternate webscraping approach to getting historic sales data off streeteasy. Our previous approach involved paging through current sale pages. Unfortunately this does not allow us to construct a training set.

In this notebook, we present an alternate approach (that unfortunately may not scale). Specifically, we
1. Page through NYC buildings (approximately 10.3k pages of buildings) to get the urls for each building.
2. Parse the building page in order to construct the url that will give us access to all historic sales in that building.
3. Scrape these historic sale tables to get:
    - Sale price
    - Sale date
    - Unit number
    - URL corresponding to the sale (page gives granular features about the unit).

Unfortunately, this approach is not particular efficient, since it requires approximately:
- 10.3k calls to get the building urls
- 126k calls to get the building pages (approximately 126k buildings in the streeteasy NYC dataset)
- 126k calls to get the historic sale tables from the building pages
- An unknown number of sale pages (some multiplicative factor of 126k).

A (potentially) more efficient approach is to:
- Randomly sample from the building pages (i.e. random draw from 1 to 10.3k)
- Scrap the historic sale tables for these buildings to determine an approximate range of identifiers for sales (sale pages are indexed integers over an unknown range).
- Try fetching all pages corresponding to sales in this range, ignoring 404 return codes.


# Step 1: Page through NYC buildings

In [78]:
import urllib2
from BeautifulSoup import BeautifulSoup
import re
import numpy as np
import pandas as pd

def get_all_building_urls(test=True, test_num=15):
    #Get first page
    base_url = 'http://streeteasy.com/buildings/nyc'
    page = urllib2.urlopen(base_url)
    soup = BeautifulSoup(page.read())
    urls = soup.findAll("div", {"class": "details-title"})
    hrefs = [x.find('a', href=True)['href'] for x in urls]
    
    #Get number of pages
    num_pages = soup.findAll('span', {'class':'page'})[-1]
    num_pages = int(num_pages.find('a').contents[0])
    
    if test:
        pagelimit = test_num
    else:
        pagelimit = num_pages
        
    #get rest of pages:
    for page in range(2,pagelimit):
        if page % 10 == 0:
            print page
        page_url = base_url + "?page={}".format(page)
        page = urllib2.urlopen(page_url)
        soup = BeautifulSoup(page.read())
        urls = soup.findAll("div", {"class": "details-title"})
        new_hrefs = [x.find('a', href=True)['href'] for x in urls]
        hrefs.extend(new_hrefs)
    return hrefs

In [82]:
def get_all_building_urls_regex(test=True, test_num=15):
    #Alternate approach just pattern matching
    
    #Get first page using beautiful soup (to find num pages)
    base_url = 'http://streeteasy.com/buildings/nyc'
    page = urllib2.urlopen(base_url)
    soup = BeautifulSoup(page.read())
    urls = soup.findAll("div", {"class": "details-title"})
    hrefs = [x.find('a', href=True)['href'] for x in urls]
    
    #Get number of pages
    num_pages = soup.findAll('span', {'class':'page'})[-1]
    num_pages = int(num_pages.find('a').contents[0])
    
    if test:
        pagelimit = test_num
    else:
        pagelimit = num_pages
        
    #get rest of pages using regex match:
    for page in range(2,pagelimit):
        if page % 10 == 0:
            print page
        page_url = base_url + "?page={}".format(page)
        page = urllib2.urlopen(page_url)
        text = page.read()
        new_hrefs = list(set(re.findall('/building/[^\s/"]*', text)))
        hrefs.extend(new_hrefs)
    return hrefs

In [83]:
%time get_all_building_urls()
%time first_15 = get_all_building_urls_regex()

#time magic doesn't suppress output
done = True
print done

10
CPU times: user 12.9 s, sys: 286 ms, total: 13.2 s
Wall time: 39.6 s
10
CPU times: user 999 ms, sys: 82.7 ms, total: 1.08 s
Wall time: 26.1 s
True


### Time estimate:

Assuming linear runtime with number of pages, this implies it'll take about

$$11.8s *(10300/15 pages) * 1/60 (min/sec)$$

or approximately 2 hours to get all buildings.

# Step 2: Get past sale tables

In [74]:
#Get past sales for a building:
def parse_row(row):
    '''
    Helper function to parse table of past sales for buildings.
    '''
    try:
        cols = row.findAll('td')
        date = cols[0].contents[0].strip()
        sale_url = re.search(r'/sale/[0-9]*', str(cols[0].contents[1])).group(0)
        unit = cols[1].find('a').contents[0]
        sale_price = cols[2].find('span', {'class':'price'}).contents[0].replace('$','').replace(',','').strip()
        return {'Date':date, 'URL':sale_url, 'Unit':unit, 'Price': sale_price}
    except:
        return {'Date':np.nan, 'URL':np.nan, 'Unit':np.nan, 'Price': np.nan}

def get_all_building_sales(building_url):
    build_url = 'http://streeteasy.com/{}#tab_building_detail=2'.format(building_url)
    build_page = urllib2.urlopen(build_url)
    soup = BeautifulSoup(build_page.read())
    visible_url = soup.findAll("div", {"class": "tabset-content",  'se:behavior':"loadable"})
    
    #Note after some digging into the source code, it appears that all the sales are found
    #using a modification of this visible url:
    #ex: http://streeteasy.com/nyc/property_activity/past_transactions_body/8612508?all_activity=true&show_sales=true
    building_number = re.search('[0-9]+',visible_url[0]['se:url'])
    
    #Get all past sales
    sales_url = 'http://streeteasy.com/nyc/property_activity/past_transactions_body/{}?all_activity=true&show_sales=true'.format(building_number.group(0))
    sales_page = urllib2.urlopen(sales_url)
    soup = BeautifulSoup(sales_page.read())
    
    #Many rows in the table are not recorded sales- only those recorded as sold sold.
    rows = soup.findAll('tr')
    sold_rows = [row for row in soup.findAll('tr') if re.search('Sold', str(row))]
    if len(sold_rows)>0:
        parsed = pd.DataFrame([parse_row(row) for row in sold_rows])
        parsed['Date'] = pd.to_datetime(parsed['Date'],infer_datetime_format=True)
        parsed = parsed.loc[parsed['Date'].dt.year>=2010,:]
        parsed.dropna(axis=0, how='any', inplace=True)
        return parsed
    else:
        return None

In [94]:
def get_all_building_sales_regex(building_url):
    
    #Read in building page
    build_url = 'http://streeteasy.com/{}#tab_building_detail=2'.format(building_url)
    build_page = urllib2.urlopen(build_url).read()
    
    #Find building number using regex
    visible_url = re.search('/nyc/property_activity/past_transactions_component/([0-9]*)', build_page)   
    
    if visible_url:
        building_number = visible_url.group(1)

        #Get all past sales for building
        sales_table_url = 'http://streeteasy.com/nyc/property_activity/past_transactions_body/{}?all_activity=true&show_sales=true'.format(building_number)
        sales_table_page = urllib2.urlopen(sales_table_url).read()
        sales_urls = re.findall('/sale/[0-9]+', sales_table_page)
        return sales_urls
    
    else:
        return []
    

In [79]:
%time get_all_building_sales(building_url)
%time get_all_building_sales_regex(building_url)

CPU times: user 7.83 s, sys: 119 ms, total: 7.95 s
Wall time: 14.8 s
CPU times: user 14.6 ms, sys: 8.65 ms, total: 23.2 ms
Wall time: 5.01 s


['/sale/1095961',
 '/sale/1095961',
 '/sale/1219516',
 '/sale/1219516',
 '/sale/1211096',
 '/sale/1211096',
 '/sale/1230062',
 '/sale/1230062',
 '/sale/1168412',
 '/sale/1168412',
 '/sale/1178689',
 '/sale/1178689',
 '/sale/1214803',
 '/sale/1214803',
 '/sale/1206140',
 '/sale/1206140',
 '/sale/1123425',
 '/sale/1123425',
 '/sale/1108140',
 '/sale/1108140',
 '/sale/1131447',
 '/sale/1131447',
 '/sale/1130825',
 '/sale/1130825',
 '/sale/1188140',
 '/sale/1188140',
 '/sale/1174632',
 '/sale/1174632',
 '/sale/1138282',
 '/sale/1138282',
 '/sale/1218622',
 '/sale/1218622',
 '/sale/1146153',
 '/sale/1146153',
 '/sale/1192570',
 '/sale/1192570',
 '/sale/1100209',
 '/sale/1100209',
 '/sale/1114684',
 '/sale/1114684',
 '/sale/1178687',
 '/sale/1178687',
 '/sale/1205603',
 '/sale/1205603',
 '/sale/1178691',
 '/sale/1178691',
 '/sale/1206994',
 '/sale/1206994',
 '/sale/1206995',
 '/sale/1206995',
 '/sale/1114677',
 '/sale/1114677',
 '/sale/1136369',
 '/sale/1136369',
 '/sale/1210107',
 '/sale/12

In [92]:
def get_all_sales_for_time_test(first_15):
    sales = []
    for building in first_15:
        sales.extend(get_all_building_sales_regex(building))
    return sales

In [95]:
%time sales = get_all_sales_for_time_test(list(set(first_15)))

CPU times: user 3.61 s, sys: 2.43 s, total: 6.04 s
Wall time: 20min 31s


In [97]:
len(list(set(sales)))

26889

In [100]:
list(set(sales))[0:6]

['/sale/1198913',
 '/sale/927342',
 '/sale/840286',
 '/sale/880849',
 '/sale/1144747',
 '/sale/95681']

In [103]:
sale_nums = [int(x.split('/')[-1]) for x in list(set(sales))]

In [104]:
import numpy as np
np.max(sale_nums)

1249694

In [105]:
np.min(sale_nums)

5231

# Step 3- not implemented

Note step 3 (scraping sale page) is not implemented here since it was previously implemented to scrape current sales.