# Getting all historic sales off streeteasy

In this notebook, we present an alternate webscraping approach to getting historic sales data off streeteasy. Our previous approach involved paging through current sale pages. Unfortunately this does not allow us to construct a training set.

In this notebook, we present an alternate approach (that unfortunately may not scale). Specifically, we
1. Page through NYC buildings (approximately 10.3k pages of buildings) to get the urls for each building.
2. Parse the building page in order to construct the url that will give us access to all historic sales in that building.
3. Scrape these historic sale tables to get:
    - Sale price
    - Sale date
    - Unit number
    - URL corresponding to the sale (page gives granular features about the unit).

Unfortunately, this approach is not particular efficient, since it requires approximately:
- 10.3k calls to get the building urls
- 126k calls to get the building pages (approximately 126k buildings in the streeteasy NYC dataset)
- 126k calls to get the historic sale tables from the building pages
- An unknown number of sale pages (some multiplicative factor of 126k).

A (potentially) more efficient approach is to:
- Randomly sample from the building pages (i.e. random draw from 1 to 10.3k)
- Scrap the historic sale tables for these buildings to determine an approximate range of identifiers for sales (sale pages are indexed integers over an unknown range).
- Try fetching all pages corresponding to sales in this range, ignoring 404 return codes.


# Step 1: Page through NYC buildings

In [213]:
import urllib2
from BeautifulSoup import BeautifulSoup
import re
import numpy as np

def get_all_building_urls(test=True):
    #Get first page
    base_url = 'http://streeteasy.com/buildings/nyc'
    page = urllib2.urlopen(base_url)
    soup = BeautifulSoup(page.read())
    urls = soup.findAll("div", {"class": "details-title"})
    hrefs = [x.find('a', href=True)['href'] for x in urls]
    
    #Get number of pages
    num_pages = soup.findAll('span', {'class':'page'})[-1]
    num_pages = int(num_pages.find('a').contents[0])
    
    if test:
        pagelimit = 5
    else:
        pagelimit = num_pages
        
    #get rest of pages:
    for page in range(2,pagelimit):
        if page % 10 == 0:
            print page
        page_url = base_url + "?page={}".format(page)
        page = urllib2.urlopen(page_url)
        soup = BeautifulSoup(page.read())
        urls = soup.findAll("div", {"class": "details-title"})
        new_hrefs = [x.find('a', href=True)['href'] for x in urls]
        hrefs.extend(new_hrefs)
    return hrefs

# Step 2: Get past sale tables

In [230]:
#Get past sales for a building:
def parse_row(row):
    '''
    Helper function to parse table of past sales for buildings.
    '''
    try:
        cols = row.findAll('td')
        date = cols[0].contents[0].strip()
        sale_url = re.search(r'/sale/[0-9]*', str(cols[0].contents[1])).group(0)
        unit = cols[1].find('a').contents[0]
        sale_price = cols[2].find('span', {'class':'price'}).contents[0].replace('$','').replace(',','').strip()
        return {'Date':date, 'URL':sale_url, 'Unit':unit, 'Price': sale_price}
    except:
        return {'Date':np.nan, 'URL':np.nan, 'Unit':np.nan, 'Price': np.nan}

def get_all_building_sales(building_url):
    build_url = 'http://streeteasy.com/{}#tab_building_detail=2'.format(building_url) #random example
    build_page = urllib2.urlopen(build_url)
    soup = BeautifulSoup(build_page.read())
    visible_url = soup.findAll("div", {"class": "tabset-content",  'se:behavior':"loadable"})
    
    #Note after some digging into the source code, it appears that all the sales are found
    #using a modification of this visible url:
    #ex: http://streeteasy.com/nyc/property_activity/past_transactions_body/8612508?all_activity=true&show_sales=true
    building_number = re.search('[0-9]+',visible_url[0]['se:url'])
    
    #Get all past sales
    sales_url = 'http://streeteasy.com/nyc/property_activity/past_transactions_body/{}?all_activity=true&show_sales=true'.format(building_number.group(0))
    sales_page = urllib2.urlopen(sales_url)
    soup = BeautifulSoup(sales_page.read())
    
    #Many rows in the table are not recorded sales- only those recorded as sold sold.
    rows = soup.findAll('tr')
    sold_rows = [row for row in soup.findAll('tr') if re.search('Sold', str(row))]
    if len(sold_rows)>0:
        parsed = pd.DataFrame([parse_row(row) for row in sold_rows])
        parsed['Date'] = pd.to_datetime(parsed['Date'],infer_datetime_format=True)
        parsed = parsed.loc[parsed['Date'].dt.year>=2010,:]
        parsed.dropna(axis=0, how='any', inplace=True)
        return parsed
    else:
        return None

In [212]:
test_buildings = get_all_building_urls()

2
3
4


In [231]:
test_buildings[32]

u'/building/greenwich-club'

In [232]:
test_df = get_all_building_sales(test_buildings[32])

In [233]:
test_df

Unnamed: 0,Date,Price,URL,Unit
0,2016-10-26,685000,/sale/1232003,#429
1,2016-09-23,749000,/sale/1216932,#919
2,2016-09-13,995000,/sale/1226794,#2306
3,2016-08-25,620000,/sale/1213144,#405
4,2016-08-18,905000,/sale/1148750,#708
5,2016-08-09,765000,/sale/1213097,#1317
6,2016-08-05,955000,/sale/1211482,#907
7,2016-07-28,750000,/sale/1182950,#1317
8,2016-07-19,625000,/sale/1212725,#1012
9,2016-07-18,650000,/sale/1211137,#728


# Step 3- not implemented

Note step 3 (scraping sale page) is not implemented here since it was previously implemented to scrape current sales.