# Collect house price data for Vancouver

The goal of this project is to scrape listing pages for housing prices in Vancouver. It should collect the house type, address, bedrooms, bathrooms, square footage, agent/agency, and of course the listing price. A listing ID should be used to catch changes and update across multiple days.

Unfortunately at the moment this doesn't include sale prices but it could in the future if the data is available. The MLS listing number should be enough to join the sale price back to the original listing(s) and should also give an idea of how long the house was on the market.

There are several sources of housing data online. [REW](https://www.rew.ca/properties/areas/vancouver-bc) seems to be the most comprehensive and popular. It lists from multiple agencies and the pages are easy enough to translate. We can start by spinning up beautiful soup and learning how to scrape a listing that just went up. 

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

target = 'https://www.rew.ca/properties/R2227563/401-e-55th-avenue-vancouver-bc'
page = requests.get(target)
soup = BeautifulSoup(page.text, 'html5lib')
print(soup.prettify())

<!DOCTYPE html>
<html data-locale="en-CA" prefix="og: http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#">
 <head>
  <!-- https://developers.google.com/tag-manager/devguide?hl=en#adding-data-layer-variables-to-a-page -->
  <script>
   dataLayer = [];
  dataLayer.push({'propertyType': 'house'});dataLayer.push({'propertyCity': 'Vancouver'});dataLayer.push({'propertyNeighbourhood': 'South Vancouver'});dataLayer.push({'propertyPrice': '2088000'});
  </script>
  <!-- Google Tag Manager -->
  <script>
   (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-X9BN');
  </script>
  <!-- End Google Tag Manager -->
  <meta charset="utf-8"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-d

In [2]:
# save target to local file so it doesn't change on us.
import os
fname = 'R2227563.html'
fpath = os.path.join('./examples/', fname)
if not os.path.exists(fpath):
    os.makedirs(os.path.dirname(fpath), exist_ok=True)
    with open(fpath, 'w') as fp:
        fp.write(soup.prettify())

In [3]:
# load from file
with open(fpath, 'r') as fp:
    page = fp.read()
soup = BeautifulSoup(page, 'html5lib')

### Pull out the listing ID and URL for reference

In [10]:
# listing ID is located in many places including the URL. Let's grab the URL 
import re
url = soup.head.link['href']
listing_id = re.findall('(?<=properties/).*(?=/)', url)[0]

print('Listing ID: {}\nURL: {}'.format(listing_id, url))

Listing ID: R2227563
URL: https://www.rew.ca/properties/R2227563/401-e-55th-avenue-vancouver-bc


### Address

In [11]:
# pull out the address, street, city, province and postal code
street_address = soup.find('span', itemprop='streetAddress').text.strip()
city = soup.find('span', itemprop='addressLocality').text.strip()
prov = soup.find('span', itemprop='addressRegion').text.strip()
postal = soup.find('span', itemprop='postalCode').text.strip()

print('Address: {}, {} {} {}'.format(street_address, city, prov, postal))

Address: 401 E 55th Avenue, Vancouver BC V5X 3P4


### List Price

In [14]:
list_price = soup.find('div', class_='propertyheader-price').text.strip()
# format as a number
list_price = int(list_price.strip('$').replace(',',''))

print('List price: {}'.format(list_price))

List price: 2088000


### House Type and Square Footage

In [48]:
# pull square footage from header
summary_bar = soup.find('div', class_='summarybar')
sqft_span = summary_bar.find(string=re.compile('Sqft')).findParent().text
sqft = int(re.findall(r'\d+', sqft_span)[0])

# house type is easier to grab from the property overview table
property_overview = soup.find('caption', string=re.compile('Property Overview')).findParent()
property_overview_dict = {}
for row in property_overview.find_all("tr"):
    key = row.find('th').text.strip()
    val = row.find('td').text.strip()
    property_overview_dict[key] = val

property_type = property_overview_dict['Property Type']

print('Property Type: {}\nSquare Footage: {} sqft'.format(property_type, sqft))

Property Type: House
Square Footage: 2537 sqft


### Number of Beds and Baths

In [20]:
beds_span = summary_bar.find(string=re.compile('Bed')).findParent().text
beds = int(re.findall(r'\d+', beds_span)[0])

baths_span = summary_bar.find(string=re.compile('Bath')).findParent().text
baths = int(re.findall(r'\d+', baths_span)[0])

print('Beds: {}\nBaths: {}'.format(beds, baths))

Beds: 6
Baths: 5


### Description

In [21]:
soup.find('div', itemprop='description').text.strip()

'Custom built home, like new. Located in the popular South Vancouver Sunset area, this 6 bdrms plus den, 5 baths home bright and open layout. Home features: HRV, high ceiling, laminate hardwood flooring on the top floor, living room, dining room and den on the main floor, all granite countertops, tile flooring in kitchen and family room, stainless steel kitchen appliances plus legal suite in the basement and potential suite, 2 car garage and much more. Wlaking distance to Sunset Park and new Community Centre, Main Street and Fraser Street shops and Langara Golf Course. Move in and enjoy. All measurements are approximate and must be verified by the Buyer.'

### Community

In [26]:
community = property_overview_dict['Sub-Area/Community']

print('Community: {}'.format(community))

Community: South Vancouver


### Lot Size

In [31]:
depth = int(property_overview_dict['Depth'].strip())
frontage = int(property_overview_dict['Frontage'].strip())
lot_size = depth * frontage

print('Lot size: {} sqft ({} ft frontage)'.format(lot_size, frontage))

Lot size: 3696 sqft (33 ft frontage)


### Agent & Broker

In [34]:
agent = property_overview_dict['Primary Agent']
broker = property_overview_dict['Primary Broker']

print('Broker: {}\nAgent: {}'.format(broker, agent))

Broker: RE/MAX Real Estate Services
Agent: Major Olak


### Features List

In [61]:
feature_table = soup.find('caption', string=re.compile('Special'))
feature_table.findNext('th').findNext('td').text.strip()

'ClthWsh/Dryr/Frdg/Stve/DW, Drapes/Window Coverings, Garage Door Opener, Heat Recov. Vent., Jetted Bathtub, Security System, Smoke Alarm'