## Introduction

This notebook will document the creation of a real estate price and information dataset. I will use packages BeautifulSoup and Selenium to scrape data from realestate.com.au and domain.com.au and then process it into a structured dataset for modelling.

In [2]:
import os
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import sys
import numpy as np
import pandas as pd
import regex as re
import requests
from math import ceil
from time import sleep, time, perf_counter
from random import randint
from IPython.core.display import clear_output

Open the Firefox browser and go to website using Selenium

In [218]:
# Open firefox browser
browser = webdriver.Firefox()

# Go to Realestate.com.au, NSW properties 1st page
browser.get('https://www.realestate.com.au/buy/in-nsw/list-1')

First we need to inspect the HTML of the site to look for the tags which store the links to the detailed property information pages. Let's use BeautifulSoup to scrape the page's HTML and telling it to find that details link for each property.

In [219]:
soup = BeautifulSoup(browser.page_source, 'html.parser')

# Get links to each house's webpage
listings = soup.find_all("a", class_="details-link residential-card__details-link")
listings[:5]

[<a class="details-link residential-card__details-link" href="/property-apartment-nsw-bondi-131687742"><span class="">11/28 Edward Street, Bondi</span></a>,
 <a class="details-link residential-card__details-link" href="/property-house-nsw-riverstone-131892442"><span class="">56 McCulloch Street, Riverstone</span></a>,
 <a class="details-link residential-card__details-link" href="/property-house-nsw-harrington+park-131784446"><span class="">5 Sir Warwick Fairfax Drive, Harrington Park</span></a>,
 <a class="details-link residential-card__details-link" href="/property-house-nsw-blacktown-131892406"><span class="">8 Kastelan Street, Blacktown</span></a>,
 <a class="details-link residential-card__details-link" href="/property-house-nsw-narrabeen-131783742"><span class="">11 Albemarle Street, Narrabeen</span></a>]

Each one of these entries contains a 'class' and a 'href' attribute. What we are after is the href tag, which gives us the suffix of the specific property's webpage link. To extract that attribute, very simply with the following:

In [55]:
listings[0]['href']

'/property-unit-nsw-westmead-131784298'

To get the full link, we just need to append the suffix to the root URL, https://www.realestate.com.au. 

In [56]:
property_links = [ 'https://www.realestate.com.au' + suffix['href'] for suffix in listings ]
property_links[:5]

['https://www.realestate.com.au/property-unit-nsw-westmead-131784298',
 'https://www.realestate.com.au/property-unit-nsw-manly+vale-131599398',
 'https://www.realestate.com.au/property-unit-nsw-albury-131892214',
 'https://www.realestate.com.au/property-house-nsw-rozelle-131784258',
 'https://www.realestate.com.au/property-apartment-nsw-coolangatta-131892246']

Now that we have all the property links to the first page, we need to replicate the same process for the 2nd, 3rd, 4th etc. pages in the RealEstate website (just for NSW there are 55000 homes).

FOr example the below snippet is the html corresponding to the 'Next' button (we can see it is just 'list-2', 'list-3' etc, so perhaps we could even just hardcode this in):

```HTML
<a href="/buy/property-house-in-nsw/list-2?includeSurrounding=false" class="rui-button-brand pagination__link-next" title="Go to Next Page" rel="next"><span class="pagination__next-label">Next</span><span class="rui-icon rui-icon-forward-small"></span></a>
```

Again, we can use BeautifulSoup's find_all method to look for 'a' tags can the class "rui-button-brand pagination__link-next".

In [220]:
# Get link to next page
nextpage = soup.find_all("a", class_="rui-button-brand pagination__link-next")
nextpage_link = 'https://www.realestate.com.au' + nextpage[0]['href']
nextpage_link

'https://www.realestate.com.au/buy/in-nsw/list-2'

We actually don't need to use Selenium to scrape realestate.com.au, since its mostly a static html webpage. I can even just loop through the page numbers (ie. list-1,list-2,list-3...). Instead of Selenium I will use the python requests library to get the website.

Below is the same snippet fom above that extracts the desired html tags from the site.

Note: must specify user-agent***

In [4]:
headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"}
req  = requests.get('https://www.realestate.com.au/buy/in-2155/list-1', headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')

# Get links to each house's webpage
links = soup.find_all("a", class_="details-link residential-card__details-link")
links[0]['href']

[<a class="details-link residential-card__details-link" href="/property-apartment-nsw-rouse+hill-131704398"><span class="">66/97 Caddies Boulevard, Rouse Hill</span></a>,
 <a class="details-link residential-card__details-link" href="/property-house-nsw-north+kellyville-131803638"><span class="">10 Blue Wren Way, North Kellyville</span></a>,
 <a class="details-link residential-card__details-link" href="/property-apartment-nsw-rouse+hill-131913202"><span class="">5/93 Caddies Boulevard, Rouse Hill</span></a>,
 <a class="details-link residential-card__details-link" href="/property-house-nsw-kellyville-131911202"><span class="">10 Blackham Road, Kellyville</span></a>,
 <a class="details-link residential-card__details-link" href="/property-house-nsw-kellyville-131910858"><span class="">35 Glenrowan Avenue, Kellyville</span></a>]

We need a some code to better handle retries in the web scrape since errors can happen at any point. The code ideally should also be able to retain the list obtained so far and save the page number and postcode that the web crawler is up to.

In [5]:
# This code comes from: https://www.peterbe.com/plog/best-practice-with-retries-with-requests
# Basically we 'replace requests.get(...)', with 'requests_retry_session().get(...)'
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry


def requests_retry_session(
    retries=3,
    backoff_factor=0.3,
    status_forcelist=(500, 502, 504),
    session=None,
):
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

In [6]:
def get_property_links(url, max_pages,headers):
    property_links = []
    
    for i in range(max_pages):
        req  = requests_retry_session().get( url + str(i+1) + '?includeSurrounding=false' ,headers=headers)
        soup = BeautifulSoup(req.content, 'html.parser')
        listings = soup.find_all("a", class_="details-link residential-card__details-link")
        page_link = ['https://www.realestate.com.au'+row['href'] for row in listings]
        property_links.extend(page_link)
        print('page'+str(i+1))
        sleep(np.random.lognormal(0,1))
    
    return property_links

In [72]:
headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"}

start=perf_counter()
property_links = get_property_links('https://www.realestate.com.au/buy/in-2000/list-', 3,headers)
end=perf_counter()
end-start

page1
page2
page3


5.820175299999846

In [73]:
property_links[:5]

['https://www.realestate.com.au/property-apartment-nsw-sydney-131803558',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131800378',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131798910',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131692682',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131787250']

One thing to make sure is the crawler doesn't try to go beyond the last page. If it does, it either starts looking at surrounding suburbs, or just displays a blank page. We can use the number of results summary at the top of the list (eg. '25 of 48 results') to calculate the maximum number of pages to visit.

In [32]:
req_test  = requests_retry_session().get( 'https://www.realestate.com.au/buy/in-pyrmont/list-1?includeSurrounding=false' ,headers=headers)
soup = BeautifulSoup(req_test.content, 'html.parser')
html_section = soup.find("div", class_="results-set-header__summary")
print(html_section.text)
print(re.findall(r"(\d+) result", html_section.text)[0])

1-25 of 50 results
50


In [33]:
def get_max_pages(url,headers):
    req  = requests_retry_session().get( url + '1?includeSurrounding=false' ,headers=headers)
    soup = BeautifulSoup(req.content, 'html.parser')
    results = soup.find("div", class_="results-set-header__summary")
    num_results = re.findall(r"(\d+) result", results.text)[0]
    max_pages = ceil(int(num_results)/25)
    return max_pages

Realestate.com.au does not display all 55000 homes when I filter by NSW, it limits the number of listed pages to 80. Therefore, it only shows up to 2000 homes, which is not a large enough dataset for us. One thing we can do is get a full NSW postcode list and do a search for each postcode (most likely no postcodes have more than 2000 ads). I found a postcode list that's been generously made public by Matthew Proctor here: https://www.matthewproctor.com/australian_postcodes

After testing manually, looks like when you search by postcode the website becomes: 

URL = 'https://www.realestate.com.au/buy/in' + POSTCODE + '/list-' + PAGE_NUMBER

Below is code to loop through loop through each postcode and then extract each link.

In [34]:
# Get Postcodes
postcodes = pd.read_csv("./data/australian_postcodes.csv")
postcode_list=postcodes[postcodes.type=="Delivery Area"].postcode
nsw_postcodes = postcode_list[(postcode_list>=2000) & (postcode_list<=2999)].unique()
nsw_postcodes[:20]

array([2000, 2006, 2007, 2008, 2009, 2010, 2011, 2015, 2016, 2017, 2018,
       2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027], dtype=int64)

For each postcode, get the max number of pages from the results count, and then loop through to get all the property links. 

In [35]:
nsw_property_links = []
num_requests = 0
start_time = time()
i=0
headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"}


for postcode in nsw_postcodes:
    url = 'https://www.realestate.com.au/buy/in-' + str(postcode) + '/list-'
    max_pages = get_max_pages(url, headers=headers) # get the number of pages of results
    num_requests += 1
    print('Postcode: '+ str(postcode) + ', Total pages: ' + str(max_pages))
    
    if max_pages == 0: # if there are 0 results, skip to the next postcode
        continue
    
    sleep(np.random.lognormal(0,1))
    
    nsw_property_links.extend(get_property_links(url, max_pages,headers)) # get all property links from each page, and add to end of list
    num_requests += max_pages
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(num_requests, num_requests/elapsed_time))
    
    i+=1
    if i>=5:
        clear_output(wait=True)
        i=0
    

Postcode: 2127, Total pages: 8
page1
page2
page3
page4
page5
page6
page7
page8
Request: 388; Frequency: 0.3932425589487163 requests/s
Postcode: 2128, Total pages: 1
page1
Request: 390; Frequency: 0.39384666372771215 requests/s
Postcode: 2130, Total pages: 1


ConnectionError: HTTPSConnectionPool(host='www.realestate.com.au', port=443): Max retries exceeded with url: /buy/in-2130/list-1?includeSurrounding=false (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000001E1A12F3438>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

In [78]:
nsw_property_links[:10]

['https://www.realestate.com.au/property-apartment-nsw-sydney-131803558',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131800378',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131798910',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131692682',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131787250',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131891918',
 'https://www.realestate.com.au/property-apartment-nsw-sydney-131778582',
 'https://www.realestate.com.au/property-apartment-nsw-walsh+bay-131878854',
 'https://www.realestate.com.au/property-apartment-nsw-the+rocks-131583162',
 'https://www.realestate.com.au/property-unit-nsw-sydney-131774526']

In [36]:
len(nsw_property_links)

5816

Code previously failed at ~8000 links. Now its fixed, as well as getting the links, we should get the basic info such as number of bedrooms, car parks, bathrooms, price and floor area in the same request.

In [81]:
import pickle
with open("./data/nsw_property_links.txt", "wb") as fp:   #Pickling
    pickle.dump(nsw_property_links, fp)

# with open("./data/nsw_property_links.txt", "rb") as fp:   # Unpickling
#     nsw_property_links = pickle.load(fp)

In [29]:
headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"}

# get_max_pages('https://www.realestate.com.au/buy/in-2344/list-',headers=headers)
req  = requests_retry_session().get( 'https://www.realestate.com.au/buy/in-20000/list-1?includeSurrounding=false' ,headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
results = soup.find("div", class_="results-set-header__summary")
num_results = re.findall(r"(\d+) result", results.text)[0]
num_results
max_pages = ceil(int(num_results)/25)
max_pages
postcode=2344
url = 'https://www.realestate.com.au/buy/in-' + str(postcode) + '/list-'
max_pages = get_max_pages(url, headers=headers) # get the number of pages of results
print('Postcode: '+ str(postcode) + ', Total pages: ' + str(max_pages))

Postcode: 2344, Total pages: 1


In each individual webpage, now we have to extract the relevant features. Test out the tags to search for each property attribute below.

In [38]:
# Get links to each house's webpage
links = soup.find_all("a", class_="details-link residential-card__details-link")
links[0]['href']

'/property-apartment-nsw-rouse+hill-131704398'

In [12]:
# Get Prices of Property
prices = soup.find_all("div", class_="residential-card__price")
prices[0].text

'$800,000 - $850,000'

In [17]:
# Get Addresses of Property
addresses = soup.find_all("a", class_="details-link residential-card__details-link")
addresses[0].text

'66/97 Caddies Boulevard, Rouse Hill'

In [62]:
# Get number of bedrooms in Property
beds = soup.find_all("span", class_="general-features__icon general-features__beds")
beds[0].text

' 3'

In [63]:
# Get number of bathrooms in Property
baths = soup.find_all("span", class_="general-features__icon general-features__baths")
baths[0].text

' 2'

In [64]:
# Get number of parking spots in Property
cars = soup.find_all("span", class_="general-features__icon general-features__cars")
cars[0].text

' 2'

In [68]:
# Get floor area of Property
floor_areas = soup.find_all("span", class_="property-size__icon property-size__building")
floor_areas[1].text

IndexError: list index out of range

In [70]:
# Get floor area of Property
floor_areas = soup.find_all("span", class_="property-size__icon property-size__land")
floor_areas[0].text

'\xa0741'

In [37]:
# Get Property Type
property_types = soup.find_all("span", class_="residential-card__property-type")
property_types[0].text

'Apartment'