# SgLandScape

Scrape list of condos page links from central 99.co website. 

From condo page links, we can find more details for each condo- include
* Condo name
* Address
* Recent transaction prices and property sizes


Import the libraries we need.  

Use BeautifulSoup for parsing webpage html.


In [1]:
import requests
import string
import time
from datetime import datetime
from bs4 import BeautifulSoup

Let's test a simple page request, see what we get- and test how to parse it.

In [2]:
r = requests.get("https://www.99.co/singapore/condos-apartments?alphabet=a&page=1")
r.encoding

'UTF-8'

The response of the http request is in r.text.

In [3]:
r.text



In [4]:
len(r.text)

133850

Now we use BeautifulSoup to parse the html.

In [5]:
soup = BeautifulSoup(r.text, 'html5lib')

Hey soup, find me all the divisions of class type 'content-section'.  The condo details are in this section.

In [6]:
props = soup.find_all('div', class_='content-section')

How many did soup find?

In [7]:
len(props)

20

Show me the contents of 'content-section'.

In [8]:
props[0]

<div class="content-section"><a href="/singapore/condos-apartments/astrid-meadows" style="color: #595959;" target="_self">
            <div class="content content--padding">
              <h4 class="content-title title--light nnBoldHeader nnBoldHeader--h4">Astrid Meadows</h4>
            </div>
            <div class="content-body">
              <div class="clearfix">
                <div class="row">
                  <div class="col-sm-4 col-xs-4">
                    <div class="thumbnail" style="background-image: url('//dwk5ggjgl5r05.cloudfront.net/J56P7Ry3n6nq7TNXhh78LT?width=300&amp;height=300&amp;mode=crop&amp;sampling=lanczos&amp;quality=70&amp;version=8&amp;signature=1575fc72be7cc929eab6dc9ed0e0c5426b3b4938');"></div>
                  </div>
                  <ul class="list-unstyled col-xs-8 clearfix development-info">
                    <li class="col-xs-12"><span class="room-feature-key">Type: <b>condo</b></span>
                    </li>
                    <li class="c

I think I have what I want.  The condo page link.

In [9]:
atag = props[0].find('a')
atag['href']

'/singapore/condos-apartments/astrid-meadows'

Now let's formalise this- and make this series of steps to get a condo page link from a html section into a function.  So we can reuse these steps with one call. 

In [10]:
def get_list_of_property_links(props):
    props_list = []
    for prop in props:
        atag = prop.find('a')
        if atag is not None:
            props_list.append(atag['href'])
    return props_list

property_links = get_list_of_property_links(props)
print(property_links)


['/singapore/condos-apartments/astrid-meadows', '/singapore/condos-apartments/ao-jiang-apartments', '/singapore/condos-apartments/affluence-court', '/singapore/condos-apartments/ampas-apartment', '/singapore/condos-apartments/acacia-lodge', '/singapore/condos-apartments/advance-apartments', '/singapore/condos-apartments/airview-towers', '/singapore/condos-apartments/asimont-barker', '/singapore/condos-apartments/adam-green', '/singapore/condos-apartments/ardmore-residence', '/singapore/condos-apartments/aalto', '/singapore/condos-apartments/adam-park-condominium', '/singapore/condos-apartments/airstream', '/singapore/condos-apartments/avon-park', '/singapore/condos-apartments/amaranda-gardens', '/singapore/condos-apartments/astor-green', '/singapore/condos-apartments/axis-siglap', '/singapore/condos-apartments/auralis', '/singapore/condos-apartments/arc-at-tampines', '/singapore/condos-apartments/ardmore-3']


Okay now we're ready to grab all condos from A to Z.  So let's prepare a list of a-z so we can grab all pages a-z. 

In [11]:
atoz = string.ascii_lowercase
for i in range(0,26):
    print(atoz[i:i+1])

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z


Again let's put this a-z into a function- and return a list of a-z. 

In [12]:
def gen_atoz_list():
    atoz = []
    alphas = string.ascii_lowercase
    for i in range(0,26):
        atoz.append(alphas[i:i+1])
    return atoz


We're now putting together all the little pieces. 

Here we make it easy to generate a full html page link given a alphabet letter and a page number. 

In [13]:
#r = requests.get("https://www.99.co/singapore/condos-apartments?alphabet=a&page=100")

def gen_probe_url(alpha, page_num):
    url_template = 'https://www.99.co/singapore/condos-apartments?alphabet={0}&page={1}'
    probe_url = url_template.format(alpha, page_num)
    return probe_url
    
    
assert( gen_probe_url('b', 2) == 'https://www.99.co/singapore/condos-apartments?alphabet=b&page=2' )
assert( gen_probe_url('a', 100) == 'https://www.99.co/singapore/condos-apartments?alphabet=a&page=100' )

Also, we dont want to disrupt the website 99.co too much.  We want to slow down our pings.  Does the sleep function work?

In [14]:
print('Starting sleep')
time.sleep(5)
print('Waking')

Starting sleep
Waking


A big step here.  Now we're seriously productionising all the little pieces.

Given a main page, get all the condo page links from the webpage. 

In [15]:
def get_property_links_from_page(page_text):
    soup = BeautifulSoup(page_text, 'html5lib')
    props = soup.find_all('div', class_='content-section')
    property_links = get_list_of_property_links(props)
    return property_links
        

Here we wrap the request to get each page- inside a try-catch loop and retry 3 times. 

In [16]:
def get_page_contents(alpha, probe_url):
    
    num_retries = 3
    while num_retries > 0:
        try:
            print('{0} Probing {1}, {2}'.format(datetime.time(datetime.now()),alpha,probe_url))
            r = requests.get(probe_url)
            return r
        except:
            num_retries = num_retries - 1
            time.sleep(5)
    
    # if exceptions hit more than 3 times, return None
    return None
    

## Putting this all together

Finally, we have all the small working pieces to put into one main function. 
We're going to crawl all condo page links- from a-z. 
Given a page response, we'll parse the page for condo page links. 
We'll save the condo page links in an array that we print out at the end. 


In [17]:
atoz = gen_atoz_list()

all_property_links = []
cur_property_links = all_property_links
for alpha in atoz:
    for pagenum in range(1,100):
        probe_url = gen_probe_url(alpha, pagenum)
        r = get_page_contents(alpha, probe_url)
        if r == None or len(r.text)<100:
            break
            
        property_links = get_property_links_from_page(r.text)
        all_property_links = cur_property_links + property_links
        print('Total Property Links = {0}'.format(len(all_property_links)))
        
        # if no new links, then move to next alphabet
        if (len(all_property_links) == len(cur_property_links)):
            break
        cur_property_links = all_property_links
        time.sleep(5)
        
print(all_property_links)

03:58:30.236054 Probing a, https://www.99.co/singapore/condos-apartments?alphabet=a&page=1
Total Property Links = 20
03:58:36.329016 Probing a, https://www.99.co/singapore/condos-apartments?alphabet=a&page=2
Total Property Links = 40
03:58:42.513843 Probing a, https://www.99.co/singapore/condos-apartments?alphabet=a&page=3
Total Property Links = 60
03:58:48.689884 Probing a, https://www.99.co/singapore/condos-apartments?alphabet=a&page=4
Total Property Links = 76
03:58:54.693916 Probing a, https://www.99.co/singapore/condos-apartments?alphabet=a&page=5
Total Property Links = 76
03:58:55.679825 Probing b, https://www.99.co/singapore/condos-apartments?alphabet=b&page=1
Total Property Links = 96
03:59:01.798232 Probing b, https://www.99.co/singapore/condos-apartments?alphabet=b&page=2
Total Property Links = 116
03:59:07.869332 Probing b, https://www.99.co/singapore/condos-apartments?alphabet=b&page=3
Total Property Links = 136
03:59:14.013837 Probing b, https://www.99.co/singapore/condos-

Total Property Links = 1059
04:04:52.782095 Probing m, https://www.99.co/singapore/condos-apartments?alphabet=m&page=2
Total Property Links = 1079
04:04:59.086892 Probing m, https://www.99.co/singapore/condos-apartments?alphabet=m&page=3
Total Property Links = 1099
04:05:05.113854 Probing m, https://www.99.co/singapore/condos-apartments?alphabet=m&page=4
Total Property Links = 1119
04:05:11.342132 Probing m, https://www.99.co/singapore/condos-apartments?alphabet=m&page=5
Total Property Links = 1139
04:05:17.413846 Probing m, https://www.99.co/singapore/condos-apartments?alphabet=m&page=6
Total Property Links = 1155
04:05:23.400086 Probing m, https://www.99.co/singapore/condos-apartments?alphabet=m&page=7
Total Property Links = 1155
04:05:24.345335 Probing n, https://www.99.co/singapore/condos-apartments?alphabet=n&page=1
Total Property Links = 1175
04:05:30.372886 Probing n, https://www.99.co/singapore/condos-apartments?alphabet=n&page=2
Total Property Links = 1195
04:05:36.493662 Prob

Total Property Links = 2204
04:11:33.301857 Probing v, https://www.99.co/singapore/condos-apartments?alphabet=v&page=1
Total Property Links = 2224
04:11:39.589430 Probing v, https://www.99.co/singapore/condos-apartments?alphabet=v&page=2
Total Property Links = 2244
04:11:45.642219 Probing v, https://www.99.co/singapore/condos-apartments?alphabet=v&page=3
Total Property Links = 2260
04:11:51.837937 Probing v, https://www.99.co/singapore/condos-apartments?alphabet=v&page=4
Total Property Links = 2260
04:11:53.063113 Probing w, https://www.99.co/singapore/condos-apartments?alphabet=w&page=1
Total Property Links = 2280
04:11:59.337519 Probing w, https://www.99.co/singapore/condos-apartments?alphabet=w&page=2
Total Property Links = 2300
04:12:05.629144 Probing w, https://www.99.co/singapore/condos-apartments?alphabet=w&page=3
Total Property Links = 2320
04:12:11.711139 Probing w, https://www.99.co/singapore/condos-apartments?alphabet=w&page=4
Total Property Links = 2329
04:12:18.266181 Prob

In [18]:
print(all_property_links)

['/singapore/condos-apartments/astrid-meadows', '/singapore/condos-apartments/ao-jiang-apartments', '/singapore/condos-apartments/affluence-court', '/singapore/condos-apartments/ampas-apartment', '/singapore/condos-apartments/acacia-lodge', '/singapore/condos-apartments/advance-apartments', '/singapore/condos-apartments/airview-towers', '/singapore/condos-apartments/asimont-barker', '/singapore/condos-apartments/adam-green', '/singapore/condos-apartments/ardmore-residence', '/singapore/condos-apartments/aalto', '/singapore/condos-apartments/adam-park-condominium', '/singapore/condos-apartments/airstream', '/singapore/condos-apartments/avon-park', '/singapore/condos-apartments/amaranda-gardens', '/singapore/condos-apartments/astor-green', '/singapore/condos-apartments/axis-siglap', '/singapore/condos-apartments/auralis', '/singapore/condos-apartments/arc-at-tampines', '/singapore/condos-apartments/ardmore-3', '/singapore/condos-apartments/alpha-apartments', '/singapore/condos-apartments

Dump all found property page links to file

In [19]:
with open("all_links.txt", "w") as f:
    for link in all_property_links:
        condo_link = "https://www.99.co" + link + "\n"
        f.write(condo_link)
