[👈 Chapter 24](24-functions.ipynb) -
[🏠 To index](README.md)

# 25 - Web scraping with BeautifulSoup

In [5]:
# Web scraping is the black art of transforming a messy webpage into nicely structured data
# Web pages are structured as well, but they're usually not so tidely structured as a CSV or JSON file
# This means you need to create a structure yourself, and it also means you need to clean and tidy
# up the data contained in the webpages.
#
# Web scraping also has a couple of ethical and judicial issues. Scraping is not against the law,
# but it can definitely be in the grey area of what's legal and what not depending on what you
# do with the data.
# 
# However, leaving all those issues aside. Let's scrape a website! We're going to be looking at transforming
# listings on kamernet.nl to structured data. Our starting point is the listing of recent rooms in Utrecht:
# https://kamernet.nl/huren/kamer-utrecht
# 
# Before we can do any scraping we need to get the HTML using the requests library
import requests
req = requests.get("https://kamernet.nl/huren/kamer-utrecht")
html = req.text

# Because getting a HTML page can be slow, and sites are eager to block you for repeated requests
# to the same page, it's safer and faster to save the data to a file, so we can use that
# instead of endlessly downloading the same URL. 
with open("rooms-utrecht.html", "w") as file:
    file.write(html)

In [6]:
# Now we can start scraping the HTML. For this process the Developer Tools in either Chrome or Firefox are 
# indispensable. You use those tools to select (Right click and 'Inspect element') elements and look at 
# attributes like classes or ids you can use to write selectors.
#
# We're going to first get the rooms from the page and get their title. Note how we write this in a function.
# Why? If we're going to change from the static file to the live website we can easily change the code calling
# the function. Our function stays the same, it just accepts HTML.
#
# If you don't have the BeautifulSoup library yet, install it using `pip install bs4` on the terminal
# or the Anaconda prompt

from bs4 import BeautifulSoup # We import the BeautifulSoup library here

def get_rooms(html):
    soup = BeautifulSoup(html, "lxml") # The 'lxml' argument is called the parser, you can try 'html5lib' here as well
    rooms = [] # Our room data is going here
    
    # 'soup' is now our parsed HTML. We use the select() method to get the room elements using a
    # CSS selector for all elements with the 'rowSearchResultRoom' class
    rooms_list = soup.select(".rowSearchResultRoom")
    
    print(f"Found {len(rooms_list)} rooms")
    
    # Loop over our rooms
    for room in rooms_list:
        # 'room' is a new BeautifulSoup element that also accepts the select() method
        title = room.select(".tile-title") # Kamernet has decent class names indicating different data
        
        # select() *always* returns a list, even if there's just one element! 
        # So we need to get the first element in the list
        title = title[0]
        
        # Now we can use the get_text() method to get the text in the element
        title = title.get_text()
        
        # Note that all these methods can be chained, so this is the same (and a lot shorter!)
        title = room.select(".tile-title")[0].get_text()
        
        # And add it to the rooms list
        rooms.append(title)
        
    return rooms
    
# And here we're calling the function with our saved webpage
with open("rooms-utrecht.html") as file:
    html = file.read()

rooms = get_rooms(html)

# Let's use a pandas dataframe for easy viewing
import pandas as pd
pd.DataFrame(rooms)

Found 18 rooms


Unnamed: 0,0
0,Acaciastraat
1,Hebriden
2,Koeweitdreef
3,Nieuwe Koekoekstraat
4,Marco Pololaan
5,Rooseveltlaan
6,Edmond Audranstraat
7,Nieuwe Keizersgracht
8,Ridderschapstraat
9,Amsterdamsestraatweg


In [7]:
# Okay, now that we know the basics, let's try getting some more information
# Something that isn't in the original data is the price per square meter,
# we can calculate that if we divide the rent with the surface of the room
def get_rooms(html):
    soup = BeautifulSoup(html, "lxml")
    rooms = []

    for room in soup.select(".rowSearchResultRoom"): # Note that we're directly using soup.select() here
        # We need to convert the rent and surface from strings to integer, so make variables
        # with the strings first
        rent_str = room.select(".tile-rent")[0].get_text()
        surface_str = room.select(".tile-surface")[0].get_text()

        # When you look at the rent strings you see the price is always 3 digits and starting from the 
        # third character. This method will break when prices are lower than €100 or higher than €999,
        # but we're taking that risk
        rent = int(rent_str[2:6])   
        
        # Same method for surface, and same problem here: if surface is lower than 10 square meters
        # or higher than 99 square meters this will fail
        surface = int(surface_str[0:2])
        
        # We can finally calculate the price per square meter, 
        # note that we're using the inbuild round() function here, to get a proper number
        rent_per_sqm = round(rent / surface)

        # Let's also add a bool indicating if you need to pay extra for electricity, water and gas
        has_gwl = "incl. G/W/E" in rent_str
        
        # Let's also get the thumbnail, note that the `src` attribute contains the image,
        # so we need to use get() instead of get_text() to get that value
        image = room.select(".tile-img img")[0].get("src")

        rooms.append({
            "available" : room.select(".tile-availability .left")[0].get_text(), # We're doing a nested selector here
            "furnished" : room.select(".tile-furnished")[0].get_text(),
            "has_gwl" : has_gwl,
            "image" : image,
            "rent" : rent,
            "rent_per_sqm" : f"€{rent_per_sqm}", # Use an f-string here to get the Euro character
            "rent_str" : rent_str,
            "surface" : surface,
            "surface_str" : surface_str,
            "title" : room.select(".tile-title")[0].get_text(),
        })

    return rooms

with open("rooms-utrecht.html") as f:
    rooms = get_rooms(f.read())

# Make a dataframe and sort by rent_per_sqm
df = pd.DataFrame(rooms)

# describe() gives some nice statistics here, such as as the average rent and surface
print(df.describe())

             rent    surface
count   18.000000  18.000000
mean   470.388889  19.500000
std    131.033062  11.366877
min    260.000000   8.000000
25%    380.000000  12.750000
50%    442.500000  17.000000
75%    584.500000  22.750000
max    705.000000  60.000000


In [8]:
# Use pandas' to_csv() method to save to a CSV file
df.to_csv("rooms-utrecht.csv")

# And finally, show the table sorted by rent per square meter
df.sort_values('rent_per_sqm')

Unnamed: 0,available,furnished,has_gwl,image,rent,rent_per_sqm,rent_str,surface,surface_str,title
10,01-01-'19 - Onbepaalde tijd,Gemeubileerd,False,https://resources.kamernet.nl/image/418f4669-a...,705,€12,"€ 705,-",60,60 m2,Oslostraat
1,13-11-'18 - 31-12-'20,Kaal,True,https://resources.kamernet.nl/image/5f88cd98-2...,500,€19,"€ 500,- incl. G/W/E",27,27 m2,Hebriden
3,01-12-'18 - Onbepaalde tijd,Kaal,True,https://resources.kamernet.nl/Content/images/p...,435,€19,"€ 435,- incl. G/W/E",23,23 m2,Nieuwe Koekoekstraat
12,01-12-'18 - Onbepaalde tijd,Kaal,True,https://resources.kamernet.nl/image/a057752d-c...,490,€20,"€ 490,- incl. G/W/E",24,24 m2,Korfoedreef
8,03-12-'18 - Onbepaalde tijd,Kaal,True,https://resources.kamernet.nl/image/9edcf15f-8...,405,€22,"€ 405,- incl. G/W/E",18,18 m2,Ridderschapstraat
9,01-01-'19 - Onbepaalde tijd,Gestoffeerd,True,https://resources.kamernet.nl/image/06912528-0...,395,€23,"€ 395,- incl. G/W/E",17,17 m2,Amsterdamsestraatweg
17,01-12-'18 - Onbepaalde tijd,Kaal,True,https://resources.kamernet.nl/Content/images/p...,295,€25,"€ 295,- incl. G/W/E",12,12 m2,Swammerdamstraat
11,12-11-'18 - Onbepaalde tijd,Gemeubileerd,False,https://resources.kamernet.nl/image/6fdff551-8...,596,€26,"€ 596,-",23,23 m2,Reykjavikplein
7,12-11-'18 - Onbepaalde tijd,Gestoffeerd,True,https://resources.kamernet.nl/image/35b27f34-4...,400,€27,"€ 400,- incl. G/W/E",15,15 m2,Nieuwe Keizersgracht
14,11-11-'18 - 01-01-'20,Gemeubileerd,True,https://resources.kamernet.nl/image/25892b7b-4...,610,€28,"€ 610,- incl. G/W/E",22,22 m2,Dolomieten


[👈 Chapter 24](24-functions.ipynb) -
[🏠 To index](README.md)