# Examples week 5
## Day 1: functions and web scraping with BeautifulSoup
### Functions

In [1]:
# There's one essential part of programming that we haven't touched upon yet: functions
# We have used functions a lot, such as the print() statement and the len() statement
name = "Barrie"
name_length = len(name)
print(name_length)

6


In [2]:
# Functions are a pretty simple concept: you 'call' them with parentheses, optionally with an argument.
# Functions are used to encapsulate functionality that you reuse often in a program.
# 
# To create a function yourself you use the def statement
def print_barrie():
    print("Barrie")
    
# And you call it by using the function's name an parentheses
print_barrie()

Barrie


In [3]:
# You can put arguments after the parentheses of the function's name
def print_friend(friend):
    print(f"{friend} is a friend of Barrie")
    
print_friend("Tinus")

Tinus is a friend of Barrie


In [5]:
# As many as you like
def print_friends(friend1, friend2):
    print(f"{friend1} is a friend of Barrie")
    print(f"{friend2} is also Barrie's friend")
        
print_friends("Tinus", "Hans")

Tinus is a friend of Barrie
Hans is also Barrie's friend


In [10]:
# Functions can also transform values by using the return statement
def friendify(name):
    friend = f"{name} is a friend"
    return friend

friend = friendify("Tinus")
print(friend)

Tinus is a friend


In [1]:
# Let's create a more useful function. Remember the code you wrote in chapter 3 
# to read the footballers.json file and parse the JSON data?
# You could write a function to accept any filename
import json

def read_json(filename):
    with open(filename) as f:
        data = json.load(f)
        
    return data

footballers = read_json("footballers.json")
print(footballers[0]["name"])

Neymar


In [2]:
# When you call a function without the defined number of arguments you'll
# get an error
def friend_of(friend1, friend2):
    print(f"{friend1} is a friend of {friend2}")
    
friend_of("Barrie")

TypeError: friend_of() missing 1 required positional argument: 'friend2'

In [3]:
# However, you can also give a default value for an argument
def friend_of(friend1, friend2 = "Tinus"):
    print(f"{friend1} is a friend of {friend2}")
    
friend_of("Barrie")

Barrie is a friend of Tinus


In [4]:
# You can still overwrite a default argument
def friend_of(friend1, friend2 = "Tinus"):
    print(f"{friend1} is a friend of {friend2}")
    
friend_of("Barrie", "Hans")

Barrie is a friend of Hans


In [13]:
# Functions can call other functions
def a_really_good_friend(name):
    friend = a_good_friend(name)
    return f"{friend}, a really good friend"

def a_good_friend(name):
    return f"{name} is a good friend"

a_really_good_friend("Barrie")

'Barrie is a good friend, a really good friend'

In [12]:
# Functions are evaluated when they are run, so make sure 
# you put all your functions at the top of your file
def a_really_good_friend(name):
    friend = a_good_friend(name)
    return f"{friend}, a really good friend"

a_really_good_friend("Barrie")

def a_good_friend(name):
    return f"{name} is a good friend"

NameError: name 'a_good_friend' is not defined

### Web scraping with BeautifulSoup

In [34]:
# Web scraping is the black art of transforming a messy webpage into nicely structured data
# Web pages are structured as well, but they're usually not so tidely structured as a CSV or JSON file
# This means you need to create a structure yourself, and it also means you need to clean and tidy
# up the data contained in the webpages.
#
# Web scraping also has a couple of ethical and judicial issues. Scraping is not against the law,
# but it can definitely be in the grey area of what's legal and what not depending on what you
# do with the data.
# 
# However, leaving all those issues aside. Let's scrape a website! We're going to be looking at transforming
# listings on kamernet.nl to structured data. Our starting point is the listing of recent rooms in Utrecht:
# https://kamernet.nl/huren/kamer-utrecht
# 
# Before we can do any scraping we need to get the HTML using the requests library
import requests
req = requests.get("https://kamernet.nl/huren/kamer-utrecht")
html = req.text

# Because getting a HTML page can be slow, and sites are eager to block you for repeated requests
# to the same page, it's safer and faster to save the data to a file, so we can use that
# instead of endlessly downloading the same URL. 
file = open("rooms-utrecht.html", "w")
file.write(html)
file.close()

In [45]:
# Now we can start scraping the HTML. For this process the Developer Tools in either Chrome or Firefox are 
# indispensable. You use those tools to select (Right click and 'Inspect element') elements and look at 
# attributes like classes or ids you can use to write selectors.
# We're going to first get the rooms from the page and get their title. Note how we write this in a function.
# Why? If we're going to change from the static file to the live website we can easily change the code calling
# the function. Our function stays the same, it just accepts HTML.

from bs4 import BeautifulSoup # We import the BeautifulSoup library here

def get_rooms(html):
    soup = BeautifulSoup(html, "lxml") # The 'lxml' argument is called the parser, you can try 'html5lib' here as well
    rooms = [] # Our room data is going here
    
    # 'soup' is now our parsed HTML. We use the select() method to get the room elements using a
    # CSS selector for all elements with the 'rowSearchResultRoom' class
    rooms_list = soup.select(".rowSearchResultRoom")
    
    print(f"Found {len(rooms_list)} rooms")
    
    # Loop over our rooms
    for room in rooms_list:
        # 'room' is a new BeautifulSoup element that also accepts the select() method
        title = room.select(".tile-title") # Kamernet has decent class names indicating different data
        
        # select() *always* returns a list, even if there's just one element! 
        # So we need to get the first element in the list
        title = title[0]
        
        # Now we can use the get_text() method to get the text in the element
        title = title.get_text()
        
        # Note that all these methods can be chained, so this is the same (and a lot shorter!)
        title = room.select(".tile-title")[0].get_text()
        
        # And add it to the rooms list
        rooms.append(title)
        
    return rooms
    
# And here we're calling the function with our saved webpage
file = open("rooms-utrecht.html")
html = file.read()
file.close()

rooms = get_rooms(html)

# Let's use a pandas dataframe for easy viewing
import pandas as pd
pd.DataFrame(rooms)

Found 18 rooms


Unnamed: 0,0
0,Aureliahof
1,Groeneweg
2,Oudegracht
3,Blois van Treslongstraat
4,Noorderstraat
5,Noorderstraat
6,Noorderstraat
7,Noorderstraat
8,Billitonstraat
9,Lodewijk Napoleonplantsoen


In [68]:
# Okay, now that we know the basics, let's try getting some more information
# Something that isn't in the original data is the price per square meter,
# we can calculate that if we divide the rent with the surface of the room
def get_rooms(html):
    soup = BeautifulSoup(text, "lxml")
    rooms = []

    for room in soup.select(".rowSearchResultRoom"): # Note that we're directly using soup.select() here
        # We need to convert the rent and surface from strings to integer, so make variables
        # with the strings first
        rent_str = room.select(".tile-rent")[0].get_text()
        surface_str = room.select(".tile-surface")[0].get_text()

        # When you look at the rent strings you see the price is always 3 digits and starting from the 
        # third character. This method will break when prices are lower than €100 or higher than €999,
        # but we're taking that risk
        rent = int(rent_str[2:6])   
        
        # Same method for surface, and same problem here: if surface is lower than 10 square meters
        # or higher than 99 square meters this will fail
        surface = int(surface_str[0:2])
        
        # We can finally calculate the price per square meter, 
        # note that we're using the inbuild round() function here, to get a proper number
        rent_per_sqm = round(rent / surface)

        # Let's also add a bool indicating if you need to pay extra for electricity, water and gas
        has_gwl = "incl. G/W/E" in rent_str
        
        # Let's also get the thumbnail, note that the `src` attribute contains the image,
        # so we need to use get() instead of get_text() to get that value
        image = room.select(".tile-img img")[0].get("src")

        rooms.append({
            "available" : room.select(".tile-availability .left")[0].get_text(), # We're doing a nested selector here
            "furnished" : room.select(".tile-furnished")[0].get_text(),
            "has_gwl" : has_gwl,
            "image" : image,
            "rent" : rent,
            "rent_per_sqm" : f"€{rent_per_sqm}", # Use an f-string here to get the Euro character
            "rent_str" : rent_str,
            "surface" : surface,
            "surface_str" : surface_str,
            "title" : room.select(".tile-title")[0].get_text(),
        })

    return rooms

# We're using the with() statement here, which automatically closes the
# file you're opening and is a bit shorter than the usual four lines this takes
with open("rooms-utrecht.html") as f:
    rooms = get_rooms(f.read())

# Make a dataframe and sort by rent_per_sqm
df = pd.DataFrame(rooms)

# describe() gives some nice statistics here, such as as the average rent and surface
print(df.describe())

# Use pandas' to_csv() method to save to a CSV file
df.to_csv("rooms-utrecht.csv")

# And finally, show the table sorted by rent per square meter
df.sort_values('rent_per_sqm')

             rent    surface
count   18.000000  18.000000
mean   525.277778  21.944444
std    128.833593   8.940800
min    300.000000  11.000000
25%    422.500000  14.000000
50%    511.500000  22.500000
75%    629.250000  28.750000
max    750.000000  40.000000


Unnamed: 0,available,furnished,has_gwl,image,rent,rent_per_sqm,rent_str,surface,surface_str,title
15,01-08-'18 - 29-08-'18,Gemeubileerd,True,https://resources.kamernet.nl/image/ab8bc6e6-8...,300,€12,"€ 300,- incl. G/W/E",26,26 m2,Parkstraat
1,10-07-'18 - Onbepaalde tijd,Gestoffeerd,True,https://resources.kamernet.nl/image/c08ed0e2-5...,420,€14,"€ 420,- incl. G/W/E",30,30 m2,Blois van Treslongstraat
3,10-07-'18 - Onbepaalde tijd,Kaal,True,https://resources.kamernet.nl/Content/images/p...,639,€16,"€ 639,- incl. G/W/E",40,40 m2,Noorderstraat
2,01-08-'18 - 31-07-'23,Kaal,True,https://resources.kamernet.nl/Content/images/p...,488,€17,"€ 488,- incl. G/W/E",29,29 m2,Noorderstraat
4,01-08-'18 - 31-07-'23,Kaal,True,https://resources.kamernet.nl/Content/images/p...,523,€17,"€ 523,- incl. G/W/E",30,30 m2,Noorderstraat
6,10-07-'18 - Onbepaalde tijd,Gestoffeerd,False,https://resources.kamernet.nl/image/21886f05-4...,650,€19,"€ 650,-",35,35 m2,Billitonstraat
5,01-08-'18 - 31-07-'23,Kaal,True,https://resources.kamernet.nl/Content/images/p...,495,€20,"€ 495,- incl. G/W/E",25,25 m2,Noorderstraat
17,15-08-'18 - Onbepaalde tijd,Kaal,True,https://resources.kamernet.nl/image/49c7dfe7-0...,700,€25,"€ 700,- incl. G/W/E",28,28 m2,El Salvadordreef
13,29-07-'18 - 30-07-'18,Gemeubileerd,True,https://resources.kamernet.nl/image/e5e07393-a...,375,€27,"€ 375,- incl. G/W/E",14,14 m2,Kanaalstraat
14,01-08-'18 - Onbepaalde tijd,Kaal,True,https://resources.kamernet.nl/image/41cc97a3-6...,550,€28,"€ 550,- incl. G/W/E",20,20 m2,Marco Pololaan
