# The heaviest list

[The heaviest list](https://stubru.be/stem/dezwaarstelijst/lijst) is a yearly list played in the easter weekend on Studio Brussels. It's a list of the best heavy metal records.

![](images/2022-03-23-17-48-05.png)

The plan is to get this list as a list in Python, meaning we can iterate over it, or compare it to other heavy metal list to see which heavy metal music is the best overall heavy metal music. But there is a problem why we can't do this in the same way we did the minifigs:

![](images/2022-03-23-17-50-49.png)

The actual list is generated using javascript after the page has loaded. This means that if we simply scrape the page using beautifulsoup we don't get the actual list, only the general page setup of the page. To get an actual list we'll need our Python code to mimick, or interact with, a webbrowser. In this example we'll be using Firefox, but chromium would work as well. We also need selenium to have Python interact with the browser. And beautifulsoup and requests are also still needed.

In [1]:
! pip install requests
! pip install BeautifulSoup4
! pip install selenium

In [1]:
import requests
from bs4 import BeautifulSoup

There may just be a very small chance we are wrong, and we can simply scrape the list using beautifulsoup. Let's try. We want these parts of the list:

![](images/2022-03-23-18-19-43.png)

We need the div's with class code "css-901oao". (It contains the name of the band.)

In [2]:
URL = "https://stubru.be/stem/dezwaarstelijst/lijst"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find("div", {"class": "css-901oao"})
if not results:
    print("Nothing")
else:
    print(results.prettify())

Nothing


No luck. So we'll need to use a browser from our python script. To do this, we'll be using a webdriver. There is one for every major browser:

| Browser | Driver |
|--- | --- |
| Firefox | [geckodriver](https://github.com/mozilla/geckodriver/releases) | 
| Chrome | [ChromeDriver](https://chromedriver.chromium.org/downloads) |
| Edge | [MS Edge WebDriver](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/) |
| Opera | [Opera Chromium Driver](https://github.com/operasoftware/operachromiumdriver) |
| Safari | [Safari driver](https://developer.apple.com/documentation/webkit/testing_with_webdriver_in_safari) |

You can choose which one you'll use. The rest of this example uses Firefox and the geckodriver. To install simply download and extract the executable. Then you have three options:

1) place the executable in a place that is in your path
2) create a service in Python to refer to the executable

'Installing' may be the wrong word here: you download an executable and have to make sure the program can find that executable. There are three options:

1) place it in a known location and reference it from your Python program (we'll be doing this one)
2) place it in a location already in the path (like c:\windows\system32)
3) add the location of the driver to the path

We'll go for option 1 since you shouldn't mess with system folders or the path unless you absolutely have to. Therefore:

1) download the latest version of the the webdriver
2) unpack it
3) place the entire folder in "c:\tmp"

In [4]:
# import urllib.request -> done, see before
# from bs4 import BeautifulSoup -> done, see before
from selenium import webdriver
from selenium.webdriver.common.by import By
# Change the following line if using another browser
from selenium.webdriver.firefox.service import Service
import time

# specify the url
# URL = "https://stubru.be/stem/dezwaarstelijst/lijst" -> done, see before

# run firefox webdriver from executable path of your choice
ser = Service('C:/tmp/geckodriver-v0.30.0-win64/geckodriver.exe')
driver = webdriver.Firefox(service=ser)
# driver = webdriver.Edge()

# get web page
driver.get(URL)
# sleep for 3s
time.sleep(3)
# driver.quit()

SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line


As you notice the page loads. And you notice this because the browsers stays open. Which is purely as an example by the way, you can uncomment the last line (driver.quit()) to not keep the window open.

Next up is trying to find the bandnames again. We're reusing the previous code here, but changing it to use the new webdriver.

In [10]:
results = driver.find_elements(by=By.CLASS_NAME, value="css-901oao")
if not results:
    print("Nothing")
else:
    print(len(results))
    for t in results[0:20]:
        print(t.text)

506
De Zwaarste Lijst
De 666 zwaarste gitaarplaten, door jou gekozen.
#
Nummer
Vorige positie
Aantal keer
1
Metallica
Master Of Puppets
1
7
2
Tool
Schism
3
7
3
Brutus
All Along
8


Great, but not quite there. Or we are: you could cycle through this list and sort it into a correct dataframe. But there is another way: by using the x_path you can let the webdriver home in on the element much better.

First, copy the x_path for the first element using chrome inspect.

![](images/2022-03-23-18-57-03.png)

In [11]:
xpath = r'//*[@id="hoofdinhoud"]/div/main/div/div/div[3]/div[1]/table/tbody/tr[1]/td[3]/div'

results = driver.find_elements(by=By.XPATH, value=xpath)

if not results:
    print("Nothing")
else:
    print(len(results))
    for t in results[0:20]:
        print(t.text)

1
Metallica
Master Of Puppets


Good, but only one. How to get all items in the list?

In [12]:
first_xpath  = r'//*[@id="hoofdinhoud"]/div/main/div/div/div[3]/div[1]/table/tbody/tr[1]/td[3]/div'
second_xpath = r'//*[@id="hoofdinhoud"]/div/main/div/div/div[3]/div[1]/table/tbody/tr[2]/td[3]/div'
last_xpath   = r'//*[@id="hoofdinhoud"]/div/main/div/div/div[3]/div[1]/table/tbody/tr[100]/td[3]/div'

# notice the pattern?

final_list = []

for i in range(1,101):
    xpath = r'//*[@id="hoofdinhoud"]/div/main/div/div/div[3]/div[1]/table/tbody/tr[' + str(i) + r']/td[3]/div'
    result = driver.find_element(by=By.XPATH, value=xpath)

    if not result:
        print("Nothing")
    else:
        final_list.append(result)

for i, t in enumerate(final_list[0:10],start=1):
    print(i, t.text)

1 Metallica
Master Of Puppets
2 Tool
Schism
3 Brutus
All Along
4 Slayer
Raining blood
5 Amenra
A Solitary Reign
6 Iron Maiden
Fear Of The Dark
7 Rammstein
DEUTSCHLAND
8 Steak Number Eight
The Sea is Dying
9 Channel Zero
Black Fuel
10 Slipknot
Duality


Good enough! Well, not really, but the rest is up to you. (See the exercises.)

Only thing left to do now is quit the driver.

In [13]:
driver.quit()