# Web Scraping Advanced (oDCM)

In this tutorial, we'll expand on tools like BeautifulSoup, which is essential for static websites but has limitations with dynamic content. While BeautifulSoup remains useful, Selenium allows us to handle both static and dynamic sites by simulating user actions like scrolling, clicking, and logging in. It works within a browser window, making it a more intuitive complement to BeautifulSoup, rather than a replacement.

## Learning Objectives

Students will be able to:
- Emulate user interactions on a site
  - clicking,
  - scrolling,
  - filling forms.
- Save the retrieved data as JSON files.

--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


---
## 1. Selenium 

### 1.1 Let's recap: Why Selenium? 

In the Web Scraping 101 tutorial, we mainly used BeautifulSoup to turn HTML into a data structure that we could search and access using Python-like syntax. While it is easy to get started with this library, it has limitations when it comes to dynamic websites. That is, websites of which the content changes - even before actively refreshing it. Examples are TikTok (e.g., having an infinite scroll), or Twitch (e.g., the chat window dynamically updating).

Selenium can handle both static and dynamic websites and mimic user behavior (e.g., scrolling, clicking, logging in). It launches another web browser window in which all actions are visible which makes it feel more intuitive. 

### 1.2 Installing Selenium

<div class="alert alert-block alert-warning"><b>Installing Selenium and Chromedriver</b> 

To install Selenium and Chromedriver locally, please follow the <a href="https://tilburgsciencehub.com/configure/python-for-scraping/?utm_campaign=referral-short">Tutorial on Tilburg Science Hub</a>.
    
You can also use the code snippet below to automate the installation. Running this snippet takes a little longer each time, but the benefit is that it almost always works!
</div>



In [None]:
# Installing necessary packages
!pip3 install webdriver_manager
!pip3 install selenium

# Importing required libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Setting up Chrome WebDriver with WebDriver Manager using Service
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Opening the 'music to scrape' website
url = "https://music-to-scrape.org/"
driver.get(url)

# Optional: Adding some wait time for the page to fully load if needed
driver.implicitly_wait(10)  # 10 seconds


If everything went smooth, your computer opened a new Chrome window, and opened `music-to-scrape`. 

__Importantly, note that you have to run the code cell always when you want to open a new instance of Chrome.__
__If you want to close Chrome, you can use `driver.quit()`.__

<div class="alert alert-block alert-info"><b>Using Google Colab</b> 

If you're using Google Colab, you don't see your browser open up manually.
    
Whenever you switch pages, just manually open that page in your browser. Although this feels like a little less interactive, you will still be able to work through this tutorial!

</div>


### 1.3 Getting Access to a Website


Next, we're going to tell the browser to visit the "Music to Scrape" website. We'll use the `driver` object we created earlier and call the `get` method, passing the URL of the website we'd like to extract data from.

In [None]:
driver.get("https://music-to-scrape.org/")

From this point, we can use BeautifulSoup as we learned previously, though we create the `res` object from the `driver` object this time. 

## 2. Using Selenium in Combination with BeautifulSoup

Selenium is powerful for interacting with dynamic websites, but once the webpage has loaded, we can still use BeautifulSoup for parsing the HTML and extracting data efficiently. BeautifulSoup is well-suited for quickly navigating and querying the page's source code, while Selenium handles the initial loading and interaction with dynamic content. This combination gives us flexibility and efficiency in our scraping tasks.

Here's an example of how to use Selenium to load a page and then pass the HTML to BeautifulSoup for parsing. We'll extract the title of the page from https://music-to-scrape.org.

In [None]:
# Initialize the Selenium WebDriver (make sure you have a WebDriver installed)
# (see code snippets above)

# Use Selenium to load the page
driver.get("https://music-to-scrape.org")

# Get the page source and pass it to BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Example: Extracting the page title using BeautifulSoup
page_title = soup.find('title').get_text()
print("Page Title:", page_title)


In this example, Selenium loads the webpage, and then we pass the page’s source code to BeautifulSoup to extract the page title.

__Exercise 1:__

Now, it is your turn. Use the same combination of Selenium and BeautifulSoup to store the names of the artists in the top 15 weekly tracks in a JSON dictionary.

__Hint:__

Look at the structure of the page in the browser's developer tools to find where the artist's name is located.

*Solution:*


In [None]:
import time
# Use Selenium to load the page
driver.get("https://music-to-scrape.org")

# Wait until the website is loaded
time.sleep(2)

# Get the page source and pass it to BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Solution: Extracting the first artist's name
section = soup.find('section', attrs={'name': 'weekly_15'})

tracks = section.find_all(class_='list-group-item')

json_dic = []

for tr in tracks:
    json_dic.append({'name': tr.find('h5').get_text()})
json_dic


In this section, we combined Selenium with BeautifulSoup to scrape data from a dynamically loaded website. After using Selenium to load the page and waiting for it to fully render, we passed the HTML to BeautifulSoup for parsing. This allowed us to extract data efficiently, like the weekly top 15 tracks, and store it in a JSON-like format. This method leverages the strengths of both tools: Selenium for dynamic content handling and BeautifulSoup for data extraction.

## 3. Clicking on buttons with Selenium
In many websites, content is hidden or only appears after certain actions like clicking a button. While BeautifulSoup is excellent for parsing static HTML, it cannot perform actions such as clicking or interacting with dynamic elements. For example, a button might load new content that doesn’t appear in the initial HTML source, or content might change after interacting with certain elements on the page (e.g., loading more items, navigating through a slideshow, etc.). This is where Selenium comes in—it can mimic user interactions like clicking, scrolling, and more, making it a powerful tool for handling dynamic websites.

In this section, we’ll show you how to use Selenium to click on a button. For example, let’s try to click the “Previous Page” button on this profile page: https://music-to-scrape.org/user?username=PandaVector67.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Navigate to the specific user's page
driver.get("https://music-to-scrape.org/user?username=PandaVector67")

# Wait until the website is fully loaded
time.sleep(2)

# Locate and click the 'Previous Page' button by its class or identifier (depending on the site structure)
previous_button = driver.find_element(By.LINK_TEXT, "Previous Week")
previous_button.click()

# Wait a bit to allow the new page to load
time.sleep(2)

# Get the new page's source and pass it to BeautifulSoup if needed
soup = BeautifulSoup(driver.page_source, 'html.parser')

# (Optional) Print the new section title to confirm the action
print("New Section Title:", soup.find_all('h2')[1].get_text())


__Explanation:__
1. Selenium Interaction: We used Selenium to load the webpage and then locate the “Previous Page” button.
2. Click Action: Using find_element() with the By.LINK_TEXT method, we identified the button by its label text and triggered a click.
3.  Post-Click Action: After clicking, we waited for the new content to load and optionally passed it to BeautifulSoup for further scraping.

<div class="alert alert-block alert-info"><b>Primer: Locating Elements in Selenium</b> 

<p>In Selenium, there are several ways to locate elements on a webpage. While we used <code>LINK_TEXT</code> in the example to click a button, this is just one of many methods available for finding elements. Unlike BeautifulSoup, which relies on methods like <code>find</code> and <code>find_all</code>, Selenium has its own set of strategies for interacting with elements.</p>


<p>Here are some common ways to locate elements in Selenium:</p>
<ul>
  <li><strong>By ID</strong>: <code>driver.find_element(By.ID, "element_id")</code></li>
  <li><strong>By Name</strong>: <code>driver.find_element(By.NAME, "element_name")</code></li>
  <li><strong>By Class Name</strong>: <code>driver.find_element(By.CLASS_NAME, "class_name")</code></li>
  <li><strong>By Tag Name</strong>: <code>driver.find_element(By.TAG_NAME, "tag_name")</code></li>
  <li><strong>By CSS Selector</strong>: <code>driver.find_element(By.CSS_SELECTOR, "css_selector")</code></li>
  <li><strong>By XPath</strong>: <code>driver.find_element(By.XPATH, "xpath_expression")</code></li>
</ul>

<p>It is important to note that the syntax and methods in Selenium differ from BeautifulSoup. While BeautifulSoup focuses on parsing HTML and querying the structure, Selenium is designed to interact with the browser, and thus its element locators reflect that.</p>

<p>To avoid confusion, keep in mind that the two libraries use different conventions for querying elements, and you will need to adjust your approach accordingly.</p>

<p>For more information on the various ways to locate elements in Selenium, check out the official documentation <a href="https://selenium-python.readthedocs.io/locating-elements.html" target="_blank">here</a>.</p>


</div>


## 4. Scrolling on a site

When scraping websites, you may encounter pages with infinite scrolling or content that only loads when you scroll down, such as social media feeds or product listings. BeautifulSoup cannot handle scrolling since it only interacts with the initial HTML source code. However, Selenium can simulate user interactions like scrolling, making it a critical tool for scraping websites with dynamic loading.

In this section, we'll cover how to scroll on a webpage using Selenium, and we'll explain different stopping rules to control the scrolling behavior.

__Example: Slow Scrolling on music-to-scrape.org__

Let's start with a simple example of scrolling through the webpage https://music-to-scrape.org. The following code snippet shows how to scroll slowly down the page using Selenium.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# Navigate to the website
driver.get("https://music-to-scrape.org")

# Sleep to allow the page to load completely
time.sleep(2)

# Scroll down the page slowly
scroll_pause_time = 1
for _ in range(3):  # Scroll down 3 times
    print(f'Scrolling for the {_+1}th time')
    driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.PAGE_DOWN)
    time.sleep(scroll_pause_time)  # Pause to simulate human-like scrolling


In this example:
- We load the website and use Keys.PAGE_DOWN to simulate the down arrow key, which scrolls the page down incrementally.
- We include a pause (time.sleep()) between each scroll to mimic a more natural, human-like behavior.

There are different stopping rules to control when the scrolling should end:

1. Until a certain number of iterations: You can control the scroll by a fixed number of iterations, as shown in the example above, where we scrolled 10 times. This is useful when you want to limit the depth of your scraping (see example above).
 
2. Until an element is located: You might want to stop scrolling once a specific element appears on the page, such as a "Load More" button or a particular section.
```
while not driver.find_elements(BY.CLASS_NAME, 'target-element'):
    driver.find_element_by_tag_name('body').send_keys(Keys.PAGE_DOWN)
```

3. Until the end of the page is reached: To stop scrolling when the page reaches its bottom, you can compare the current scroll position to the total page height.

```
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_pause_time)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:  # Stop if the scroll height hasn't changed
        break
    last_height = new_height
```


__Exercise:__

Now it’s your turn. Use Selenium to scroll through the https://music-to-scrape.org website and extract the name of each track displayed after scrolling down several times.
1. Scroll down the page five times.
2. Store the names of all tracks that appear after scrolling in a JSON dictionary.

*Solution:*

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# Navigate to the website
driver.get("https://music-to-scrape.org")

# Sleep to allow the page to load completely
time.sleep(2)

json_dic = []

# Scroll down the page slowly
scroll_pause_time = 1
for _ in range(5):  
    print(f'Scrolling for the {_+1}th time')
    driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.PAGE_DOWN)
    time.sleep(scroll_pause_time)  # Pause to simulate human-like scrolling
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    for song in soup.find(class_='card'):
        json_dic.append({'name': song.get_text(), 'iteration': _})
json_dic

In this solution:
- We scrolled the page five times using Keys.PAGE_DOWN.
- After scrolling, we passed the current page source to BeautifulSoup to extract the track names.
- We stored the names of all tracks found on the page, along with the iteration number, in a JSON dictionary.

Would you know why the names are always the same? Well, at this stage, music-to-scrape.org always loads all of the content of the website, so it is technically not very dynamic (yet). As you embark on your own projects, this scrolling approach enables you to scrape dynamic content on websites that is only loaded after you scroll.

# 5. Filling in forms

Filling in forms is a common task when interacting with websites, especially when you need to perform actions like logging in, searching, or submitting data. While BeautifulSoup is excellent for parsing static HTML, it cannot interact with forms or simulate user inputs. Selenium, on the other hand, can fill in forms, submit them, and interact with various input fields on a page, making it ideal for tasks like automated searches.

In this section, we’ll explore how to use Selenium to fill in a search form, query for artist names, and extract results. We'll query the search bar on https://music-to-scrape.org using artist names found on the page itself.

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Navigate to the website
driver.get("https://music-to-scrape.org")
time.sleep(2)

artists = ['Adrenalin', 'Presets', 'Lupe']

# Loop through each artist and perform a search
for artist_name in artists:
    # Locate the search input field
    search_bar = driver.find_element(By.NAME, 'query')

    # Clear any previous input, then type the artist's name and submit the form
    search_bar.clear()
    search_bar.send_keys(artist_name)
    search_bar.send_keys(Keys.RETURN)

    # Wait for search results to load
    time.sleep(2)


In this example, we combine Selenium and BeautifulSoup to automate searches on the website music-to-scrape.org, search for multiple artist names, and extract results for each query. Here's a breakdown of how the process works:

1. __Loading the Website:__     We start by navigating to the website using Selenium's driver.get() method and give the page a moment to load using time.sleep().
2. __Looping Through Artists:__    We define a list of artist names (['Adrenalin', 'Presets', 'Lupe']) and loop through each name to perform a search on the site.
3. __Interacting with the Search Bar:__  For each artist, we locate the search input field using driver.find_element(By.NAME, 'query'). We clear the field, input the artist's name, and simulate pressing the return key to submit the search.
4. __Waiting for Results:__     After submitting the search, we wait for the results to load using another time.sleep(), allowing the page to update with the new search results.

__Exercise:__

Extend the code snippet above to store all search results in a JSON dictionary (stored on disk, new-line separated).

*Solution:*

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
import json

# Navigate to the website
driver.get("https://music-to-scrape.org")
time.sleep(1)

artists = ['Adrenalin', 'Presets', 'Lupe']

f = open('output.json', 'a')

# Loop through each artist and perform a search
for artist_name in artists:
    # Locate the search input field
    search_bar = driver.find_element(By.NAME, 'query')

    # Clear any previous input, then type the artist's name and submit the form
    search_bar.clear()
    search_bar.send_keys(artist_name)
    search_bar.send_keys(Keys.RETURN)

    # Wait for search results to load
    time.sleep(2)

    soup = BeautifulSoup(driver.page_source, 'html.parser')

    results = soup.find(class_='container').find_all(class_='list-group-item')
    counter = 0
    for row in results:
        counter = counter + 1
        res = {} # empty dictionary
        res['query'] = artist_name
        res['rank'] = counter # rank of search result
        res['type'] = row.find(class_='center-text').find('p').get_text()
        res['result'] = row.find(class_='center-text').get_text().strip()
        f.write(json.dumps(res))
        f.write('\n')
f.close()
        


In this exercise, we’ve walked through how to automate the process of filling in a search form, extracting the resulting data, and saving it in a structured format using Selenium and BeautifulSoup.

Here's a quick walk-through:
1. After submitting each search, we pause briefly using `time.sleep()` to allow the page to load fully before parsing the content. This ensures that we capture the updated results from the search.
2. Once the results page has loaded, we use BeautifulSoup to parse the page’s HTML. We specifically look for the `container` section where the search results are displayed, with each result being located within a `list-group-item` class. This allows us to accurately extract the data we’re interested in.
3. For each result, we create a JSON-like dictionary to store:
- **The artist name** (`query`) that was used in the search.
- **The rank** (`rank`) of each result, incremented for each item in the list.
- **The type of result** (`type`), such as "Song" or "Album", which we find in the `<p>` tag within each result.
- **The full result text** (`result`), which includes the specific details about the result, extracted from the relevant element.
4. We then write each result into an `output.json` file, ensuring each search result is stored as an individual JSON object. This allows us to save our scraped data in a structured and reusable format for future analysis.


### 6. Summary

In this tutorial, you've explored how to combine Selenium and BeautifulSoup to scrape data from dynamic websites effectively. Here's a quick summary of key takeaways:

- Selenium vs. BeautifulSoup:
  - Selenium: Ideal for interacting with dynamic websites, such as handling forms, clicking buttons, and simulating user actions like scrolling. It works by controlling a browser and mimicking user behavior.
  - BeautifulSoup: Best suited for parsing static HTML content and extracting data from the page source. It cannot interact with dynamic elements but is excellent for navigating and querying structured HTML.
-   Using Them Together: By combining Selenium for interaction and BeautifulSoup for parsing, you can scrape dynamic sites, automate form submissions, extract data, and store it efficiently in formats like JSON.

This combination allows you to tackle more complex scraping tasks by automating the interaction with dynamic content while still benefiting from BeautifulSoup’s simplicity and efficiency in data extraction.