# Web Scraping Advanced (oDCM)

[add intro line]


## Learning Objectives

Students will be able to: 
- Make more advanced use of `selenium`, emulating user interaction on a site (e.g., scrolling and filling in forms)
- Access data that is hidden behind a login-screen
- Apply search parameters to obtain subsets of data
- Capture and store images from the web
- Save the retrieved data as tabular files (e.g., CSV)

--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


---
## 1. Selenium 

### 1.1 Let's recap: Why Selenium? 

In the Web Scraping 101 tutorial, we mainly used BeautifulSoup to turn HTML into a data structure that we could search and access using Python-like syntax. While it's easy to get started with this library, it has limitations when it comes to dynamic websites. That is, websites of which the content changes after each page refresh. Selenium can handle both static and dynamic websites and mimic user behavior (e.g., scrolling, clicking, logging in). It launches another web browser window in which all actions are visible which makes it feel more intuitive. 

### 1.2 Installing Selenium

<div class="alert alert-block alert-warning"><b>Installing Selenium and Chromedriver</b> 

To install Selenium and Chromedriver locally, please follow the <a href="https://tilburgsciencehub.com/configure/python-for-scraping/?utm_campaign=referral-short">Tutorial on Tilburg Science Hub</a>.
    
You can also use the code snippet below to automate the installation. Running this snippet takes a little longer each time, but the benefit is that it almost always works!
</div>



In [32]:
# Installing necessary packages
!pip3 install webdriver_manager
!pip3 install selenium

# Importing required libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Setting up Chrome WebDriver with WebDriver Manager using Service
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Opening the 'music to scrape' website
url = "https://music-to-scrape.org/"
driver.get(url)

# Optional: Adding some wait time for the page to fully load if needed
driver.implicitly_wait(10)  # 10 seconds



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


If everything went smooth, your computer opened a new Chrome window, and opened `music-to-scrape`. 

<div class="alert alert-block alert-info"><b>Using Google Colab</b> 

If you're using Google Colab, you don't see your browser open up manually.
    
Whenever you switch pages, just manually open that page in your browser. Although this feels like a little less interactive, you will still be able to work through this tutorial!

</div>


### 1.3 Access Dynamic Sites Programmatically

**Importance**  
Next, we're going to tell the browser to visit the "Music to Scrape" website. We'll use the `driver` object we created earlier and call the `get` method, passing the URL of the website we'd like to extract data from.

In [33]:
driver.get("https://music-to-scrape.org/")

From this point, we can use BeautifulSoup as we learned previously, though we create the `res` object from the `driver` object this time. 

In [34]:
# make sure to login to your Twitter account first!
from bs4 import BeautifulSoup
res = driver.page_source.encode('utf-8')
soup = BeautifulSoup(res, "html.parser")

### 1.4 Scroll Sites Programmatically

Let's scrape the content of the top 15 Weekly Tracks as follows:

In [9]:
# Scroll down to the "Top 15 Weekly Tracks" section
driver.execute_script("window.scrollTo(0, document.body.scrollHeight / 3);")

# Pause for a moment to ensure the section loads completely
import time
time.sleep(2)


In [11]:
# Find all track titles and artists (Adjusted based on the actual HTML structure)
track_entries = soup.find_all('div', class_='row justify-content-center')

# Initialize a list to store the track details
track_details = []

# Loop through each track entry and extract the artist name and track title
for track in track_entries:
    # Extract the track title from the <b> tag within the <h5> element
    track_title = track.find('h5').find('b').get_text()

    # Extract the artist name from the <p> element
    artist_name = track.find('p').get_text()

    # Format and store the track info
    track_info = f"{artist_name}: {track_title}"
    track_details.append(track_info)
    print(track_info)  # Print the track info for demonstration purposes


China: Tito Puente
Pacific Coast Remix: DJ Quik
Funk You: Babylon Disco
It Makes No Difference Now: Billie Jo Spears
Tin Pan Alley (aka Roughest Place In Town): Stevie Ray Vaughan And Double Trouble
Blue Yodel No. 1(T For Texas): Charlie McCoy
Exogenesis: Symphony Part 1 [Overture]: Muse
Welcome: Les Bonapartes
Peanuts: Little Joe & The Thrillers
The Tube: Spectra Soul
Snippet 15: Bare Jr.
Travellin' On: Charlie Byrd Trio
Couldn't Stand The Weather: Stevie Ray Vaughan And Double Trouble
Purple Haze: Johnny Pearson
Paint It Black: Chris Farlowe


In [12]:

# Initialize an empty list to store track information
tracks = []

# Get the initial scroll height of the page
last_height = driver.execute_script("return document.body.scrollHeight")

# Infinite scrolling: Continue until no new content is loaded
while True:
    # Get the current page source and parse it with BeautifulSoup
    res = driver.page_source
    soup = BeautifulSoup(res, 'html.parser')
    
    # Find all track titles and artists (Adjusted based on the actual HTML structure)
    track_entries = soup.find_all('div', class_='row justify-content-center')
    
    # Add track details to the list
    for track in track_entries:
        # Extract the track title from the <b> tag inside <h5>
        track_title = track.find('h5').find('b').get_text()
        
        # Extract the artist name from the <p> element
        artist_name = track.find('p').get_text()

        # Format the track info and check for duplicates before adding
        track_info = f"Title: {track_title}\nArtist: {artist_name}\n"
        
        # Append if the track is not already in the list
        if track_info not in tracks:
            tracks.append(track_info)
    
    # Scroll down to load more content
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Pause for a second to allow new content to load
    time.sleep(2)
    
    # Recalculate the scroll height and break the loop if the end is reached
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Print the tracks we've collected in the desired format
for idx, track in enumerate(tracks, 1):
    print(f"{idx}. {track}")




1. Title: Tito Puente
Artist: China

2. Title: DJ Quik
Artist: Pacific Coast Remix

3. Title: Babylon Disco
Artist: Funk You

4. Title: Billie Jo Spears
Artist: It Makes No Difference Now

5. Title: Stevie Ray Vaughan And Double Trouble
Artist: Tin Pan Alley (aka Roughest Place In Town)

6. Title: Charlie McCoy
Artist: Blue Yodel No. 1(T For Texas)

7. Title: Muse
Artist: Exogenesis: Symphony Part 1 [Overture]

8. Title: Les Bonapartes
Artist: Welcome

9. Title: Little Joe & The Thrillers
Artist: Peanuts

10. Title: Spectra Soul
Artist: The Tube

11. Title: Bare Jr.
Artist: Snippet 15

12. Title: Charlie Byrd Trio
Artist: Travellin' On

13. Title: Stevie Ray Vaughan And Double Trouble
Artist: Couldn't Stand The Weather

14. Title: Johnny Pearson
Artist: Purple Haze

15. Title: Chris Farlowe
Artist: Paint It Black



# TBD: add some exercises 
e.g. scrape a user's recently played tracks : https://music-to-scrape.org/user?username=Shadow06


### 2.1 Click Sites Programmatically

We will extract basic album information for every album listed in weekly top 15. 

In [35]:
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup

# Function to extract information from each album's detail page 
def extract_album_info(driver):
    # Get the page source after navigating to the album detail page
    page_source = driver.page_source
    # Parse the page using BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find the div that contains the basic information (by class name)
    basic_info_div = soup.find('div', class_='song_basic_information_card')

    # Extract the Title, Artist, Year, and Plays
    title = basic_info_div.find_all('h5')[0].find_next('p').get_text()
    artist = basic_info_div.find_all('h5')[1].find_next('p').get_text()
    year = basic_info_div.find_all('h5')[2].find_next('p').get_text()
    plays = basic_info_div.find_all('h5')[3].find_next('p').get_text()

    # Return the album information as a dictionary
    return {
        "Title": title,
        "Artist": artist,
        "Year": year,
        "Plays": plays
    }


# Function to click each album, navigate to its page, and collect album information
def click_each_album_and_collect_info(driver):
    # Locate the top 15 weekly track links (based on <a> tags with "song-id" in the href attribute)
    album_links = driver.find_elements(By.CSS_SELECTOR, 'a[href*="song-id"]')
    
    # List to store album info
    album_info_list = []

    # Loop through each album link
    for i, link in enumerate(album_links):
        try:
            print(f"Clicking album {i + 1}...")  # Print which album we're clicking
            link.click()  # Click the album link
            time.sleep(2)  # Wait for the album page to load

            # Extract information from the album page using BeautifulSoup
            album_info = extract_album_info(driver)
            print(f"Extracted Info: {album_info}")
            
            # Append the collected info to the list
            album_info_list.append(album_info)

            # Navigate back to the main page (weekly tracks list)
            driver.back()
            time.sleep(2)  # Wait for the page to load again

            # Re-locate the album links after navigating back (since the DOM may have refreshed)
            album_links = driver.find_elements(By.CSS_SELECTOR, 'a[href*="song-id"]')

        except Exception as e:
            print(f"An error occurred while clicking album {i + 1}: {e}")

    return album_info_list



# Main logic to collect and print the album information
def main_flow():
    # Open the website and ensure it's loaded (this code should already exist in your tutorial)
    url = "https://music-to-scrape.org/"
    driver.get(url)
    time.sleep(3)  # Wait for the page to load

    # Scroll to ensure the albums are visible
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight / 3);")
    time.sleep(2)  # Wait for content to load

    # Click on each album and collect information
    album_info_list = click_each_album_and_collect_info(driver)

    # Print the collected album information
    print("\nCollected Album Info:")
    for album in album_info_list:
        print(album)


main_flow()



Clicking album 1...
Extracted Info: {'Title': 'Goor', 'Artist': 'Plaza', 'Year': 'None', 'Plays': '151'}
Clicking album 2...
Extracted Info: {'Title': 'Say Your Prayers', 'Artist': 'Gang Starr', 'Year': '1991', 'Plays': '131'}
Clicking album 3...
Extracted Info: {'Title': 'Red Barchetta (feat. James LaBrie_ James Murphey_ Sean Malone_ Sean Reinert & David Townson)', 'Artist': 'Steve Morse', 'Year': 'None', 'Plays': '114'}
Clicking album 4...
Extracted Info: {'Title': 'Love Letter', 'Artist': 'Gerald Veasley', 'Year': '1999', 'Plays': '160'}
Clicking album 5...
An error occurred while clicking album 5: Message: element not interactable
  (Session info: chrome=128.0.6613.120)
Stacktrace:
0   chromedriver                        0x00000001022a9248 chromedriver + 5096008
1   chromedriver                        0x00000001022a0a7a chromedriver + 5061242
2   chromedriver                        0x0000000101e2be01 chromedriver + 388609
3   chromedriver                        0x0000000101e79a85 c

the error `element not interactable` is probably because Selenium is trying to click an element but it is not visible so it needs to be scrolled into view. Now lets apply our scrolling skills to this problem! 

In [36]:
# Add scroll before clicking 
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup

# Function to extract information from each album's detail page using BeautifulSoup and class names
def extract_album_info(driver):
    # Get the page source after navigating to the album detail page
    page_source = driver.page_source
    # Parse the page using BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find the div that contains the basic information (by class name)
    basic_info_div = soup.find('div', class_='song_basic_information_card')

    # Extract the Title, Artist, Year, and Plays
    title = basic_info_div.find_all('h5')[0].find_next('p').get_text()
    artist = basic_info_div.find_all('h5')[1].find_next('p').get_text()
    year = basic_info_div.find_all('h5')[2].find_next('p').get_text()
    plays = basic_info_div.find_all('h5')[3].find_next('p').get_text()

    # Return the album information as a dictionary
    return {
        "Title": title,
        "Artist": artist,
        "Year": year,
        "Plays": plays
    }

# Function to click each album, navigate to its page, and collect album information
def click_each_album_and_collect_info(driver):
    # Locate the top 15 weekly track links (based on <a> tags with "song-id" in the href attribute)
    album_links = driver.find_elements(By.CSS_SELECTOR, 'a[href*="song-id"]')
    
    # List to store album info
    album_info_list = []

    # Loop through each album link
    for i, link in enumerate(album_links):
        try:
            print(f"Clicking album {i + 1}...")  # Print which album we're clicking
            
            # Scroll to the element to make sure it's visible
            driver.execute_script("arguments[0].scrollIntoView(true);", link)
            time.sleep(1)  # Allow some time for scrolling

            # Click the album link
            link.click()
            time.sleep(2)  # Wait for the album page to load

            # Extract information from the album page using BeautifulSoup
            album_info = extract_album_info(driver)
            print(f"Extracted Info: {album_info}")
            
            # Append the collected info to the list
            album_info_list.append(album_info)

            # Navigate back to the main page (weekly tracks list)
            driver.back()
            time.sleep(2)  # Wait for the page to load again

            # Re-locate the album links after navigating back (since the DOM may have refreshed)
            album_links = driver.find_elements(By.CSS_SELECTOR, 'a[href*="song-id"]')

        except Exception as e:
            print(f"An error occurred while clicking album {i + 1}: {e}")

    return album_info_list



# Main logic to collect and print the album information
def main_flow():
    # Open the website and ensure it's loaded (this code should already exist in your tutorial)
    url = "https://music-to-scrape.org/"
    driver.get(url)
    time.sleep(3)  # Wait for the page to load

    # Scroll to ensure the albums are visible
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight / 3);")
    time.sleep(2)  # Wait for content to load

    # Click on each album and collect information
    album_info_list = click_each_album_and_collect_info(driver)

    # Print the collected album information
    print("\nCollected Album Info:")
    for album in album_info_list:
        print(album)


main_flow()




Clicking album 1...
Extracted Info: {'Title': 'Between Hell and A Heartbeat [Explicit]', 'Artist': 'Hatebreed (int License)', 'Year': 'None', 'Plays': '160'}
Clicking album 2...
Extracted Info: {'Title': 'Greatest X', 'Artist': 'Janet Jackson', 'Year': '2008', 'Plays': '142'}
Clicking album 3...
Extracted Info: {'Title': 'Short On Ideas / One Last Cigarette', 'Artist': 'Less Than Jake', 'Year': '2002', 'Plays': '168'}
Clicking album 4...
Extracted Info: {'Title': 'Porque Yo Te Amo', 'Artist': 'Leonardo Favio', 'Year': 'None', 'Plays': '150'}
Clicking album 5...
Extracted Info: {'Title': 'Why?', 'Artist': 'Kisha', 'Year': '1998', 'Plays': '128'}
Clicking album 6...
Extracted Info: {'Title': 'Jumke', 'Artist': 'Sukshinder Shinda', 'Year': 'None', 'Plays': '173'}
Clicking album 7...
Extracted Info: {'Title': "Romeo's Exile", 'Artist': "Liam O'Flynn", 'Year': '1995', 'Plays': '131'}
Clicking album 8...
Extracted Info: {'Title': 'The Man with the Dogs', 'Artist': 'Dead Kennedys', 'Year': '1

---
### 2.2 Scrape Image Files

**Importance**  
In previous examples, we looked at scraping textual data from a web page. On Instagram, however, it would make sense to store the image files as well. To this end, we extract a link to the image source (`image_link`) and pass it to the `wget` library. You can name the image whatever you want (e.g., `my_image.jpg`). By default, the image is stored in your current working directory (i.e., where this notebook resides).

Please note that the `wget` module is not a standard Python package, so before running the cell below you need to install it first. Either type `pip install wget` in your terminal or use the Anaconda Navigator interface (see below - in your case there will likely be only the base (root) environment).

[modify this later]