# Web Scraping 101

*After finishing this tutorial, you can extract data from multiple pages on the web, and export such data to CSV files so that you can use it in an analysis. Plan a few hours to work through this notebook. Taking a few breaks inbetween keeps you sharp!*

*Just starting out with web scraping? Then make sure to have followed the ["webdata for dummies" tutorial](https://odcm.hannesdatta.com/docs/modules/week2/webdata-for-dummies/) first.*

*Enjoy!*

--- 

## Learning Objectives

Our main goal is to compile a panel data set of music consumption data for (simulated) users of music-to-scrape.org, a platform developed for practicing web scraping skills.

* Identifying a strategy to generating seeds (“sampling”)
    * Extracting multiple elements at once using the `.find_all()` function
    * Preventing array misalignment
* Navigating on a website 
    * Using URLs to programmatically visit web pages
    * Writing loops to execute data collections in bulk using functions
* Improving extraction design
    * Implementing timers and modularizing extraction code
    * Storing data in CSV or JSON files with relevant meta data
* Scraping more advanced, dynamic websites
    * Understanding the difference between headless requests and browser emulation 
    * Learn when to apply one of the two methods (using `requests` and `selenium`)

--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


## 1. Generating seeds ("sampling")


__Importance__

So far, we've extracted (=parsed) some information (e.g., names of featured artists) from an artist's individual *artist page*. What we haven't done yet is to take a closer look at the consumption of individual users.

In fact, individual users are often a focal point of attention in web scraping. For example, we can sample users' tweets on Twitter/X, or users' movie watching behavior on trakt.tv. 

Yet, before we can start building what is called a "panel data set" (i.e., multiple users, observed over multiple time periods), we need to decide for __which users to obtain information__. Ideally, we would like to capture information for a *sample of users* (or books, movies, series, games - depending on the platform.).

In web scraping, we typically refer to a "seed" as a starting point for a data collection. Without a seed, there's no data to collect.

For example, before we can crawl through all users available at [music-to-scrape.org](https://music-to-scrape.org), we first need to generate a *list of many users of the platform*. (Note that obtaining the user names of ALL users of the site is barely possible).

One way to get there would be to:

1. first visit the main homepage of [music-to-scrape.org](https://music-to-scrape.org), showing five recently active users at the time, and
2. visit a users' profile page and start scraping their consumption data (or anything else on that page; we have done this in the webdata for dummies tutorial). 

Note that the homepage allows us to "navigate" to the users' profile pages, such as by clicking on the user name or the avatar (see red boxes in the figure below). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-users.png" align="left" width=80%/>

### 1.1 Collecting links to use as seeds

Let's take a look at how the links for users' profile pages are written in the website's source code.

Open the [website](https://music-to-scrape.org), and inspect the underlying HTML code with the Chrome or Firefox Inspector (right click --> inspect element). 

Each user contains a clickable link (`<a>`), containing the link (`href`) to the user's profile page. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-inspect-link.png" align="left" width=60%/>

How could we tell a computer to capture the links to the various user pages?

One simple way is to select *elements by their tags*. For example, to extract all links (`<a>` tags). 

<div class="alert alert-block alert-info"><b>How to extract multiple elements at once?</b>
    <br>
    
- By working through other tutorials, you may already be familiar with the <code>.find()</code> function of BeautifulSoup. The <code>.find()</code> function returns the <b>first element</b> that matches your particular "search query". <br>
- If you want to extract <b>all elements</b> that match a particular search pattern (say, a class name), you can use BeautifulSoup's <code>.find_all()</code> function.<br>
- Note that the "result" of the <code>.find_all()</code> option is a list of results __that you need to iterate through.__

</div>


__Exercise 1.1__

Please run the code cell below, which extracts all links (the `a` tag!), and prints the URL (`href`) to the screen. Don't worry, you don't need need to understand the code yet, we'll go over it line by line shortly!

If you look at these links more closely, you'll notice that we're not interested in many of these links... 

Make a list of all links we're *not* interested in (i.e., those *not* pointing to a user page). Which ones are those? Can you find out why they are there?

In [7]:
# Run this code now
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
user_agent = {'User-agent': 'Mozilla/5.0'}
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
url = 'http://127.0.0.1:8000'

res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

# return the href attribute in the <a> tag nested within the first product class element
for link in soup.find_all("a"):
    if 'href' in link.attrs: 
        print(link.attrs["href"])

/privacy_terms
/privacy_terms
about
/
/
/tutorial_scraping
/tutorial_api
#
https://api.music-to-scrape.org/docs
/about
song?song-id=SOQELHR12AC4689565
song?song-id=SOZONJS12A58A7920D
song?song-id=SOPTRXF12A8C135387
song?song-id=SOHUDTL12AAFF43497
song?song-id=SOLCJXU12A8C134178
song?song-id=SOEHBWL12A6D4F95D1
song?song-id=SOOIAGM12AB01862C7
song?song-id=SOHHJIL12A8C144BDB
song?song-id=SOTKTCQ12AB01863FF
song?song-id=SODKRYJ12AC468A00F
song?song-id=SOMNCXX12A8C130C46
song?song-id=SOEUWYJ12A8C144BE8
song?song-id=SOWUAYB12A6D4FA1D4
song?song-id=SORUWYJ12AB0181EBF
song?song-id=SOMURVQ12A67AD741A
song?song-id=SOUNWEU12CF5F88ADE
song?song-id=SOQLNGZ12A8C1314E8
song?song-id=SOJUORK12A8C143BE5
song?song-id=SOGYMWC12A6D4FBAF2
song?song-id=SOCIFIF12A8C1378E5
song?song-id=SOHWJYQ12A8C13658B
song?song-id=SOXFSTR12A8AE463B0
song?song-id=SOWYBEG12A6D4F9142
song?song-id=SOUBVIH12A8C137C89
song?song-id=SOQXGVE12CF5F86D20
artist?artist-id=ARQIWOW11F4C840FF1
artist?artist-id=ARWBL9E1187FB4E695
artist?ar

**Your answer**

...

__Solution__

The links we want to ignore are...

* The links to the about or privacy pages
* Any link pointing to the most popular songs or artists
* Any social media links, etc.

These links are present on the page, because they are used by users to navigate on the page. 

### 1.2 Collecting *More Specific* Links

__Importance__

We've just discovered that selecting elements by their tags gives us many irrelevant links. But, how can we narrow down these links, or, in other words, __how can we scrape only the users we're interested in?__.

To answer this question, we need to briefly revisit the notion of how an HTML code is structured. __Open your browser's inspect mode again and hover over the "recently active users" section on the site.__

After inspecting, you'd probably notice that the page is generated according to a rigid structure: all user links are contained in a `<section>` tag, with the attribute `name="recent_users"`. The "wrong links" extracted above (i.e., to the about or privacy pages) are *not* part of these elements. 

So, if we can tell our scraper that we're only interested in the `<a>` tags *within the particular `<section>` with attribute `name` equal to `recent_users`, we end up with our desired selection of links. 

__Let's try it out__

Like before, we'll use `.find_all()` to capture all matching elements on the page. The difference, however, is that we do not directly try to extract the __links__ with the tag `a`, but first try to obtain a __list with product containers__ identified by the classname `product_pod`.

Run the code below, in which we first try to capture all book containers using the `product_pod` class.


In [9]:
import requests
from bs4 import BeautifulSoup

# make request
url = 'https://books.toscrape.com/catalogue/category/books_1/index.html'
url = 'http://127.0.0.1:8000'

res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

relevant_section = soup.find('section',attrs={'name':'recent_users'})

users = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        users.append(link.attrs['href'])
users

['user?username=CoderTech68',
 'user?username=PixelGalaxy84',
 'user?username=Pixel48',
 'user?username=PixelGamer12',
 'user?username=GalaxyNinja25',
 'user?username=GalaxyShadow34']

As expected, we retrieve up to six user names. You can now also use the `users` object to look at the data for the first, second, third, ... user.

In [10]:
users[0] # returns the link to the user page of the 1st user

'user?username=CoderTech68'

...to subsequently try to extract the link for the first book...

Note the user list still contains a lot of "other" things, unrelated to the user name. Remember, we extracted the __links__ to the profile pages, not just the user names.

If we want to remove anything but the usernames, we can modify our extraction function slightly, using Python's `split` function.


In [11]:
users = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        users.append(link.attrs['href'].split('=')[1])
users

['CoderTech68',
 'PixelGalaxy84',
 'Pixel48',
 'PixelGamer12',
 'GalaxyNinja25',
 'GalaxyShadow34']

Need explanation on this code? Just copy-paste it to ChatGPT and ask for an explanation, e.g., using this prompt:

> I struggle to understand this piece of Python code in the context of web scraping. 
> Can you please explain it, paying attention to the complicated last line (user.append())?

Pretty cool, right? So let's proceed with some exercises.

#### Exercise 1.2
1. Modify the loop (`for link in relevant_section`...) written above to extract the *absolute URLs* rather than the relative URLs. Specifically, combine the website's URL (`https://music-to-scrape.org/`) and the string you extracted earlier (`user?username=GalaxyShadow34`). The final URL needs to be: `https://music-to-scrape.org/user?username=GalaxyShadow34`.

2. Write a function to collect many user names (seeds) from this page, returning this information as an array. 

3. Execute your function from 2) in a while loop, that runs every 2 seconds for a duration of 15 seconds. Importantly, write all URLs to a new-line separated text file, called `seeds.txt`.

In [7]:
# your answer goes here!

#### Solutions

In [14]:
# Question 1 
urls = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        urls.append(f'https://music-to-scrape.org/{extracted_link}')
urls

['https://music-to-scrape.org/user?username=CoderTech68',
 'https://music-to-scrape.org/user?username=PixelGalaxy84',
 'https://music-to-scrape.org/user?username=Pixel48',
 'https://music-to-scrape.org/user?username=PixelGamer12',
 'https://music-to-scrape.org/user?username=GalaxyNinja25',
 'https://music-to-scrape.org/user?username=GalaxyShadow34']

In [17]:
# Question 2
import requests
from bs4 import BeautifulSoup

url = 'https://music-to-scrape.org/'
url = 'http://127.0.0.1:8000'

def get_users():
  
    res = requests.get(url, headers = user_agent)
    res.encoding = res.apparent_encoding
    
    soup = BeautifulSoup(res.text)
    
    relevant_section = soup.find('section',attrs={'name':'recent_users'})

    links = []
    for link in relevant_section.find_all("a"):
        if 'href' in link.attrs: 
            extracted_link = link.attrs['href']
            links.append(f'https://music-to-scrape.org/{extracted_link}')
    return(links) # to return all links

get_users()

['https://music-to-scrape.org/user?username=GalaxyNinja25',
 'https://music-to-scrape.org/user?username=PandaStealth43',
 'https://music-to-scrape.org/user?username=CoderTech68',
 'https://music-to-scrape.org/user?username=GamerMoon97',
 'https://music-to-scrape.org/user?username=Geek61',
 'https://music-to-scrape.org/user?username=Dragon05']

In [21]:
# Question 3
import time

# Define the duration in seconds (1 minute = 60 seconds)
duration = 15

# Calculate the end time
end_time = time.time() + duration

f = open('seeds.txt','a')

# Run the loop until the current time reaches the end time
while time.time() < end_time:
    for user in get_users():
        f.write(user+'\n')
    time.sleep(2)  # Sleep for a few seconds between each execution
f.close()


# 1.3 Preventing array misalignment

So far, we have only extracted *one* piece of information (the URL) from the list of recently active users. But, what if we want to also extract the recently consumed song? 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-song-tag.png" align="left" width=60%/>



A simple solution may be to just use multiple `.find_all()` commands.

__Example__:


In [23]:
# Run this code now
import requests
from bs4 import BeautifulSoup

url = 'https://music-to-scrape.org/'
url = 'http://127.0.0.1:8000'

res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

relevant_section = soup.find('section',attrs={'name':'recent_users'})

# getting links
links = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        links.append(f'https://music-to-scrape.org/{extracted_link}')

# getting songs
songs = []
for song in relevant_section.find_all("span"):
    songs.append(song.get_text())


# links for each user
print(links)

# recent songs for each user
print(songs)

['https://music-to-scrape.org/user?username=user24', 'https://music-to-scrape.org/user?username=user34', 'https://music-to-scrape.org/user?username=user7', 'https://music-to-scrape.org/user?username=user4', 'https://music-to-scrape.org/user?username=user42', 'https://music-to-scrape.org/user?username=user19']
['Jewel - Serve The Ego (Hani Num Dub)', 'Cirrus - She kills', 'Lionel Rogg - Die Kunst der Fuge_ BWV 1080 (2007 Digital Remaster): Contrapunctus XVII - Inversus', 'Anna Abreu - Shame', 'Fudge Tunnel - Gut Rot']


While this approach seems easily implemented, it is __highly error-prone and needs to be avoided.__

So... what happened?

The length for these two objects - `links` and `songs` - differ! While the links properly render for each user, we can only retrieve song information for a subset of songs. In the end, we won't be able to tell WHICH song is part of WHICH user.

<div class="alert alert-block alert-info"><b>What's an array misalignment?</b>
    <br>
    
<ul>
<li>
When extracting information from the web, we sometimes are prone to "ripping apart" the website's original structure by putting data points into individual arrays (e.g., lists such as one list for user names and another for their recently consumed songs). </li>
<li>In so doing, we violate the data's original structure: we should store information on users, and <b>each user</b> has a user name/link and song.</li>
    <li>The <b>correct way of organizing the data</b> is to create a list of users (e.g., in a dictionary) and then store each attribute (e.g., the song, etc.) <b>within</b> these objects. <b>Only if we store data this way</b> can we be sure to store everything correctly. </li>
<br>
<li>When we do not adhere to this practice, we run the risk of "array misalignment". For example, if only ONE data point were missing for a user, then the (independent) user names array (say, with 6 items) wouldn't be "1:1 aligned" with the song array (say, with only 2-5 items).</li>

</div>

__So, how to do it correctly?__

We will first have to iterate through each __user__, and *within* each user, extract the required information.

Storing the information in a list of dictionaries corresponds most to this solution (see the example below):

In [34]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup

# Define the URL you want to scrape
url = 'https://music-to-scrape.org/'
url = 'http://127.0.0.1:8000'

# Send an HTTP GET request to the URL and store the response
res = requests.get(url, headers=user_agent)

# Set the encoding of the response to the apparent encoding
res.encoding = res.apparent_encoding

# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(res.text)

# Find the HTML section with the attribute 'name' equal to 'recent_users'
relevant_section = soup.find('section', attrs={'name': 'recent_users'})

# Identify individual users within the relevant section
users = relevant_section.find_all(class_='mobile-user-margin')

# Initialize a list to store user data
user_data = []

# Loop through each user in the list of users
for user in users:
    # Check if the user has an 'href' attribute within an anchor tag
    if 'href' in user.find('a').attrs:
        # Extract the link from the 'href' attribute
        extracted_link = user.find('a').attrs['href']
    
    # Check if the user has a 'span' element
    if user.find('span') is not None:
        # Get the text content of the 'span' element, which represents song names
        song_name = user.find('span').get_text()
    else:
        # If there is no 'span' element, set the song_name to 'NA'
        song_name = 'NA'
    
    # Create a dictionary object with the extracted data
    obj = {'url': extracted_link, 'song_name': song_name}
    
    # Append the dictionary to the user_data list
    user_data.append(obj)

# user_data now contains a list of dictionaries, each representing user information with a URL and song name
user_data

[{'url': 'user?username=user24',
  'song_name': 'Set Your Goals - the fallen...'},
 {'url': 'user?username=user34', 'song_name': 'Cirrus - She kills'},
 {'url': 'user?username=user7',
  'song_name': 'Lionel Rogg - Die Kunst der Fuge_ BWV 1080 (2007 Digital Remaster): Contrapunctus XVII - Inversus'},
 {'url': 'user?username=user4', 'song_name': 'Anna Abreu - Shame'},
 {'url': 'user?username=user42', 'song_name': 'Lit - Lovely Day'},
 {'url': 'user?username=user19', 'song_name': 'Fudge Tunnel - Gut Rot'}]

## 2. Navigating on a Website

### 2.1. Using URLs

__Importance__

Alright - what have we learnt up this point?

We've learnt how to extract seeds (here: users) from __one page -- the homepage of the platform.__

So... what's missing?

Exactly! [`music-to-scrape.org`](https://music-to-scrape.org) contains data on many users. 

The objective of this section is to navigate through each user's __consumption history__. However, it's important to note that this information is spread across multiple pages, and we need to visit them one by one.

__Let's try it out__

Open [the website](https://music-to-scrape.org/user?username=StarCoder49), and click on the "previous" button at the top of the page.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-user-page.png" align="left" width=60%/>


Repeat this a couple of times, and observe how the URL in your navigation bar is changing...

- `https://music-to-scrape.org/user?username=StarCoder49`
- `https://music-to-scrape.org/user?username=StarCoder49&week=37`
- `https://music-to-scrape.org/user?username=StarCoder49&week=36`
- `https://music-to-scrape.org/user?username=StarCoder49&week=35`
- ...

Can you guess the next one...?

Indeed! The URL can be divided into a __fixed base part__ (`https://music-to-scrape.org/user?username=StarCoder49`), and a __counter__ that is dependent on the page you're visiting (e.g., `&week=36`). 

__Now let's create a list of all URLs!__ 

Click once on "previous week" to figure out in which week you currently are. Then, we can create a variable, counting downwards from that number to 0 (the first week in which the platform was active). 

Then, we assemble the complete list of URLs. In our application (as of 11 Sept. 2023), the current week number is 37.

In [47]:
counter = 37
page_urls = []
while counter >=0:
    page_urls.append(f'https://music-to-scrape.org/user?username=StarCoder49&week={counter}')
    counter-=1
page_urls

['https://music-to-scrape.org/user?username=StarCoder49&week=37',
 'https://music-to-scrape.org/user?username=StarCoder49&week=36',
 'https://music-to-scrape.org/user?username=StarCoder49&week=35',
 'https://music-to-scrape.org/user?username=StarCoder49&week=34',
 'https://music-to-scrape.org/user?username=StarCoder49&week=33',
 'https://music-to-scrape.org/user?username=StarCoder49&week=32',
 'https://music-to-scrape.org/user?username=StarCoder49&week=31',
 'https://music-to-scrape.org/user?username=StarCoder49&week=30',
 'https://music-to-scrape.org/user?username=StarCoder49&week=29',
 'https://music-to-scrape.org/user?username=StarCoder49&week=28',
 'https://music-to-scrape.org/user?username=StarCoder49&week=27',
 'https://music-to-scrape.org/user?username=StarCoder49&week=26',
 'https://music-to-scrape.org/user?username=StarCoder49&week=25',
 'https://music-to-scrape.org/user?username=StarCoder49&week=24',
 'https://music-to-scrape.org/user?username=StarCoder49&week=23',
 'https://

As expected, this gives a list of all page URLs that contain consumption data for this particular user. 

In [44]:
# print the number of pages urls (btw, run print(page_urls) for yourself to see all page URLs!)
print("The number of page urls in the list is: " + str(len(page_urls)))

The number of page urls in the list is: 38


#### Exercise 2.1

Let's take a step back again, and practice combining the seeds from the previous exercises, with what you've just learnt. 


1. Use the function `get_users()` to generate once a list of users currently active on the site (see exercise 1.2). Store these user names in a list called `users`.
2. Create an empty object, called `urls`. Then, loop through your list of users and - at each iteration - append all URLs (i.e., user-week) to the list.


In [14]:
# your answer goes here!

#### Solutions
1. Let's first use the get_users() function from above (see exercise 1.2) to regenerate a list of users.

In [39]:
users = get_users()

2. Let us now assemble the list of URLs.

In [48]:
urls = []

for user in users:
    counter = 37
    page_urls = []
    while counter >=0:
        urls.append(f'{user}&week={counter}')
        counter-=1

In [49]:
# view result
urls

['https://music-to-scrape.org/user?username=Wizard25&week=37',
 'https://music-to-scrape.org/user?username=Wizard25&week=36',
 'https://music-to-scrape.org/user?username=Wizard25&week=35',
 'https://music-to-scrape.org/user?username=Wizard25&week=34',
 'https://music-to-scrape.org/user?username=Wizard25&week=33',
 'https://music-to-scrape.org/user?username=Wizard25&week=32',
 'https://music-to-scrape.org/user?username=Wizard25&week=31',
 'https://music-to-scrape.org/user?username=Wizard25&week=30',
 'https://music-to-scrape.org/user?username=Wizard25&week=29',
 'https://music-to-scrape.org/user?username=Wizard25&week=28',
 'https://music-to-scrape.org/user?username=Wizard25&week=27',
 'https://music-to-scrape.org/user?username=Wizard25&week=26',
 'https://music-to-scrape.org/user?username=Wizard25&week=25',
 'https://music-to-scrape.org/user?username=Wizard25&week=24',
 'https://music-to-scrape.org/user?username=Wizard25&week=23',
 'https://music-to-scrape.org/user?username=Wizard25&we

Of course, one of the big disadvantages of this "manual" link building is that we need to "know" how many pages to extract information from. This may vastly differ by user and across time. 

We turn towards this issue next.

### 2.2 Using links contained in elements (e.g., buttons)

__Importance__

For now, the user link extraction has worked without problems. Yet, there's still one little improvement that we can make. *If the number of pages changes*, we need to manually update for how many pages we would like to retrieve seeds.

A general solution is therefore to look up whether there is a `previous` button on the page (see HTML code below). We can then either "grab" the URL and visit it (so, in essence, we're still using URLs to navigate), or - instead - "click" on it.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-previous-page.png" align="left" width=60% style="border: 1px solid black" />

__Let's try it out__

So, let's write a snippet that "captures" the link of the next page button on the [books page](https://books.toscrape.com).

We always proceed in small steps.

In [51]:
# Step 1: Load the website's source code and convert to BeautifulSoup object
url = 'https://music-to-scrape.org/user?username=StarCoder49'
url = 'http://127.0.0.1:8000/user?username=StarCoder49'

header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

In [58]:
# Step 2: Trying to locate the previous button, using a combination of class names and attribute-value pairs.
soup.find(class_='page-link', attrs={'type':'previous_page'})

<a class="page-link" href="user?username=StarCoder49&amp;week=36" type="previous_page">Previous
                                        Week</a>

In [62]:
# Step 3: Trying to extract the `href` attribute
soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']

'user?username=StarCoder49&week=36'

In [64]:
# Step 4: Storing "previous page" link
previous_page_link = soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']
previous_page_link # print it

'user?username=StarCoder49&week=36'

At each iteration, we can observe how we're getting closer to the information we need.

Now, we only need to combine the base URL (`https://music-to-scrape.org/`) with the page number.

In [66]:
previous_page_link = soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']
f'https://music-to-scrape.org/{previous_page_link}'

'https://music-to-scrape.org/user?username=StarCoder49&week=36'

__Exercise 2.2__

Please first load the snippet below, which has wrapped the "previous page" capturing in a function. Observe the use of `try` and `except`, which accounts for the last page NOT having a next page button.

In [76]:
def previous_page(url):
    header = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers = header)
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text)
    try:
        previous_page_link = soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']
        return(f'https://music-to-scrape.org/{previous_page_link}')
    except:
        return('no previous page')


1. Pass 'https://music-to-scrape.org/user?username=StarCoder49&week=36' to `previous_page()` and observe the output. Then, use  `https://music-to-scrape.org/user?username=StarCoder49&week=0`. Is that what you expected? 

2. Write a while loop that assembles a list of all product pages for the user `StarCoder49`, by extracting previous page URLs from each page and appending them to an array/list called `urls`.


In [23]:
# write your code here

__Solution__

In [74]:
# Question 1
previous_page('https://music-to-scrape.org/user?username=StarCoder49&week=36')

'https://music-to-scrape.org/no next page'

In [25]:
previous_page('https://music-to-scrape.org/user?username=StarCoder49&week=0')
# returns "no next page"

'https://books.toscrape.com/catalogue/category/books_1/no next page'

In [81]:
# Question 2
urls = []

# define first URL to start from
url = 'https://music-to-scrape.org/user?username=StarCoder49'
url = 'http://127.0.0.1:8000/user?username=StarCoder49'

while True:
    print('Trying to get previous page URL from ' + url)
    previous_url = previous_page(url)
    if 'no previous page' in previous_url: break
    url = previous_url
    urls.append(url)
    
urls

Trying to get previous page URL from http://127.0.0.1:8000/user?username=StarCoder49
Trying to get previous page URL from https://music-to-scrape.org/user?username=StarCoder49&week=36


['https://music-to-scrape.org/user?username=StarCoder49&week=36']

### 2.3 Collecting information from each user page

Up to this moment, we have defined which seeds to use (usernames from the homepage), and identified from which pages we would like to extract information (e.g., for weeks 37 through 0). Yet, we haven't yet extracted any of the consumption data from the website (e.g., which song a particular user has listened to in a given week.

For this, we use our previous learnings (e.g., see "Web scraping for Dummies" tutorial in this course) to iterate through the table.

__Exercise 2.3__

View the code snippet below, which *prints* the information on what songs were listened to by a user to the screen.


In [117]:
# Step 1: Load the website's source code and convert to BeautifulSoup object
url = 'https://music-to-scrape.org/user?username=StarCoder49'

header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

table = soup.find('table')

rows = table.find_all('tr')

for row in rows:
    #print(row)
    data = row.find_all('td')
    
    if len(data)>0:
        song_name=data[0].get_text()
        artist_name=data[1].get_text()
        date=data[2].get_text()
        time=data[3].get_text()

        print(f'Song "{song_name}" by "{artist_name}"')

Song "So Ist Das Nun Mal" by "Andreas Dorau"
Song "Hex Breaker" by "Taint"
Song "Picture Book" by "Ray Davies"
Song "Lombrigas e os vermes" by "Eddie"
Song "124 Stomp" by "Azukx"
Song "Untitled" by "Rhian Sheehan"
Song "Bare As You Dare" by "Lady Saw"
Song "Long Day (Album Version)" by "Soul Asylum"
Song "Dear Friend" by "Shunza"
Song "Bop Till You Drop" by "Michael Stanley Band"
Song "On The Boards" by "Taste"
Song "Fool For Your Loving" by "Whitesnake"
Song "None Missing" by "Birdapres"
Song "Hello" by "LL Cool J / Amil"


1. Store the information in a list of dictionaries, containing the following data points:
    - username
    - song
    - artist
    - date
    - time
2. Wrap your code in a function, that returns the JSON dictionary from 1).

__Solution__

In [119]:
# Q1:

url = 'https://music-to-scrape.org/user?username=StarCoder49'
url = 'http://127.0.0.1:8000/user?username=StarCoder49'

header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

table = soup.find('table')

rows = table.find_all('tr')

json_data=[]
for row in rows:
    data = row.find_all('td')

    if len(data)>0:
        song_name=data[0].get_text()
        artist_name=data[1].get_text()
        date=data[2].get_text()
        time=data[3].get_text()
        json_data.append({'song_name': song_name,
                          'artist_name': artist_name,
                          'date': date,
                          'time': time,
                          'username': url.split('=')[1]})
json_data

[{'song_name': 'So Ist Das Nun Mal',
  'artist_name': 'Andreas Dorau',
  'date': '2023-09-11',
  'time': '23:03:59',
  'username': 'StarCoder49'},
 {'song_name': 'Hex Breaker',
  'artist_name': 'Taint',
  'date': '2023-09-11',
  'time': '22:59:18',
  'username': 'StarCoder49'},
 {'song_name': 'Picture Book',
  'artist_name': 'Ray Davies',
  'date': '2023-09-11',
  'time': '22:56:46',
  'username': 'StarCoder49'},
 {'song_name': 'Lombrigas e os vermes',
  'artist_name': 'Eddie',
  'date': '2023-09-11',
  'time': '22:53:56',
  'username': 'StarCoder49'},
 {'song_name': '124 Stomp',
  'artist_name': 'Azukx',
  'date': '2023-09-11',
  'time': '22:46:37',
  'username': 'StarCoder49'},
 {'song_name': 'Untitled',
  'artist_name': 'Rhian Sheehan',
  'date': '2023-09-11',
  'time': '22:45:58',
  'username': 'StarCoder49'},
 {'song_name': 'Bare As You Dare',
  'artist_name': 'Lady Saw',
  'date': '2023-09-11',
  'time': '22:42:07',
  'username': 'StarCoder49'},
 {'song_name': 'Long Day (Album Ve

In [121]:
#Q2

def get_consumption_history(url):
    header = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers = header)
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text)
    
    table = soup.find('table')
    
    rows = table.find_all('tr')
    
    json_data=[]
    for row in rows:
        data = row.find_all('td')
    
        if len(data)>0:
            song_name=data[0].get_text()
            artist_name=data[1].get_text()
            date=data[2].get_text()
            time=data[3].get_text()
            json_data.append({'song_name': song_name,
                              'artist_name': artist_name,
                              'date': date,
                              'time': time,
                              'username': url.split('=')[1]})
    return(json_data)

In [122]:
# try running the function
get_consumption_history('http://127.0.0.1:8000/user?username=StarCoder49')


[{'song_name': 'So Ist Das Nun Mal',
  'artist_name': 'Andreas Dorau',
  'date': '2023-09-11',
  'time': '23:03:59',
  'username': 'StarCoder49'},
 {'song_name': 'Hex Breaker',
  'artist_name': 'Taint',
  'date': '2023-09-11',
  'time': '22:59:18',
  'username': 'StarCoder49'},
 {'song_name': 'Picture Book',
  'artist_name': 'Ray Davies',
  'date': '2023-09-11',
  'time': '22:56:46',
  'username': 'StarCoder49'},
 {'song_name': 'Lombrigas e os vermes',
  'artist_name': 'Eddie',
  'date': '2023-09-11',
  'time': '22:53:56',
  'username': 'StarCoder49'},
 {'song_name': '124 Stomp',
  'artist_name': 'Azukx',
  'date': '2023-09-11',
  'time': '22:46:37',
  'username': 'StarCoder49'},
 {'song_name': 'Untitled',
  'artist_name': 'Rhian Sheehan',
  'date': '2023-09-11',
  'time': '22:45:58',
  'username': 'StarCoder49'},
 {'song_name': 'Bare As You Dare',
  'artist_name': 'Lady Saw',
  'date': '2023-09-11',
  'time': '22:42:07',
  'username': 'StarCoder49'},
 {'song_name': 'Long Day (Album Ve

In [123]:
# Check whether it also works for different weeks
get_consumption_history('http://127.0.0.1:8000/user?username=StarCoder49&week=12')

[{'song_name': 'Fiel Enamorado',
  'artist_name': 'Estrellas Cubanas',
  'date': '2023-03-25',
  'time': '23:15:18',
  'username': 'StarCoder49&week'},
 {'song_name': 'Trop De...',
  'artist_name': 'Ramses',
  'date': '2023-03-25',
  'time': '23:12:42',
  'username': 'StarCoder49&week'},
 {'song_name': 'Midnight',
  'artist_name': 'Joe Satriani',
  'date': '2023-03-25',
  'time': '23:09:36',
  'username': 'StarCoder49&week'},
 {'song_name': 'Five Long Years',
  'artist_name': 'Freddie King',
  'date': '2023-03-25',
  'time': '23:05:12',
  'username': 'StarCoder49&week'},
 {'song_name': 'Birak Yakami',
  'artist_name': 'Ebru Yasar',
  'date': '2023-03-25',
  'time': '23:01:37',
  'username': 'StarCoder49&week'},
 {'song_name': 'Bailando',
  'artist_name': 'Casual',
  'date': '2023-03-25',
  'time': '22:58:14',
  'username': 'StarCoder49&week'},
 {'song_name': 'Dis_ Oh Dis (Everybody Loves A Lover)',
  'artist_name': 'Line Renaud',
  'date': '2023-03-25',
  'time': '22:55:23',
  'usernam

To retrieve consumption data, you could now loop through the previously generated list of links (see webdata for dummies tutorial).

## 3. Improving Extraction Design

### 3.1 Timers

__Importance__

Before we started running some of the cells above, you may have observed the usage of the `time.sleep` function. Sending many requests at the same time can overload a server. Therefore, it's highly recommended to pause between requests rather than sending them all simultaneously. This avoids that your IP address (i.e., numerical label assigned to each device connected to the internet) gets blocked, and you can no longer visit (and scrape) the website. 

__Let's try it out__

In Python, you can import the `time` module, which pauses the execution of future commands for a given amount of time. For example, the print statement after `time.sleep(3)` will only be executed after 3 seconds:

In [30]:
# run this cell again to see the timer in action yourself!
import time
pause = 3
time.sleep(pause)
print(f"I'll be printed to the console after {pause} seconds!")

I'll be printed to the console after 3 seconds!


__Exercise 3.1__

Modify the code above to sleep for 2 minutes. Go grab a coffee inbetween. Did it take you longer than 2 minutes?

(if you want to abort the running code, just select the cell and push the "stop" button)

In [31]:
# your answer goes here!

**Solution**  

In [32]:
time.sleep(2*60)
print("Done!")

Done!


### 3.2 Modularization

**Importance**  

In scraping, many things have to be executed *multiple times*. For example, whenever we open a new page on books.toscrape.com, we would like to extract all the available book links.

To help us execute things over and over again, we will "modularize" our code into functions. We can then call these functions whenever we need them. Another benefit from using functions is that we can improve the readability and reusability of our code. If you need a quick refresher on functions, please revisit section 4 of the [Python Bootcamp](https://odcm.hannesdatta.com/docs/tutorials/pythonbootcamp/).

**Let's try it out**

Let's finish up our book URL scraper by putting together everything we have learned thus far.

1. We need a function that extracts all seeds, given a category URL. We would like to store these seeds in a JSON file and save it to the disk. This will consititute our "sample" going forward.
2. We need a function that opens this JSON file, and captures all of the relevant product information (for now, let's use the title and price).

__Exercise 3.2__

Write a function to accomplish (1) above? (capturing the seeds and storing them in a JSON file)? Start with the solution in 2.3.

__Solution__

In [33]:
import time

def get_seeds(start_url = 'https://books.toscrape.com/catalogue/category/books_1/'):
    seeds = []
    url = start_url
    counter = 0 #initialize counter so that you can break earlier from this loop when needed

    while True:
        counter+=1

        if (counter>4): break # (de)activate this comment if you want to break after x iterations for prototyping

        print(f'Trying to get next page URL from {url}')

        header = {'User-agent': 'Mozilla/5.0'}
        res = requests.get(url, headers=header)
        res.encoding = res.apparent_encoding
        soup = BeautifulSoup(res.text)

        # extract information
        urls = soup.find_all(class_="product_pod")
        for book in urls:
            url_book = book.find("a").attrs["href"]
            book_url = "https://books.toscrape.com/catalogue/" + url_book
            book_url = book_url.replace('../', '')
            seeds.append({'product_url': book_url,
                          'page_url': url,
                          'timestamp': int(time.time())})
        
        # next page available?
        try:
            url = 'https://books.toscrape.com/catalogue/category/books_1/' + soup.find(class_='next').find('a')['href']
        except:
            break # no next page present
            
    return(seeds)


In [34]:
data = get_seeds('https://books.toscrape.com/catalogue/category/books_1/')

Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-2.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-3.html
Trying to get next page URL from https://books.toscrape.com/catalogue/category/books_1/page-4.html


In [35]:
# preview the data
data

[{'product_url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscrape.com/catalogue/soumission_998/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
  'page_url': 'https://books.toscrape.com/catalogue/category/books_1/',
  'timestamp': 1676484189},
 {'product_url': 'https://books.toscr

In [36]:
# store data in new-line separated JSON files

import json
f = open('seeds.json','w',encoding = 'utf-8')
for item in data:
        f.write(json.dumps(item))
        f.write('\n')
f.close()

__Exercise 3.3__

Now, let's write some code that loads `seeds.json`, and visits each of the websites to extract the product title and price. Remember to build in a little timer (e.g., waiting for 1 second). The prototype/starting code below stops automatically after 5 iterations to minimize server load. Try removing the prototyping condition using the comment character `#` when you think you're done!


In [37]:
# start from the code below
import time # we need the time package for implementing a bit of waiting time

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    print(obj['product_url'])
    
    # eventually sleep for a second
    time.sleep(1)
    

https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html
https://books.toscrape.com/catalogue/sharp-objects_997/index.html
https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html


<div class="alert alert-block alert-info"><b>Tips</b>
    <br>
    <ul>
        <li>
            Use the function <code>parse_website</code> from exercise 1.6 in the "webdata for dummies" tutorial and remove the file saving part.
        </li>
 
</div>


__Solution__

In [38]:
# Paste the parse_website() function here from an earlier tutorial. Remember also using the import statements!
import requests
from bs4 import BeautifulSoup

def parse_website(url):
    header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
    request = requests.get(url, headers = header)
    request.encoding = request.apparent_encoding # set encoding to UTF-8
    source_code = request.text

    # make information "extractable" using BeautifulSoup
    soup = BeautifulSoup(source_code)
    
    # title
    title = soup.find('h1').get_text()
    price = soup.find(class_='price_color').get_text()
    instock = soup.find(class_='instock availability').get_text().strip()
    stars = soup.find(class_='star-rating').attrs['class'][1]

    data = {'title': title,
            'price': price,
            'instock': instock,
            'stars': stars}
    
    return(data)

In [39]:
# test whether the function works (I just randomly picked a book)
parse_website('https://books.toscrape.com/catalogue/set-me-free_988/index.html')

{'title': 'Set Me Free',
 'price': '£17.46',
 'instock': 'In stock (19 available)',
 'stars': 'Five'}

In [40]:
# now start from the code above and "use" the function

# start from the code below
import time # we need the time package for implementing a bit of waiting time

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    url = obj['product_url']
    print(f'Retrieving data for {url}.')
    
    retrieved_data = parse_website(url)
    retrieved_data['timestamp_retrieval'] = int(time.time())
    # store data
    f = open('book_data.json', 'a', encoding = 'utf-8')
    f.write(json.dumps(retrieved_data))
    f.write('\n')
    f.close() 
    
    # eventually sleep for a second
    time.sleep(1)
 

Retrieving data for https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html.
Retrieving data for https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html.
Retrieving data for https://books.toscrape.com/catalogue/soumission_998/index.html.
Retrieving data for https://books.toscrape.com/catalogue/sharp-objects_997/index.html.
Retrieving data for https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html.


In [41]:
# inspect data in pandas
import pandas as pd
pd.read_json('book_data.json', lines=True)

Unnamed: 0,title,price,instock,stars,timestamp_retrieval
0,A Light in the Attic,£51.77,In stock (22 available),Three,2023-02-15 08:14:16
1,Tipping the Velvet,£53.74,In stock (20 available),One,2023-02-15 08:14:17
2,Soumission,£50.10,In stock (20 available),One,2023-02-15 08:14:19
3,Sharp Objects,£47.82,In stock (20 available),Four,2023-02-15 08:14:20
4,Sapiens: A Brief History of Humankind,£54.23,In stock (20 available),Five,2023-02-15 08:14:22
5,A Light in the Attic,£51.77,In stock (22 available),Three,2023-02-15 18:03:18
6,Tipping the Velvet,£53.74,In stock (20 available),One,2023-02-15 18:03:20
7,Soumission,£50.10,In stock (20 available),One,2023-02-15 18:03:21
8,Sharp Objects,£47.82,In stock (20 available),Four,2023-02-15 18:03:23
9,Sapiens: A Brief History of Humankind,£54.23,In stock (20 available),Five,2023-02-15 18:03:25


### 3.3 Summary

At the beginning of this tutorial, we set out the promise of writing multi-page scrapers from start to finish. Although the examples we have studied are relatively simple, the same principles (seed definition, data extraction plan, page-level data collection) apply to any other website you'd like to scrape. 

But... then, there are more *advanced websites*, which we address next.

# 4. Scraping more advanced, dynamic websites

In previous tutorials, you have used the `requests` library to retrieve web data. For example, re-run the following code.



In [42]:
import requests
from bs4 import BeautifulSoup

header = {'User-agent': 'Mozilla/5.0'}
request = requests.get('https://books.toscrape.com/catalogue/sharp-objects_997/index.html', headers = header)
request.encoding = request.apparent_encoding
source_code = request.text

# save website 
f=open('simple_website.html','w',encoding='utf-8')
f.write(source_code)
f.close()

# parse some information
soup=BeautifulSoup(source_code)
soup.find('h1')

<h1>Sharp Objects</h1>

This works well for relatively simple websites, but... try the same for the homepage of Twitch!

In [43]:
request = requests.get('https://www.twitch.tv/', headers = header)
request.encoding = request.apparent_encoding
source_code = request.text
soup=BeautifulSoup(source_code)

# save website 
f=open('advanced_website.html','w',encoding='utf-8')
f.write(source_code)
f.close()

When trying to open `advanced_website.html` in your browser, you quickly realize there is a problem. You can't see what's on the website when you manually open it using the URL. This mainly has to do with how advanced a website is: in the case of Twitch, you'd encounter quite a dynamic site with a video player, previews, real-time updates on the number of streams, etc. The normal request library isn't just able to handle it. 

So, we're resorting to an alternative way to retrieve data, using `selenium`.

## 4.1 Making a connection to a website using Selenium

<div class="alert alert-block alert-warning"><b>Installing Selenium and Chromedriver</b> 

To install Selenium and Chromedriver locally, please follow the <a href="https://tilburgsciencehub.com/configure/python-for-scraping/?utm_campaign=referral-short">Tutorial on Tilburg Science Hub</a>.
    
You can also use the code snippet below to automate the installation. Running this snippet takes a little longer each time, but the benefit is that it almost always works!
</div>


In [44]:
# Installing and starting up Chrome using Webdriver Manager
!pip install webdriver_manager
!pip install selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# Opening the Twitch site
driver = webdriver.Chrome(ChromeDriverManager().install())

url = "https://twitch.tv/"
driver.get(url)

If everything went smooth, your computer opened a new Chrome window, and opened `twitch.tv`. 

<div class="alert alert-block alert-info"><b>Using Google Colab</b> 

If you're using Google Colab, you don't see your browser open up manually.
    
Whenever you switch pages, just manually open that page in your browser. Although this feels like a little less interactive, you will still be able to work through this tutorial!

</div>

From now onwards, you can use `driver.get('https://google.com')` to point to different websites (i.e., you don't need to install it over and over again, unless you open up a new instance of Jupyter Notebook).

## 4.2 Using BeautifulSoup with Selenium


We can now also try to extract information. Note that we're converting the source code of the site to a `BeautifulSoup` object (because you may have learnt how to use `BeautifulSoup` earlier).

In [45]:
# we also need the time package to wait a few seconds until the page is loaded
import time
url = "https://twitch.tv/"
driver.get(url)
time.sleep(3)

Rather than using the "source code" obtained with the `requests` library, we can now convert the source code of the Selenium website to a BeautifulSoup object.

In [46]:
soup=BeautifulSoup(driver.page_source)

...and start experimenting with querying the site, such as retrieving the titles of the currently active streams.

In [47]:
streams = soup.find_all('a', attrs = {'data-test-selector':"TitleAndChannel"})

# print a list of stream names
counter = 0
for stream in streams:
    counter = counter + 1
    print('Stream ' + str(counter) + ': ' + stream.get_text())


Stream 1: 🔴CLICK HERE🔴CLICK NOW🔴CLICKY CLICKY🔴NEWS BIG🔴DRAMA MEGA🔴NO VALENTINE ANDY🔴LONELY CERTIFIED CONTENT🔴BASEMENT WARLORD🔴#1 GOBLIN🔴xQc
Stream 2: HIGLIGHTS: G2 Esports vs Heroic - IEM Katowice 2023 - Grand FinalESL_CSGO
Stream 3: VCT LOCK//IN  TH vs. EG— Alpha Bracket Day 3VALORANT
Stream 4: PSN: AuzioMF - 86+ MIXED CAMPAIGN PLAYER PICKS! 🔥 !prime @AuzioMFAuzioMF
Stream 5: ADEYEMI'S ARMYdannyaarons
Stream 6: #ANALYSE 5 AVEC SLIPIXotplol_
Stream 7: [DROPS] Annie Huffley Hufflepuff playthrough - HARD mode - 100% challenges complete, 92% trophies !nordvpnAnnieFuchsia
Stream 8: freelancingsips_
Stream 9: [DROPS] CAN I GET A HOYYAHHHHHHHHH!!! hogwarts later <3 !discord !skillsharelydiaviolet
Stream 10: ❌Donaton Stream❌ !Donaton !TSamSaberi
Stream 11: DON'T HAVE MUCH TIME BUT I WANT TO STREAMFoolish_Gamers
Stream 12: 🦐 WHOLESOME ORCA IS BACK! AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ~ BLO'HOLE BLAST FLAVOR RELEASE → !gg 《VTuber》!socials !gg !merchShylily
Stream 13: i r wizard | disc

Wow - this is cool. You've just learnt a second way to open websites using `selenium`. The benefit of `selenium` is that you can work with highly dynamic websites (which also helps you to not getting blocked). The drawback is that `selenium` is slower than just using the `requests` library, and it may sometimes be buggy on computers without a screen (which matters when you scale up your data collection.

<div class="alert alert-block alert-info"><b>Awesome stuff with Selenium</b> 

Selenium is your best shot at navigating a dynamic website. It can do amazing things, such as 
    
<ul>
    <li>"clicking" on buttons</li>
    <li>scrolling through a site</li>
    <li>hovering over items and capturing information from popups,</li>
    <li>starting to play a stream,</li>
    <li>typing text and submitting it in the chat, and</li>
    <li>so much more...!</li>
</ul>
    
Note though that we won't cover the advanced functionality of Selenium in this tutorial, but the optional "Web data advanced" tutorial holds the necessary information.
   
</div>



__Exercise 4.1__

Please write code snippets to extract the following pieces of information. Do you choose `requests` or `selenium`?

1. The titles of all `<h2>` tags from `https://odcm.hannesdatta.com/docs/course/`
2. The titles of all available TV series from `https://www.bol.com/nl/nl/l/series/3133/30291/` (about 24)

```
soup.find_all('a', class_='product-title')
```


We also need the time package to wait a few seconds until the page is loaded.

```
import time
url = "https://twitch.tv/" # some example URL
driver.get(url)
time.sleep(3)
```

In [48]:
# write your solution here

In [49]:
# Solution to question 1:
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
request = requests.get('https://odcm.hannesdatta.com/docs/course/', headers = header)
request.encoding = request.apparent_encoding # set encoding to UTF-8
soup = BeautifulSoup(request.text)
for title in soup.find_all('h2'): print(title.get_text())

Instructor
Course description
Prerequisites
Teaching format
Assessment
Code of Conduct
Structure of the course
More links


In [50]:
# Solution to question 2:
driver.get('https://www.bol.com/nl/nl/l/series/3133/30291/')
time.sleep(3)
soup = BeautifulSoup(driver.page_source)

In [51]:
urls = []
for url in soup.find_all('a', class_='product-title'):
    urls.append(url.attrs['href'])
urls

['/nl/nl/p/midsomer-murders-seizoen-19-deel-2/9200000119833762/',
 '/nl/nl/p/midsomer-murders-seizoen-17/9200000132010294/',
 '/nl/nl/p/ncis-seizoen-19/9300000135569426/',
 '/nl/nl/p/fawlty-towers/9300000087454356/',
 '/nl/nl/p/sisi-seizoen-2/9300000139818897/',
 '/nl/nl/p/chicago-fire-seizoen-10/9300000123634169/',
 '/nl/nl/p/flikken-maastricht-seizoen-16/9300000096688928/',
 '/nl/nl/p/star-trek-discovery-seizoen-4/9300000127973053/',
 '/nl/nl/p/house-of-the-dragon-seizoen-1/9300000127606162/',
 '/nl/nl/p/game-of-thrones-seizoen-1-8/9300000045366024/',
 '/nl/nl/p/ncis-los-angeles-s12/9300000058801046/',
 '/nl/nl/p/star-trek-picard-seizoen-2/9300000123707493/',
 '/nl/nl/p/nachtwacht-het-donkere-spiegelbeeld/9300000128499338/',
 '/nl/nl/p/midsomer-murders-seizoen-12-deel-2/9200000132010284/',
 '/nl/nl/p/midsomer-murders-seizoen-18-deel-1/9200000132010326/',
 '/nl/nl/p/columbo-complete-collection/9200000096426621/',
 '/nl/nl/p/outlander-seizoen-6-blu-ray-import-met-nl-ondertiteling/93000

### 4.3 Using interactive elements (e.g., by clicking buttons)

__Importance__

For more dynamic websites, we may have to click on certain elements (rather than extracting some URL).

<div class="alert alert-block alert-info"><b>Extracting elements using Selenium, not BeautifulSoup</b> 

Selenium is really great for navigating dynamic website. There are two ways in which you can use it for querying sites:
    
<ul>
    <li>put the "selenium" source code (<code>driver.page_source</code>) to BeautifulSoup, and then use BeautifulSoup commands, or </li>
    <li>directly use selenium (and it's own query language) to extract elements.</li>
</ul>
    
In the next few examples, we are using selenium's "internal" query language (which you identify easily because it is a subfunction of the `driver` object, and because it has a different name (`find_element`, instead of `find` or `find_all`).
    
Want to know more about selenium's built-in query language? Check out the "Advanced Web Scraping Tutorial", or dig up some extra material from the web. Knowing both BeautifulSoup and Selenium makes you most productive!
  
</div>

__Try it out__

If you haven't done so, rerun the installation code for `selenium` from above. Then, proceed by running the following cell and observe what happens in your browser.


In [52]:
driver.get('https://books.toscrape.com/catalogue/category/books_1/')

After a few seconds, your browser will have loaded the website in Chrome. Now, run the next cells.

In [53]:
# Step 1: Let's try location the element
from selenium.webdriver.common.by import By
driver.find_element(By.CLASS_NAME, 'next')

<selenium.webdriver.remote.webelement.WebElement (session="381144fe48efd393c0dbb5cb4d5a4689", element="ecbeb367-1848-4d47-bdba-e7a98ed9578e")>

In [54]:
# Step 2: Finding the link within the `next` class
driver.find_element(By.CLASS_NAME, 'next').find_element(By.TAG_NAME, 'a')

<selenium.webdriver.remote.webelement.WebElement (session="381144fe48efd393c0dbb5cb4d5a4689", element="5f555662-5379-47ab-9a2a-e6e76c2ea298")>

In [55]:
# Step 3: Clicking the link!
driver.find_element(By.CLASS_NAME, 'next').find_element(By.TAG_NAME, 'a').click()

Boom! In step 3, we finally clicked on the link. Just try rerunning this cell with step 3 over and over again. Does iterating through the pages work?!

__Exercise 4.2__

Iterate through the entire set of pages, until there are no new pages left. This time, use `selenium` and click on the next page button. You can start on page 47 (`https://books.toscrape.com/catalogue/category/books_1/page-47.html`) to speed up this exercise a bit.

Make use of the `time.sleep(2)` function to make the code wait a bit after each page load.


__Solution__

In [57]:
import time
urls = []
driver.get('https://books.toscrape.com/catalogue/category/books_1/page-47.html')
time.sleep(1)

while True:
    try:
        driver.find_element(By.CLASS_NAME, 'next').find_element(By.TAG_NAME, 'a').click()
        time.sleep(1)
    except:
        break
urls

[]

## After-class exercises

### Exercise 1

Extending the code written for exercise 3.2 in "Web data 101", please collect seeds from ten self-chosen product categories and store them in a file called `all_seeds.json`.

### Exercise 2

Please use the code written in exercise 3.3 in "Web Data 101" and extend it so capture more information (e.g., not only title and price, but also as other attributes/data points you are interested in. In particular, try getting the product description!

Try running your code and store the product data in a JSON dictionary called `all_books.json`.

### Exercise 3

Please complete an entire data collection project in a `.py` file, capturing data for 10 product categories and all products contained on all of the pages. You can proceed in two steps: first collect the seeds, then obtain all data. In addition, parse all retrieved data to a CSV file (with rows and columns), using `pd.read_json(filename, lines = True)` for reading in the JSON data, and `pd.to_csv(filename)` for saving the data in tabular format.

Run your data collection from the terminal.

The final deliverable is
- `all_seeds.json`
- `all_books.json`
- `all_books.csv`




## Backup: Executing Python Files

### Jupyter Notebooks versus editors such as Visual Studio Code, PyCharm, or Spyder

Jupyter Notebooks are ideal for combining programming and markdown (e.g., text, plots, equations), making it the default choice for sharing and presenting reproducible data analyses. Since we can execute code blocks one by one, it's suitable for developing and debugging code on the fly. 

That said, Jupyter Notebooks also have some severe limitations when using them in production environments. That's where an "Integrated Development Environment" (IDE) comes in, such as Visual Studio Code or PyCharm. Let's revisit the most important differences.

First, the order in which you run cells within a notebook may affect the results. While prototyping, you may lose sight of the top-down hierarchy, which can cause problems once you restart the kernel (e.g., a library is imported after it is being used). Second, there is no easy way to browse through directories and files within a Jupyter Notebook. Third, notebooks cannot handle large codebases nor big data remarkably well. 

That's why we recommend starting in Jupyter Notebooks, moving code into functions along the way, and once all seems to be running well, save your Jupyter Notebook as a `.py` file and continue working with it in Visual Studio Code.

Below, we introduce you to the IDE (here, Spyder, but VS Code looks very similar), and show you how to run Python files from the command line. 

### Introduction to Spyder
The first time you need to click on the green "Install" button in Anaconda Navigator, after which you start Spyder by clicking on the blue "Launch" button (alternatively, type `spyder` in the terminal). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/anaconda_navigator.png" width=90% align="left" style="border: 1px solid black" />


The main interface consists of three panels: 
1. **Code editor** = where you write Python code (i.e., the content of code cells in a notebook)
2. **Variable / files** = depending on which tab you choose either an overview of all declared variables (e.g. look up their type or change their values) or a file explorer (e.g., to open other Python files)
3. **Console** = the output of running the Python script from the code editor (what normally appears below each cell in a notebook)

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/spyder.png" width=90% align="left" style="border: 1px solid black" />

**Let's try it out!**     
Copy the solution from exercise 3.3 to a new file, called `webscraping_101.py`. To run the script you can

- click on the green play button to run all code, or
- highlight the parts of the script you want to execute and then click the run selection button.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/toolbar.png" width=40% align="left" style="border: 1px solid black" />

Once the script is running, you may need to interrupt the execution because it is simply taking too long or you spotted a bug somewhere. Click on the red rectangular in the console to stop the execution. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/interrupt.gif" width=80% align="left" style="border: 1px solid black" />

### Run Python Files 

__For Mac and Linux users__

1. Open the terminal and navigate to the folder in which the `.py` file has been saved (use `cd` to change directories and `ls` to list all files).
2. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/webscraping101/images/running_python.gif" width=60% align="left" style="border: 1px solid black" />

__For Windows users__

1. Open Windows explorer and navigate to the folder in which the `.py` file has been saved. Type `cmd` to open the command prompt. Alternatively, open the command prompt from the start menu (and use `cd` to change directories and `dir` to list files).
2. Activate Anaconda by typing `conda activate`.
3. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).