# Chapter 13 - Web Scraping

## Notes

### Project 6: Run a Program with the webbrowser Module

Let’s learn about Python’s webbrowser module by using it in a programming project. The webbrowser module’s open() function can launch a new browser to a specified URL. Enter the following into the interactive shell:  

```
>>> import webbrowser
>>> webbrowser.open('https://inventwithpython.com/')
```

A web browser tab will open to the URL https://inventwithpython.com. This is about the only thing the webbrowser module can do. Even so, the open() function does make some interesting things possible.

For example, it’s tedious to copy a street address to the clipboard every time you’d like to bring up a map of it on OpenStreetMap. You could eliminate a few steps from this task by writing a simple script to automatically launch the map in your browser using the contents of your clipboard. This way, you’d only have to copy the address to a clipboard and run the script for the map to load for you. We can put the address directly into the OpenStreetMap URL, so all we need is the webbrowser.open() function.

This is what your program does:

Gets a street address from the command line arguments or clipboard
Opens the web browser to the OpenStreetMap page for that address
This means your code needs to do the following:

Read the command line arguments from sys.argv.
Read the clipboard contents.
Call the webbrowser.open() function to open the web browser.
Open a new file editor tab and save it as showmap.py.

#### Step 1: Figure Out the URL
By following the instructions in Chapter 12, set up a showmap.py file so that when you run it from the command line, like so

`C:\Users\al> showmap 777 Valencia St, San Francisco, CA 94110`  
the script will use the command line arguments instead of the clipboard. If there are no command line arguments, then the program will know to use the contents of the clipboard.

To do so, you need to figure out what URL to use for a given street address. When you load https://www.openstreetmap.org in the browser and search for an address, the URL in the address bar looks something like this: https://www.openstreetmap.org/search?query=777%20Valencia%20St%2C%20San%20Francisco%2C%20CA%2094110#map=19/37.75897/-122.42142.

We can test that the URL doesn’t need the #map part by taking it out of the address bar and visiting that site to confirm it still loads properly. So, your program can be set to open a web browser to https://www.openstreetmap.org/search?query=<your_address_string> (where <your_address_string> is the address you want to map). Note that your browser automatically handles any necessary URL encoding, such as converting space characters in the URL to %20.

#### Step 2: Handle the Command Line Arguments
Make your code look like this:
```
# showmap.py - Launches a map in the browser using an address from the
# command line or clipboard

import webbrowser, sys
if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])

# TODO: Get address from clipboard.

# TODO: Open the web browser.
```

First, you need to import the webbrowser module for launching the browser and the sys module for reading the potential command line arguments. The sys.argv variable stores the program’s filename and command line arguments as a list. If this list has more than just the filename in it, then len(sys.argv) evaluates to an integer greater than 1, meaning that command line arguments have indeed been provided.

Command line arguments are usually separated by spaces, but in this case, you’ll want to interpret all of the arguments as a single string. Because sys.argv is a list of strings, you can pass it to the join() method, which returns a single string value. You don’t want the program name in this string, so you should pass sys.argv[1:] instead of sys.argv to chop off the first element of the array. The final string that this expression evaluates to is stored in the address variable.

If you run the program by entering this into the command line

`showmap 777 Valencia St, San Francisco, CA 94110`

the `sys.argv` variable will contain this list value:

`['showmap.py', '777', 'Valencia', 'St, ', 'San', 'Francisco, ', 'CA', '94110']`


After you’ve joined `sys.argv[1:]` with a space character, the address variable will contain the string `'777 Valencia St, San Francisco, CA 94110'`.

#### Step 3: Retrieve the Clipboard Content
To fetch the URL from the clipboard, make your code look like the following:
```
# showmap.py - Launches a map in the browser using an address from the
# command line or clipboard

import webbrowser, sys, pyperclip
if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard.
    address = pyperclip.paste()

# Open the web browser.
webbrowser.open('https://www.openstreetmap.org/search?query=' + address)
```
If there are no command line arguments, the program will assume the address is stored on the clipboard. You can get the clipboard content with `pyperclip.paste()` and store it in a variable named address. Finally, to launch a web browser with the OpenStreetMap URL, call `webbrowser.open()`.

While some of the programs you write will perform huge tasks that save you hours, it can be just as satisfying to use a program that conveniently saves you a few seconds each time you perform a common task, such as getting a map of an address. Table 13-1 compares the steps needed to display a map with and without showmap.py.

We’re fortunate that the OpenStreetMap website doesn’t require any interaction to get a map; we can just put the address information directly into the URL. The showmap.py script makes this task less tedious, especially if you do it frequently.

### A REVIEW OF FILE DOWNLOADING AND SAVING

To review, here’s the complete process for downloading and saving a file:

Call `requests.get()` to download the file.  
Call `open()` with `'wb'` to create a new file in write binary mode.  
Loop over the Response object’s `iter_content()` method.  
Call `write()` on each iteration to write the content to the file.  

### Project 7: Open All Search Results
When I look up a topic on a search engine, I don’t look at just one search result at a time. By middle-clicking a search result link (or clicking it while holding CTRL), I open the first several links in a bunch of new tabs to read later. I search the internet often enough that this workflow—opening my browser, searching for a topic, and middle-clicking several links one by one—is tedious. It would be nice if I could simply enter a term on the command line and have my computer automatically open the top search results in new browser tabs.

Let’s write a script to do this for the search results page of the Python Package Index at https://pypi.org. You could adapt a program like this to many other websites, although Google, DuckDuckGo, Amazon, and other large websites often employ measures that make scraping their search results pages difficult.

This is what the program should do:

- Get search keywords from the command line arguments  
- Retrieve the search results page  
- Open a browser tab for each result  

This means your code needs to do the following:  

- Read the command line arguments from sys.argv.
- Fetch the search results page with the requests module.
- Find the links to each search result.
- Call the webbrowser.open() function to open the web browser.
- Open a new file editor tab and save it as searchpypi.py.

#### Step 1: Get the Search Page
Before writing code, you first need to know the URL of the search results page. By looking at the browser’s address bar after doing a search, you can see that the results page has a URL that looks like this: https://pypi.org/search/?q=<SEARCH_TERM_HERE>. The requests module can download this page; then, you can use Beautiful Soup to find the search result links in the HTML. Finally, you’ll use the webbrowser module to open those links in browser tabs.

Make your code look like the following:
```
# searchpypi.py - Opens several search results on pypi.org

import requests, sys, webbrowser, bs4

print('Searching...')  # Display text while downloading the search results page.
res = requests.get('https://pypi.org/search/?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()

# TODO: Retrieve top search result links.

# TODO: Open a browser tab for each result.
```
The user will specify the search terms as command line arguments when launching the program, and the code stores these arguments as strings in a list in `sys.argv`.

#### Step 2: Find All Results
Now you need to use Beautiful Soup to extract the top search result links from your downloaded HTML. But how do you figure out the right selector for the job? For example, you can’t just search for all `<a>` tags, because there are lots of links you don’t care about in the HTML. Instead, you must inspect the search results page with the browser’s Developer Tools to try to find a selector that will pick out only the links you want.

After doing a search for pyautogui, you can open the browser’s Developer Tools and inspect some of the link elements on the page. They can look complicated, like pages of this: `<a class="package-snippet" href="/project/pyautogui">`. But it doesn’t matter that the element looks incredibly complicated. You just need to find the pattern that all the search result links have.

Make your code look like the following:
```
# searchpypi.py - Opens several search results on pypi.org
import requests, sys, webbrowser, bs4
--snip--
# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'parser.html')
# Open a browser tab for each result.
link_elems = soup.select('.package-snippet')
```
If you look at the `<a>` elements, you’ll see that the search result links all have `class="package-snippet"`. Looking through the rest of the HTML source, it looks like the package-snippet class is used only for search result links. You don’t have to know what the CSS class package-snippet is or what it does. You’re just going to use it as a marker for the `<a>` element you’re looking for.

You can create a BeautifulSoup object from the downloaded page’s HTML text and then use the selector '.package-snippet' to find all `<a>` elements that are within an element that has the package-snippet CSS class. Note that if the PyPI website changes its layout, you may need to update this program with a new CSS selector string to pass to `soup.select()`. The rest of the program should remain up-to-date.

#### Step 3: Open Web Browsers for Each Result
Finally, you must tell the program to open web browser tabs for the results. Add the following to the end of your program:
```
# searchpypi.py - Opens several search results on pypi.org
import requests, sys, webbrowser, bs4
--snip--
# Open a browser tab for each result.
link_elems = soup.select('.package-snippet')
num_open = min(5, len(link_elems))
for i in range(num_open):
    url_to_open = 'https://pypi.org' + link_elems[i].get('href')
    print('Opening', url_to_open)
    webbrowser.open(url_to_open)
```
By default, the program opens the first five search results in new tabs using the webbrowser module. However, the user may have searched for something that turned up fewer than five results. The `soup.select()` call returns a list of all the elements that matched your '.package-snippet' selector, so the number of tabs you want to open is either 5 or the length of this list (whichever is smaller).

The built-in Python function min() returns the smallest of the integer or float arguments it is passed. (There is also a built-in max() function that returns the largest argument it is passed.) You can use min() to find out whether there are fewer than five links in the list and store the number of links to open in a variable named num_open. Then, you can run through a for loop by calling `range(num_open)`.

On each iteration of the loop, the code uses webbrowser.open() to open a new tab in the web browser. Note that the href attribute’s value in the returned `<a>` elements don’t have the initial https://pypi.org part, so you have to concatenate that to the href attribute’s string value.

Now you can instantly open the first five PyPI search results for, say, boring stuff by running searchpypi boring stuff on the command line! See Chapter 12 for how to easily run programs on your operating system.

### Project 8: Download XKCD Comics

Blogs, web comics, and other regularly updating websites usually have a front page with the most recent post, as well as a Previous button on the page that takes you to the previous post. That post will also have a Previous button, and so on, creating a trail from the most recent page to the first post on the site. If you wanted a copy of the site’s content to read when you’re not online, you could manually navigate over every page and save each one. But this is pretty boring work, so let’s write a program to do it instead.

XKCD, shown in Figure 13-5, is a popular geek webcomic with a website that fits this structure. The front page at https://xkcd.com has a Prev button that guides the user back through prior comics. Downloading each comic by hand would take forever, but you can write a script to do this in a couple of minutes.

Here’s what your program should do:

- Load the XKCD home page.
- Save the comic image on that page.
- Follow the Previous Comic link.
- Repeat until it reaches the first comic or the max download limit.


This means your code will need to do the following:

- Download pages with the requests module.
- Find the URL of the comic image for a page using Beautiful Soup.
- Download and save the comic image to the hard drive with iter_content().
- Find the URL of the Previous Comic link, and repeat.
- Open a new file editor tab and save it as downloadXkcdComics.py.

#### Step 1: Design the Program
If you open the browser’s Developer Tools and inspect the elements on the page, you should find the following to be true:

The `src` attribute of an `<img>` element stores the URL of the comic’s image file.
The `<img>` element is inside a `<div id="comic">` element.
The Prev button has a rel HTML attribute with the value prev.
The oldest comic’s Prev button links to the https://xkcd.com/# URL, indicating that there are no more previous pages.
To prevent the readers of this book from eating up too much of the XKCD website’s bandwidth, let’s limit the number of downloads we make to 10 by default. Make your code look like the following:
```
# downloadXkcdComics.py - Downloads XKCD comics

import requests, os, bs4, time

url = 'https://xkcd.com'  # Starting URL
os.makedirs('xkcd', exist_ok=True)  # Store comics in ./xkcd
num_downloads = 0
MAX_DOWNLOADS = 10
while not url.endswith('#') and num_downloads < MAX_DOWNLOADS:
    # TODO: Download the page.

    # TODO: Find the URL of the comic image.

    # TODO: Download the image.

    # TODO: Save the image to ./xkcd.

    # TODO: Get the Prev button's url.

print('Done.')
```
The program creates a url variable that starts with the value 'https://xkcd.com' and repeatedly updates it (in a while loop) with the URL of the current page’s Prev link. At every step in the loop, you’ll download the comic at url. The loop stops when url ends with '#' or you have downloaded MAX_DOWNLOADS comics.

You’ll download the image files to a folder in the current working directory named xkcd. The call `os.makedirs()` ensures that this folder exists, and the `exist_ok=True` keyword argument prevents the function from throwing an exception if this folder has already been created.

#### Step 2: Download the Web Page
Let’s implement the code for downloading the page. Make your code look like the following:
```
# downloadXkcdComics.py - Downloads XKCD comics

import requests, os, bs4, time

url = 'https://xkcd.com'  # Starting URL
os.makedirs('xkcd', exist_ok=True)  # Store comics in ./xkcd
num_downloads = 0
MAX_DOWNLOADS = 10
while not url.endswith('#') and num_downloads < MAX_DOWNLOADS:
    # Download the page.
    print(f'Downloading page {url}...')
    res = requests.get(url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # TODO: Find the URL of the comic image.

    # TODO: Download the image.

    # TODO: Save the image to ./xkcd.

    # TODO: Get the Prev button's url.

print('Done.')
```
First, print url so that the user knows which URL the program is about to download; then, use the requests module’s `requests.get()` function to download it. As always, you should immediately call the Response object’s `raise_for_status()` method to throw an exception and end the program if something went wrong with the download. Otherwise, create a BeautifulSoup object from the text of the downloaded page.

#### Step 3: Find and Download the Comic Image
To download the comic on each page, make your code look like the following:
```
# downloadXkcdComics.py - Downloads XKCD comics

import requests, os, bs4, time

--snip--

    # Find the URL of the comic image.
    comic_elem = soup.select('#comic img')
    if comic_elem == []:
        print('Could not find comic image.')
    else:
        comic_URL = 'https:' + comic_elem[0].get('src')
        # Download the image.
        print(f'Downloading image {comic_URL}...')
        res = requests.get(comic_URL)
        res.raise_for_status()

    # TODO: Save the image to ./xkcd.

    # TODO: Get the Prev button's url.

print('Done.')
```
Because you inspected the XKCD home page with your Developer Tools, you know that the `<img>` element for the comic image is inside another element with the id attribute set to comic, so the selector `'#comic img'` will get you the correct `<img>` element from the BeautifulSoup object.

A few XKCD pages have special content that isn’t a simple image file. That’s fine; you’ll just skip those. If your selector doesn’t find any elements, `soup.select('#comic img')` will return a ResultSet object of a blank list. When that happens, the program can just print an error message and move on without downloading the image.

Otherwise, the selector will return a list containing one `<img>` element. You can get the src attribute from this `<img>` element and pass it to `requests.get()` to download the comic’s image file.

#### Step 4: Save the Image and Find the Previous Comic
At this point, the comic’s image file is stored in the res variable. You need to write this image data to a file on the hard drive. Make your code look like the following:
```
# downloadXkcdComics.py - Downloads XKCD comics

import requests, os, bs4, time

--snip--

    # Save the image to ./xkcd.
        image_file = open(os.path.join('xkcd', os.path.basename(comic_URL)), 'wb')
            for chunk in res.iter_content(100000):
               image_file.write(chunk)
            image_file.close()

    # Get the Prev button's URL.
      prev_link = soup.select('a[rel="prev"]')[0]
      url = 'https://xkcd.com' + prev_link.get('href')
      num_downloads += 1
      time.sleep(1)  # Pause so we don't hammer the web server.

print('Done.')
```
You’ll also need a filename for the local image file to pass to open(). The comic_URL will have a value like 'https://imgs.xkcd.com/comics/heartbleed_explanation.png', which you might have noticed looks a lot like a filepath. In fact, you can call `os.path.basename()` with comic_URL to return just the last part of the URL, `'heartbleed_explanation.png'`, and use this as the filename when saving the image to your hard drive. Join this name with the name of your xkcd folder using `os.path.join()` so that your program uses backslashes (\) on Windows and forward slashes (/) on macOS and Linux. Now that you finally have the filename, you can call open() to open a new file in 'wb' mode.

Remember from earlier in this chapter that, to save files you’ve downloaded using requests, you need to loop over the return value of the `iter_content()` method. The code in the for loop writes chunks of the image data to the file. Then, the code closes the file, saving the image to your hard drive.

Afterward, the selector `'a[rel="prev"]'` identifies the `<a>` element with the rel attribute set to prev. You can use this `<a>` element’s href attribute to get the previous comic’s URL, which gets stored in url.

The last part of the loop’s code increments `num_downloads` by 1 so that it doesn’t download all of the comics by default. It also introduces a one-second pause with `time.sleep(1)` to prevent the script from “hammering” the site (that is, impolitely downloading comics as fast as possible, which may cause performance issues for other website visitors). Then, the while loop begins the entire download process again.

The output of this program will look like this:
```
Downloading page https://xkcd.com...
Downloading image https://imgs.xkcd.com/comics/phone_alarm.png...
Downloading page https://xkcd.com/1358/...
Downloading image https://imgs.xkcd.com/comics/nro.png...
Downloading page https://xkcd.com/1357/...
Downloading image https://imgs.xkcd.com/comics/free_speech.png...
Downloading page https://xkcd.com/1356/...
Downloading image https://imgs.xkcd.com/comics/orbital_mechanics.png...
Downloading page https://xkcd.com/1355/...
Downloading image https://imgs.xkcd.com/comics/airplane_message.png...
Downloading page https://xkcd.com/1354/...
Downloading image https://imgs.xkcd.com/comics/heartbleed_explanation.png...
--snip--
```
This project is a good example of a program that can automatically follow links to scrape large amounts of data from the web. You can learn about Beautiful Soup’s other features from its documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

## Practice Questions

1. Briefly describe the differences between the webbrowser, requests, and bs4 modules.  
    **Answer:**  
    `webbrowser` - comes with python and opens a browser to a specific page.  
    `requests` - downloads files and webpages from the internet.  
    `bs4` - parses HTML

2. What type of object is returned by requests.get()? How can you access the downloaded content as a string value?  
    **Answer:** A Response object is returned. You can access it with `response.text`.

3. What requests method checks that the download worked?  
    **Answer:** `response.raise_for_status()`.

4. How can you get the HTTP status code of a requests response?  
    **Answer:** `response.status_code`

5. How do you save a requests response to a file?  
    **Answer:** 

In [None]:
import requests
response = requests.get('https://automatetheboringstuff.com/files/rj.txt')
response.raise_for_status()
with open('./output/RomeoAndJuliet.txt', 'wb') as play_file:
    for chunk in response.iter_content(100000):
        play_file.write(chunk)

6. What two formats do most online APIs return their responses in?  
    **Answer:** JSON and XML

7. What is the keyboard shortcut for opening a browser’s Developer Tools?  
    **Answer:** F12

8. How can you view (in the Developer Tools) the HTML of a specific element on a web page?  
    **Answer:** Right-click any part of the web page and select `Inspect Element` from the context menu to bring up the HTML responsible for that part of the page. This will help you parse HTML for your web scraping programs.

9. What CSS selector string would find the element with an id attribute of main?  
    **Answer:** `soup.select('#main')`

10. What CSS selector string would find the elements with an id attribute of highlight?  
    **Answer:** `soup.select('#highlight')`

11. Say you have a Beautiful Soup Tag object stored in the variable spam for the element `<div>Hello, world!</div>`. How could you get a string 'Hello, world!' from the Tag object?  
    **Answer:** `gettext()`

12. How would you store all the attributes of a Beautiful Soup Tag object in a variable named link_elem?  
    **Answer:** `link_elem = elems[0].attr`

13. Running import selenium doesn’t work. How do you properly import Selenium?  
    **Answer:** `from selenium import webdriver`

14. What’s the difference between the `find_element()` and `find_elements()` methods in Selenium?  
    **Answer:** `find_element()` returns the first element (1 element only) that matches the query. While `find_elements()` returns all the matches as a list of WebElement objects.

15. What methods do Selenium’s WebElement objects have for simulating mouse clicks and keyboard keys?  
    **Answer:** `click()` and `sendkeys()`

16. In Playwright, what locator method call simulates pressing CTRL-A to select all the text on the page?  
    **Answer:** `page.locator('html').press('Control+A')`

17. How can you simulate clicking a browser’s Forward, Back, and Refresh buttons with Selenium?  
    **Answer:** `browser.back()`, `browser.forward()`, `browser.refresh()`

18. How can you simulate clicking a browser’s Forward, Back, and Refresh buttons with Playwright?  
    **Answer:** `page.go_back()`, `page.go_forward()`, `page.reload()`

## Practice Programs

For practice, write programs to do the following tasks.

### Image Site Downloader
Write a program that goes to a photo-sharing site like Flickr or Imgur, searches for a category of photo, and then downloads all the resulting images. You could write a program that works with any photo site that has a search feature.

### 2048
The game 2048 is a simple game in which you combine tiles by sliding them up, down, left, or right with the arrow keys. You can actually get a fairly high score by sliding tiles in random directions. Write a program that will open the game at https://play2048.co and keep sending up, right, down, and left keystrokes to automatically play the game.

### Link Verification
Write a program that, given the URL of a web page, will find every `<a>` link on the page and test whether the linked URL results in a “404 Not Found” status code. The program should print out any broken links.