# Section 10. Python Web Scraping with Beautiful Soup

#### Instructor: Pierre Biscaye 

The objective of this notebook is to introduce you to how to extract data from the web with APIs using Python, with the New York Times API as a case study. The content of this notebook is taken from UC Berkeley D-Lab's Python Web Scraping [course](https://github.com/dlab-berkeley/Python-Web-Scraping).

### Learning Objectives
1. Extracting and parsing HTML using Beautiful Soup
2. Understand difference between tags, attributes, and attribute values
3. Apply these tools in the context of scraping information about the Illinois General Assembly
4. Practice scraping downloadable files from a website

# Introduction

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. In the previous notebook we provided an introduction to using APIs using the NY Times API as a case study.

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. 

Before we get started, let's peruse these websites to take a look at their structure. 

**Question**: What do you observe, both about the structure of the web pages and about the structure of the URLs?

## Package installation and loading

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:

In [None]:
%pip install requests

In [None]:
%pip install beautifulsoup4

We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [None]:
%pip install lxml

In [None]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time

# 1. Extracting and Parsing HTML 

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

In [None]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# Read the content of the server’s response
src = req.text
# View some output
print(src[:1000])

## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [None]:
# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# Take a look
print(soup.prettify()[:1000])

The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with a particular HTML tag, and returns all of those elements.

What does the example below do?

In [None]:
# Find all elements with a certain tag
a_tags = soup.find_all("a")
print(a_tags[:10])

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

In [None]:
a_tags = soup.find_all("a")
a_tags_alt = soup("a")
print(a_tags[0])
print(a_tags_alt[0])

How many links did we obtain?

In [None]:
print(len(a_tags))

That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? What classes do you see in the above list of the first set of HTML tags?

We can restrict our search to certain classes by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`.

In [None]:
# Get only the 'a' tags in 'sidemenu' class
side_menus = soup("a", class_="sidemenu")
side_menus[:5]

A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [None]:
# Get elements with "a.sidemenu" CSS Selector.
selected = soup.select("a.sidemenu")
selected[:5]

## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [None]:
# Get all sidemenu links as a list
side_menu_links = soup.select("a.sidemenu")

# Examine the first link
first_link = side_menu_links[0]
print(first_link)

# What class is this variable?
print('Class: ', type(first_link))

It's a Beautiful Soup tag! This means it has a `text` attribute.

In [None]:
print(first_link.text)

Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes. You can access a tag’s attributes by treating the tag like a dictionary.

In [None]:
print(first_link['href'])

# 2. Scraping members of the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the members of the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

In [None]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")

## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.

In [None]:
# Get all table row elements
rows = soup.find_all("tr")
len(rows)

Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. 

Let's look at the source code for the page and look carefully to see if we can use some CSS selectors to get just the rows we're interested in. What do we observe?

Well, we can observe that the table we are interested in is **nested**. If we scroll to the bottom of the inspector of the source code, we can see that it ends by closing out two levels of tables and table rows. In fact the whole webpage is formatted as a giant table. This implies that the information we want is several levels of table rows deep. 

The CSS selector allows to query particular levels of tags by repeating them. Here we will use `tr tr tr` which tells it to select only rows at least 3 levels deep within the overall tabular page structure.

In [None]:
# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

for row in rows[:5]:
    print(row, '\n')

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [None]:
example_row = rows[2]
print(example_row.prettify())

Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

In [None]:
for cell in example_row.select('td'):
    print(cell)
print()

for cell in example_row.select('.detail'):
    print(cell)
print()

for cell in example_row.select('td.detail'):
    print(cell)
print()

We can confirm that these are all the same.

In [None]:
assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')

Let's use the selector `td.detail` to be as specific as possible.

In [None]:
# Select only those 'td' tags with class 'detail' 
detail_cells = example_row.select('td.detail')
detail_cells

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` attribute.

In [None]:
# Keep only the text in each of those cells
row_data = [cell.text for cell in detail_cells]

print(row_data)

Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [None]:
print(row_data[0]) # Name
print(row_data[3]) # District
print(row_data[4]) # Party

We're also interested in accessing information about the senator's bills. How can we retrieve this? Look at the structure of the detail_cells above. We know we want an 'a' tag and we want information about the href attribute. We can also observe that there are several a tags, and that the one we want is second. We can therefore index on the list of 'a' tags to get the one we want, then extract the 'href' attribute.

Notice that the string in the `href` attribute contains the **relative** link we are after. We will need to construct the full URL to be able to srape bill information.

In [None]:
print(example_row.select('a')[1]['href'])
# Create full path to bills
print("http://www.ilga.gov/senate/" + example_row.select('a')[1]['href'] + "&Primary=True")

## Getting Rid of Junk Rows

We saw earlier that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:

In [None]:
print('Row 0:\n', rows[0], '\n')
print('Row 1:\n', rows[1], '\n')
print('Last Row:\n', rows[-1])

When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [None]:
# Bad rows
print(len(rows[0]))
print(len(rows[1]))

# Good rows
print(len(rows[2]))
print(len(rows[3]))

Perhaps good rows have a length of 5. Let's check:

In [None]:
good_rows = [row for row in rows if len(row) == 5]

# Let's check some rows
print(good_rows[0], '\n')
print(good_rows[-2], '\n')
print(good_rows[-1])

We found a footer row in our list that we'd like to avoid. Let's try something else:

In [None]:
rows[2].select('td.detail') 

In [None]:
# Bad row
print(rows[-1].select('td.detail'), '\n')

# Good row
print(rows[5].select('td.detail'), '\n')

# How about this?
good_rows = [row for row in rows if row.select('td.detail')]

print("Checking rows...\n")
print(good_rows[0], '\n')
print(good_rows[-1])

Looks like we found something that worked!

## Loop it All Together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

In [None]:
# Define storage list
members = []

# Get rid of junk rows
valid_rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Store in a tuple
    senator = (name, district, party)
    # Append to list
    members.append(senator)

In [None]:
# Create empty list to store our data
members = []

# Get rid of junk rows
valid_rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail') 
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect senator information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Extract href for bills
    href = row.select('a')[1]['href']
    # Create full path
    full_path = "http://www.ilga.gov/senate/" + href + "&Primary=True"
    # Store in a tuple
    senator = (name, district, party, full_path)
    # Append to list
    members.append(senator)

In [None]:
# Should be 61
len(members)

Let's take a look at what we have in `members`.

In [None]:
print(members[:5])

# 3. Scraping Illinois senate members' bills 

In the code above we retrieved the URL for each senator's list of bills, taking advantage of the fact that each URL follows a specific format: 

`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`

We want to scrape the webpages corresponding to bills sponsored by each senator. Let's look at [one of them](https://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True) to see the structure.

We will write a function called `get_bills(url)` to parse a given bill's URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the relevant elements
      - we will again take advantage of the nested structure of tables
      - we also take advantage of our exploration of the source code to recognize that all relevant rows have class `billlist`
  - return a _list_ of tuples, each with:
      - the bill ID (1st column)
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
    

In [None]:
def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr tr tr')
    # create an empty list to store the data
    bills = []
    # Iterate over rows
    for row in rows:
        # Grab all bill list cells
        cells = row.select('td.billlist')
        # Relevant entries will have length 5 corresponding to the 5 columns
        # Note that the name of the senator has a class billlist but not td class billlist, so it is not included
        if len(cells) == 5:
            row_text = [cell.text for cell in cells]
            # Extract info from row text
            bill_id = row_text[0]
            description = row_text[1]
            chamber = row_text[2]
            last_action = row_text[3]
            last_action_date = row_text[4]
            # Consolidate bill info
            bill = (bill_id, description, chamber, last_action, last_action_date)
            bills.append(bill)
    return bills

In [None]:
# Let's test it!
print(members[0][3])
test_url = members[0][3]
get_bills(test_url)[0:5]

### Scrape All Bills

As a challenge, you can create a dictionary which maps a senator's district (the key) onto a a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members` and calling `get_bills()` for each of their associated bill URLs.

After you've compiled this information, you can move on to conducting analysis on questions like:
- What is the distribution of bill proposals across senators?
- Are the dates of last actions concentrated over time? 
- Does the number of bill proposals by democrats and republicans vary by whether there are more members of their party in the Senate?
- Does the last action of bill proposals by democrats and republicans vary by whether there are more members of their party in the Senate?

# 4. Scraping downloadable files

Another useful application of web scraping is bulk downloading files that are stored following some predictable format. This can save time if you know you have to download many files and don't want to go through the process of navigating to each page and manually clicking on download links.

We'll apply these tools to the case of downloading country boundary shape files from [GADM](https://gadm.org/download_country.html). Let's first look at the website.

Now we will set up our code to download shapefiles.

In [None]:
# GADM download page URL
base_url = "https://gadm.org/download_country.html"

# Start a session
session = requests.Session()

# Get the HTML content of the page
response = session.get(base_url)
soup = BeautifulSoup(response.text, "html.parser")
soup

In [None]:
# Extract country codes from the dropdown menu
# Observe that the dropdown is formatted with name="Country" as an attribute
# Observe that each country is associated with an option value giving the GADM country code
country_options = soup.select("select[name=country] option")
country_dict = {option.text.strip(): option["value"] for option in country_options if option["value"]}
country_dict

We now have a dictionary of country names and associated GADM country values. But these are not exactly what we need to create the URLs for the shapefiles. We need to extract just the first three letters - this is straightforward. Then, we can to loop through the countries we want, find the appropriate download URL, and download shapefiles.

In [None]:
country_dict['Brazil'][:3]

In [None]:
# Create a directory to store downloaded shapefiles
output_dir = "Data"
import os
os.makedirs(output_dir, exist_ok=True)

# List of countries to download shapefiles for
countries = ["Bangladesh", "Brazil", "Burundi"]

# Create a directory to store downloaded shapefiles
output_dir = "Data"
os.makedirs(output_dir, exist_ok=True)

# Iterate over the list of desired countries
for country in countries:
    if country in country_dict:
        country_code = country_dict[country][:3]
        download_url = f"https://geodata.ucdavis.edu/gadm/gadm4.1/shp/gadm41_{country_code}_shp.zip"

        # File path for saving the zip file
        file_path = os.path.join(output_dir, f"gadm41_{country_code}_shp.zip")

        print(f"Downloading {country} shapefile")

        # Download and save the file
        response = session.get(download_url, stream=True)
        if response.status_code == 200:
            with open(file_path, "wb") as f:
                for chunk in response.iter_content(chunk_size=1024):
                    f.write(chunk)
            print(f"Successfully downloaded: {country}\n")
        else:
            print(f"Failed to download: {country}\n")
    else:
        print(f"Country not found in dropdown: {country}\n")

print("All downloads complete!")

We're done! We've successfully downloaded all the shapefiles we wanted.

You can see how this is a powerful tool for bulk downloading data where the URLs follow a common format.