# Lab 5: Web Scraping with Python
## ENGL 6701 Spring 2024

Contact:
Lindsay Thomas, lthomas@cornell.edu

For more information about this lab and the context in which it was used in ENGL 6701, see [the Lab 5 page on the ENGL 6701 Spring 2024 course website](https://lindsaythomas.net/engl6701s24/labs/lab-5.html).

### 0. Create and Save a Notes Document

As with Lab 4, you will want to create and save a notes document before beginning this lab. Starting with section 2, you will be asked to write some simple code. As you complete these portions of the lab, copy and paste the code you write into your notes document.

Since we are once again using Binder to work with this notebook, please remember that none of the changes you make will be saved after your browser session. This includes outputs that are displayed to the notebook, as well as things like inserting filenames. When you shut this tab down or your laptop loses its connection to the internet or server, you will need to restart this notebook, and it will be like restarting from the beginning. That's why you should copy and paste the code you write into a notes document. You should also know that you may need to rerun cells from the beginning of the notebook to get later sections of the notebook to function correctly.

Sections 1 and 2 of this notebook are drawn from the ["Web Scraping -- Part 1" section of Chapter 4, "Data Collection,"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/02-Web-Scraping-Part1.html) from Melanie Walsh's free online textbook, *Introduction to Cultural Analytics & Python*.

### 1. What is Web Scraping?

To illustrate what web scraping is and how it's useful, let's look at a dataset collected by Cornell CIS faculty Cristian Danescu-Niculescu-Mizil and Lillian Lee over ten years ago now. These researchers utilized this corpus, the [Cornell Movie Dialogues Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), in their paper ["Chameleons in Imagined Conversations"](https://www.cs.cornell.edu/~cristian/papers/chameleons.pdf). To create this dataset, these researchers scraped movie scripts from various websites; they kept track of each url they used. Let's look at these urls.

First, we're going to import [pandas](https://pypi.org/project/pandas/), which is a package for handling tabular data within Python. 

In [None]:
import pandas as pd

Next, we're going to read in a spreadsheet that lists the titles and urls where the researchers found each script. In Python-speak, we're using `pandas` to create a DataFrame to store this tabular data.

In [None]:
urls = pd.read_csv("raw_script_urls.csv", delimiter='\t', encoding='utf=8')

Display the DataFrame.

In [None]:
urls

We can see from looking at this output that each row in this DataFrame is a movie, and the script url is listed in the third column. We could use this information to manually navigate to each listed url and copy/paste each script into a txt file, but that method would be labor-intensive and we run the risk of losing information that isn't displayed on the web page itself or that is structured in a weird way but that may be useful to us. So instead, we're going to programmatically access the scripts. 

#### Request and Response

When you type in a url to the address bar in your browser, you are sending an HTTP **request** for a web page. The server that stores that web page then sends back a **response**, which is the web page data that your browser renders.

We can use a Python library called [requests](https://requests.readthedocs.io/en/latest/) to programmatically access the data sent via those responses. Let's import requests.

In [None]:
import requests

#### Get HTML Data

Head over to <http://www.scifiscripts.com/scripts/Ghostbusters.txt> in your browser. When we look at this webpage, we can see that it's just a plain-text file that contains the script for the movie *Ghostbusters*. 

We can capture the data contained in that plain-text file by using the `.get()` function associated with the requests library. We will store what we get in a variable called `response`.

In [None]:
response = requests.get("http://www.scifiscripts.com/scripts/Ghostbusters.txt")

However, if we check this variable, we see that it just gives us the [HTTP response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), which tells us if the request was successful or not.

In [None]:
response

In this case, it was: "200" is a successful response.

But let's see what happens if we change the url in our request to a webpage that doesn't exist.

In [None]:
bad_response = requests.get("http://www.scifiscripts.com/scripts/Ghostboogers.txt")

In [None]:
bad_response

404, on the other hand, is a common "Page Not Found" error. If you head over to <http://www.scifiscripts.com/scripts/Ghostboogers.txt>, you'll see what this looks like.

#### Extract Text from a Web Page

In order to actually read the text data in our response, we need to use `.text`, which we will save in a variable called `html_string`. We will use the data stored in the `response` variable we created above.

In [None]:
html_string = response.text

Now, if we print `html_string`, we will be able to see the text data stored on the screenplay's website.

In [None]:
print(html_string)

#### Extract Text from Multiple Web Pages

But how could we grab the screenplay for every movie in the DataFrame of movie scripts we created above? We can write a function that will do this. To demonstrate this, let's first create a smaller version of our movie script dataframe, one including only 10 scripts.

In [None]:
# in this line of code, we are using Python's built-in indexing functionality to tell the computer to take 
# only the first 10 rows in the urls DataFrame we created above.
sample_urls = urls[:10]

In [None]:
sample_urls

Now, we need to write a function that 1) gets the data from each web page; and 2) stores the text data.

In [None]:
# first, we define the function and tell it to act on a single url
def scrape_screenplay(url):
    # then we get the data from that url
    response = requests.get(url)
    # then we store the text data in a variable called `html_string`
    html_string = response.text
    # then we return the `html_string` variable as output
    return html_string


Then we apply this function to the “script_url” column of the DataFrame and create a new column for the resulting extracted text. Pandas makes this easy.

In [None]:
# this code says: Apply the 'scrape_screenplay' function to each row of the 'sample_urls' dataframe.
# use whatever is in the 'script_url' column as the input. 
# then, store the output for each row in a new column titled 'text'.
sample_urls['text'] = sample_urls['script_url'].apply(scrape_screenplay)

In [None]:
sample_urls

If we print out every row in the column, we can see that we successfully extracted text for each URL (though some of these URLs returned 404 errors). This text is encoded in HTML (hence the tags you see in the 'text' column above).

In [None]:
# this for loop says: for each row in the 'sample_urls' dataframe, print out the value of the 'text' column
for text in sample_urls['text']:
    print(text)

### 2. Working with HTML Data

In [None]:
# this code block will import all of the packages we need for this section.
import requests
from bs4 import BeautifulSoup

Most pages we will want to scrape, however, won't be as simple as our moive script urls. What's more, we will sometimes want to collect only some information included on a page, or we will want to restructure the data included in a web page for our own ends. To do this, we need to be familiar with how to programmatically extract specific pieces of information from a web page. We will use the Python library [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/), which parses HTML (HyperText Markup Language) documents, to do this (last week's lab also used BeautifulSoup; it can also parse TEI).

To get a handle on using BeautifulSoup to parse HTML, we're going to examine a toy website made by the poet, programmer, and professor Allison Parrish for purposes of teaching BeautifulSoup. 

Here's the website: <https://static.decontextualize.com/kittens.html>.

#### Scraping the Kittens Web Page

Let's use the requests library to scrape this Kitten TV website.

In [None]:
# first we get the data on the website
response = requests.get("http://static.decontextualize.com/kittens.html")

# then we specify we want to see the text data
html_string = response.text

# then we print out what we got
print(html_string)

If we examine the above output, we can see that HTML uses "tags" to represent different elements of the page, such as the `<h1>` tag, or the main header tag, which marks the first line we see on the kittens web page. HTML tags usually, though not always, also require closing tags. For example, the main header "Kittens and the TV Shows They Love" is surrounded by an "opening" `<h1>` tag and a "closing" `</h1`> tag.

You can see an alphabetized list of HTML tags on this page: <https://www.w3schools.com/tags/>.

HTML elements sometimes come with even more information inside a tag, such as attributes, classes, and IDs. This information will often consist of a keyword (like class or id) followed by an equals sign = and a further descriptor such as `<div class="kitten">` or `<ul class="tvshows">`.

We need to know about tags as well as attributes, classes, and IDs because this is how we’re going to extract specific HTML data with BeautifulSoup.

#### Using BeautifulSoup to Extract Data from HTML Documents

First, import BeautifulSoup if you haven't already done so.

In [None]:
from bs4 import BeautifulSoup

To make a BeautifulSoup document, we call `BeautifulSoup()` with two parameters: the `html_string` from our HTTP request above and the kind of parser that we want to use, which will always be `html.parser` when we're dealing with HTML.

In [None]:
# first, get the data from the kittens website
response = requests.get("http://static.decontextualize.com/kittens.html")

# then, specify that we want just the text
html_string = response.text

# finally, call `BeautifulSoup()` and tell it to parse the `html_string` variable using the html parser
document = BeautifulSoup(html_string, "html.parser")

In [None]:
document

The above output looks pretty similar to what we got above when we were just using requests. However, now we can use BeautifulSoup to further parse this data and extract specific elements. We will use `find()` for this.

Run the two following cells. How are their outputs different and why are they different?

In [None]:
document.find("h1")

In [None]:
document.find("h1").text

We can also use `find()` to grab the first tag that matches the specific element we request. For example, if we wanted to find the first image in the Kittens web page, we could run the following:

In [None]:
document.find("img")

#### Now You Try It: Question 1

What if we wanted to find the data embedded in the first [`<li>` tag](https://www.w3schools.com/tags/tag_li.asp) above? Write the code to do this below, and then copy and paste it into your notes document.

In [None]:
#write your code below


#### Using BeautifulSoup to Extract Multiple HTML Elements

Notice how the above image file is the first value embedded within an `<img>` tag in our HTML document. But what if we wanted to find *all* of the images on the website?

In [None]:
document.find_all("img")

We can also specify that we want to find elements with specific attributes. For example:

In [None]:
document.find_all("div", attrs={"class": "kitten"})

#### Now You Try It: Question 2

Let's find all of the data in `<ul>` tags whose `class="tvshows"`. Write the code to do this below, and then copy and paste it into your notes document.

In [None]:
# write your code below


#### Using For Loops to Extract Multiple HTML Elements

Ok, now let's try to extract text from all of `<h2>` elements in our document. First, we find them.

In [None]:
document.find_all("h2")

Great, we can see from the above output that there are 2 different `<h2>` values in the document. Let's extract just the text ("Fluffy," "Monsieur Whiskeurs") from both.

(Note: The code in the cell below is supposed to cause an error.)

In [None]:
document.find_all("h2").text

Whoops, looks like we need a `for` loop so that we can cycle through the value for each `<h2>` tag and extract the text.

In [None]:
# first, find all the h2 values and store them in the `all_h2_headers` variable
all_h2_headers = document.find_all("h2")

all_h2_headers

In [None]:
# then, create an empty list where we will put the text we extract
h2_headers = []

# now, create a for loop in which, for each header in `all_h2_headers`, we grab the text, 
# put it in a variable called `header_contents`, then append that value to our `h2_headers` list
for header in all_h2_headers:
    header_contents = header.text
    h2_headers.append(header_contents)

# finally, print out what's in `h2_headers`
h2_headers

#### Now You Try It: Question 3

Let's write a `for` loop that will allow us to put all of the TV show names into a list. Write the code to do this in the below cells, and then copy and paste it into your notes document once it's working.

First, find all of the TV show names. What element do we need to `find_all()` of?

In [None]:
# enter your code below


Then, create an empty list to store the show names, and write a `for` loop that cycles through each element and extracts the text of the TV show names. Hint: Your code should look very similar to the `for` loop above -- you might just change the variable names.

In [None]:
# enter your code below



#### Want an Additional Challenge?: Question 3.5

Write code that will extract all of the links to each TV show's IMDB page and put those links into a list. We want just the links here, not any of the tags. As before, write the code to do this in the below cell, and then copy and paste it into your notes document once it's working.

Hint: We need to use a method other than `.text` to accomplish this. This page might be helpful: <https://www.educative.io/answers/beautiful-soup-get-href>. 

In [None]:
# enter your code below












### 3. Scraping Goodreads

The code in this section is drawn from [Adesua Ayomitan's Goodreads webscraping notebook](https://github.com/Adesuaayo/goodreads_webscraper/blob/main/Goodreads_webscraper.ipynb). The section is inspired by Melanie Walsh's and Maria Antoniak's article ["The Goodreads 'Classics'"](https://culturalanalytics.org/article/22221-the-goodreads-classics-a-computational-study-of-readers-amazon-and-crowdsourced-amateur-criticism).

In [None]:
# this code block will import all of the packages we need for this section if you haven't already
# loaded them.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import re

We're now going to move on to scraping a more complicated website: Goodreads. However, we're still going to keep it as simple as possible, and we're going to scrape the top 100 books that Goodreads users have shelved as "classics."

Specifically, we're going to scrape this page: <https://www.goodreads.com/search?q=classics&qid=>.

If you head to that page and take a look, you'll see that it lists the "top" books users have shelved as "classics," based on users' ratings and the number of ratings. We can see that Goodreads displays 10 books per page, and that we can page through well over 100 pages of results.

We want to collect the top 100 books, so that means collecting data from the first 10 pages of results. And we want to collect the following information about each book:
- title
- author
- average rating
- year published
- number of editions (available on Goodreads)

If we look at the search results page again, we can see that that information is included in the text that appears about each book. 

So now we need to figure out how to scrape it using BeautifulSoup. To do that, we first need to figure out what tags are being used to encode the data we want to collect from these pages. We could, as we did before, grab all of the data and extract the text to see the tags. However, for a page like this, that would be pretty unwieldy. Instead, we're going to visually inspect the search results page using our browsers. I recommend using Chrome or Firefox for this part of the lab.

#### Inspecting the Search Results Page

Head over to the "classics" search results page: <https://www.goodreads.com/search?q=classics&qid=>. Right- or control-click on the page, and select "Inspect" on the menu that comes up. An inspector panel should pop up. This panel reveals the HTML structuring the page. Mousing over each line of the HTML highlights that area of the web page.

Mouse over each line of HTML until the table displaying the search results is highlighted. It's defined by the `<tbody>` tag (you may need to select the little arrows next to lines to reveal hierarchically-nested elements). This is how it looks in Firefox:

![inspect tbody tag in firefox](https://github.com/lcthomas/engl6701s24-lab45/firefox.png)

#### Extracting the Whole Table

Let's try to extract the information in this table.

In [None]:
# first, we define the url we want to scrape
url = "https://www.goodreads.com/search?q=classics&qid="

# next, we extract just the text from the url
response = requests.get(url).text

# then, we call BeautifulSoup to parse the text
soup = BeautifulSoup(response, "html5lib")

# finally, we find the information in the <tbody> tag and store it in the variable 'table'
table = soup.find("tbody")
# table = soup.find_all("tbody")[0]

# then we print 'table'
table

Ok, we're on the right track, but this is still messy and hard to understand. Let's try to locate and extract information from just one row of this table instead of the whole thing.

Head back to the "classics" search results page that you are inspecting in your browser (https://www.goodreads.com/search?q=classics&qid=, right- or control-click and select Inspect). The next level under the `<tbody>` tag is the `<tr>` tag; in HTML `<tr>` designates a table row. Let's extract the text of the first table row (for *Sense and Sensibility*).

#### Extracting Just One Row

#### Now You Try It: Question 4

Looking at the above code, what do we need to do next to extract just one row from this table? Store this information in a variable named `first_row`. Write the code to do this in the below cell, and then copy and paste it into your notes document once it's working.

Hint: Code that will answer this question is included in the large code block in section 4 of this notebook.

In [None]:
# enter your code below


In [None]:
# take a look at what you stored in 'first_row'
first_row

Ok, now we've narrowed it down to just one row of the table, but there's still way more information here than we want. 

#### Extracting Part of Just One Row

Looking back at the Goodreads results page using our browser inspector, we can see that each row is also composed of cells that contain specific information. These cells are encoded with the `<td>` tag. If we mouse over the first `<td>` tag, we see that it refers to the image of the book cover. We don't want that. But if we mouse over hte second `<td>` tag, we see that it highlights the second cell of the row, which does contain all the information we want. Let's extract just the information in the *second* cell of this first row.

In [None]:
# in this line, we are using `.find_all()` to grab all of the <td> values stored in the 'first_row' variable,
# but we are also using Python's built-in indexing functionality to select just the information stored in the
# second <td> tag (Python indexing begins at 0).
second_cell = first_row.find_all("td")[1]

In [None]:
# run this cell to see the value of 'second_cell'
second_cell

Still a lot of stuff we don't need here, but we can now zero in on what we do want: the book title, author, average rating, year published, and number of editions.

#### Now You Try It: Question 5

Take a look at the below code. In your notes document, write comments that describe what each line of code is doing, except for the `print` lines. If you're working directly in this notebook on your own computer, you can just write the comments above each line (make sure to use a hashtag at the beginning of the line to set them off as comments!).

In [None]:
title = second_cell.find("a").find("span").text
print(title)

author = second_cell.find("a", class_="authorName").text
print(author)

all_ratings = second_cell.find_all('span', class_ = 'minirating')
print(all_ratings)

year_info = second_cell.find("span", class_="greyText smallText uitext").text.split()
print(year_info)

editions = second_cell.find("span", class_="greyText smallText uitext").text.split()[-2]
print(editions)

As you can see from the printed values above, there is a bit more processing we need to do to isolate the average rating, number of ratings, and year published values, but we are nearly there now.

### 4. Putting It All Together (and then some)

Now that we've walked through the process of extracting (most of) the information we want for just one book, let's put it all together and see what it looks like to extract this information not only for *all* of the results on the first page, but also for the next 9 pages. Read through the five code blocks below and do your best to understand what's happening in each line. The second code block will take several moments to run; wait until it completes running before moving on to the next cell.

In [None]:
# first, we define some empty lists where we will place the values we extract
book_titles = []
authors = []
avg_ratings = []
ratings = []
published_years = []
editions = []

In [None]:
# this is the big one!

# first, we define how many pages we want to scrape data from
pages_to_scrape = 10

# specify the delay between requests in seconds (e.g., 2 seconds)
# the delay mimics the behavior of a human paging through results so that the goodreads servers don't
# shut you down
request_delay = 3

# for each page
for page in range(1, pages_to_scrape + 1):
    
    # Construct the url for the current page
    url = "https://www.goodreads.com/search?page=" + str(page) + "&q=classics&qid=mXUTlUsh6g&search_type=books&tab=books&utf8=✓"
   
    # wrap the code in a `try except` loop to help with error handling and so that the whole thing doesn't
    # shut down if it encounters errors
    try:
        # send an http .get() request to the url constructed above 
        response = requests.get(url).text

        # Parse the html content using BeautifulSoup
        soup = BeautifulSoup(response, "html5lib")
    
        # check for server errors or maintenance
        # if there's a server error, inform the user and then skip that page
        if soup.title and "service unavailable" in soup.title.text.lower():
            print(f"Server error on page {page}. Skipping...")
            continue

        # select the table containing the list of books
        table = soup.find("tbody")

        # for each row
        for row in table.find_all("tr"):
            cells = row.find_all("td")[1]

            # extract book title
            title = cells.find("a").find("span").text
            # append the title to the 'book_titles' list we created above
            book_titles.append(title)

            # extract author's name
            author = cells.find("a", class_="authorName").text
            # append author's name to the 'authors' list we created above
            authors.append(author)
            

            # extract ratings
            all_ratings = cells.find_all('span', class_ = 'minirating')
            # do some string operations and regular expressions work to isolate the average
            # rating value and append it to the 'avg_ratings' list
            all_ratings_text = all_ratings[0].text.strip()
            pattern_2 = re.compile(r"(\d\.?\d*)\savg")
            avg_ratings.append(pattern_2.search(all_ratings_text).group(1))

            # extract number of ratings from the data extracted above and stored in 'all_ratings_text'
            # do it using regular expressions
            pattern_4 = re.compile(r"(\d\,?\d*) rating")
            ratings_matches = pattern_4.search(all_ratings_text)
            ratings.append(ratings_matches.group(1) if ratings_matches else 0)  

            # extract published year, handling cases where it may not be in the expected format
            year_info = cells.find("span", class_="greyText smallText uitext").text.split()
            year = None
            for item in year_info:
                if item.isdigit() and len(item) == 4:
                    year = item
                    break
            if year:
                published_years.append(year)   # append to the 'published_years' list
            else:
                published_years.append(0)  # handle cases where year is not found

            # extract edition information and append to the 'editions' list
            edition = cells.find("span", class_="greyText smallText uitext").text.split()[-2]
            editions.append(edition)

        # sleep to add a delay between requests
        time.sleep(request_delay)
    
    except requests.exceptions.RequestException as e:
        # handle http request errors (e.g., connection issues)
        print(f"Error on page {page}: {e}")

    except IndexError as e:
        # handle "list index out of range" error
        print(f"Index error on page {page}: {e}")

    except Exception as e:
        # handle other unexpected errors
        print(f"Unexpected error on page {page}: {e}")

#### Question 6

There are two `for` loops in the code block above (well, there are actually three, but let's just look at the first two): one begins on line 12, and one on line 36. What is the code looping through in each one?

In [None]:
# after scraping all pages, we create a dictionary to store the collected data
data = {
    "Title": book_titles,
    "Author": authors,
    "Average Rating": avg_ratings,
    "Rating": ratings,
    "Year Published": published_years,
    "Editions": editions
}

In [None]:
# finally, we use pandas to create a dataframe to display the data
goodreads = pd.DataFrame(data)

In [None]:
# display the first and last five rows of the dataframe
goodreads

If you're running this notebook on your own computer, you can uncomment the below line of code to save this dataframe as a `.csv` file to your computer. It will save to the same directory where you stored this notebook file. Then, you can open up this file using Excel on your own computer. I've also placed a copy of this file in the lab 5 folder on our Canvas site.

In [None]:
# goodreads.to_csv("Goodreads_classics_top100.csv", index=False)