# HTML and Web Scraping

**HTML**, which stands for "hypertext markup language", is an XML-like language for specifying the appearance of web pages. Each tag in HTML corresponds to a specific page element. For example:

- `<img>` specifies an image. The path to the image file is specified in the `src=` attribute.
- `<a>` specifies a hyperlink. The text enclosed between `<a>` and `</a>` is the text of the link that appears, while the URL is specified in the `href=` attribute of the tag.
- `<table>` specifies a table. The rows of the table are specified by `<tr>` tags nested inside the `<table>` tag, while the cells in each row are specified by `<td>` tags nested inside each `<tr>` tag.

Our goal in this section is not to teach you HTML to make a web page. You will learn just enough HTML to be able to scrape data programmatically from a web page.

# Inspecting HTML Source Code

Suppose we want to scrape faculty information from the [Cal Poly Statistics Department directory](https://statistics.calpoly.edu/content/directory) (`https://statistics.calpoly.edu/content/directory`). Once we have identified a web page that we want to scrape, the next step is to study the HTML source code. All web browsers have a "View Source" or "Page Source" feature that will display the HTML source of a web page.

Visit the web page above, and view the HTML source of that page. (You may have to search online to figure out how to view the page source in your favorite browser.) Scroll down until you find the HTML code for the table containing information about the name, office, phone, e-mail, and office hours of the faculty members.

Notice how difficult it can be to find a page element in the HTML source. Many browsers allow you to right-click on a page element and jump to the part of the HTML source corresponding to that element.

# Web Scraping Using BeautifulSoup

`BeautifulSoup` is a Python library that makes it easy to navigate an HTML document. Like with XML, we can query tags by name or attribute, and we can narrow our search to the ancestors and descendants of specific tags. Also, many web sites have malformed HTML, which `BeautifulSoup` is able to handle gracefully.

First, we issue an HTTP request to the URL to get the HTML source code.

In [1]:
import requests
response = requests.get("https://statistics.calpoly.edu/content/directory")

The HTML source is stored in the `.content` attribute of the response object. We pass this HTML source into `BeautifulSoup` to obtain a tree-like representation of the HTML document.

In [2]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

Now we can search for tags within this HTML document, using tags like `.find_all()`. For example, we can find all tables on this page.

In [3]:
tables = soup.find_all("table")
len(tables)

3

As a visual inspection of [the web page](https://statistics.calpoly.edu/content/directory) would confirm, there are 3 tables on the page (chair and staff, faculty, emeritus faculty), and we are interested in the second one (for faculty).

In [4]:
table = tables[1]
table

<table cellspacing="0" class="directory backwhite" height="2667" summary="COSAM Dean's Office Staff" width="596"><tbody><tr><th scope="col" width="18%">
<p>Name</p>
</th>
<th scope="col" width="8%">Office</th>
<th scope="col" width="10%">Phone (805)</th>
<th scope="col" width="20%">Email</th>
<th scope="col" width="20%">
<p>Office HoURS </p>
</th>
</tr><tr><td><a href="/Kelly-Bodwin"><strong>Kelly Bodwin</strong></a></td>
<td><a href="https://maps.calpoly.edu/">25-106</a></td>
<td>756-2450</td>
<td><a href="mailto:kbodwin@calpoly.edu">kbodwin@calpoly.edu</a></td>
<td>
<p><b>MW </b>12:30-2pm</p>
<p><strong>F</strong> 1:10-2pm Zoom</p>
</td>
</tr><tr><td><a href="/Matt-Carlton"><strong>Matt Carlton</strong></a></td>
<td><a href="https://maps.calpoly.edu/">25-101</a></td>
<td>756-7076</td>
<td><a href="mailto:rmlau@calpoly.edu">mcarlton@calpoly.edu</a></td>
<td>
<p><strong>MW</strong> 5:10-6:30pm</p>
<p><strong>F</strong> 3:10-4pm Zoom</p>
</td>
</tr><tr><td>
<p><a href="/Beth-Chance"><st

There is one faculty member per row (`<tr>`), except for the first row, which is the header. We iterate over all rows except for the first, extracting information about each faculty to append to `rows`, which we will eventually turn into a `DataFrame`. As you read the code below, refer to the HTML source above, so that you understand what each line is doing. For example, `cells[0]` represents a `<td>` tag; why do we need to call `.find("strong")` within this tag to get the name of the faculty member?


You are encouraged to add `print()` statements inside the `for` loop to check your understanding of each line of code.

Note that for most faculty their name and office location contain links, but not all (see Nian Cheng). This is the reason for the `or` lines below.

In [5]:
# initialize an empty list
rows = []

# iterate over all rows in the faculty table
for faculty in table.find_all("tr")[1:]:

    # Get all the cells (<td>) in the row.
    cells = faculty.find_all("td")

    # The information we need is the text between tags.

    # Find the the name of the faculty in cell[0]
    # which for most faculty is contained in the <strong> tag
    name_tag = cells[0].find("strong") or cells[0]
    name = name_tag.text

    # Find the office of the faculty in cell[1]
    # which for most faculty is contained in the <a> tag
    link = cells[1].find("a") or cells[1]
    office = link.text

    # Find the email of the faculty in cell[3]
    # which for most faculty is contained in the <a> tag
    email_tag = cells[3].find("a") or cells[3]
    email = email_tag.text

    # Append this data.
    rows.append({
        "name": name,
        "office": office,
        "email": email
    })

In the code above, observe that `.find_all()` returns a list with all matching tags, while `.find()` returns only the first matching tag. If no matching tags are found, then `.find_all()` will return an empty list `[]`, while `.find()` will return `None`.

Finally, we turn `rows` into a `DataFrame`.

In [6]:
import pandas as pd
pd.DataFrame(rows)

Unnamed: 0,name,office,email
0,Kelly Bodwin,25-106,kbodwin@calpoly.edu
1,Matt Carlton,25-101,mcarlton@calpoly.edu
2,Beth Chance,25-235,bchance@calpoly.edu
3,Nian Cheng,Remote,ncheng@calpoly.edu
4,Olga Dekhtyar,25-225,odekhtya@calpoly.edu
5,Sinem Demirci,25-213,sdemirci@calpoly.edu
6,Jimmy Doi,25-108,jdoi@calpoly.edu
7,Samuel Frame,25-121,sframe@calpoly.edu
8,Frank Giron,25-207,frgiron@calpoly.edu
9,Hunter Glanz,25-111,hglanz@calpoly.edu


Now this data is ready for further processing.

# Ethical tidbit: `robots.txt`

Web robots are crawling the web all the time. A website may want to restrict the robots from crawling specific pages. One reason is financial: each visit to a web page, by a human or a robot, costs the website money, and the website may prefer to save their limited bandwidth for human visitors. Another reason is privacy: a website may not want a search engine to preserve a snapshot of a page for all eternity.

To specify what a web robot is and isn't allowed to crawl, most websites will place a text file named `robots.txt` in the top-level directory of the web server. For example, the Statistics department web page has a `robots.txt` file: https://statistics.calpoly.edu/robots.txt

The format of the `robots.txt` file should be self-explanatory, but you can read a full specification of the standard here: http://www.robotstxt.org/robotstxt.html. As you scrape websites using your web robot, always check the `robots.txt` file first, to make sure that you are respecting the wishes of the website owner.