# Scraping and crawling the web: Extra challenges

## Challenges: Making requests

1. Write a function called `get_html` that takes a URL as an argument and returns the HTML contents as a string. Test your function on the page for [Sir Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee).
2. What happens if the request doesn't go so smoothly? Add a defensive measure to your function to check that the response recieved was successful.

### Part 1

In [51]:
# solution
import requests

def get_html(url):
    response = requests.get(url)
    return response.text

url = 'https://en.wikipedia.org/wiki/Tim_Berners-Lee'
html = get_html(url)

html[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Tim Berners-Lee - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"062c039e-507e-42b9-98bd-1d8acfcc4b0a","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Tim_Berners-Lee","wgTitle":"Tim Berners-Lee","wgCurRevisionId":1013601970,"wgRevisionId":1013601970,"wgArticleId":30034,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Wikipedia indefinitely move-protected pages","Articles with short description","Short description is di

### Part 2

In [52]:
# solution
def get_html(url):
    response = requests.get(url)
    assert response.ok, "Whoops, this request didn't go as planned!"
    return response.text
    
url = 'https://en.wikipedia.org/wiki/Tim_Berners-Lee'
html = get_html(url)

html[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Tim Berners-Lee - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"062c039e-507e-42b9-98bd-1d8acfcc4b0a","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Tim_Berners-Lee","wgTitle":"Tim Berners-Lee","wgCurRevisionId":1013601970,"wgRevisionId":1013601970,"wgArticleId":30034,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Wikipedia indefinitely move-protected pages","Articles with short description","Short description is di

## Challenges: Parsing HTML

Imagine you're in the field of education, in fact your specialty is studying higher education institutions. You're wondering how different disciplines change over time. Is it true that disciplines are incorporating more computational techniques as the years go on? Is that true for all disciplines or only some? Can we spot emerging themes across a whole university?

To answer these questions, we're going to need data. We're going to collect a dataset of all courses registered at UC Berkeley, not just those being taught this semester but all courses currently approved to be taught. These are listed on [this page](http://guide.berkeley.edu/courses/), called the Academic Guide. Well, actually they're not directly listed on that page. That page lists the departments/programs/units that teach currently approved courses. If we click on each department (for the sake of brevity, I'm just going to call them all "departments"), we can see the list of all courses they're approved to teach. For example, [here's](http://guide.berkeley.edu/courses/aerospc/) the page for Aerospace Studies. We'll call these pages departmental pages.

### Challenge

View the source HTML of [the page listing all departments](http://guide.berkeley.edu/courses/), and see if you can find the part of the HTML where the departments are listed. There's a lot of other stuff in the file that we don't care too much about. You could try `Crtl-F`ing for the name of a department you can see on the webpage.







**Solution**

You should see something like this:


```
<div id="atozindex">
<h2 class="letternav-head" id='A'><a name='A'>A</a></h2>
<ul>
<li><a href="/courses/aerospc/">Aerospace Studies (AEROSPC)</a></li>
<li><a href="/courses/africam/">African American Studies (AFRICAM)</a></li>
<li><a href="/courses/a,resec/">Agricultural and Resource Economics (A,RESEC)</a></li>
<li><a href="/courses/amerstd/">American Studies (AMERSTD)</a></li>
<li><a href="/courses/ahma/">Ancient History and Mediterranean Archaeology (AHMA)</a></li>
<li><a href="/courses/anthro/">Anthropology (ANTHRO)</a></li>
<li><a href="/courses/ast/">Applied Science and Technology (AST)</a></li>
<li><a href="/courses/arabic/">Arabic (ARABIC)</a></li>
<li><a href="/courses/arch/">Architecture (ARCH)</a></li>
```

This is HTML. HTML uses "tags", code that surrounds the raw text which indicates the structure of the content. The tags are enclosed in `<` and `>` symbols. The `<li>` says "this is a new thing in a list and `</li>` says "that's the end of that new thing in the list". Similarly, the `<a ...>` and the `</a>` say, "everything between us is a hyperlink". In this HTML file, each department is listed in a list with `<li>...</li>` and is also linked to its own page using `<a>...</a>`. In our browser, if we click on the name of the department, it takes us to that department's own page. The way the browser knows where to go is because the `<a>...</a>` tag tells it what page to go to. You'll see inside the `<a>` bit, there's a `href=...`. That tells us the (relative) location of the page it's linked to.

### Challenge

Look at HTML source of [the page for the Aerospace Studies department](http://guide.berkeley.edu/courses/aerospc/), and try to find the part of the file where the information on each course is. Again, try searching for it using `Crtl-F`.

**Solution**


```
<div class="courseblock">

<button class="btn_toggleCoursebody" aria-expanded="false" aria-controls="cb_aerospc1a" data-toggle="#cb_aerospc1a">

<a name="spanaerospc1aspanspanfoundationsoftheu.s.airforcespanspan1unitspan"></a>
<h3 class="courseblocktitle">
<span class="code">AEROSPC 1A</span> 
<span class="title">Foundations of the U.S. Air Force</span> 
<span class="hours">1 Unit</span>
</h3>
```

The content that we care about is enclosed within HTML tags. It looks like the course code is enclosed in a `span` tag, which has a `class` attribute with the value `"code"`. What we'll have to do is extract out the information we care about by specifying what tag it's enclosed in.

But first, we're going to need to get the HTML of the first page.

### Challenge

Get the HTML content of `http://guide.berkeley.edu/courses/` and store it in a variable called `academic_guide_html`. You can use the `get_html` function you wrote before.

Print the first 500 characters to see what we got back.

In [53]:
# your solution here

In [54]:
# solution
academic_guide_url = 'http://guide.berkeley.edu/courses/'
academic_guide_html = get_html(academic_guide_url)
print(academic_guide_html[:500])

<!doctype html>
<html xml:lang="en" lang="en" dir="ltr">

<head>
<title>Courses &lt; University of California, Berkeley</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="search" type="application/opensearchdescription+xml"
			href="/search/opensearch.xml" title="Catalog" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<link href="/favicon.ico" rel="shortcut icon" />
<link rel="stylesheet" type="text/css" href="


Great, we've got the HTML contents of the Academic Guide site we want to scrape. Now we can parse it. ["Parsing"](https://en.wikipedia.org/wiki/Parsing) means to turn a string of data into a structured representation. When we're parsing HTML, we're taking the Python string and turning it into a tree. The Python package `BeautifulSoup` does all our HTML parsing for us. We give it our HTML as a string and it returns a parsed HTML tree. Here, we're also telling BeautifulSoup to use the `lxml` parser behind the scenes.

In [55]:
from bs4 import BeautifulSoup

academic_guide_soup = BeautifulSoup(academic_guide_html, 'lxml')

We said before that all the departments were listed on the Academic Guide page with links to their departmental page, where the actual courses are listed. So we can find all the departments by looking in our parsed HTML for all the links. Remember that the links are represented in the HTML with the `<a>...</a>` tag, so we ask our `academic_guide_soup` to find us all the tags called `a`. What we get back is a list of all the `a` elements in the HTML page.

In [56]:
links = academic_guide_soup.find_all('a')
# print a random link element
links[48]

<a name="A">A</a>

So now we have a list of `a` elements, each one represents a link on the Academic Guide page. But there are other links on this page in addition to the ones we care about, for example, a link back to the UC Berkeley home page. How can we filter out all the links we don't care about?

### Challenge

Look through the list `links`, or the HTML source, and figure out how we can identify just the links that we care about, namely the links to departmental pages.

In [57]:
# your solution here

In [58]:
# solution
import re

def is_departmental_page(link):
    """
    Return true if `link` points to a departmental page.
    
    By examining the source HTML by eye, I noticed that 
    the links we care about (i.e. the departmental pages) 
    all point to a relative path that starts with "/courses/".
    This function uses that idea to determine if the link is 
    a departmental page.
    """
    # some links don't have a href attribute, only a name attribute
    # we don't care about them
    try:
        href = link.attrs['href'] 
    except KeyError:
        return False
    pattern = r'/courses/(.*)/'
    match = re.search(pattern, href)
    return bool(match)

print(links[0])
print(is_departmental_page(links[0]))
print()
print(links[48])
print(is_departmental_page(links[48]))

<a href="#main" rel="section">Skip to Content</a>
False

<a name="A">A</a>
False


Let's use our new `is_departmental_page` function to filter out the links we don't care about. How many departments do we have?

In [59]:
departmental_page_links = [link for link in links if is_departmental_page(link)]
len(departmental_page_links)

185

Each link in our `departmental_page_links` list contains a HTML element representing a link. Each element contains not only the relative location of the link but also the text that is linked (i.e. the words on the page that are underlined and you can click on to go to the linked page). In BeautifulSoup, we can get that text by asking for it with `element.text`, like this:

In [60]:
departmental_page_links[0].text

'Aerospace Studies (AEROSPC)'

### Challenge

From the `departmental_page_links`, we can extract out the name and the code for each department. Try doing this.

In [61]:
# your solution here

In [62]:
# solution
import re

def extract_department_name_and_code(departmental_link):
    """
    Return the (name, code) for a department.
    
    The easiest way to do this is to use regular expressions. 
    We're not going to cover regular expressions in this workshop, 
    but here's how to do it anyway.
    """
    text = departmental_link.text
    pattern = r'([^(]+) \((.*)\)'
    match = re.search(pattern, text)
    if match:
        return match.group(1), match.group(2)

extract_department_name_and_code(links[48])

From each link in our `departmental_page_links` list, we can get the relative link that it points to like this:

In [63]:
departmental_page_links[0].attrs['href']

'/courses/aerospc/'

### Challenge

Write a function that extracts out the relative link of a link element.

*Hint: This has a similar solution to our `is_departmental_page` function from before.*

In [64]:
# your solution here

In [65]:
# solution
def extract_relative_link(departmental_link):
    """
    We noted above that all the departmental links point to "/courses/something/", 
    where the "something" looks a lot like their code. This function 
    extracts out that "something", so we can add it to the base URL of 
    the Academic Guide page and get full paths to each departmental page.
    """
    href = departmental_link.attrs['href']
    pattern = r'/courses/(.*)/'
    match = re.search(pattern, href)
    if match:
        return match.group(1)

extract_relative_link(departmental_page_links[0])

'aerospc'

All right! Now we've identified all the departmental links on the Academic Guide page, we've found their name and code, and we know the relative link they point to. Next, we can use this relative link to construct the full URL they point to, which we'll then use to scrape the HTML for each departmental page.

Let's write a function that takes a departmental link and returns the absolute URL of its departmental page.

In [66]:
def construct_absolute_url(departmental_link):
    relative_link = extract_relative_link(departmental_link)
    return academic_guide_url + relative_link

construct_absolute_url(departmental_page_links[37])

'http://guide.berkeley.edu/courses/civ_eng'

To summarize so far, we've gone from the URL of the Academic Guide website, found all the departments that offer approved courses, identified their name and code and the link to their departmental page which lists all the courses they teach. 

Now we want to find the get the HTML for each departmental page and scrape it for all the courses they offer. Let's focus on one page for now, the Aerospace Studies page. Once we select the link, we use our functions from above to: i) get the name (I guess we already know it's Aerospace, but whatever) and code, get the full URL, get the HTML for that URL and then parse the HTML.

In [67]:
aerospace_link = departmental_page_links[0]
aerospace_name, aerospace_code = extract_department_name_and_code(aerospace_link)
aerospace_url = construct_absolute_url(aerospace_link)
aerospace_html = get_html(aerospace_url)
aerospace_soup = BeautifulSoup(aerospace_html, 'lxml')
print(aerospace_html[:500])

<!doctype html>
<html xml:lang="en" lang="en" dir="ltr">

<head>
<title>Aerospace Studies (AEROSPC) &lt; University of California, Berkeley</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="description" content="Aerospace Studies Courses" />
<link rel="search" type="application/opensearchdescription+xml"
			href="/search/opensearch.xml" title="Catalog" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<link href


Right at the start of this section on parsing HTML, we saw the HTML for a departmental page. Here it is again.

```
<div class="courseblock">

<button class="btn_toggleCoursebody" aria-expanded="false" aria-controls="cb_aerospc1a" data-toggle="#cb_aerospc1a">

<a name="spanaerospc1aspanspanfoundationsoftheu.s.airforcespanspan1unitspan"></a>
<h3 class="courseblocktitle">
<span class="code">AEROSPC 1A</span> 
<span class="title">Foundations of the U.S. Air Force</span> 
<span class="hours">1 Unit</span>
</h3>
```

It looks like each course is listed in a `div` element that has a `class` attribute with value `"courseblock"`. We can use this information to identify all the courses on a page and then extract out the information from them. You've seen how to do this before, here it is again:

In [68]:
aerospace_courseblocks = aerospace_soup.find_all(class_='courseblock')
len(aerospace_courseblocks)

7

Looks like the Aerospace department has seven current courses they're approved to teach (at the time of writing). Looking at the page in our browser, that looks right to me! So now we have a list called `aerospace_courseblocks` that holds seven elements that each refer to one course taught by the Aerospace department. Now we can extract out any information we care about. We just have to look at the page in our browser, decide what information we care about, then look at the HTML source to see where that information is kept in the HTML structure. Finally, we write a function for each piece of information we want to extract out of a course.

### Challenge
Write functions to take a courseblock and extract:
- The course code (e.g. AEROSPC 1A)
- The coure name
- The number of units
- The textual description of the course

In [69]:
# your solution here

In [70]:
# solution
def extract_course_code(courseblock):
    span = courseblock.find(class_='code')
    return span.text

def extract_course_title(courseblock):
    span = courseblock.find(class_='title')
    return span.text

def extract_course_units(courseblock):
    span = courseblock.find(class_='hours')
    return span.text

def extract_course_description(courseblock):
    span = courseblock.find(class_='coursebody')
    return span.text

def extract_one_course(courseblock):
    course = {}
    course['course_code'] = extract_course_code(courseblock)
    course['course_title'] = extract_course_title(courseblock)
    course['course_units'] = extract_course_units(courseblock)
    course['course_description'] = extract_course_description(courseblock)
    return course

first_aerospace_course = extract_one_course(aerospace_courseblocks[0])
for value in first_aerospace_course.values():
    print(value)
    print()

AEROSPC 1A

Foundations of the U.S. Air Force

1 Unit


Terms offered: Spring 2021, Fall 2020, Spring 2020
This course introduces students to the United States Air Force (USAF) and Air Force Reserve Officer Training Corps (AFROTC) with an overview of the basic characteristics, missions, and organization of the Air Force; additional topics include officership and professionalism, Air Force career opportunities, military customs and courtesies, and an introduction to USAF basic communication skills. Additionally, AFROTC cadets must attend weekly Leadership Lab. Leadership Lab is a weekly laboratory that touches on the topics of Air Force customs and courtesies, health and physical fitness, and drills and ceremonies.
Foundations of the U.S. Air Force: Read More [+]

Hours & FormatFall and/or spring: 15 weeks - 1 hour of lecture per week
Additional DetailsSubject/Course Level: Aerospace Studies/UndergraduateGrading/Final exam status: Letter grade. Final exam required. 
Foundations of the U

Let's write a function to scrape these four pieces of information from every course from every department and save it as a csv file.

In [71]:
def scrape_one_department(department_link):
    department_name, department_code = extract_department_name_and_code(department_link)
    department_url = construct_absolute_url(department_link)
    department_html = get_html(department_url)
    department_soup = BeautifulSoup(department_html, 'lxml')
    department_courseblocks = department_soup.find_all(class_='courseblock')
    result = []
    for courseblock in department_courseblocks:
        course = extract_one_course(courseblock)
        course['department_name'] = department_name
        course['department_code'] = department_code
        result.append(course)
    return result

aerospace_courses = scrape_one_department(aerospace_link)
for value in aerospace_courses[0].values():
    print(value)
    print()

AEROSPC 1A

Foundations of the U.S. Air Force

1 Unit


Terms offered: Spring 2021, Fall 2020, Spring 2020
This course introduces students to the United States Air Force (USAF) and Air Force Reserve Officer Training Corps (AFROTC) with an overview of the basic characteristics, missions, and organization of the Air Force; additional topics include officership and professionalism, Air Force career opportunities, military customs and courtesies, and an introduction to USAF basic communication skills. Additionally, AFROTC cadets must attend weekly Leadership Lab. Leadership Lab is a weekly laboratory that touches on the topics of Air Force customs and courtesies, health and physical fitness, and drills and ceremonies.
Foundations of the U.S. Air Force: Read More [+]

Hours & FormatFall and/or spring: 15 weeks - 1 hour of lecture per week
Additional DetailsSubject/Course Level: Aerospace Studies/UndergraduateGrading/Final exam status: Letter grade. Final exam required. 
Foundations of the U

In [72]:
import time

def scrape_all_departments(be_nice=True):
    academic_guide_url = 'http://guide.berkeley.edu/courses/'
    academic_guide_html = get_html(academic_guide_url)
    academic_guide_soup = BeautifulSoup(academic_guide_html, 'lxml')
    links = academic_guide_soup.find_all('a')
    departmental_page_links = [link for link in links if is_departmental_page(link)]
    
    result = []
    for departmental_page_link in departmental_page_links:
        department_result = scrape_one_department(departmental_page_link)
        result.extend(department_result)
        if be_nice:
            time.sleep(1)
    return result

In [73]:
import pandas as pd
result = scrape_all_departments(be_nice=False)
df = pd.DataFrame(result)
print(str(len(df)) + ' courses scraped')
df.head()

9842 courses scraped


Unnamed: 0,course_code,course_title,course_units,course_description,department_name,department_code
0,AEROSPC 1A,Foundations of the U.S. Air Force,1 Unit,"\nTerms offered: Spring 2021, Fall 2020, Sprin...",Aerospace Studies,AEROSPC
1,AEROSPC 1B,Foundations of the U.S. Air Force,1 Unit,"\nTerms offered: Spring 2021, Spring 2020, Spr...",Aerospace Studies,AEROSPC
2,AEROSPC 2A,The Evolution of Air and Space Power,1 Unit,"\nTerms offered: Spring 2021, Fall 2020, Fall ...",Aerospace Studies,AEROSPC
3,AEROSPC 2B,The Evolution of Air and Space Power,1 Unit,"\nTerms offered: Spring 2021, Spring 2020, Spr...",Aerospace Studies,AEROSPC
4,AEROSPC 100,Leadership Laboratory,0.0 Units,"\nTerms offered: Spring 2021, Fall 2020, Sprin...",Aerospace Studies,AEROSPC


9360 courses scraped! (At the time of writing). Wow, that was a lot easier than doing it by hand!