# Scraping and crawling the web: Extra challenges

## Challenges: Making requests

1. Write a function called `get_html` that takes a URL as an argument and returns the HTML contents as a string. Test your function on the page for [Sir Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee).
2. What happens if the request doesn't go so smoothly? Add a defensive measure to your function to check that the response recieved was successful.

### Part 1

In [None]:
# Your solution here


### Part 2

In [None]:
# Your solution here


## Challenge: Parsing HTML

Scrape a clean claim review explanation in [this claim review by FactCheck.org](https://www.factcheck.org/2019/10/trumps-claims-about-hunter-biden-in-china/). As usual, use your browser to inspect this website's HTML and identify any unique types and/or classes that enclose the explanation (and nothing else).

In [None]:
# Your solution here


## More Challenges: Extended parsing case study

Imagine you're in the field of education, in fact your specialty is studying higher education institutions. You're wondering how different disciplines change over time. Is it true that disciplines are incorporating more computational techniques as the years go on? Is that true for all disciplines or only some? Can we spot emerging themes across a whole university?

To answer these questions, we're going to need data. We're going to collect a dataset of all courses registered at UC Berkeley, not just those being taught this semester but all courses currently approved to be taught. These are listed on [this page](http://guide.berkeley.edu/courses/), called the Academic Guide. Well, actually they're not directly listed on that page. That page lists the departments/programs/units that teach currently approved courses. If we click on each department (for the sake of brevity, I'm just going to call them all "departments"), we can see the list of all courses they're approved to teach. For example, [here's](http://guide.berkeley.edu/courses/aerospc/) the page for Aerospace Studies. We'll call these pages departmental pages.

### Challenge

View the source HTML of [the page listing all departments](http://guide.berkeley.edu/courses/), and see if you can find the part of the HTML where the departments are listed. There's a lot of other stuff in the file that we don't care too much about. You could try `Crtl-F`ing for the name of a department you can see on the webpage.


### Challenge

Look at HTML source of [the page for the Aerospace Studies department](http://guide.berkeley.edu/courses/aerospc/), and try to find the part of the file where the information on each course is. Again, try searching for it using `Crtl-F`.


### Challenge

Get the HTML content of `http://guide.berkeley.edu/courses/` and store it in a variable called `academic_guide_html`. You can use the `get_html` function you wrote before.

Print the first 500 characters to see what we got back.

In [None]:
# your solution here

We said before that all the departments were listed on the Academic Guide page with links to their departmental page, where the actual courses are listed. So we can find all the departments by looking in our parsed HTML for all the links. Remember that the links are represented in the HTML with the `<a>...</a>` tag, so we ask our `academic_guide_soup` to find us all the tags called `a`. What we get back is a list of all the `a` elements in the HTML page.

In [None]:
from bs4 import BeautifulSoup

academic_guide_soup = BeautifulSoup(academic_guide_html, 'lxml')

links = academic_guide_soup.find_all('a')
# print a random link element
links[48]

So now we have a list of `a` elements, each one represents a link on the Academic Guide page. But there are other links on this page in addition to the ones we care about, for example, a link back to the UC Berkeley home page. How can we filter out all the links we don't care about?

### Challenge

Look through the list `links`, or the HTML source, and figure out how we can identify just the links that we care about, namely the links to departmental pages.

In [None]:
# your solution here

Let's use our new `is_departmental_page` function to filter out the links we don't care about. How many departments do we have?

In [None]:
departmental_page_links = [link for link in links if is_departmental_page(link)]
len(departmental_page_links)

Each link in our `departmental_page_links` list contains a HTML element representing a link. Each element contains not only the relative location of the link but also the text that is linked (i.e. the words on the page that are underlined and you can click on to go to the linked page). In BeautifulSoup, we can get that text by asking for it with `element.text`, like this:

In [None]:
departmental_page_links[0].text

### Challenge

From the `departmental_page_links`, we can extract out the name and the code for each department. Try doing this.

In [None]:
# your solution here

From each link in our `departmental_page_links` list, we can get the relative link that it points to like this:

In [None]:
departmental_page_links[0].attrs['href']

### Challenge

Write a function that extracts out the relative link of a link element.

*Hint: This has a similar solution to our `is_departmental_page` function from before.*

In [None]:
# your solution here

All right! Now we've identified all the departmental links on the Academic Guide page, we've found their name and code, and we know the relative link they point to. Next, we can use this relative link to construct the full URL they point to, which we'll then use to scrape the HTML for each departmental page.

Let's write a function that takes a departmental link and returns the absolute URL of its departmental page.

In [None]:
def construct_absolute_url(departmental_link):
    relative_link = extract_relative_link(departmental_link)
    return academic_guide_url + relative_link

construct_absolute_url(departmental_page_links[37])

To summarize so far, we've gone from the URL of the Academic Guide website, found all the departments that offer approved courses, identified their name and code and the link to their departmental page which lists all the courses they teach. 

Now we want to find the get the HTML for each departmental page and scrape it for all the courses they offer. Let's focus on one page for now, the Aerospace Studies page. Once we select the link, we use our functions from above to: i) get the name (I guess we already know it's Aerospace, but whatever) and code, get the full URL, get the HTML for that URL and then parse the HTML.

In [None]:
aerospace_link = departmental_page_links[0]
aerospace_name, aerospace_code = extract_department_name_and_code(aerospace_link)
aerospace_url = construct_absolute_url(aerospace_link)
aerospace_html = get_html(aerospace_url)
aerospace_soup = BeautifulSoup(aerospace_html, 'lxml')
print(aerospace_html[:500])

Right at the start of this section on parsing HTML, we saw the HTML for a departmental page. Here it is again.

```
<div class="courseblock">

<button class="btn_toggleCoursebody" aria-expanded="false" aria-controls="cb_aerospc1a" data-toggle="#cb_aerospc1a">

<a name="spanaerospc1aspanspanfoundationsoftheu.s.airforcespanspan1unitspan"></a>
<h3 class="courseblocktitle">
<span class="code">AEROSPC 1A</span> 
<span class="title">Foundations of the U.S. Air Force</span> 
<span class="hours">1 Unit</span>
</h3>
```

It looks like each course is listed in a `div` element that has a `class` attribute with value `"courseblock"`. We can use this information to identify all the courses on a page and then extract out the information from them. You've seen how to do this before, here it is again:

In [None]:
aerospace_courseblocks = aerospace_soup.find_all(class_='courseblock')
len(aerospace_courseblocks)

Looks like the Aerospace department has seven current courses they're approved to teach (at the time of writing). Looking at the page in our browser, that looks right to me! So now we have a list called `aerospace_courseblocks` that holds seven elements that each refer to one course taught by the Aerospace department. Now we can extract out any information we care about. We just have to look at the page in our browser, decide what information we care about, then look at the HTML source to see where that information is kept in the HTML structure. Finally, we write a function for each piece of information we want to extract out of a course.

### Challenge
Write functions to take a courseblock and extract:
- The course code (e.g. AEROSPC 1A)
- The coure name
- The number of units
- The textual description of the course

In [None]:
# your solution here

Let's write a function to scrape these four pieces of information from every course from every department and save it as a csv file.

In [None]:
def scrape_one_department(department_link):
    department_name, department_code = extract_department_name_and_code(department_link)
    department_url = construct_absolute_url(department_link)
    department_html = get_html(department_url)
    department_soup = BeautifulSoup(department_html, 'lxml')
    department_courseblocks = department_soup.find_all(class_='courseblock')
    result = []
    for courseblock in department_courseblocks:
        course = extract_one_course(courseblock)
        course['department_name'] = department_name
        course['department_code'] = department_code
        result.append(course)
    return result

aerospace_courses = scrape_one_department(aerospace_link)
for value in aerospace_courses[0].values():
    print(value)
    print()

In [None]:
import time

def scrape_all_departments(be_nice=True):
    academic_guide_url = 'http://guide.berkeley.edu/courses/'
    academic_guide_html = get_html(academic_guide_url)
    academic_guide_soup = BeautifulSoup(academic_guide_html, 'lxml')
    links = academic_guide_soup.find_all('a')
    departmental_page_links = [link for link in links if is_departmental_page(link)]
    
    result = []
    for departmental_page_link in departmental_page_links:
        department_result = scrape_one_department(departmental_page_link)
        result.extend(department_result)
        if be_nice:
            time.sleep(1)
    return result

In [None]:
import pandas as pd
result = scrape_all_departments(be_nice=False)
df = pd.DataFrame(result)
print(str(len(df)) + ' courses scraped')
df.head()

9842 courses scraped! (At the time of writing). Wow, that was a lot easier than doing it by hand!