# DSCI 511: Data acquisition and pre-processing<br>Chapter 5: Harvesting Content from the World Wide Web
## Exercises
Note: numberings refer to the main notes.

#### 5.2.1.0 Example: Parsing a BeautifulSoup document
Access the content from `'http://catalog.drexel.edu/'`, and find the title.
#### Discussion: System check
That's all this exercise really is. In order to get this going, you'll need to instal the `bs4` module and sort out which parser you've got. The notes use `'html.parser'`, but in past versions `'lxml'` has worked out of the box, too.

In [2]:
from bs4 import BeautifulSoup
import urllib # We'll still need this to download webpages

URL = 'http://catalog.drexel.edu/'

# This just downloads the html text with full markdown
html_text = urllib.request.urlopen(URL).read()

# We want a markdown-interpreted (structured) version of the html using BeautifulSoup:
soup = BeautifulSoup(html_text, 'html.parser')

title = soup.find('title')
print(title)

<title>Drexel University &lt; 2018-2019 Catalog | Drexel University</title>


#### 5.2.1.3 Understanding the body
Find and print the body content of the University Catalog (`'http://catalog.drexel.edu/'`) and review the structure of the html. What might get us into the actual catalog data? Can you find the code for the Undergraduate and Graduate 'buttons'?

#### Discussion: If you didn't write it, it can be hard to know what's going on!
However, it looks like the buttons are in `'p'`-tagged blocks with clear attribute: `<p class="btn-ugrad">`.

In [3]:
print(soup.find('body'))

<body>
<noscript><iframe height="0" src="//www.googletagmanager.com/ns.html?id=GTM-TJCKC2" style="display:none;visibility:hidden" width="0"></iframe></noscript>
<div id="top">
<div class="clearfix" id="header">
<div class="wrap">
<a href="http://catalog.drexel.edu"><img alt="University Catalog 2011-2012" src="/images/header.png"/></a>
</div>
</div>
</div>
<div id="wrapper">
<div id="content_wrapper">
<div id="content">
<!--htdig_noindex-->
<div id="navigation">
<ul>
<li class="nav1"><a href="/colleges/"><span>Colleges and Schools</span></a></li>
<li class="nav2"><a href="/majors/"><span>Majors</span></a></li>
<li class="nav3"><a href="/minors/"><span>Minors</span></a></li>
<li class="nav4"><a href="/graduateprograms/"><span>Graduate Programs</span></a></li>
<li class="nav5"><a href="/certificates/"><span>Certificate Programs</span></a></li>
<li class="nav6"><a href="http://www.drexel.edu/catalog/archive/" target="_blank"><span>Archive</span></a></li>
</ul>
</div>
<!--/htdig_noindex-->


#### 5.2.1.5 Exercise: Examining tag attributes
Find all `'p'`-tagged blocks in the University Catalog (`'http://catalog.drexel.edu/'`) and examine any tag attributes. What sort of information is being conveyed?

#### Discussion: Some mysterious and not-so mysterious attributes.
Sure enough, if we look at all of the paragraph attributes, three are grouped systematically as buttons, with `'class'` attributes starting in `'btn...'`. However, there also appear to be paragraph's with an `'id'` of `'breadcrumbs'`; what do you think these are?

In [4]:
paras = soup.find_all('p')
for para in paras:
    print(para.attrs)

{'id': 'breadcrumbs'}
{'class': ['intro']}
{'class': ['disclaimer']}
{'class': ['disclaimer']}
{'class': ['btn-ugrad']}
{'class': ['btn-grad']}
{'class': ['btn-other']}
{}
{}


#### 5.2.1.6 Exercise: Filtering tags by attribute
Returning to `'http://catalog.drexel.edu/'`, can you collect only those hyperlinks which don't have the attribute `target="_blank"`? \[__Hint__: Think sets and exclusion.\]

#### Discussion: Don't forget how useful Python's built-in objects are!
By casting the full and blank-target collections of links as sets separately, we can perform a set difference on the `BeautifulSoup` `Tag` objects returned by the `.find_all()` method to leave only those links which didn't have `"_blank"` `'target'`s.

In [5]:
set(soup.find_all('a')) - set(soup.find_all('a', {'target': "_blank"}))

{<a href="/">Drexel University</a>,
 <a href="/aboutdrexel/">About Drexel</a>,
 <a href="/accreditation/">Accreditation</a>,
 <a href="/additionalacademicprograms/"><span>Additional Academic Programs</span></a>,
 <a href="/catalogcontents/">Site Map</a>,
 <a href="/certificates/"><span>Certificate Programs</span></a>,
 <a href="/colleges/"><span>Colleges and Schools</span></a>,
 <a href="/graduate/"><span>Graduate</span></a>,
 <a href="/graduateprograms/"><span>Graduate Programs</span></a>,
 <a href="/majors/"><span>Majors</span></a>,
 <a href="/minors/"><span>Minors</span></a>,
 <a href="/undergraduate/"><span>Undergraduate</span></a>,
 <a href="http://catalog.drexel.edu"><img alt="University Catalog 2011-2012" src="/images/header.png"/></a>,
 <a href="mailto:catalog@drexel.edu">catalog@drexel.edu</a>}

#### 5.3.1.3 Extracting hyperlinks for a crawl
From the Graduate Quarter course descriptions page (`'http://catalog.drexel.edu/coursedescriptions/quarter/grad/'`), extract a list of URLs to the various program course descriptions pages.  Can you keep them  grouped by academic unit? \[__Hint__: To do so, you can use the page layout regular expressions, i.e., `re.compile()` to manage the unit headings.\]

#### Discussion: Flexible searching for content structure
, you can compile regular expressions into the `.find()` and `.find_all()` methods. This is essential to be able to group by academic unit, as the page reads from top to bottom, with the academic unit information sitting positionally above the links being represented to be able to. Do you see any other solutions using page tags and nesting?

In [6]:
from collections import defaultdict
import re
URL = 'http://catalog.drexel.edu/coursedescriptions/quarter/grad/'

# This just downloads the html text with full markdown
html_text = urllib.request.urlopen(URL).read()

# We want a markdown-interpreted (structured) version of the html using BeautifulSoup:
soup = BeautifulSoup(html_text, 'html.parser')

colleges = defaultdict(list)
college = ""
for x in soup.find_all(re.compile('^a|h2$')):
    if x.name == 'h2':
        college = x.text
    elif re.search("/coursedescriptions/quarter/.", x.get('href', "")):
        colleges[college].append(x['href'])
colleges

defaultdict(list,
            {'Antoinette Westphal College of Media Arts & Design (A)': ['/coursedescriptions/quarter/grad/anim/',
              '/coursedescriptions/quarter/grad/arch/',
              '/coursedescriptions/quarter/grad/arth/',
              '/coursedescriptions/quarter/grad/aadm/',
              '/coursedescriptions/quarter/grad/aaml/',
              '/coursedescriptions/quarter/grad/dsre/',
              '/coursedescriptions/quarter/grad/digm/',
              '/coursedescriptions/quarter/grad/fash/',
              '/coursedescriptions/quarter/grad/gmap/',
              '/coursedescriptions/quarter/grad/intr/',
              '/coursedescriptions/quarter/grad/musl/',
              '/coursedescriptions/quarter/grad/rmer/',
              '/coursedescriptions/quarter/grad/tvmn/',
              '/coursedescriptions/quarter/grad/urbs/',
              '/coursedescriptions/quarter/grad/vsst/',
              '/coursedescriptions/quarter/grad/west/'],
             'College of Ar

#### Alternative solution: using tag nesting to control data associations
Since the catalogue website is nicely setup, the other way we can go about associating the academic units to their programs is through tag nesting and attribute filtering. Each unit's programs are separated into their own attriibute (`class=qugcourses`) tagged `'div'`. Within each `'div'`, the academic unit is contained in a heading (`'h2'`) tag, and subsequent programs are contained in `'a'` tags, both of whch is extracted easily.

In [7]:
import requests
from bs4 import BeautifulSoup

URL = "http://catalog.drexel.edu/coursedescriptions/quarter/grad/"

response = requests.get(URL)

soup = BeautifulSoup(response.text, 'html.parser')

divs = soup.find_all('div', {"class": "qugcourses"})
program_data = {}
for div in divs:
    unit_data = {
        "name": [],
        "href": []
    }
    
    unit = div.find('h2').text
    
    a_tags = div.find_all('a')
    for a_tag in a_tags:
        unit_data['href'].append(a_tag['href'])
        unit_data['name'].append(a_tag.text)
    
    program_data[unit] = unit_data
    
program_data

{'Antoinette Westphal College of Media Arts & Design (A)': {'name': ['Animation (ANIM)',
   'Architecture (ARCH)',
   'Art History (ARTH)',
   'Arts Administration (AADM)',
   'Arts Administration and Museum Leadership   (AAML)',
   'Design Research  (DSRE)',
   'Digital Media (DIGM)',
   'Fashion Design (FASH)',
   'Game Art and Production (GMAP)',
   'Interior Design (INTR)',
   'Museum Leadership (MUSL)',
   'Retail & Merchandising (RMER)',
   'Television Management (TVMN)',
   'Urban Strategy (URBS)',
   'Visual Studies (VSST)',
   'Westphal Studies (WEST)'],
  'href': ['/coursedescriptions/quarter/grad/anim/',
   '/coursedescriptions/quarter/grad/arch/',
   '/coursedescriptions/quarter/grad/arth/',
   '/coursedescriptions/quarter/grad/aadm/',
   '/coursedescriptions/quarter/grad/aaml/',
   '/coursedescriptions/quarter/grad/dsre/',
   '/coursedescriptions/quarter/grad/digm/',
   '/coursedescriptions/quarter/grad/fash/',
   '/coursedescriptions/quarter/grad/gmap/',
   '/coursedescri

#### 5.3.2.9 Extracting the data
Write a function that scrapes the course descriptions from a given program page. For example, Drexel's MSDS course descriptions page is `'http://catalog.drexel.edu/coursedescriptions/quarter/grad/dsci/'`. Retain as much structure as possible!

#### Discussion: Taking encapsulation when it's there
The really important feature of this approach is using the `'courseblock'` class atribute of divs. This provides separate access to each course on the page. All else follows from accessing the appropriate tags for each course.

In [8]:
URL = 'http://catalog.drexel.edu/coursedescriptions/quarter/grad/dsci/'
def course_descriptions(URL):
    # This just downloads the html text with full markdown
    html_text = urllib.request.urlopen(URL).read()

    # We want a markdown-interpreted (structured) version of the html using BeautifulSoup:
    soup = BeautifulSoup(html_text, 'html.parser')

    courses = []
    for div in soup.find_all('div'):
        if "courseblock" in div.get('class', []):
            number, title, credits, description = "", "", "", ""
            for para in div.find_all('p'):
                if 'courseblocktitle' in para.get('class', []):
                    number, title, credits = re.search("([A-Z]+\s+\d+)\s+([^0-9]+)\s+(\d\.\d Credits)", para.text).groups()
                elif 'courseblockdesc' in para.get('class', []):
                    description = para.text
            try:
                unit = re.search("College\/Department\:\s+(.*?)\n", div.text).groups()[0]
            except:
                unit = ""
            try:
                repeat = re.search("Repeat Status:\s+(.*?)\n", div.text).groups()[0]
            except:
                repeat = ""
            try:
                prerequesites = re.search("Prerequisites:\s+(.*?)\n", div.text).groups()[0]
            except:
                prerequesites = ""
                
            courses.append({
                "number": number,
                "title": title,
                "credits": credits,
                "description": description,
                "unit": unit,
                "repeat": repeat,
                "prerequesites": prerequesites
            })
    return(courses)
course_descriptions(URL)

[{'number': 'DSCI\xa0511',
  'title': 'Data Acquisition and Pre-Processing',
  'credits': '3.0 Credits',
  'description': '\nIntroduces the breadth of data science through a project lifecycle perspective. Covers early-stage data-life cycle activities in depth for the development and dissemination of data sets. Provides technical experience with data harvesting, acquisition, pre-processing, and curation.  Concludes with an open-ended term project where students explore data availability, scale, variability, and reliability.\n',
  'unit': 'College of Computing and Informatics',
  'repeat': 'Not repeatable for credit',
  'prerequesites': ''},
 {'number': 'DSCI\xa0521',
  'title': 'Data Analysis and Interpretation',
  'credits': '3.0 Credits',
  'description': '\nIntroduces methods for data analysis and their quantitative foundations in application to pre-processed data. Covers reproducibility and interpretation for project life cycle activities, including data exploration, hypothesis gene