# DSCI 511: Data acquisition and pre-processing<br>Chapter 5: Harvesting Content from the World Wide Web
## Exercises
Note: numberings refer to the main notes.

#### 5.2.1.0 Example: Parsing a BeautifulSoup document
Access the content from `'http://catalog.drexel.edu/'`, and find the title.

In [5]:
import requests
from bs4 import BeautifulSoup

URL = 'http://catalog.drexel.edu/'

html = requests.get(URL)

soup = BeautifulSoup(html.text, 'html.parser')

print(soup.find("title"))

<title>Drexel University &lt; 2018-2019 Catalog | Drexel University</title>


#### 5.2.1.3 Understanding the body
Find and print the body content of the University Catalog (`'http://catalog.drexel.edu/'`) and review the structure of the html. What might get us into the actual catalog data? Can you find the code for the Undergraduate and Graduate 'buttons'?

In [11]:
import requests
from bs4 import BeautifulSoup

response = requests.get('http://catalog.drexel.edu/')

soup = BeautifulSoup(response.text, 'html.parser')

for link in soup.find('body').find_all('a'):
    print(link)

<a href="http://catalog.drexel.edu"><img alt="University Catalog 2011-2012" src="/images/header.png"/></a>
<a href="/colleges/"><span>Colleges and Schools</span></a>
<a href="/majors/"><span>Majors</span></a>
<a href="/minors/"><span>Minors</span></a>
<a href="/graduateprograms/"><span>Graduate Programs</span></a>
<a href="/certificates/"><span>Certificate Programs</span></a>
<a href="http://www.drexel.edu/catalog/archive/" target="_blank"><span>Archive</span></a>
<a href="/undergraduate/"><span>Undergraduate</span></a>
<a href="/graduate/"><span>Graduate</span></a>
<a href="/additionalacademicprograms/"><span>Additional Academic Programs</span></a>
<a href="http://drexel.edu/webtms" target="_blank"><img alt="Schedule of Classes" src="/images/sidebar_scheduleclasses.jpg"/></a>
<a href="/coursedescriptions/" target="_blank"><img alt="All Course Descriptions" src="/images/sidebar_coursedescriptions.jpg"/></a>
<a href="/undergraduate/coop/" target="_blank"><img alt="Co-op" src="/images/si

#### 5.2.1.5 Exercise: Examining tag attributes
Find all `'p'`-tagged blocks in the University Catalog (`'http://catalog.drexel.edu/'`) and examine any tag attributes. What sort of information is being conveyed?

In [2]:
import requests
from bs4 import BeautifulSoup

URL = "http://catalog.drexel.edu/"

response = requests.get(URL)

In [14]:
soup = BeautifulSoup(response.text, "html.parser")

for paragraph in soup('p'):
    print(paragraph.attrs)

# print(soup)

{'id': 'breadcrumbs'}
{'class': ['intro']}
{'class': ['disclaimer']}
{'class': ['disclaimer']}
{'class': ['btn-ugrad']}
{'class': ['btn-grad']}
{'class': ['btn-other']}
{}
{}


#### 5.2.1.6 Exercise: Filtering tags by attribute
Returning to `'http://catalog.drexel.edu/'`, can you collect only those hyperlinks which don't have the attribute `target="_blank"`? \[__Hint__: Think sets and exclusion.\]

In [19]:
all_links = set(soup.find_all('a'))

bad_links = set(soup.find_all('a', {'target' : '_blank'}))
                
good_links = all_links - bad_links

good_links

{<a href="/">Drexel University</a>,
 <a href="/aboutdrexel/">About Drexel</a>,
 <a href="/accreditation/">Accreditation</a>,
 <a href="/additionalacademicprograms/"><span>Additional Academic Programs</span></a>,
 <a href="/catalogcontents/">Site Map</a>,
 <a href="/certificates/"><span>Certificate Programs</span></a>,
 <a href="/colleges/"><span>Colleges and Schools</span></a>,
 <a href="/graduate/"><span>Graduate</span></a>,
 <a href="/graduateprograms/"><span>Graduate Programs</span></a>,
 <a href="/majors/"><span>Majors</span></a>,
 <a href="/minors/"><span>Minors</span></a>,
 <a href="/undergraduate/"><span>Undergraduate</span></a>,
 <a href="http://catalog.drexel.edu"><img alt="University Catalog 2011-2012" src="/images/header.png"/></a>,
 <a href="mailto:catalog@drexel.edu">catalog@drexel.edu</a>}

In [20]:
bad_links

{<a href="/coursedescriptions/" target="_blank"><img alt="All Course Descriptions" src="/images/sidebar_coursedescriptions.jpg"/></a>,
 <a href="/undergraduate/coop/" target="_blank"><img alt="Co-op" src="/images/sidebar_coop.jpg"/></a>,
 <a href="/undergraduate/tuitionandfees/" target="_blank"><img alt="Tuition &amp; Fees" src="/images/sidebar_tuitionfees.jpg"/></a>,
 <a href="http://drexel.edu/" target="_blank">Drexel Home</a>,
 <a href="http://drexel.edu/provost/aard/advising/undergraduate-advisors/" target="_blank"><img alt="Academic Advising" src="/images/sidebar_advising.jpg"/></a>,
 <a href="http://drexel.edu/provost/policies/overview/" target="_blank">Academic Policies</a>,
 <a href="http://drexel.edu/webtms" target="_blank"><img alt="Schedule of Classes" src="/images/sidebar_scheduleclasses.jpg"/></a>,
 <a href="http://www.catalog.drexel.edu/admissions/" target="_blank"><img alt="Admissions" src="/images/sidebar_admissions.jpg"/></a>,
 <a href="http://www.drexel.edu/catalog/ar

#### 5.3.1.3 Extracting hyperlinks for a crawl
From the Graduate Quarter course descriptions page (`'http://catalog.drexel.edu/coursedescriptions/quarter/grad/'`), extract a list of URLs to the various program course descriptions pages.  Can you keep them  grouped by academic unit? \[__Hint__: To do so, you can use the page layout regular expressions, i.e., `re.compile()` to manage the unit headings.\]

In [21]:
import requests
from bs4 import BeautifulSoup

URL = "http://catalog.drexel.edu/coursedescriptions/quarter/grad/"

response = requests.get(URL)

In [32]:
soup = BeautifulSoup(response.text, 'html.parser')

divs = soup.find_all('div', {"class": "qugcourses"})
program_data = {}
for div in divs:
    unit_data = {
        "name": [],
        "href": []
    }
    
    unit = div.find('h2').text
    
    a_tags = div.find_all('a')
    for a_tag in a_tags:
        unit_data['href'].append(a_tag['href'])
        unit_data['name'].append(a_tag.text)
    
    program_data[unit] = unit_data
    
program_data

{'Antoinette Westphal College of Media Arts & Design (A)': {'name': ['Animation (ANIM)',
   'Architecture (ARCH)',
   'Art History (ARTH)',
   'Arts Administration (AADM)',
   'Arts Administration and Museum Leadership   (AAML)',
   'Design Research  (DSRE)',
   'Digital Media (DIGM)',
   'Fashion Design (FASH)',
   'Game Art and Production (GMAP)',
   'Interior Design (INTR)',
   'Museum Leadership (MUSL)',
   'Retail & Merchandising (RMER)',
   'Television Management (TVMN)',
   'Urban Strategy (URBS)',
   'Visual Studies (VSST)',
   'Westphal Studies (WEST)'],
  'href': ['/coursedescriptions/quarter/grad/anim/',
   '/coursedescriptions/quarter/grad/arch/',
   '/coursedescriptions/quarter/grad/arth/',
   '/coursedescriptions/quarter/grad/aadm/',
   '/coursedescriptions/quarter/grad/aaml/',
   '/coursedescriptions/quarter/grad/dsre/',
   '/coursedescriptions/quarter/grad/digm/',
   '/coursedescriptions/quarter/grad/fash/',
   '/coursedescriptions/quarter/grad/gmap/',
   '/coursedescri

#### 5.3.2.9 Extracting the data
Write a function that scrapes the course descriptions from a given program page. For example, Drexel's MSDS course descriptions page is `'http://catalog.drexel.edu/coursedescriptions/quarter/grad/dsci/'`. Retain as much structure as possible!

In [35]:
import requests
from bs4 import BeautifulSoup
import re

URL = "http://catalog.drexel.edu/coursedescriptions/quarter/grad/dsci/"

new_response = requests.get(URL)

new_soup = BeautifulSoup(new_response.text, 'html.parser')

data = []

courseblocks = new_soup.find_all('div', {"class": "courseblock"})
for div in courseblocks:

    heading = div.find('p')
    number, name = heading.find_all('span')[:2]

    credits = re.search("(\d+\.\d+\s+[cC]redits)", heading.text).groups()[0]
    
    description = div.find('p', {'class': "courseblockdesc"})

    course_data = {
        "name": name.text,
        "number": number.text,
        'credits': credits,
        'description': description.text
    }
    
    if re.search("(College/Department):\s+([^\n]+)\n", div.text):
        unit_key, unit = re.search("(College/Department):\s+([^\n]+)\n", div.text).groups()
        course_data[unit_key] = unit
    if re.search("(Repeat Status):\s+([^\n]+)\n", div.text):
        repeat_key, repeat = re.search("(Repeat Status):\s+([^\n]+)\n", div.text).groups()
        course_data[repeat_key] = repeat
    if re.search("(Prerequisites):\s+([^\n]+)\n", div.text):
        prereq_key, prereq = re.search("(Prerequisites):\s+([^\n]+)\n", div.text).groups()
        course_data[prereq_key] = prereq

    data.append(course_data)
    
data

[{'name': 'Data Acquisition and Pre-Processing',
  'number': 'DSCI\xa0511  ',
  'credits': '3.0 Credits',
  'description': '\nIntroduces the breadth of data science through a project lifecycle perspective. Covers early-stage data-life cycle activities in depth for the development and dissemination of data sets. Provides technical experience with data harvesting, acquisition, pre-processing, and curation.  Concludes with an open-ended term project where students explore data availability, scale, variability, and reliability.\n',
  'College/Department': 'College of Computing and Informatics',
  'Repeat Status': 'Not repeatable for credit'},
 {'name': 'Data Analysis and Interpretation',
  'number': 'DSCI\xa0521  ',
  'credits': '3.0 Credits',
  'description': '\nIntroduces methods for data analysis and their quantitative foundations in application to pre-processed data. Covers reproducibility and interpretation for project life cycle activities, including data exploration, hypothesis gene