# Web scraping

Agenda:
  * use GET to grab a page (Endicott's CSC majors page)
    - look at response
  * use GET to grab a CSV response
  * use POST to grab a CSV response
  * cookies
  * extract course requirements from Endicott's CSC majors page
    
## `requests` module
This module will make working with HTTP requests much easier.

In [1]:
import requests

## A simple get request

Let's grab the Computer Science Major page from Endicott's catalog website: `https://catalog.endicott.edu/preview_program.php?catoid=46&poid=5251`

In [2]:
response = requests.get('https://catalog.endicott.edu/preview_program.php?catoid=46&poid=5251', timeout=2)
response

<Response [200]>

Look at the content from the response as plain text:

In [3]:
response.text

'<!DOCTYPE html><html lang="en">\n\t<head>\n\t\t<title>Program: Computer Science Major (Bachelor of Science) - Endicott College - Modern Campus Catalog™</title>\n\t\t\t\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\t\t\t\t\t\t\t<link rel="shortcut icon" href="//acalog-clients.s3.amazonaws.com/production/endicott/img/favicon/favicon.ico" />\n\t\t<meta name="description" content="Endicott College Academic Catalog. Endicott College, 376 Hale Street, Beverly, MA, USA. Endicott offers Bachelor of Arts, B.A., Bachelor of Fine Arts, B.F.A., Bachelor of Science, B.S., Master of Education, M.Ed., Master of Business Administration, M.B.A., Master of Science, M.S., Master of Arts, M.A., Master of Fine Arts, M.F.A., and Associate in Science (A.S.) degrees.Most students select Endicott after a personal tour of our oceanfront campus ends their search for the ideal academic experience."><meta name="keywords" content="academic, catalogs, undergraduate, Endicott College, progra

Look at the content as a byte string (notice the `b` before the starting quote):

In [4]:
response.content

b'<!DOCTYPE html><html lang="en">\n\t<head>\n\t\t<title>Program: Computer Science Major (Bachelor of Science) - Endicott College - Modern Campus Catalog\xe2\x84\xa2</title>\n\t\t\t\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\t\t\t\t\t\t\t<link rel="shortcut icon" href="//acalog-clients.s3.amazonaws.com/production/endicott/img/favicon/favicon.ico" />\n\t\t<meta name="description" content="Endicott College Academic Catalog. Endicott College, 376 Hale Street, Beverly, MA, USA. Endicott offers Bachelor of Arts, B.A., Bachelor of Fine Arts, B.F.A., Bachelor of Science, B.S., Master of Education, M.Ed., Master of Business Administration, M.B.A., Master of Science, M.S., Master of Arts, M.A., Master of Fine Arts, M.F.A., and Associate in Science (A.S.) degrees.Most students select Endicott after a personal tour of our oceanfront campus ends their search for the ideal academic experience."><meta name="keywords" content="academic, catalogs, undergraduate, Endicott Col

Let's look at the headers:

In [5]:
# import json
# json.dumps(response.headers, indent=2)
response.headers

{'Date': 'Wed, 12 Feb 2025 14:32:52 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '11367', 'Connection': 'keep-alive', 'Set-Cookie': 'AWSALB=IOs0B907MELvKaWgsj+DTtrhzVcZFo3diBGeo1ftQMo9GNuKcFJxkOXmVxG70QCcxPBBf1b0vnDIBkkCrkgRbpUo9EJEgg5RPwps7XqEOVB88G+2/+/7G6ClCfkH; Expires=Wed, 19 Feb 2025 14:32:52 GMT; Path=/; secure, AWSALBCORS=IOs0B907MELvKaWgsj+DTtrhzVcZFo3diBGeo1ftQMo9GNuKcFJxkOXmVxG70QCcxPBBf1b0vnDIBkkCrkgRbpUo9EJEgg5RPwps7XqEOVB88G+2/+/7G6ClCfkH; Expires=Wed, 19 Feb 2025 14:32:52 GMT; Path=/; secure; SameSite=None, acalog_theme=1; expires=Tue, 15-Jun-3024 14:33:52 GMT; Max-Age=31536000060; path=/; secure, PHPSESSID=659277948b9eea5f8b2c206fe4d8dc06; path=/; secure, PHPSESSID=659277948b9eea5f8b2c206fe4d8dc06; path=/; secure; HttpOnly, ADRUM_BT=R%3A0%7Cg%3A51479b13-1423-42e2-9d67-708bcc133f3e1857%7Cn%3Adigarc_881d5e4b-64f1-425e-8ceb-5e44d2b69b37%7Cd%3A98; expires=Wed, 12-Feb-2025 14:33:22 GMT; Max-Age=30; path=/; secure', 'Expires': 'Thu, 19 Nov 1981 08:52:00

In [6]:
response.status_code

200

In [7]:
response.reason

'OK'

If we don't want to manually check, e.g., `response.status_code == 200`, we can use the built in constants from `requests.codes`:

In [8]:
if response.status_code == requests.codes.ALL_OK:
    print("Success!")
    ## Process response...

Success!


In [9]:
requests.codes.ALL_OK

200

## POST vs. GET + arguments

I've put up a page where we can get a few different versions of the a data file from a study of mouse weights: https://digdug.cs.endicott.edu/~hfeild/csc440/mouse.php

The page supports one GET *or* POST argument, `format`, and it can take on three different values: `csv`, `json`, or `tsv`. What happens if we don't provide the argument?


In [10]:
response = requests.get('https://digdug.cs.endicott.edu/~hfeild/ds303/mouse.php', timeout=2)

In [11]:
response.status_code

400

In [12]:
response.reason

'Bad Request'

In [13]:
response.headers

{'Date': 'Wed, 12 Feb 2025 14:32:53 GMT', 'Server': 'Apache/2.4.58 (Ubuntu)', 'Content-Length': '20', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}

In [14]:
response.text

'No format specified.'

What happens if we provide an incorrect argument?

In [15]:
response = requests.get('https://digdug.cs.endicott.edu/~hfeild/csc440/mouse.php',
                        {'format': 'cv'} , timeout=2)

In [16]:
response.status_code

400

In [17]:
response.reason

'Bad Request'

In [18]:
response.headers

{'Date': 'Wed, 12 Feb 2025 14:32:53 GMT', 'Server': 'Apache/2.4.58 (Ubuntu)', 'Content-Length': '28', 'Connection': 'close', 'Content-Type': 'text/html; charset=UTF-8'}

In [19]:
response.text

'Invalid format specified: cv'

Let's grab the CSV version of the data. We'll talk about parsing tabular data later in the course.

In [20]:
## TODO -- get the data in CSV format using GET.
response = requests.get('https://digdug.cs.endicott.edu/~hfeild/csc440/mouse.php',
                        {'format': 'csv'} , timeout=2)

if response.status_code == requests.codes.ALL_OK:
    print('Success!')
else:
    print('ERROR!')
print(response.text)

Success!
strain,sex,weight
C57Bl/6J,M,27.59675331333915
C57Bl/6J,M,29.635837290427975
C57Bl/6J,M,28.4606850523734
C57Bl/6J,M,29.248224513188948
C57Bl/6J,M,25.702118112188177
C57Bl/6J,M,26.972695857278342
C57Bl/6J,M,30.352503586931082
C57Bl/6J,M,29.05602355618237
C57Bl/6J,M,27.869611829423285
C57Bl/6J,M,31.025557869722743
C57Bl/6J,M,26.4514244620284
C57Bl/6J,M,29.015105617301778
C57Bl/6J,M,24.88924725446606
C57Bl/6J,M,29.874311671891547
C57Bl/6J,M,25.52512438476657
C57Bl/6J,M,24.670567191691454
C57Bl/6J,M,26.742198725332102
C57Bl/6J,M,28.131760610169305
C57Bl/6J,M,28.72482798481937
C57Bl/6J,M,25.277441061646307
C57Bl/6J,M,27.774556974580587
C57Bl/6J,M,28.393142210576897
C57Bl/6J,M,24.775898723580788
C57Bl/6J,M,25.042079443710662
C57Bl/6J,M,25.44514160586376
C57Bl/6J,M,27.805460215504564
C57Bl/6J,M,26.598269218252614
C57Bl/6J,M,25.744341266714354
C57Bl/6J,M,28.248257070355297
C57Bl/6J,M,26.010734389148837
C57Bl/6J,M,28.431915788937303
C57Bl/6J,M,27.87918279182886
C57Bl/6J,M,28.9009761961

Let's do the same thing, but use POST instead of GET.

In [21]:
## TODO -- request the CSV version of the data using POST.
response = requests.post('https://digdug.cs.endicott.edu/~hfeild/csc440/mouse.php',
                        {'format': 'csv'} , timeout=2)
if response.status_code == requests.codes.ALL_OK:
    print('Success!')
else:
    print('ERROR!')
print(response.text)

Success!
strain,sex,weight
C57Bl/6J,M,27.59675331333915
C57Bl/6J,M,29.635837290427975
C57Bl/6J,M,28.4606850523734
C57Bl/6J,M,29.248224513188948
C57Bl/6J,M,25.702118112188177
C57Bl/6J,M,26.972695857278342
C57Bl/6J,M,30.352503586931082
C57Bl/6J,M,29.05602355618237
C57Bl/6J,M,27.869611829423285
C57Bl/6J,M,31.025557869722743
C57Bl/6J,M,26.4514244620284
C57Bl/6J,M,29.015105617301778
C57Bl/6J,M,24.88924725446606
C57Bl/6J,M,29.874311671891547
C57Bl/6J,M,25.52512438476657
C57Bl/6J,M,24.670567191691454
C57Bl/6J,M,26.742198725332102
C57Bl/6J,M,28.131760610169305
C57Bl/6J,M,28.72482798481937
C57Bl/6J,M,25.277441061646307
C57Bl/6J,M,27.774556974580587
C57Bl/6J,M,28.393142210576897
C57Bl/6J,M,24.775898723580788
C57Bl/6J,M,25.042079443710662
C57Bl/6J,M,25.44514160586376
C57Bl/6J,M,27.805460215504564
C57Bl/6J,M,26.598269218252614
C57Bl/6J,M,25.744341266714354
C57Bl/6J,M,28.248257070355297
C57Bl/6J,M,26.010734389148837
C57Bl/6J,M,28.431915788937303
C57Bl/6J,M,27.87918279182886
C57Bl/6J,M,28.9009761961

## Cookies & sessions

In some situations, you may be accessing data from a site that stores information in cookies (e.g., for authentication). If that's the case, use sessions:

In [22]:
currentSession = requests.Session()
response = currentSession.get('https://catalog.endicott.edu/preview_program.php?catoid=46&poid=5251', timeout=2)

You can view the cookies received in the reponse by examining the `cookies` field:

In [23]:
# This holds the cookies sent from the server.
response.cookies


<RequestsCookieJar[Cookie(version=0, name='AWSALB', value='T/PKqoY/PGIrvvDRZi89xXSYkbPsox28u8DoMOtPvSoaKS6bz30PeGr1hbydFngGl5fiuNdt6fKalWxiUx5bDTKlNb54jKpAIbtNd3wJC7t4i22GrzQzALWwzLa7', port=None, port_specified=False, domain='catalog.endicott.edu', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=True, expires=1739975574, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), Cookie(version=0, name='AWSALBCORS', value='T/PKqoY/PGIrvvDRZi89xXSYkbPsox28u8DoMOtPvSoaKS6bz30PeGr1hbydFngGl5fiuNdt6fKalWxiUx5bDTKlNb54jKpAIbtNd3wJC7t4i22GrzQzALWwzLa7', port=None, port_specified=False, domain='catalog.endicott.edu', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=True, expires=1739975574, discard=False, comment=None, comment_url=None, rest={'SameSite': 'None'}, rfc2109=False), Cookie(version=0, name='acalog_theme', value='1', port=None, port_specified=False, domain='catalog.endicott.edu', domain_spec

## Scraping

Let's use [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) to extract the list of CSC courses a major should take in their Sophomore year.

In [24]:
from bs4 import BeautifulSoup

# We'll continue using `currentSession` from before.
response = currentSession.get('https://catalog.endicott.edu/preview_program.php?catoid=46&poid=5251', timeout=2)
content = response.text

soup = BeautifulSoup(content, "html.parser")
soup

<!DOCTYPE html>
<html lang="en">
<head>
<title>Program: Computer Science Major (Bachelor of Science) - Endicott College - Modern Campus Catalog™</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="//acalog-clients.s3.amazonaws.com/production/endicott/img/favicon/favicon.ico" rel="shortcut icon"/>
<meta content="Endicott College Academic Catalog. Endicott College, 376 Hale Street, Beverly, MA, USA. Endicott offers Bachelor of Arts, B.A., Bachelor of Fine Arts, B.F.A., Bachelor of Science, B.S., Master of Education, M.Ed., Master of Business Administration, M.B.A., Master of Science, M.S., Master of Arts, M.A., Master of Fine Arts, M.F.A., and Associate in Science (A.S.) degrees.Most students select Endicott after a personal tour of our oceanfront campus ends their search for the ideal academic experience." name="description"/><meta content="academic, catalogs, undergraduate, Endicott College, programs, courses, Endicott, endicott college, endicott, n

In [25]:
soup.title

<title>Program: Computer Science Major (Bachelor of Science) - Endicott College - Modern Campus Catalog™</title>

In [26]:
soup.find_all("h3")

[<h3><a name="FirstYearCredits34"></a><a id="core_27952" name="firstyearcredits34"></a>First Year - Credits: 34</h3>,
 <h3><a name="SophomoreCredits32"></a><a id="core_27953" name="sophomorecredits32"></a>Sophomore - Credits: 32</h3>,
 <h3><a name="JuniorCredits31"></a><a id="core_27954" name="juniorcredits31"></a>Junior - Credits: 31</h3>,
 <h3><a name="SeniorCredits30"></a><a id="core_27955" name="seniorcredits30"></a>Senior - Credits: 30</h3>,
 <h3><a name="ComputerScienceElectives"></a><a id="core_28958" name="computerscienceelectives"></a>Computer Science Electives</h3>,
 <h3><a name="LearningOutcomes"></a><a id="core_28431" name="learningoutcomes"></a>Learning Outcomes</h3>]

In [27]:
soup.find_all("a", attrs={'name': 'SophomoreCredits32'})

[<a name="SophomoreCredits32"></a>]

In [28]:
import re # For regular expressions (e.g., pattern and partial text matching).

soup.find_all(string=re.compile('Sophomore'))

['Sophomore - Credits: 32']

In [29]:
soup.find_all(string=re.compile('Sophomore'))[0].parent

<h3><a name="SophomoreCredits32"></a><a id="core_27953" name="sophomorecredits32"></a>Sophomore - Credits: 32</h3>

In [30]:
soup.find_all(string=re.compile('Sophomore'))[0].parent.parent

<div class="acalog-core"><h3><a name="SophomoreCredits32"></a><a id="core_27953" name="sophomorecredits32"></a>Sophomore - Credits: 32</h3><hr/><ul> <li>Aesthetic Awareness and Creative Expression General Education Requirement   (Cr: 3)</li> <li>Global Issues General Education Elective (Cr. 3)</li> <li>Elective (Cr: 3)</li> <li>Computer Science Elective (Cr: 6)</li> </ul><ul><li class="acalog-course"><span><a aria-expanded="false" href="#" onclick="showCourse('46', '54139',this, 'a:2:{s:8:~location~;s:7:~program~;s:4:~core~;s:5:~27953~;}'); return false;">CSC 251 - Network Fundamentals</a> (Cr: 3)</span></li><li class="acalog-course"><span><a aria-expanded="false" href="#" onclick="showCourse('46', '54130',this, 'a:2:{s:8:~location~;s:7:~program~;s:4:~core~;s:5:~27953~;}'); return false;">CSC 260 - Visual Programming I</a> (Cr: 3)</span></li><li class="acalog-course"><span><a aria-expanded="false" href="#" onclick="showCourse('46', '54131',this, 'a:2:{s:8:~location~;s:7:~program~;s:4:~

In [31]:
soup.find_all(string=re.compile('Sophomore'))[0].parent.parent.find_all(string=re.compile('CSC'))

['CSC 251\xa0-\xa0Network Fundamentals',
 'CSC 260\xa0-\xa0Visual Programming I',
 'CSC 261\xa0-\xa0Visual Programming II and Object-Oriented Design',
 'CSC 265\xa0-\xa0Discrete Structures',
 'CSC 280\xa0-\xa0Computer Architecture']

In [32]:
coursesStr = soup.find_all(string=re.compile('Sophomore'))[0].parent.parent.find_all(string=re.compile('CSC'))

courses = []

for course in coursesStr:
    if len(course) == 0:
        continue
        
    majorCourse, title = course.split(' - ')
    major, courseNo = majorCourse.strip().split()
    
    courses.append({
        'major': major,
        'courseNo': courseNo,
        'title': title.strip()
    })

courses

ValueError: not enough values to unpack (expected 2, got 1)

# HTML example (Prof. Feild's homepage)

In [38]:
homepageHTML = requests.get('https://hank.feild.org', timeout=2).text
homepageSoup = BeautifulSoup(homepageHTML, 'html.parser')
homepageHead = homepageSoup.find('head')
homepageHead

<head>
<title>
        Henry Feild
    </title>
<link href="/currentCSS.css" rel="stylesheet" type="text/css"/>
<link href="/favicon.ico" rel="icon" type="image/ico"/>
</head>

In [44]:
homepageHead.find_all('link')[0].attrs

{'rel': ['stylesheet'], 'type': 'text/css', 'href': '/currentCSS.css'}

In [52]:
# Find link to dissertation.
homepageSoup.find(string=re.compile('Ph.D.')).parent.find_all('a')[2].attrs['href']

'http://scholarworks.umass.edu/open_access_dissertations/790/'

## Exercises 

### Junior Applied Math Major courses

Write the code to extract the Junior year courses for Applied Math majors at Endicott. Use this URL: `https://catalog.endicott.edu/preview_program.php?catoid=46&poid=5333`



In [None]:
# TODO Code goes here.

## Synonyms

Let's see if we can extract synonyms for a word from `https://www.merriam-webster.com/thesaurus/`. 

### Questions
  1. how do we specify the URL of the page for the word we want to find synonyms for?
  2. how can we extract synonyms?
  3. can we extract the strength of synonyms? how?

In [None]:
def extractSynonyms(word):
    # TODO implement
    pass

## Extracting birthdays from Wikipedia

In [6]:
# TODO Write code that will extract the birthday from a Wikipedia page about a person. E.g., Pedro Pascal.

wikiPage = requests.get('https://en.m.wikipedia.org/wiki/Pedro_Pascal', timeout=2)
assert wikiPage.status_code == requests.codes.ALL_OK
soup = BeautifulSoup(wikiPage.content, 'html.parser')

In [11]:
soup.find_all('span', class_='bday')
soup.find_all('span', attrs={'class': 'bday'})
soup.select('td.infobox-data span.bday')

[<span class="bday">1975-04-02</span>]

In [12]:
bday = soup.select('td.infobox-data span.bday')[0].text
bday

'1975-04-02'

In [15]:
import datetime
print(datetime.datetime.now())

2025-02-14 09:21:09.660907


In [20]:
print(datetime.datetime.strptime(bday, '%Y-%m-%d'))

1975-04-02 00:00:00


In [23]:
(datetime.datetime.now() - datetime.datetime.strptime(bday, '%Y-%m-%d')).days // 365

49

# Web scraping dynamically loaded pages

Many web pages include JavaScript that loads additional HTML and data after the page is initially loaded. Using a regular `requests.get()` call will not trigger that JavaScript to run and therefore will not give you the complete page.

To get this kind of content, we need a web browser and a bridge to connect our Python code to the browser so we can ask the browser to load a page the way it usually would, then get the HTML after the JavaScript has loaded. `Selenium` is a tool that works in browsers as well as in programming languages such as Python and Java (among others) that will talk to a browser. We'll use that and I'm using Firefox and the Firebox bridge (gecko). This code will not work on your system directly; you'll need to follow the instructions in [this article](https://scrape.do/blog/how-to-scrape-javascript-rendered-web-pages-with-python/) to setup your browser (Firefox, Chrome, etc.) and a bridge for that specific browser; you will likely need to hardcode a path to the bridge in your code.


In [24]:
from bs4 import BeautifulSoup 
import requests

url = 'https://canvas.endicott.edu/courses/51797'
elementSelector = '.module-item-title span.external_link_icon'

# Question 1: Write the code that will download the HTML for the given URL and parse it as a BeuatifulSoup object.
canvasPage = requests.get(url, timeout=2)
assert canvasPage.status_code == requests.codes.ALL_OK
soup = BeautifulSoup(canvasPage.content, 'html.parser')


# Display the selector.
soup.select(elementSelector)

[]

In [27]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

opts = webdriver.FirefoxOptions()
# opts.add_argument('-headless') # This says: don't open a browser window; may not be aviaalble for every browser.
service = webdriver.FirefoxService( executable_path='/snap/bin/geckodriver' )

ffox_driver = webdriver.Firefox(options=opts, service=service)  

ffox_driver.get(url)
html1 = ffox_driver.page_source

try:
    # Wait for multiple elements to be present -- this is only waiting for one, but we could add more.
    WebDriverWait(ffox_driver, 10).until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, elementSelector))
    )
    print("Found it!")
    
except:
    print("Elements not found within the time frame")

html = ffox_driver.page_source
# # TODO output the element with BeautifulSoup.
# soup = BeautifulSoup(html, 'html.parser')
# soup.select(elementSelector)
    
# ffox_driver.quit() # Call this when you're all done.

Found it!


In [28]:
# TODO output the element with BeautifulSoup.
soup = BeautifulSoup(html1, 'html.parser')
soup.select(elementSelector)

# TODO output the element with BeautifulSoup.
soup = BeautifulSoup(html, 'html.parser')
soup.select(elementSelector)

[<span class="external_link_icon" role="presentation" style="margin-inline-start: 5px; display: inline-block; text-indent: initial; "><svg style="width:1em; height:1em; vertical-align:middle; fill:currentColor" viewbox="0 0 1920 1920" xmlns="http://www.w3.org/2000/svg">
 <path d="M1226.667 267c88.213 0 160 71.787 160 160v426.667H1280v-160H106.667v800C106.667 1523 130.56 1547 160 1547h1066.667c29.44 0 53.333-24 53.333-53.333v-213.334h106.667v213.334c0 88.213-71.787 160-160 160H160c-88.213 0-160-71.787-160-160V427c0-88.213 71.787-160 160-160Zm357.706 442.293 320 320c20.8 20.8 20.8 54.614 0 75.414l-320 320-75.413-75.414 228.907-228.906H906.613V1013.72h831.254L1508.96 784.707l75.413-75.414Zm-357.706-335.626H160c-29.44 0-53.333 24-53.333 53.333v160H1280V427c0-29.333-23.893-53.333-53.333-53.333Z" fill-rule="evenodd"></path>
 </svg>
 <span class="screenreader-only">Links to an external site.</span></span>,
 <span class="external_link_icon" role="presentation" style="margin-inline-start: 5px; 

In [29]:
ffox_driver.get('https://www.endicott.edu')