Remember to periodically save!

# Web Scraping with Requests-HTML

Documentation: https://requests.readthedocs.io/projects/requests-html/en/latest/

If you've used Python for getting web pages, connecting to an API, or getting files off the web, odds are you've encountered the requests library. The library we're using today, Requests-HTML (note the different name) was developed by the author of the requests, Kenneth Reitz.  It's intended to be an easy-to-learn web scraping library, it's basically a layer on top of requests and BeautifulSoup. BeautifulSoup is another good basic Python library for parsing HTML. Lots of tutorials and documentation on it. 


## Part 1: Scraping headlines from the GW Hatchet

Before we start scraping, let's look at the robots.txt file for the site. This is a list of instructions to crawlers--usually search engines--about how to crawl the site and which parts to stay away from. 

https://www.gwhatchet.com/robots.txt

Note that this site is very open about crawling. That doesn't mean there aren't still legal and ethical considerations, but we're respecting their instructions about the site. 


**Getting started**

First we need to install the `requests_html` Python library using the pip command.

In [None]:
! pip install requests_html

In [None]:
from requests_html import HTMLSession

We  need to make an HTTP GET request to get the website HTML. To do this we create an HTMLSession. That session object has methods to make GET requests. 

The get() method takes the URL as an argument. Sends an HTTP GET request. 



In [None]:
session = HTMLSession()

r = session.get('https://www.gwhatchet.com/section/news/')

In [None]:
r.html

<HTML url='https://www.gwhatchet.com/section/news/'>

`r` is a Response object and it has an html attribute which contains all of the HTML from the page in Unicode. We can look at the web page's text using the text attribute.  This is like looking at the text content of all of the HTML tags. 

In [None]:
r.html.text

'News – The GW Hatchet\nwindow._wpemojiSettings = {"baseUrl":"https:\\/\\/s.w.org\\/images\\/core\\/emoji\\/12.0.0-1\\/72x72\\/","ext":".png","svgUrl":"https:\\/\\/s.w.org\\/images\\/core\\/emoji\\/12.0.0-1\\/svg\\/","svgExt":".svg","source":{"concatemoji":"https:\\/\\/www.gwhatchet.com\\/wp-includes\\/js\\/wp-emoji-release.min.js?ver=5.4"}}; /*! This file is auto-generated */ !function(e,a,t){var r,n,o,i,p=a.createElement("canvas"),s=p.getContext&&p.getContext("2d");function c(e,t){var a=String.fromCharCode;s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,e),0,0);var r=p.toDataURL();return s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,t),0,0),r===p.toDataURL()}function l(e){if(!s||!s.fillText)return!1;switch(s.textBaseline="top",s.font="600 32px Arial",e){case"flag":return!c([127987,65039,8205,9895,65039],[127987,65039,8203,9895,65039])&&(!c([55356,56826,55356,56819],[55356,56826,8203,55356,56819])&&!c([55356,57332,56128,56423,56128,56418,56128,56421,56128,56430,5

This html attribute also has many other attributes which are shortcuts to parts of the content. There is a links attribute which extracts all of the links in the page. Note that these are returned in alphabetical order, not the order on the page. 

In [None]:
r.html.links

In [None]:
r.html.find('footer')

[<Element 'footer' >]

Now let's work at extracting certain parts of the page. Suppose we want all of the headlines. First we need to examine the page and determine if there is any markup that helps us identify headlines.  

*Take a look at the HTML of the page.*

We use CSS selectors to specify tags, classes, ids, or other attributes. 
https://www.w3schools.com/cssref/css_selectors.asp

We can see that all of the headlines are in h2 tags with a class of post-title. If there is a class that identifies the elements we want, that's ideal. 

In this case, we can use the h2 tag OR we could also use the class. *Show both parameters.*

We can now use the html object's find() method to provide a CSS selector. 

In [None]:
titles = r.html.find('h2')
# But note that there are <h2> tags in the footer so should use the post-title class instead.
titles = r.html.find('.post-title')

In [None]:
type(titles)

list

In [None]:
titles

[<Element 'h2' class=('post-title',)>,
 <Element 'h2' class=('post-title',)>,
 <Element 'h2' class=('post-title',)>,
 <Element 'h2' class=('post-title',)>,
 <Element 'h2' class=('post-title',)>,
 <Element 'h2' class=('post-title',)>,
 <Element 'h2' class=('post-title',)>,
 <Element 'h2' class=('post-title',)>,
 <Element 'h2' class=('post-title',)>,
 <Element 'h2' class=('post-title',)>]

`titles` is a list of Elements (capital E). We can access just the text content of those Elements using the `full_text` attribute. Let's look at just the first one to start.

In [None]:
titles[0].text

'Greek life fall recruitment expanded to freshmen this year'

Now let's grab the text of all of the headlines. We'll iterate through the Elements in the titles list and add to our headlines list the text attribute from each one:

In [None]:
headlines = [t.text for t in titles]
print(headlines)

['Greek life fall recruitment expanded to freshmen this year', 'Two professors receive NSA cyber scholarship grant', "SA Senate overrides Hill's veto ahead of court hearing", 'Global computer chip shortage delays technology order', 'GWSB center reviews plans to expand financial literacy', 'SMPA professor receives grant to protect female journalists', 'Trustees to announce Colonials decision this academic year: Speights', 'Interim communications vice president departing for American University', 'Thousands march to Supreme Court against abortion restrictions', 'Staff member transported to hospital after assault']


That's fine for one page, but if we want to get a lot more headlines, we'll need to check multiple pages. Let's look at the Hatchet website and see which pages have new articles. 

*Return to Hatchet website and page through news section.*

Note: there are only 10 headlines per page, and then we need to click Next to go to another page. Clicking on that, we can see that there is number in the URL. So seeing that, we can use Python to get the HTML of each page and then get the headlines from each one. 

Before we jump in and do that, we're at the point where we should consider what is the most polite way to get this content from the Hatchet website. When you're web scraping, you don't want to interfere with a website's regular traffic by requesting lots of content all at one. We've got a lot of us online in this workshop, so let's give our requests a little space. We'll pause a second between requesting each page. 

To do that pausing, we need to import another library in the Python standard library. 

In [None]:
from time import sleep

In [None]:
news = []

# get the first 4 pages
for i in range(1,5):
    resp = session.get('https://www.gwhatchet.com/section/news/page/{}'.format(i))
    titles = resp.html.find('.post-title')
    headlines = [t.text for t in titles]
    news.extend(headlines)
    sleep(1)

print(news)

['Greek life fall recruitment expanded to freshmen this year', 'Two professors receive NSA cyber scholarship grant', "SA Senate overrides Hill's veto ahead of court hearing", 'Global computer chip shortage delays technology order', 'GWSB center reviews plans to expand financial literacy', 'SMPA professor receives grant to protect female journalists', 'Trustees to announce Colonials decision this academic year: Speights', 'Interim communications vice president departing for American University', 'Thousands march to Supreme Court against abortion restrictions', 'Staff member transported to hospital after assault', 'Crime log: Woman assaults student, police officer', 'Law professor finds ankle monitors as restrictive as incarceration', 'Officials launch fundraising initiative at Bicentennial Bash', 'Students protest Title IX case handling at Commencement', 'Writing Center to continue virtual appointments with in-person service', 'History department boosting diversity one year after Krug',

In [None]:
len(news)

40

Now that you have this data, you could go on to do other analysis of it. Maybe you want to look for mentions of certain entitites or do sentiment analysis or topic modelling of the headlines. There's a workshop on Natural Language Processing (NLP) Wednesday, October 13 at 11:30 if you want to learn more about working with text data: https://library.gwu.edu/news-events/events/python-natural-language-processing-0

**Exercise:**

Collect the bylines from the news-articles on the page. Start with one page of news articles. 

Hints:
1. determine what tag or class identifies the text you want. 
2. use the find() method to identify the elements to collect and put them into a list. Assign that to a variable called bylines. 
3. Extract the text from each element and assign the results to a list called authors. 


In [None]:
# Note that if you use just byline-author as below you will also pick up the text "By". 
# bylines = r.html.find('.byline-author')
# bylines[0].text

In [None]:
bylines = r.html.find('.byline-author a')

In [None]:
bylines[0].text

'Abby Kennedy'

In [None]:
authors = [b.text for b in bylines]
authors

['Abby Kennedy',
 'Lauren Sforza',
 'Eddie Herzig',
 'Henry Huvos',
 'Nicholas Pasion',
 'Daniel Patrick Galgano',
 'Katelyn Aluise',
 'Cristina Stassis',
 'Isha Trivedi',
 'Isha Trivedi',
 'Sejal Govindarao',
 'Zachary Blackburn',
 'Abby Kennedy',
 'Isha Trivedi',
 'Nicholas Pasion',
 'Zachary Blackburn']

If you want to use this data to get a count of how many times each author has been published, you could count them. There is a count() method in Python.

In [None]:
authors.count("Zachary Blackburn")

2

The .count() method is a bit inefficient for our purposes because you have to use the count method with each possible author's name. Better to use Python's Counter from the collections module. We can give Counter a list and it will return a count of each value that appears. 

In [None]:
from collections import Counter
a = Counter(authors)
a

Counter({'Abby Kennedy': 2,
         'Cristina Stassis': 1,
         'Daniel Patrick Galgano': 1,
         'Eddie Herzig': 1,
         'Henry Huvos': 1,
         'Isha Trivedi': 3,
         'Katelyn Aluise': 1,
         'Lauren Sforza': 1,
         'Nicholas Pasion': 2,
         'Sejal Govindarao': 1,
         'Zachary Blackburn': 2})

In [None]:
b = Counter(authors)
print(b.most_common(3))
#print(b["Abby Kennedy"])


[('Isha Trivedi', 3), ('Abby Kennedy', 2), ('Nicholas Pasion', 2)]


## Part 2: More scraping! Scraping course listings from the Schedule of Classes

**Have you saved lately? Be sure to save your notebook file!**

*Bring up the GW Schedule of Classes in your browser* 
https://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=202103

Recommend starting with one page, and then building up as you determine how to access the data you need. We'll start with just the American Studies courses for Spring. *Drill down to the Main Campus Spring semester and then select American Studies*.

Let's suppose we want to have a spreadsheet listing all of the courses in our department for our major. 

We will need to parse the HTML to get the information in each of these rows. (We'll ignore the comments and Find Books links.)

###robots.txt
Before doing anything, let's look at the robots.txt

https://my.gwu.edu/robots.txt
Our page is in /mod/pws and there are no prohibitions about crawling that content. 

**Now we can move on and look at the HTML for the page we want:**

*View Page Source.*

Relevant tags and courses include the `tr` tag in the tables. It looks like we have a class on each row which we could use, `.crseRow1`. We could look for all of the `td`s within `.crseRow1`, but then we would lose the contxt of which row they were in. 

Let's see what we get with the class that's on each row in each table. 



In [None]:
session_new = HTMLSession()

amst_url = 'https://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=202103&subjId=AMST'
response = session_new.get(amst_url)

In [None]:
tds = response.html.find('.crseRow1 td')
tds[0].text

'WAITLIST'

In [None]:
for td in tds[:20]:
  print(td.text)

WAITLIST
68081
AMST 1050
11
What is Democracy?
3.00
Anker, E
MON B32
T
12:45PM - 03:15PM
08/30/21 - 12/11/21

OPEN
62945
AMST 1100
10
Politics and Film
3.00
Anker, E
FNGR 108
AND
FNGR 108
M
12:45PM - 02:00PM
AND
M
07:10PM - 09:40PM


In [None]:
len(tds)
#We have a long list of cells, and we would lose the contxt of which row they were in. 

638

Instead, let's try to work with each row and the data within it. 


In [None]:
rows = response.html.find('.crseRow1')
rows[0].text

'WAITLIST\n68081\nAMST 1050\n11\nWhat is Democracy?\n3.00\nAnker, E\nMON B32\nT\n12:45PM - 03:15PM\n08/30/21 - 12/11/21'

We get back a string, with each cell's text concatenated. It looks like the text for each cell, as gathered by requests_html's text, ends with a `\n` newline character. 

We still have access to the elements within the table rows object. For example, we can look at the cells (`td` tags) in our first row as follows: 

In [None]:
cells = rows[0].find('td')

In [None]:
cells[0].text

'WAITLIST'

In [None]:
cells[8].text

'T\n12:45PM - 03:15PM'

Looking at the site, we can see that the time for the class is indeed on a different line from the day of the week, INSIDE THE SAME CELL. Let's take a look at each cell: 

In [None]:
for c in cells:
    print("td: ", c.text)

td:  WAITLIST
td:  68081
td:  AMST 1050
td:  11
td:  What is Democracy?
td:  3.00
td:  Anker, E
td:  MON B32
td:  T
12:45PM - 03:15PM
td:  08/30/21 - 12/11/21
td:  


Now that we know how to access the text of each `td` and that there are some extra newlines in there, let's create a function that will strip out newlines in the text of the td tags. Our function will return a list that has the text of all of the cells. 

In [None]:
def get_tds(tr):
    row = [td.text.replace('\n', ' ') for td in tr.find('td')]
    return row

Now we can use this to get all the relevant text in all of the course tables on the page, and collect them into a courses list.

In [None]:
courses = []
for tr in rows:
    courses.append(get_tds(tr))

In [None]:
# Look at courses. Note that there are some here that don't appear on the web page! 
# That's because there is display:none as a style on some of the tables. 

courses

Now that we've extracted this data, we want to hold onto it. We'll create a CSV file to hold this data. 

In [None]:
import csv

with open('amst.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    for course in courses:
        writer.writerow(course)

In [None]:
!head amst.csv

WAITLIST,68081,AMST 1050,11,What is Democracy?,3.00,"Anker, E",MON B32,T 12:45PM - 03:15PM,08/30/21 - 12/11/21,
OPEN,62945,AMST 1100,10,Politics and Film,3.00,"Anker, E",FNGR 108 AND FNGR 108,M 12:45PM - 02:00PM AND M 07:10PM - 09:40PM,08/30/21 - 12/11/21,Linked
CLOSED,62983,AMST 1100,30,Discussion,0.00,"Anker, E",GELM B04,W 09:35AM - 10:25AM,08/30/21 - 12/11/21,Find Books
CLOSED,62984,AMST 1100,31,Discussion,0.00,"Anker, E",1957 E B16,W 11:10AM - 12:00PM,08/30/21 - 12/11/21,Find Books
OPEN,62985,AMST 1100,32,Discussion,0.00,"Anker, E",1957 E B17,W 09:35AM - 10:25AM,08/30/21 - 12/11/21,Find Books
CLOSED,62986,AMST 1100,33,Discussion,0.00,"Anker, E",FNGR 108,W 12:45PM - 01:35PM,08/30/21 - 12/11/21,Find Books
CLOSED,64060,AMST 1100,34,Discussion,0.00,"Anker, E",1957 E B17,W 12:45PM - 01:35PM,08/30/21 - 12/11/21,Find Books
CLOSED,64061,AMST 1100,35,Discussion,0.00,"Anker, E",1957 E 214,W 02:20PM - 03:10PM,08/30/21 - 12/11/21,Find Books
WAITLIST,68545,AMST 1100,36,Discussion,0.00,"

We could also load this CSV into pandas, if we wanted to use some of its filtering and analysis methods. 

In [None]:
import pandas as pd


In [None]:
df = pd.DataFrame(courses, columns = ['STATUS', 'CRN', 'SUBJECT', 'SECT', 'COURSE', 'CREDIT', 'INSTR.', 'BLDG/RM', 'DAY/TIME', 'FROM / TO', 'Books'])

In [None]:
df

In [None]:
df[df['STATUS'] == "OPEN"]


Instructions for mounting your Google Drive: 
https://www.marktechpost.com/2019/06/07/how-to-connect-google-colab-with-google-drive/

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp amst.csv /content/drive/MyDrive


Now that we've done this on one page, what might be next steps?

* Create a function to modularize the work we did. Then apply this to other pages (Note that there were 2 pages for AMST) and other departments. 
* Consider how you can be respectful of the website. Where could you insert some sleep statements? 
* Maybe there are better times of day to run this kind of scraping?