Remember to periodically save!

# Web Scraping with Requests-HTML

Documentation: https://requests.readthedocs.io/projects/requests-html/en/latest/

If you've used Python for getting web pages, connecting to an API, or getting files off the web, odds are you've encountered the requests library. The library we're using today, Requests-HTML (note the different name) was developed by the author of the requests, Kenneth Reitz.  It's intended to be an easy-to-learn web scraping library, it's basically a layer on top of requests and BeautifulSoup. BeautifulSoup is another good basic Python library for parsing HTML. Lots of tutorials and documentation on it. 


## Part 1: Scraping headlines from the GW Hatchet

Before we start scraping, let's look at the robots.txt file for the site. This is a list of instructions to crawlers--usually search engines--about how to crawl the site and which parts to stay away from. 

https://www.gwhatchet.com/robots.txt

Note that this site is very open about crawling. That doesn't mean there aren't still legal and ethical considerations, but we're respecting their instructions about the site. 


**Getting started**

First we need to install the `requests_html` Python library using the pip command.

In [None]:
!pip install requests_html

In [71]:
from requests_html import HTMLSession

We  need to make an HTTP GET request to get the website HTML. To do this we create an HTMLSession. That session object has methods to make GET requests. 

The get() method takes the URL as an argument. Sends an HTTP GET request. 



In [5]:
session = HTMLSession()

r = session.get('https://www.gwhatchet.com/section/news/')

In [6]:
r.html

<HTML url='https://www.gwhatchet.com/section/news/'>

`r` is a Response object and it has an html attribute which contains all of the HTML from the page in Unicode. We can look at the web page's text using the text attribute.  This is like looking at the text content of all of the HTML tags. 

In [None]:
r.html.text

This html attribute also has many other attributes which are shortcuts to parts of the content. There is a links attribute which extracts all of the links in the page. Note that these are returned in alphabetical order, not the order on the page. 

In [None]:
r.html.links

In [9]:
r.html.find('footer')

[<Element 'footer' >]

Now let's work at extracting certain parts of the page. Suppose we want all of the headlines. First we need to examine the page and determine if there is any markup that helps us identify headlines.  

*Bring up the HTML of the page.*

We use CSS selectors to specify tags, classes, ids, or other attributes. 
https://www.w3schools.com/cssref/css_selectors.asp

We can see that all of the headlines are in h2 tags with a class of post-title. If there is a class that identifies the elements we want, that's ideal. 

In this case, we can use the h2 tag OR we could also use the class. *Show both parameters.*

We can now use the html object's find() method to provide a CSS selector. 

In [14]:
titles = r.html.find('h2')
#titles = r.html.find('post-title')

In [15]:
type(titles)

list

In [None]:
titles

`titles` is a list of Elements (capital E). We can access just the text content of those Elements using the `full_text` attribute. Let's look at just the first one to start.

In [17]:
titles[0].text

'DDOT releases first-ever Foggy Bottom traffic crash data'

Now let's grab the text of all of the headlines. We'll iterate through the Elements in the titles list and add to our headlines list the text attribute from each one:

In [19]:
headlines = [t.text for t in titles]
print(headlines)

['DDOT releases first-ever Foggy Bottom traffic crash data', "Faculty call on officials to enact a 'cluster hire' of minority faculty", 'GW marks 200 years, begins monthslong celebration', "Board delays decision on next year's cost of attendance", 'Students rely on hobbies to maintain mental health during pandemic', "Officials debut task force on the future of GW's academics", "GW Hillel launches program to aid senior citizens' vaccine registrations", 'COVID-19 a factor in higher volume of research output this year', 'School Without Walls reopens with quarter of student body', 'RHA expands programming as more students return to campus', 'Sections', 'Reader Services', 'Learn More', 'Get in Touch']


That's fine for one page, but if we want to get a lot more headlines, we'll need to check multiple pages. Let's look at the Hatchet website and see which pages have new articles. 

*Return to Hatchet website and page through news section.*

Note: there are only 10 headlines per page, and then we need to click Next to go to another page. Clicking on that, we can see that there is number in the URL. So seeing that, we can use Python to get the HTML of each page and then get the headlines from each one. 

Before we jump in and do that, we're at the point where we should consider what is the most polite way to get this content from the Hatchet website. When you're web scraping, you don't want to interfere with a website's regular traffic by requesting lots of content all at one. We've got a lot of us online in this workshop, so let's give our requests a little space. We'll pause a second between requesting each page. 

To do that pausing, we need to import another library in the Python standard library. 

In [20]:
from time import sleep

In [33]:
news = []

# get the first 4 pages
for i in range(1,5):
    resp = session.get('https://www.gwhatchet.com/section/news/page/{}'.format(i))
    #titles = resp.html.find('h2')
    titles = resp.html.find('.post-title')
    headlines = [t.text for t in titles]
    news.extend(headlines)
    sleep(1)

print(news)

['DDOT releases first-ever Foggy Bottom traffic crash data', "Faculty call on officials to enact a 'cluster hire' of minority faculty", 'GW marks 200 years, begins monthslong celebration', "Board delays decision on next year's cost of attendance", 'Students rely on hobbies to maintain mental health during pandemic', "Officials debut task force on the future of GW's academics", "GW Hillel launches program to aid senior citizens' vaccine registrations", 'COVID-19 a factor in higher volume of research output this year', 'School Without Walls reopens with quarter of student body', 'LGBTQ business student group seeks to build community', 'RHA expands programming as more students return to campus', 'Crosswalk among renovations installed at intersection', 'Crime log: Nine Title IX cases filed with GWPD last week', 'Human rights experts analyze ethnic minority experiences in China', 'Panel addresses vaccine misinformation, hesitancy', 'Officials exploring options for in-person Commencement in 

In [23]:
len(news)

40

Now that you have this data, you could go on to do other analysis of it. Maybe you want to look for mentions of certain entitites or do sentiment analysis or topic modelling of the headlines. There's a workshop on Natural Language Processing, or NLP tomorrow at 1PM if you want to learn more about working with text data: https://library.gwu.edu/news-events/events/python-natural-language-processing-0

**Exercise:**

Collect the bylines from the news-articles on the page. Start with one page of news articles. 

Hints:
1. determine what tag or class identifies the text you want. 
2. use the find() method to identify the element and collect them into a list. Assign that to a variable called bylines. 
3. Extract the text from each element and assign the results to a list called authors. 


In [25]:
# Note that if you use just byline-author as below you will also pick up the text "By". 
# bylines = r.html.find('.byline-author')
# bylines[0].text

In [26]:
bylines = r.html.find('.byline-author > a')

In [27]:
bylines[0].text

'Chow Paueksakon'

In [28]:
authors = [b.text for b in bylines]
authors

['Chow Paueksakon',
 'Ishani Chettri',
 'Nicholas Pasion',
 'Zach Schonfeld',
 'Zach Schonfeld',
 'Abby Kennedy',
 'Yankun Zhao',
 'Brennan Fiske',
 'Lia DeGroot',
 'Michelle Vassilev',
 'Clara Duhon',
 'Jarrod Wardwell',
 'Yutong Jiang']

If you want to use this data to get a count of how many times each author has been published, you could count them. There is a count() method in Python.

In [29]:
authors.count("Jarrod Wardwell")

1

The .count() method is a bit inefficient for our purposes because you have to use the count method with each possible author's name. Better to use Python's Counter from the collections module. We can give Counter a list and it will return a count of each value that appears. 

In [30]:
from collections import Counter
a = Counter(authors)
a

Counter({'Abby Kennedy': 1,
         'Brennan Fiske': 1,
         'Chow Paueksakon': 1,
         'Clara Duhon': 1,
         'Ishani Chettri': 1,
         'Jarrod Wardwell': 1,
         'Lia DeGroot': 1,
         'Michelle Vassilev': 1,
         'Nicholas Pasion': 1,
         'Yankun Zhao': 1,
         'Yutong Jiang': 1,
         'Zach Schonfeld': 2})

## Part 2: More scraping! Scraping course listings from the Schedule of Classes

**Have you saved lately? Be sure to save your notebook file!**

*Bring up the GW Schedule of Classes* 
https://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=202101

Recommend starting with one page, and then building up as you determine how to access the data you need. We'll start with just the American Studies courses for Spring. *Drill down to the Main Campus Spring semester and then select American Studies*.

Let's suppose we want to have a spreadsheet listing all of the courses in our department for our major. 

We will need to parse the HTML to get the information in each of these rows. (We'll ignore the comments and Find Books links.)

###robots.txt
Before doing anything, let's look at the robots.txt

https://my.gwu.edu/robots.txt
Our page is in /mod/pws and there are no prohibitions about crawling that content. 

**Now we can move on and look at the HTML for the page we want:**
*View Page Source. Discuss HTML tags used for tables: table, tr, td, etc.*

Relevant tags and courses include the `tr` tag in the tables. It looks like we have a class on each row which we could use, `.crseRow1`. We could look for all of the `td`s within `.crseRow1`, but then we would lose the contxt of which row they were in. 

Let's see what we get with the class that's on each row in each table. 



In [53]:
session_new = HTMLSession()

amst_url = 'https://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=202101&subjId=AMST'
response = session_new.get(amst_url)

In [42]:
# Might be thinking we could look for all of the td's within .crseRow1
rows = response.html.find('.crseRow1 > td')
rows[0].text

'OPEN'

In [43]:
len(rows)
#We have a long list of cells, and we would lose the contxt of which row they were in. 

495

Instead, let's try to work with each row, so we have a list of rows, and drill down to extract the td content.

In [36]:
trs = response.html.find('.crseRow1')
trs[0].text

'OPEN\n15114\nAMST 1000\n10\nBodies of Work\n3.00\nIvy, N\nREMOTE INSTR\nM\n12:45PM - 03:15PM\n01/11/21 - 04/26/21'

We get back a string, with each cell's text concatenated. It looks like the text for each cell is separated by a `\n` newline character. We still have access to the elements within the table rows object. For example, we can look at the cells (`td` tags) in our first row as follows: 

In [44]:
cells = trs[0].find('td')

In [46]:
cells[0].text

'OPEN'

It appeared that there is a newline between each cell, which could be useful as a delimiter, to separate the td's in each row. However, looking at the site, we can see that the time for the class is on a different line from the day of the week. Let's take a look at each cell: 

In [52]:
for c in cells:
    print("td: ",c.text)

td:  OPEN
td:  15114
td:  AMST 1000
td:  10
td:  Bodies of Work
td:  3.00
td:  Ivy, N
td:  REMOTE INSTR
td:  M
12:45PM - 03:15PM
td:  01/11/21 - 04/26/21
td:  


Now that we know how to access the text of each `td` and that there are some extra newlines in there, let's create a function that will strip out newlines in the text of the td tags. Our function will return a list that has the text of all of the cells. 

In [55]:
def get_tds(tr):
    row = [td.text.replace('\n', ' ') for td in tr.find('td')]
    return row

Now we can use this to get all the relevant text in all of the course tables on the page, and collect them into a courses list.

In [57]:
courses = []
for tr in trs:
    courses.append(get_tds(tr))

In [None]:
# Look at courses. Note that there are some here that don't appear on the web page! 
# That's because there is display:none as a style on some of the tables. 

courses

Now that we've extracted this data, we want to hold onto it. We'll create a CSV file to hold this data. 

In [69]:
import csv

with open('amst.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    for course in courses:
        writer.writerow(course)

In [70]:
!head amst.csv

OPEN,15114,AMST 1000,10,Bodies of Work,3.00,"Ivy, N",REMOTE INSTR,M 12:45PM - 03:15PM,01/11/21 - 04/26/21,
WAITLIST,17707,AMST 1000,11,Media Culture & COVID,3.00,"McAlister, M",REMOTE INSTR,MW 02:20PM - 03:35PM,01/11/21 - 04/26/21,
OPEN,18875,AMST 1000,12,Nature & Culture of Children,3.00,"Cohen-Cole, J",REMOTE INSTR,W 03:30PM - 06:00PM,01/11/21 - 04/26/21,
WAITLIST,17948,AMST 1050,10,Race and Racism in US History,3.00,"Guglielmo, T",REMOTE INSTR,MW 09:35AM - 10:50AM,01/11/21 - 04/26/21,
CLOSED,17949,AMST 1200,10,The Sixties in America,3.00,"Osman, S",REMOTE INSTR,TR 11:10AM - 12:00PM,01/11/21 - 04/26/21,Linked
WAITLIST,17950,AMST 1200,30,Discussion,0.00,"Osman, S",REMOTE INSTR,R 09:35AM - 10:25AM,01/11/21 - 04/26/21,Find Books
WAITLIST,17951,AMST 1200,31,Discussion,0.00,"Osman, S",REMOTE INSTR,R 12:45PM - 01:35PM,01/11/21 - 04/26/21,Find Books
WAITLIST,17952,AMST 1200,32,Discussion,0.00,"Osman, S",REMOTE INSTR,R 02:20PM - 03:10PM,01/11/21 - 04/26/21,Find Books
WAITLIST,17953,A

We could also load this CSV into pandas, if we wanted to use some of its filtering and analysis methods. 

In [62]:
import pandas as pd


In [63]:
df = pd.DataFrame(courses, columns = ['STATUS', 'CRN', 'SUBJECT', 'SECT', 'COURSE', 'CREDIT', 'INSTR.', 'BLDG/RM', 'DAY/TIME', 'FROM / TO', 'Books'])

In [None]:
df

In [None]:
df[df['STATUS'] == "OPEN"]


Now that we've done this on one page, what might be next steps?

* Create a function to modularize the work we did. THen apply this to other pages (Note that there were 2 pages for AMST) and other departments. 
* Consider how you can be respectful of the website. Where could you insert some sleep statements? 
* Maybe there are better times of day to run this kind of scraping?