# Scraping Papers Data from the OU Papers Web

### Part 1: Extract plain text from HTML with BeautifulSoup

In [35]:
# If you want to make a comment or describe what something is doing, start the line with a #. 
# These don't do anything, but can be helpful for explaining context!

In [36]:
# Import 'requests', a library to help retrieve web pages
import requests
#Import the csv library for outputting the table data
import csv
# BeautifulSoup can parse web page structure, much like a web browser does
from bs4 import BeautifulSoup

In [37]:
# We store a URL, giving it the variable name 'url'
# This is  a 'string variable' - the type of data we are mainly concerned with here.
# Chose a subject page for the first search - this is the English one
url = 'https://www.otago.ac.nz/courses/papers/index.html?subjcode=*&papercode=engl&keywords=&period=&year=&distance=&lms=&submit=Search'

In [38]:
# The requests library will retrieve various information about the page
# By convention, we use 'r' to denote the object storing this information
r = requests.get(url)

# Print the status code and first 300 characters of the webpage text to check that it's working
print(r.status_code)
print(r.text[:300])

200
<!DOCTYPE html>
<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en-NZ"> <![endif]-->
<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8 lang="en-NZ""> <![endif]-->
<!--[if IE 8]>         <html class="no-js lt-ie9" lang="en-NZ"> <![endif]-->
<!--[if gt IE 8]><!--> <html


#### Making Beautiful Soup

In [39]:
# We create a BeautifulSoup object, using a parser library called 'lxml'
# lxml isn't imported directly, but is a dependency (or requirement) here
soup = BeautifulSoup(r.text, 'lxml')

In [40]:
# Select an element of the web page for display - check the page title
soup.a
soup.find_all('a')

[<a id="top"></a>,
 <a href="/accessibility/">Accessibility</a>,
 <a href="#global_nav">Skip to Global Navigation</a>,
 <a href="#localnav">Skip to Local Navigation</a>,
 <a href="#content">Skip to Content</a>,
 <a href="#globalsearch">Skip to Search</a>,
 <a href="#footer_bg">Skip to Site Map</a>,
 <a href="#menu" id="menu-link">Menu</a>,
 <a href="#" id="close_mobile_nav">Close menu</a>,
 <a accesskey="0" class="homepage_link" href="/" title="Otago homepage"><img alt="University of Otago - Te Whare Wānanga o Otāgo" height="80" src="/_assets/_gfx/logo.png" srcset="/_assets/_gfx/logo@2x.png 2x" width="160"/></a>,
 <a class="homepage_link" href="/" id="global_nav_home">Show Otago menu</a>,
 <a data-link="true" href="/" id="home_link">Otago home</a>,
 <a data-link="true" href="/future-students/" id="menu_link">Future students</a>,
 <a data-slide-out-menu="mega_2" href="/currentstudents/">Current students</a>,
 <a data-slide-out-menu="mega_3" href="/staff/">Otago staff</a>,
 <a data-slide

In [41]:
#print (soup)

Like ```r```, ```soup``` is an object. Objects _bundle together_ many variables (properties of an object)  as well as methods (actions we can perform on the object).

```soup.title``` gave us the page title. We can access other webpage properties you're probably familiar with, too. 

#### Try this out

Change the last line in the cell above to display the following page elements (or 'tags', for short) stored in the ```soup``` object:

- img
- table
- tr
- a

So far we can get webpage elements / tags, but only the first example of any given tag. Learn more about BeautifulSoup [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), such as the ```find_all()``` method that helps us extract all instances of a given element. 

#### Collect data from the table
Let's get a bit more data. Say we want to count adjectives in every story in the collection; we could go about it like this.

In [42]:
#Pulling out the table data from the page and storing it in a variable
#we can get away with with soup.table here because there's only one table on the page
paper_results_table = soup.table
print (paper_results_table)

#{'class':'paper_search_results'}

# Another way to do this if there were multiple tables in the page adn we wanted all of them -
#papers = soup.find_all('table')
#for table in papers:
#print(table)

<table class="paper_search_results" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:functx="http://www.functx.com" xmlns:ns1="https://esb.otago.ac.nz/xsd/101/paperOffering" xmlns:ns16="https://esb.otago.ac.nz/xsd/101/serverResponse" xmlns:ns2="https://esb.otago.ac.nz/xsd/101/paper" xmlns:ns3="https://esb.otago.ac.nz/xsd/101/offering" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<tr>
<th>Paper code</th>
<th>Year</th>
<th>Title</th>
<th>Points</th>
<th>Teaching period</th>
</tr>
<tr class="new_papercode">
<td><a href="index.html?papercode=ENGL120">ENGL120</a></td>
<td>2020</td>
<td><a href="index.html?papercode=ENGL120#2020">Creative Writing: How to Captivate and Persuade</a></td>
<td>18 points</td>
<td>First Semester</td>
</tr>
<tr class="new_papercode">
<td><a href="index.html?papercode=ENGL121">ENGL121</a></td>
<td>2020</td>
<td><a href="index.html?papercode=ENGL121#2020">English Literature: The Remix</a></td>
<td>18 points</td>
<td>First Semester</td>
</tr>
<tr class="new_pap

In [43]:
#Pulling out the table row data from the page and storing it in a different variable
paper_results_table_rows = paper_results_table.find_all('tr')
print (paper_results_table_rows)

[<tr>
<th>Paper code</th>
<th>Year</th>
<th>Title</th>
<th>Points</th>
<th>Teaching period</th>
</tr>, <tr class="new_papercode">
<td><a href="index.html?papercode=ENGL120">ENGL120</a></td>
<td>2020</td>
<td><a href="index.html?papercode=ENGL120#2020">Creative Writing: How to Captivate and Persuade</a></td>
<td>18 points</td>
<td>First Semester</td>
</tr>, <tr class="new_papercode">
<td><a href="index.html?papercode=ENGL121">ENGL121</a></td>
<td>2020</td>
<td><a href="index.html?papercode=ENGL121#2020">English Literature: The Remix</a></td>
<td>18 points</td>
<td>First Semester</td>
</tr>, <tr class="new_papercode">
<td><a href="index.html?papercode=ENGL126">ENGL126</a></td>
<td>2020</td>
<td><a href="index.html?papercode=ENGL126#2020">English for University Purposes</a></td>
<td>18 points</td>
<td>Second Semester</td>
</tr>, <tr class="new_papercode">
<td><a href="index.html?papercode=ENGL127">ENGL127</a></td>
<td>2020</td>
<td><a href="index.html?papercode=ENGL127#2020">Effective Wri

### Preparing the csv file
Another useful task is to extract this kind of information and store it in a spreadsheet. 

In [46]:
#Create and open a file called Paper_Results_Test1.csv
#f = csv.writer(open("Paper_Results_Test1.csv", "w"))
# Write column headers as the first line
#f.writerow(["Paper code", "Year", "Title", "Points", "Teaching period"])

#fieldnames = [Paper_code, Year, Title, Points, Teaching_period]
with open("Paper_Results_Test1.csv", "w", encoding="utf-8", newline='\n') as f:  
            csvwriter = csv.writer(f)
            csvwriter.writerow(["Paper code", "Year", "Title", "Points", "Teaching period"])

### Pulling out the td data and putting it into cells

In [47]:
#A FOR LOOP TO REPEAT EVERYTHING - the square bracket 1: means miss out one row at the start
for tr in paper_results_table_rows[1:]:
    td = tr.find_all('td') 
    row = [i.text for i in td]
        
    Paper_code = str(td[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string.
    Year = str(td[1].get_text())
    Title = str(td[2].get_text())
    Points = str(td[3].get_text())
    Teaching_period = str(td[4].get_text())    
    
    print (Paper_code, Year, Title, Points, Teaching_period)
    
    with open("Paper_Results_Test1.csv", "a", encoding="utf-8", newline='\n') as f:  
        csvwriter = csv.writer(f)
        csvwriter.writerow([Paper_code, Year, Title, Points, Teaching_period])

ENGL120 2020 Creative Writing: How to Captivate and Persuade 18 points First Semester
ENGL121 2020 English Literature: The Remix 18 points First Semester
ENGL126 2020 English for University Purposes 18 points Second Semester
ENGL127 2020 Effective Writing 18 points Second Semester, Summer School
ENGL128 2020 Effective Communication 18 points First Semester
ENGL131 2020 Controversial Classics 18 points Second Semester
ENGL214 2020 Medieval Literature 1 18 points Second Semester
ENGL216 2020 A Topic in English Language 18 points Not offered in 2020
ENGL217 2020 Creative Writing: Poetry 18 points Second Semester
ENGL218 2020 Shakespeare: Stage, Page and Screen 18 points First Semester
ENGL219 2020 Poetry and Music 18 points Not offered in 2020
ENGL220 2020 Creative Writing: Reading for Writers 18 points First Semester
ENGL222 2020 Contemporary American Fiction 18 points First Semester
ENGL223 2020 Fantasy and the Imagination 18 points Summer School
ENGL227 2020 Essay and Feature Writing 1

In [25]:
b"abcde"

b'abcde'

In [26]:
b"abcde".decode("utf-8")

'abcde'