# Scraping Papers Data for Multiple Subjects from the OU Papers Web

![OU Subjects and Papers Web Search Form](https://www.otago.ac.nz/_assets/_gfx/logo.png)

## Overview

This is a series of script sections written in Python, to scrape data from the OU 'Search for papers' webpage.

It is broken down into steps with some comments - those lines starting with a `#`; they don't **do** anything, but can be helpful for explaining context!

### A little about Jupyter notebooks

Interactive online notebooks - Jupyter notebooks liek this one -  make it easy to run small pieces of code and better understand each step in the program. They allow everyone to work in the same environment and get the same results (we hope!), and also allow text, headings, links etc to be interspersed in the code.

If you want to use Jupyter notebooks (as we are here), I recommend you set this up, along with many of the Python libraries you might want, by installing the Anaconda Community distribution.

#### Running code cells
The notebook is made up of cells, which can be run individually or together. You must run them in order as the programs are sequential - later parts rely on the completion of earlier parts.

To run a cell, you need to click on it to select it, then press the 'Run' button on the menu bar, or press Shift+Enter (the latter is much more convenient, once you get used to it). Working cell by cell helps break the program into pieces which can be understood more easily.

### Start by importing the Libraries
A *library* in Python is a collection of functions and methods that allows you to perform many actions without writing your code. We need to import three for this script:

`requests` which enables the us to retrieve web pages

`csv` which enables us to outputting the data we collect as an excel-readable csv file

`BeautifulSoup` which can interpret and process html and xml - so can read a web page, much like a web browser does

In [35]:
# Import means make available to us to access the commands and tools in the library for this particular program
import requests
import csv
from bs4 import BeautifulSoup

### Set our subjects
Store the names of the subject codes you want to check in a variable `subjects`. This is editable for different subject librarians - the current iteration is for **Social Sciences**

In [40]:
subjects = ("finc", "acct", "econ", "info", "mart", "tour")

#print them out to confirm that it's working
print (subjects)

('finc', 'acct', 'econ', 'info', 'mart', 'tour')


### Setup the Search URLS for all your Subjects
Next we are going to use an accumulator pattern or [for loop](https://en.wikipedia.org/wiki/For_loop), to build a list of the variations of URL search that we need to search for all the papers taught across the different subjects.


In [41]:
# make an empty variable
subject_papers_search_urls = []

#buid the URLS and add them one by one
for item in subjects:
    url_ending = item + '&keywords=&period=&year=&distance=&lms=&submit=Search'
    subject_papers_search_urls.append('https://www.otago.ac.nz/courses/papers/index.html?subjcode=*&papercode=' + url_ending)

print(subject_papers_search_urls)

['https://www.otago.ac.nz/courses/papers/index.html?subjcode=*&papercode=finc&keywords=&period=&year=&distance=&lms=&submit=Search', 'https://www.otago.ac.nz/courses/papers/index.html?subjcode=*&papercode=acct&keywords=&period=&year=&distance=&lms=&submit=Search', 'https://www.otago.ac.nz/courses/papers/index.html?subjcode=*&papercode=econ&keywords=&period=&year=&distance=&lms=&submit=Search', 'https://www.otago.ac.nz/courses/papers/index.html?subjcode=*&papercode=info&keywords=&period=&year=&distance=&lms=&submit=Search', 'https://www.otago.ac.nz/courses/papers/index.html?subjcode=*&papercode=mart&keywords=&period=&year=&distance=&lms=&submit=Search', 'https://www.otago.ac.nz/courses/papers/index.html?subjcode=*&papercode=tour&keywords=&period=&year=&distance=&lms=&submit=Search']


## Create a csv file to hold the search results
Create and open a file called `Paper_Results_Test1.csv` in writeable mode `'w'`, with utf-8 encoding, specifying the newline character

Write the header row of the file to match the fields in the search results table

In [44]:
with open("CombinedPaper_Results_Test.csv", "w", encoding="utf-8", newline='\n') as fd:  
    fieldnames = [Paper_code, Year, Title, Points, Teaching_period]
    csvwriter = csv.writer(fd)
    csvwriter.writerow(["Paper code", "Year", "Title", "Points", "Teaching period"])

## Gather the paper data for each subject and add them to a csv file
Create a `for` loop to process the successive pages iteratively, which involves:
* using `requests` to fetch the search results page adn then peek at the first part to ensure it has worked
* processing the search results webpages using BeautifulSoup
* pulling out the table data from the page and storing it in a variable

The we create another `for` loop within that (yes, that's a nested `for` loop!) which uses BeautifulSoup to:
* find all the rows `tr` in the table, then
* find all the cells `td`, and then
* pull out the data for each `td` cell into one of five variables `Paper_code`, `Year`, `Title`, `Points`, `Teaching_period` depending on its position in the row, and finally
* add all that data to the csv file that we created earlier

In [45]:
for subject_url in subject_papers_search_urls:
    
    #pull the data from the page
    r = requests.get(subject_url)
    # Print the status code and first 300 characters of the webpage text to check that it's working
    print(r.status_code)
    print(r.text[:300])
    
    #process the html page with Beautiful Soup, usign the lxml library
    soupextra = BeautifulSoup(r.text, 'lxml')
    
    #Pulling out the table data from the page and storing it in a variable
    #we can get away with with soup.table here because there's only one table on the page
    paper_results_rows = soupextra.find_all('tr')
    
    #a loop to pull out all the relevant data - the square bracket 1: means miss out one row at the start
    for tr in paper_results_rows[1:]:
        td = tr.find_all('td') 
        row = [i.text for i in td]
        # This structure isolates each piece of row data item by its column in the table and converts it into a string.    
        Paper_code = str(td[0].get_text()) 
        Year = str(td[1].get_text())
        Title = str(td[2].get_text())
        Points = str(td[3].get_text())
        Teaching_period = str(td[4].get_text())    
    
        #Check that its working
        print (Paper_code, Year, Title, Points, Teaching_period)
    
        #write all the results to the csv file
        with open("CombinedPaper_Results_Test.csv", "a", encoding='UTF-8',newline='\n') as fd:  
            csvwriter = csv.writer(fd)
            csvwriter.writerow([Paper_code, Year, Title, Points, Teaching_period])

200
<!DOCTYPE html>
<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en-NZ"> <![endif]-->
<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8 lang="en-NZ""> <![endif]-->
<!--[if IE 8]>         <html class="no-js lt-ie9" lang="en-NZ"> <![endif]-->
<!--[if gt IE 8]><!--> <html
FINC102 2020 Business Mathematics 18 points First Semester, Second Semester
FINC202 2020 Investment Analysis and Portfolio Management 18 points First Semester, Second Semester
FINC203 2020 Financial Data Analysis 18 points First Semester
FINC204 2020 Personal Finance 18 points Summer School
FINC206 2020 Fundamentals of Corporate Finance 18 points Second Semester
FINC302 2020 Applied Investments 18 points First Semester
FINC303 2020 Financial Management 18 points Second Semester
FINC304 2020 Financial Markets and Institutions 18 points First Semester
FINC305 2020 International Financial Management 18 points First Semester
FINC306 2020 Derivatives 18 points Second Semester
FINC308 2020 Financ

200
<!DOCTYPE html>
<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en-NZ"> <![endif]-->
<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8 lang="en-NZ""> <![endif]-->
<!--[if IE 8]>         <html class="no-js lt-ie9" lang="en-NZ"> <![endif]-->
<!--[if gt IE 8]><!--> <html
MART112 2020 Marketing 18 points First Semester, Second Semester
MART201 2020 Integrated Marketing Communications 18 points Second Semester
MART205 2020 Marketing the Professional Practice 18 points Second Semester
MART207 2020 Sports Marketing 18 points First Semester
MART210 2020 Consumer Behaviour 18 points First Semester
MART211 2020 Products to Market 18 points Second Semester
MART212 2020 Understanding Markets 18 points First Semester
MART301 2020 Strategic Marketing 18 points Second Semester
MART304 2020 Sales and Sales Management 18 points First Semester
MART305 2020 Societal Issues in Marketing 18 points First Semester
MART306 2020 Innovation and New Product Development 18 points F