# Papers Past Scraper

This notebook is designed to scrape a specific [periodical](https://paperspast.natlib.govt.nz/periodicals) from [Papers Past](https://paperspast.natlib.govt.nz/) and to create a corpus based on the OCRd text. Obviously DigitalNZ is a better way to capture data from PapersPast, but not everything is available by the DigitalNZ API.

In the case of this notebook I am scraping the [New Zealand Police Gazette](https://paperspast.natlib.govt.nz/periodicals/new-zealand-police-gazette). Note that each periodical has background information and details of copyright and/or creative commons licenses (e.g. see the [New Zealand Police Gazette](https://paperspast.natlib.govt.nz/periodicals/new-zealand-police-gazette) page as an example). 

## Configuration and setup

The main requirement that requires configuration is [Selenium](https://selenium.dev/). Selenium is used to automate a web browser. Requests using the `requests` library were blocked.  

Documentation that is relevant for installing Selenium and a driver is available at the [Selenium PyPI page](https://pypi.org/project/selenium/).  

In this case I am using the [Geckodriver](https://github.com/mozilla/geckodriver) to automate Firefox. Note: [Geckodriver binaries available here](https://github.com/mozilla/geckodriver/releases).

In [None]:
import requests
import os
import time
import glob
import bs4 as bs
from selenium import webdriver
import urllib.parse

In [None]:
#configure the driver path ...
driver_path = 'C:/applications/geckodriver' # these need to be installed - see note above about Geckodriver

In [None]:
# set directory paths
issues_dir = r'cache/issues'
contents_dir = r'cache/contents'
corpus_dir = r'corpus'

# make sure the required directories are created
if not os.path.exists(issues_dir):
    os.makedirs(issues_dir)
if not os.path.exists(contents_dir):
    os.makedirs(contents_dir)
if not os.path.exists(corpus_dir):
    os.makedirs(corpus_dir)

### Configure the start url

This should be the first issue of the publication in question. The rest of the issues are accessed automatically by crawling all issues using the 'Next issue' link.

In [None]:
start_url = 'https://paperspast.natlib.govt.nz/periodicals/new-zealand-police-gazette/1877/07/02'

# this sets a base_url for use in resolving urls extracted from the page ...
url = urllib.parse.urlparse(start_url)
base_url = url.scheme + '://' + url.netloc

## Retrieve and save all issue pages
Starting at the `start_url` this code clicks through all the separate issues of the publication and stores each issue page.

In [None]:
driver = webdriver.Firefox(executable_path = driver_path)
next_url = start_url
while next_url != False:
    print('Retrieve', next_url)
    driver.get(next_url)
    
    # build filename for cache
    url = urllib.parse.urlparse(next_url)
    filename = url.path.strip('/').replace('/' ,'-') + '.txt'

    # cache file
    f = open(issues_dir + '/' + filename, 'w', encoding='utf-8')
    f.write(driver.page_source)
    f.close()
    
    # extract next link using beautiful soup
    soup = bs.BeautifulSoup(driver.page_source)
    link = soup.select_one("div.show-for-medium div.pager__right a")
    if link is None:
        next_url = False
    else:
        next_url = urllib.parse.urljoin(base_url, link['href']) 
        time.sleep(5)

## Retrieve and save all contents pages
This code extracts the content pages from the issue pages, requests the content pages and saves them.

In [None]:
driver = webdriver.Firefox(executable_path = driver_path)
for filename in glob.glob(os.path.join(issues_dir, '*.txt')):
    # open file
    f = open(filename, 'r', encoding='utf-8')
    contents = f.read()
    f.close()

    # extract links
    soup = bs.BeautifulSoup(contents)
    for link in soup.select("ul.issue__contents li a"):
        next_url = urllib.parse.urljoin(base_url, link['href'])
        print('Retrieve', next_url)
        driver.get(next_url)

        # build filename for cache
        url = urllib.parse.urlparse(next_url)
        filename = url.path.strip('/').replace('/' ,'-') + '.txt'

        # cache file
        f = open(contents_dir + '/' + filename, 'w', encoding='utf-8')
        f.write(driver.page_source)
        f.close()
        time.sleep(5)

## Write corpus
This reads through the scraped content files, extracts the content, strips tags and writes a file. If the content for the page is blank (e.g. for the masthead image), then there will be no file written.

In [None]:
file_counter = 0
for filename in glob.glob(os.path.join(contents_dir, '*.txt')):
    # open file
    f = open(filename, 'r', encoding='utf-8')
    contents = f.read()
    f.close()

    # extract content
    text = ''
    soup = bs.BeautifulSoup(contents)
    content = soup.find("div", itemprop="articleBody")
    text = content.get_text("\n")
    
    text = text.strip()
    
    if text != '':
        # write corpus file
        file_counter += 1
        f = open(corpus_dir + '/' +  os.path.basename(filename), 'w', encoding='utf-8')
        f.write(text)
        f.close()
        
print(file_counter,'corpus files written')