My first challenge was gathering a copy the available documents within the parameters (geographic and temporal) of my study. 

The source base for my study is the collection of scanned periodicals made available by the [Office of Archives, Statistics, and Research of the Seventh-day Adventist Church](http://documents.adventistarchives.org/). One of the advantages of working with these sources is that they are openly available on the web. This removes the need to navigate through the firewalls (and legal land-mines) of using text from major library databases, a major boon for the digital project. And, although the site does not provide an API for accessing the documents, the structure of the pages is regular, which makes the site a good candidate for web scraping. 

To determine the list of titles that applied to my time and regions of study, I browsed through all of the titles in the [periodicals section of the site](http://documents.adventistarchives.org/Periodicals/Forms/AllFolders.aspx) and compiled a list of titles that fit my geographic and temporal constraints. These are: 

* [Training School Advocate (ADV)](http://documents.adventistarchives.org/Periodicals/ADV)
* [American Sentinel (AmSn)](http://documents.adventistarchives.org/Periodicals/AmSn)
* [Advent Review and Sabbath Herald (ARAI)](http://documents.adventistarchives.org/Periodicals/ARAI)
* [Christian Education (CE)](http://documents.adventistarchives.org/Periodicals/CE)
* [Welcome Visitor (Columbia Union Visitor) (CUV)](http://documents.adventistarchives.org/Periodicals/CUV)
* [Christian Educator (EDU)](http://documents.adventistarchives.org/Periodicals/EDU)
* [General Conference Bulletin (GCB)](http://documents.adventistarchives.org/Periodicals/GCSessionBulletins)
* [Gospel Herald (GH)](http://documents.adventistarchives.org/Periodicals/GH)
* [Gospel of Health (GOH)](http://documents.adventistarchives.org/Periodicals/GOH)
* [Gospel Sickle (GS)](http://documents.adventistarchives.org/Periodicals/GS)
* [Home Missionary (HM)](http://documents.adventistarchives.org/Periodicals/HM)
* [Health Reformer (HR)](http://documents.adventistarchives.org/Periodicals/HR)
* [Indiana Reporter (IR)](http://documents.adventistarchives.org/Periodicals/IR)
* [Life Boat (LB)](http://documents.adventistarchives.org/Periodicals/LB)
* [Life and Health (LH)](http://documents.adventistarchives.org/Periodicals/LH)
* [Liberty (LibM)](http://documents.adventistarchives.org/Periodicals/LibM)
* [Lake Union Herald (LUH)](http://documents.adventistarchives.org/Periodicals/LUH)
* [North Michigan News Sheet (NMN)](http://documents.adventistarchives.org/Periodicals/NMN)
* [Pacific Health Journal and Temperance Advocate (PHJ)](http://documents.adventistarchives.org/Periodicals/PHJ)
* [Present Truth (Advent Review) (PT-AR)](http://documents.adventistarchives.org/Periodicals/PT-AR) (renamed to PTAR)
* [Pacific Union Recorder (PUR)](http://documents.adventistarchives.org/Periodicals/PUR)
* [Review and Herald (RH)](http://documents.adventistarchives.org/Periodicals/RH)
* [Sabbath School Quarterly (SSQ)](http://documents.adventistarchives.org/SSQ)
* [Sligonian (Sligo)](http://documents.adventistarchives.org/Periodicals/Sligo)
* [Sentinel of Liberty (SOL)](http://documents.adventistarchives.org/Periodicals/SOL)
* [Signs of the Times (ST)](http://documents.adventistarchives.org/Periodicals/ST)
* [Report of Progress, Southern Union Conference (SUW)](http://documents.adventistarchives.org/Periodicals/SUW)
* [The Church Officer's Gazette (TCOG)](http://documents.adventistarchives.org/Periodicals/TCOG)
* [The Missionary Magazine (TMM)](http://documents.adventistarchives.org/Periodicals/TMM)
* [West Michigan Herald (WMH)](http://documents.adventistarchives.org/Periodicals/WMH)
* [Youth's Instructor (YI)](http://documents.adventistarchives.org/Periodicals/YI)

As this was my first technical task for the dissertation, my initial methods for identifying the URLs to the documents I wanted to download was rather manual. I saved a .html file for each web page that contained documents I wanted to download. I then passed those .html files to a script (similar to that recorded here) that used `BeautifulSoup` to extract the PDF ids, reconstruct the URLs, and write the URLs to a new text file. After manually deleting the URLs to any documents that were out of range, I then passed the file with URLs to `wget` using the following syntax: 

```bash
wget -i scrapeList.txt -w 2 --limit-rate=200k
```

I ran this process for each of the periodical titles included in this study. It took approximately a week to download all 13,000 files to my local machine.

This notebook reflects a more automated version of that process, created in 2017 to verify to download any missing documents. The example recorded here is for downloading the Sabbath School Quarterly collection, which I missed during my initial collection phase. 

These scripts use the [`requests`](http://docs.python-requests.org/en/master/) library to retrieve the HTML from the document directory pages and [`BeautifulSoup4`](https://www.crummy.com/software/BeautifulSoup/) to locate the filenames. I am using [`wget`](https://pypi.python.org/pypi/wget) to download the files.

In [None]:
from bs4 import BeautifulSoup
from os.path import join
import re
import requests
import wget

In [None]:
def get_html_page(url):
    """Use the requests library to get HTML content from URL
    
    Args:
        url (str): URL of webpage with content to download.
    """
    r = requests.get(url)

    return r.text


def filename_from_html(content):
    """Use Beautiful Soup to extract the PDF ids from the HTML page. 

    This script is customized to the structure of the archive pages at
    http://documents.adventistarchives.org/Periodicals/Forms/AllFolders.aspx.

    Args:
        content (str): Content is retrieved from a URL using the `get_html_page` 
            function.
    """
    soup = BeautifulSoup(content, "lxml")
    buttons = soup.find_all('td', class_="ms-vb-title")

    pdfIDArray = []

    for each in buttons:
        links = each.find('a')
        pdfID = links.get_text()
        pdfIDArray.append(pdfID)

    return pdfIDArray


def check_year(pdfID):
    """Use regex to check the year from the PDF filename.

    Args:
        pdfID (str): The filename of the PDF object, formatted as 
            PREFIXYYYYMMDD-V00-00
    """
    split_title = pdfID.split('-')
    title_date = split_title[0]
    date = re.findall(r'[0-9]+', title_date)
    year = date[0][:4]
    if int(year) < 1921:
        return True
    else:
        return False

The first step was to set the directory where I want to download the documents to, as well as the root URL for where the PDF documents could be found. The current setup requires the baseurl to be manually changed for each title.

In [None]:
download_directory = ""
baseurl = "http://documents.adventistarchives.org/SSQ/"

My next step when downloading a collection of documents was to generate a list of the files that I wanted to download. Here I use two functions within the `compile` module to request the HTML from the index_url and to extract the document ids from the HTML. Finally, to avoid downloading any files outside of my study, I check the year in the doc ID before adding it to my list of documents to download.


In [None]:
index_page_urls = ["http://documents.adventistarchives.org/SSQ/Forms/AllItems.aspx?View={44c9b385-7638-47af-ba03-cddf16ec3a94}&SortField=DateTag&SortDir=Asc",
              "http://documents.adventistarchives.org/SSQ/Forms/AllItems.aspx?Paged=TRUE&p_SortBehavior=0&p_DateTag=1912-10-01&p_FileLeafRef=SS19121001-04%2epdf&p_ID=457&PageFirstRow=101&SortField=DateTag&SortDir=Asc&&View={44C9B385-7638-47AF-BA03-CDDF16EC3A94}"
             ]

In [None]:
docs_to_download = []

for url in index_page_urls: 
    content = get_html_page(url)
    pdfs = filename_from_html(content)
    
    for pdf in pdfs:
        if check_year(pdf):
            print("Adding {} to download list".format(pdf))
            docs_to_download.append(pdf)
        else:
            pass

Finally, I loop through all of the filenames, create the URL to the PDF, and use `wget` to download a copy of the document into my directory for processing.

In [None]:
for doc_name in docs_to_download:
    url = join(baseurl, "{}.pdf".format(doc_name))
    print(url)
    wget.download(url, download_directory)