My first challenge for this digital project was to collect a copy the available documents within the parameters of my study. The source for my study is the collection of scanned periodicals made available by the [Office of Archives, Statistics, and Research of the Seventh-day Adventist Church](http://documents.adventistarchives.org/). Downloading files was one of the first computational tasks I tackled as part of the dissertation. While their website provides tools for searching and filtering the documents, I examined all of the titles in the [periodicals section of the site](http://documents.adventistarchives.org/Periodicals/Forms/AllFolders.aspx) and compiled a list of titles that fit my geographic and temporal constraints. From there, I created lists of documents to download.

My early methods for identifying the URLs to the documents and downloading the files were very manual. I saved a copy of the HTML from the pages with the directory of documents I wanted to download, and passed them to a [script]() which used `Beautiful Soup` to extract the PDF ids, reconstruct the URLs, and write the URLs to a new text file. After deleting the URLs to any documents that were out of range, I then passed the file with URLs to `wget` using the following syntax: 

`wget -i scrapeList.txt -w 2 --limit-rate=200k`

I ran this process for each of the periodical titles included in this study. It took approximately a week to download all 13,000 files to my local machine.

This notebook reflects a more automated version of that process, created in 2017 to verify whether my local corpus still matched the web holdings, and to download any missing documents. The example recorded here is for downloading the Sabbath School Quarterly collection, which I missed during my initial collection phase. These scripts use the `requests` library to retrieve the HTML from the document directory pages and `Beautiful Soup` to locate the filenames. I am again using `wget` to download those files, but within the notebook.

The `GoH` module is one I wrote, and contains a number of scripts used within the dissertation project. The documentation for the `compile` submodule is available in the [dissertation documentation (appendices?)]().

In [1]:
import wget
from GoH import compile
from os.path import join

The first step was to set the directory where I want to download the documents to, as well as the root URL for where the PDF documents could be found. The current setup requires the baseurl to be manually changed for each title.

In [2]:
download_directory = "/Users/jeriwieringa/Dissertation/text/corpora-incoming/periodicals/"
baseurl = "http://documents.adventistarchives.org/SSQ/"

My next step when downloading a collection of documents was to generate a list of the files that I wanted to download. Here I use two functions within the `compile` module to request the HTML from the index_url and to extract the document ids from the HTML. Finally, to avoid downloading any files outside of my study, I check the year in the doc ID before adding it to my list of documents to download.


In [3]:
index_urls = ["http://documents.adventistarchives.org/SSQ/Forms/AllItems.aspx?View={44c9b385-7638-47af-ba03-cddf16ec3a94}&SortField=DateTag&SortDir=Asc",
        "http://documents.adventistarchives.org/SSQ/Forms/AllItems.aspx?Paged=TRUE&p_SortBehavior=0&p_DateTag=1912-10-01&p_FileLeafRef=SS19121001-04%2epdf&p_ID=457&PageFirstRow=101&SortField=DateTag&SortDir=Asc&&View={44C9B385-7638-47AF-BA03-CDDF16EC3A94}"
       ]

missing_docs = []

for url in index_urls: 
    content = compile.get_html_page(url)
    pdfs = compile.filename_from_html(content)
    
    for pdf in pdfs:
        if compile.check_year(pdf):
            print("Adding {} to download list".format(pdf))
            missing_docs.append(pdf)
        else:
            pass

Adding SS18880101-01 to download list
Adding SS18880701-03 to download list
Adding SS18890101-01 to download list
Adding SS18890701-03 to download list
Adding SS18891001-04 to download list
Adding SS18900104-01 to download list
Adding SS18900301-01e to download list
Adding SS18900405-02 to download list
Adding SS18900415-02e1 to download list
Adding SS18900501-02e2 to download list
Adding SS18900705-03 to download list
Adding SS18901004-04 to download list
Adding SS18910103-01 to download list
Adding SS18910201-02 to download list
Adding SS18910601-03 to download list
Adding SS18911001-04 to download list
Adding SS18920101-01 to download list
Adding SS18920401-02 to download list
Adding SS18920701-03 to download list
Adding SS18921001-04 to download list
Adding SS18930101-01 to download list
Adding SS18930401-02 to download list
Adding SS18930701-03 to download list
Adding SS18931001-04 to download list
Adding SS18940101-01 to download list
Adding SS18940401-02 to download list
Adding 

Finally, I loop through all of the filenames, create the URL to the PDF, and use `wget` to download a copy of the document into my directory for processing.

In [4]:
for doc_name in missing_docs:
    url = join(baseurl, "{}.pdf".format(doc_name))
    print(url)
    wget.download(url, download_directory)

http://documents.adventistarchives.org/SSQ/SS18880101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18880701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18890101-01.pdf
http://documents.adventistarchives.org/SSQ/SS18890701-03.pdf
http://documents.adventistarchives.org/SSQ/SS18891001-04.pdf
http://documents.adventistarchives.org/SSQ/SS18900104-01.pdf
http://documents.adventistarchives.org/SSQ/SS18900301-01e.pdf
http://documents.adventistarchives.org/SSQ/SS18900405-02.pdf
http://documents.adventistarchives.org/SSQ/SS18900415-02e1.pdf
http://documents.adventistarchives.org/SSQ/SS18900501-02e2.pdf
http://documents.adventistarchives.org/SSQ/SS18900705-03.pdf
http://documents.adventistarchives.org/SSQ/SS18901004-04.pdf
http://documents.adventistarchives.org/SSQ/SS18910103-01.pdf
http://documents.adventistarchives.org/SSQ/SS18910201-02.pdf
http://documents.adventistarchives.org/SSQ/SS18910601-03.pdf
http://documents.adventistarchives.org/SSQ/SS18911001-04.pdf
http://documents.ad

In [5]:
# %load ../shared_elements/system_info.py
import IPython
print (IPython.sys_info())
!pip freeze

{'commit_hash': '5c9c918',
 'commit_source': 'installation',
 'default_encoding': 'UTF-8',
 'ipython_path': '/Users/jeriwieringa/miniconda3/envs/dissertation2/lib/python3.5/site-packages/IPython',
 'ipython_version': '5.1.0',
 'os_name': 'posix',
 'platform': 'Darwin-16.5.0-x86_64-i386-64bit',
 'sys_executable': '/Users/jeriwieringa/miniconda3/envs/dissertation2/bin/python',
 'sys_platform': 'darwin',
 'sys_version': '3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, '
                '17:52:12) \n'
                '[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]'}
alabaster==0.7.10
anaconda-client==1.5.5
appdirs==1.4.3
appnope==0.1.0
argh==0.26.1
Babel==2.3.4
beautifulsoup4==4.5.3
blinker==1.4
bokeh==0.12.4
boto==2.43.0
brewer2mpl==1.4.1
bz2file==0.98
chest==0.2.3
cleanOCR==0.1
cloudpickle==0.2.2
clyent==1.2.2
cycler==0.10.0
dask==0.12.0
datashader==0.4.0
datashape==0.5.2
decorator==4.0.11
docutils==0.13.1
doit==0.30.3
gensim==0.12.4
geoplotlib==0.3.2
ggplot==0.11.5
Ghos