Now that we have a collection of texts [selected and downloaded](http://jeriwieringa.com/2017/04/25/gathering-sources/), and have [extracted the text](http://jeriwieringa.com/2017/04/27/Extract-Text-from-PDFs/), we need to spend some time identifying what the corpus contains, both in terms of coverage and quality. As I describe in the [project overview](http://jeriwieringa.com/2017/04/21/updated-dissertation-description), I will be using these texts to make arguments about the development of the community's discourses around health and salvation. While the corpus makes that analysis possible, it also sets the limits of what we can claim from text analysis alone. Without an understanding of what those limits are, we run the risk of claiming more than the sources can sustain, and in doing so, minimizing the very complexities that historical research seeks to reveal. 

In [1]:
"""My usual practice for gathering the filenames is to read 
them in from a directory. So that this code can be run locally 
without the full corpus downloaded, I exported the list 
of filenames to an index file for use in this notebook.
"""
with open("data/2017-05-05-corpus-index.txt", "r") as f:
    corpus = f.read().splitlines()

In [2]:
len(corpus)

197943

To create an overview of the corpus, I will use the document filenames along with some descriptive metadata that I created.

Filenames are an often underestimated feature of digital files, but one that can be used to great effect. For my corpus, the team that digitized the periodicals did an excellent job of providing the files with descriptive names. Overall, the files conform to the following pattern:

`PrefixYYYYMMDD-V00-00.pdf`

I discovered a few files that deviated from the pattern, but renamed those so that the pattern held throughout the corpus. When splitting the PDF documents into pages, I preserved the structure, adding `-page0.txt` to the end. 

The advantage of this format is that the filenames contain the metadata I need to place each file within its context. By isolating the different sections of the filename, I can quickly place any file with reference to the periodical title and the publication date.

In [3]:
import pandas as pd
import re

In [4]:
def extract_pub_info(doc_list):
    """Use regex to extract metadata from filename.
    
    Note:
        Assumes that the filename is formatted as::
            
            `PrefixYYYYMMDD-V00-00.pdf`
    
    Args:
        doc_list (list): List of the filenames in the corpus.
    Returns:
        dict: Dictionary with the year and title abbreviation for each filename.
    """
    
    corpus_info = {}
    
    for doc_id in doc_list:
        
        # Split the ID into three parts on the '-'
        split_doc_id = doc_id.split('-')
        
        # Get the prefix by matching the first set of letters 
        # in the first part of the filename.
        title = re.match("[A-Za-z]+", split_doc_id[0])
        # Get the dates by grabbing all of the number elements 
        # in the first part of the filename.
        dates = re.search(r'[0-9]+', split_doc_id[0])
        # The first four numbers is the publication year.
        year = dates.group()[:4]
        
        # Update the dictionary with the title and year 
        # for the filename.
        corpus_info[doc_id] = {'title': title.group(), 'year': year}
    
    return corpus_info

In [5]:
corpus_info = extract_pub_info(corpus)

One of the most useful libraries in Python for working with data is [Pandas](http://pandas.pydata.org/). With Pandas, Python users gain much of the functionality that our colleagues who work with R have long celebrated as the benefits of that domain-specific language. 

By transforming our `corpus_info` dictionary into a dataframe, we can quickly filter and tabulate a number of different statistics on our corpus.

In [6]:
df = pd.DataFrame.from_dict(corpus_info, orient='index')

In [7]:
df.index.name = 'docs'
df = df.reset_index()

You can preview the initial dataframe by uncommenting the cell below.

In [8]:
# df

In [9]:
df = df.groupby(["title", "year"], as_index=False).docs.count()

In [12]:
df

Unnamed: 0,title,year,docs
0,ADV,1898,26
1,ADV,1899,674
2,ADV,1900,463
3,ADV,1901,389
4,ADV,1902,440
5,ADV,1903,428
6,ADV,1904,202
7,ADV,1905,20
8,ARAI,1909,64
9,ARAI,1919,32


Nearly 500 rows of data is too large to have a good sense of the coverage of the corpus from reading the data table, so it is necessary to create some visualizations of the records. For a quick prototyping tool, I am using the [`Bokeh`](http://bokeh.pydata.org/en/latest/) library.

In [13]:
from bokeh.charts import Bar, show
from bokeh.charts import defaults
from bokeh.palettes import viridis
from bokeh.io import output_notebook

In [14]:
output_notebook()

In [15]:
defaults.width = 900
defaults.height = 950

In this first graph, I am showing the total number of pages per title, per year in the corpus.

In [16]:
p = Bar(df, 
        'year', 
        values='docs',
        agg='sum', 
        stack='title',
        palette= viridis(30), 
        title="Pages per Title per Year")

In [17]:
show(p)

This graph of the corpus reflects the historical development of the publication efforts denomination. Starting with a single publication in 1849, the publishing efforts of the denomination expand in the 1860s as they launch their health reform efforts, expand again in the 1880s as they start a publishing house in California and address concerns about Sunday observance laws, and again at the turn of the century as the denomination reorganizes and regional publications expand. The chart also reveals some holes in the corpus. The *Youth's Instructor* (shown here in yellow) in one of the oldest continuous denominational publications, but the pages available for the years from 1850 - 1899 are inconsistent.

In interpreting the results of mining these texts, it will be important to factor in the relative difference in size and diversity of publication venues between the early years of the denomination and the later years of this study. 

In [18]:
by_title = df.groupby(["title"], as_index=False).docs.sum()

In [19]:
p = Bar(df, 
        'title', 
        values='docs', 
        color='title', 
        palette=viridis(30), 
        title="Total Pages by Title"
       )

In [20]:
show(p)

Another way to view the coverage of the corpus is by total pages per periodical title. The *Advent Review and Sabbath Herald* dominates the corpus in number of pages, with *The Health Reformer*, *Signs of the Times*, and the *Youth's Instructor*, making up the next major percentage of the corpus. In terms of scale, these publications will have (and had) a prominent role in shaping the discourse of the SDA community. At the same time, it will be informative to look to the smaller publications to see if we can surface alternative and dissonant ideas.

In [21]:
topic_metadata = pd.read_csv('data/2017-05-05-periodical-topics.csv')

In [22]:
topic_metadata

Unnamed: 0,periodicalTitle,title,startYear,endYear,initialPubLocation,topic
0,Training School Advocate,ADV,1898,1905,"Battle Creek, MI",Education
1,American Sentinel,AmSn,1886,1900,"Oakland, CA",Religious Liberty
2,Advent Review and Sabbath Herald,ARAI,1909,1919,"Washington, D.C.",Denominational
3,Christian Education,CE,1909,1920,"Washington, D.C.",Education
4,Welcome Visitor (Columbia Union Visitor),CUV,1901,1920,"Academia, OH",Regional
5,Christian Educator,EDU,1897,1899,"Battle Creek, MI",Education
6,General Conference Bulletin,GCB,1863,1918,"Battle Creek, MI",Denominational
7,Gospel Herald,GH,1898,1920,"Yazoo City, MS",Regional
8,Gospel of Health,GOH,1897,1899,"Battle Creek, MI",Health
9,Gospel Sickle,GS,1886,1888,"Battle Creek, MI",Missions


We can generate another view by adding some external metadata for the titles. The "topics" listed here are ones I assigned when skimming the different titles. "Denominational" refers to centrally produced publications, covering a wide array of topics. "Education" refers to periodicals focused on education. "Health" to publications focused on health. "Missions" titles are focused on outreach and evangelism focused publications and "Religious Liberty" on governmental concerns over Sabbath laws. Finally, "Regional" refers to periodicals produced by local union conferences, which like the denominational titles cover a wide range of topics.

In [23]:
by_topic = pd.merge(topic_metadata, df, on='title')

In [24]:
p = Bar(by_topic, 
        'year', 
        values='docs',
        agg='sum', 
        stack='topic',
        palette= viridis(6), 
        title="Pages per Topic per Year")

In [25]:
show(p)

Here we can see the diversification of periodical subjects over time, especially around the turn of the century.

In [26]:
p = Bar(by_topic, 
        'topic', 
        values='docs',
        agg='sum', 
        stack='title',
        palette= viridis(30), 
        title="Pages per Topic per Year")

In [27]:
p.left[0].formatter.use_scientific = False
p.legend.location = "top_right"

In [28]:
show(p)

Grouping by category allows us to see that our corpus is dominated by the denominational, health, and regionally focused publications. These topics match with our research concerns, increasing our confidence that we will have enough information to determine meaningful patterns about those topics. But, due to the focus of the corpus, we should proceed cautiously before making any claims about the relative importance of those topics within the community. 

Now that we have a sense of the temporal and topical coverage of our corpus, we will next turn our attention to evaluating the quality of the data that we have gathered from the scanned PDF files.