# An analysis of thirty years of articles in the _William and Mary Quarterly_, JAH, AHR, EH, and Ethnohistory

The _William and Mary Quarterly_ is the premier scholarly journal in early American studies and history.  It is published by the [Omohundro Institute of Early American History and Culture](https://oieahc.wm.edu).

The purpose of this experiment is to use a dataset provided by the [JSTOR API](http://dfr.jstor.org), which is an xml file of metadata for around 1000 WMQ articles starting in 1976 and continuing through 2010. The logic behind the limits of this dataset is that the API allows researchers to request data for up to 1000 articles at a time, and this search gave me a dataset of n =973.  

My goal in this basic project is to perform some basic text analysis and other kinds of analysis on this dataset, using some rudimentary Python code.

## First steps
The JSTOR API sent me a series of XML files which I have unzipped and stored in a file (called ```/data```) in the same directory/repository as this jupyter notebook.  My first experiment is to try to use xmltree to parse some of this data.

In [2]:
import xml.etree.ElementTree as ET

In [3]:
tree = ET.parse('data/citations.xml')
root = tree.getroot()

In [4]:
root.tag

'citations'

## Success
The above code imports the python library for xml parsing, known as etree.  Then I feed the ```citations.xml``` file into the parse method and store the resulting object in the ```tree``` variable.  Then I find the root tag within the xml tree, which shows me that I've successfully imported the file.  

# Next step
Now that we have the tree of xml parsed into the ET object, we can experiment with the various functions of the ElementTree module.  For instance, writing a loop to parse out the titles of the articles.  

In [17]:
for child in root.iter('author'):
    author = child.text
    print(author)
    

Jeffrey H. Richards
Paul M. Smith
Jack P. Greene
Ada van Gastel
Benjamin Schmidt
Edward Countryman
Ronald M. Gephart
Deborah A. Rosen
William E. Nelson
William M. Wiecek
Kenneth Morgan
Edwin S. Gaustad
Patricia U. Bonomi
Ian Steele
Bruce H. Mann
Douglas L. Wilson
Mary E. Fissell
Daniel Blake Smith
Kevin R. McNamara
David Armitage
Robert C. H. Sweeny
Jennifer L. Morgan
J. R. Pole
Jeffrey J. Crow
Stephen Innes
Alden T. Vaughan
Max M. Edling
Mark D. Kaplanoff
Ruth H. Bloch
Philip F. Gura
Toby L. Ditz
Walter E. Minchinton
Whitfield J. Bell, Jr.
Stuart Lee Butler
Julie Clarfield
Malick W. Ghachem
Lisa Wilson Waciega
Randy J. Sparks
Sarah Rivett
Rachel Hope Cleves
Martin H. Quitt
J. F. Bosher
Paul G. E. Clemens
Douglas Edward Leach
Philip F. Gura
Peter Charles Hoffer
David Raynor
Andrew Skinner
James Alexander Dun
Donna Merwick
None
None
T. H. Breen
Gideon Mailer
None
Joan R. Gundersen
Gwen Victor Gampel
John L. Brooke
H. James Henderson
Francis J. Bremer
Joyce E. Chaplin
Gerald F. Moran
Non

Problem: I cannot figure out how to parse this xml.  It's a similar process to beautiful soup, where you have to go down the tree, but I seem to wind up at the right child, but nothing prints out.  I have looked at the following pages:

- [stack overflow](http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python)
- p. 156 in Severance.
- [official documentation for python](https://docs.python.org/2/library/xml.etree.elementtree.html)

I am a bit stuck so I'll take a break. My sense is that I'm almost there.  Etree should not be very hard.

## Possible Breakthrough
I think I have found a good article about this dilemma in the book [_Dive Into Python_](https://docs.python.org/2/library/xml.etree.elementtree.html).  Will return and work on this next.  

In [56]:
for child in root:
    tit = child[1].text
    print(tit)
    pubdat = child[6].text
    print(pubdat)

Revolution, Domestic Life, and the End of "Common Mercy" in Crévecoeur's "Landscapes"
1998-04-01T00:00:00Z
Trivia
1984-01-01T00:00:00Z
Interpretive Frameworks: The Quest for Intellectual Order in Early American History
1991-10-01T00:00:00Z
Van der Donck's Description of the Indians: Additions and Corrections
1990-07-01T00:00:00Z
Mapping an Empire: Cartographic and Colonial Rivalry in Seventeenth-Century Dutch and English North America
1997-07-01T00:00:00Z
Indians, the Colonial Order, and the Social Significance of the American Revolution
1996-04-01T00:00:00Z
Who Wrote "The North American" Essays?
1997-04-01T00:00:00Z
Women and Property across Colonial America: A Comparison of Legal Systems in New Mexico and New York
2003-04-01T00:00:00Z
Reason and Compromise in the Establishment of the Federal Constitution, 1787-1801
1987-07-01T00:00:00Z
The Statutory Law of Slavery and Race in the Thirteen Mainland Colonies of British America
1977-04-01T00:00:00Z
Shipping Patterns and the Atlantic Tra

# Moving on...
This is one way to parse, by working "from the top down."  Since the etree object is effectively like a list, you can locate elements within the list with their index numbers as above.  So in the above example, the root is citations, and the first child (```root[0]```) is "articles" and then we drill into each of these by a for loop: 
```
for articles in root:
titl = articles[3]
```

Or whatever.

But there's another way to do it, which is to use the findall() method on elements within the etree.  

In [57]:
alltitls = tree.findall('.//title')
print(alltitls[0:10])
print(len(alltitls))

[<Element 'title' at 0x104ba8cc8>, <Element 'title' at 0x104bb1188>, <Element 'title' at 0x104bb15e8>, <Element 'title' at 0x104bb1a48>, <Element 'title' at 0x104bb1ea8>, <Element 'title' at 0x104bb5368>, <Element 'title' at 0x104bb57c8>, <Element 'title' at 0x104bb5c28>, <Element 'title' at 0x104bb80e8>, <Element 'title' at 0x104bb8548>]
973


In [9]:
for titl in alltitls[0:20]:
    print(titl.text)

Revolution, Domestic Life, and the End of "Common Mercy" in Crévecoeur's "Landscapes"
Trivia
Interpretive Frameworks: The Quest for Intellectual Order in Early American History
Van der Donck's Description of the Indians: Additions and Corrections
Mapping an Empire: Cartographic and Colonial Rivalry in Seventeenth-Century Dutch and English North America
Indians, the Colonial Order, and the Social Significance of the American Revolution
Who Wrote "The North American" Essays?
Women and Property across Colonial America: A Comparison of Legal Systems in New Mexico and New York
Reason and Compromise in the Establishment of the Federal Constitution, 1787-1801
The Statutory Law of Slavery and Race in the Thirteen Mainland Colonies of British America
Shipping Patterns and the Atlantic Trade of Bristol, 1749-1770
When in the Course
Lord Cornbury Redressed: The Governor and the Problem Portrait
Governors or Generals?: A Note on Martial Law and the Revolution of 1689 in English America
The Evoluti

In [54]:
! ls -a

[34m.[m[m                  [34m.git[m[m               README.md          test.md
[34m..[m[m                 .gitignore         WMQ_analysis.ipynb
.DS_Store          [34m.ipynb_checkpoints[m[m [34mdata[m[m


# Next step:
My next step will be to:
- Build a pandas database to store the data from this output
- write regex code to pull out all instances of four [0-9] in a row, which is to say all calendar years.  Then do a simple frequency table or a histogram

In [8]:
import pandas as pd
import numpy as np

In [12]:
titles = []
dates = []
# authors = []
alltitls = tree.findall('.//title')
alldates = tree.findall('.//pubdate')
allauths = tree.findall('.//author')
for titl in alltitls:
    titles.append(titl.text)

for date in alldates:
    dates.append(date.text)
    
# for auth in allauths:
#     authors.append(auth.text)
# This technique did not work due to the problem of multiple authors; see below.
print(titles[:10], dates[:10], authors[:10])

['Revolution, Domestic Life, and the End of "Common Mercy" in Crévecoeur\'s "Landscapes"', 'Trivia', 'Interpretive Frameworks: The Quest for Intellectual Order in Early American History', "Van der Donck's Description of the Indians: Additions and Corrections", 'Mapping an Empire: Cartographic and Colonial Rivalry in Seventeenth-Century Dutch and English North America', 'Indians, the Colonial Order, and the Social Significance of the American Revolution', 'Who Wrote "The North American" Essays?', 'Women and Property across Colonial America: A Comparison of Legal Systems in New Mexico and New York', 'Reason and Compromise in the Establishment of the Federal Constitution, 1787-1801', 'The Statutory Law of Slavery and Race in the Thirteen Mainland Colonies of British America'] ['1998-04-01T00:00:00Z', '1984-01-01T00:00:00Z', '1991-10-01T00:00:00Z', '1990-07-01T00:00:00Z', '1997-07-01T00:00:00Z', '1996-04-01T00:00:00Z', '1997-04-01T00:00:00Z', '2003-04-01T00:00:00Z', '1987-07-01T00:00:00Z',

## The problem of multiple authors
The code doesn't work because it relies on .findall to generate a list of _all_ authors in the xml. THe problem is that several essays have multiple authors. When it comes time to concatenate the lists into a single data frame, in fact they are different lengths. We need a bit of code to account for that.

In [51]:
article_count = 0
for article in root.iter('article'):
    authors = article.findall('author')
    for author in authors:
        print(author.text)
    article_count += 1
    print(article_count)
        
        
        
        

Jeffrey H. Richards
1
Paul M. Smith
2
Jack P. Greene
3
Ada van Gastel
4
Benjamin Schmidt
5
Edward Countryman
6
Ronald M. Gephart
7
Deborah A. Rosen
8
William E. Nelson
9
William M. Wiecek
10
Kenneth Morgan
11
Edwin S. Gaustad
12
Patricia U. Bonomi
13
Ian Steele
14
Bruce H. Mann
15
Douglas L. Wilson
16
Mary E. Fissell
17
Daniel Blake Smith
18
Kevin R. McNamara
19
David Armitage
20
Robert C. H. Sweeny
21
Jennifer L. Morgan
22
J. R. Pole
23
Jeffrey J. Crow
24
Stephen Innes
25
Alden T. Vaughan
26
Max M. Edling
Mark D. Kaplanoff
27
Ruth H. Bloch
28
Philip F. Gura
29
Toby L. Ditz
30
Walter E. Minchinton
31
Whitfield J. Bell, Jr.
Stuart Lee Butler
Julie Clarfield
32
Malick W. Ghachem
33
Lisa Wilson Waciega
34
Randy J. Sparks
35
Sarah Rivett
36
Rachel Hope Cleves
37
Martin H. Quitt
38
J. F. Bosher
39
Paul G. E. Clemens
40
Douglas Edward Leach
41
Philip F. Gura
42
Peter Charles Hoffer
43
David Raynor
Andrew Skinner
44
James Alexander Dun
45
Donna Merwick
46
None
47
None
48
T. H. Breen
49
Gideon

## This isn't working
It seems like I'm successfully iterating through the articles, and my logic is that, while inside each "article" node on the tree, I can pick up every instance of "author" by using .findall(). The problem is that they do not get dealt with, as I expect, within the iterating article, but rather simply print to output, etc. In other words, I cannot seem to get the multiple authors to get pulled out of each instance of "article." rather they get pulled out one by one out of the whole corpus, which is not what I want. 

## Figured it out
Okay, now I have it working. Now I just need to get the authors into the dataframe.

In [55]:
authors_list = []
for article in root.iter('article'):
    authors = article.findall('author')
    if authors == None:
        authors = 'NaN'
    for author in authors:
        author_list = []
        author_list.append(author.text)
    authors_list.append(author_list)
len(authors_list)

973

In [58]:
dates_ser = pd.Series(dates)
titles_ser = pd.Series(titles)
authors_ser = pd.Series(authors_list)
# authors_ser = pd.Series(authors)
wmq = pd.concat([dates_ser, titles_ser, authors_ser], axis = 1)
wmq.columns = ['date', 'title', 'author']
wmq.shape

(973, 3)

In [59]:
wmq.head()

Unnamed: 0,date,title,author
0,1998-04-01T00:00:00Z,"Revolution, Domestic Life, and the End of ""Com...",[Jeffrey H. Richards]
1,1984-01-01T00:00:00Z,Trivia,[Paul M. Smith]
2,1991-10-01T00:00:00Z,Interpretive Frameworks: The Quest for Intelle...,[Jack P. Greene]
3,1990-07-01T00:00:00Z,Van der Donck's Description of the Indians: Ad...,[Ada van Gastel]
4,1997-07-01T00:00:00Z,Mapping an Empire: Cartographic and Colonial R...,[Benjamin Schmidt]
