<img align="right" width="300" src="libraries_short_color.png" alt="NYU Libraries Logo">

# Data in/and the Humanities
**[NYU Abu Dhabi Winter Institute in Digital Humanities](https://wp.nyu.edu/widh/)**
## Course Session \#5-2 -- Text as Data API Activity

**Nicholas Wolf**<br/>
[ORCID 0000-0001-5512-6151](https://orcid.org/0000-0001-5512-6151)

This lesson is licensed under a [Creative Commons Attribution-NonCommercial 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).

**Overview**

In this sesssion we combine the affordances of machine-readable full-text accessible via an Application Programming Interface (API) endpoint. Specifically, we'll make use of a tremendous resource in the form of the Library of Congress's [Chronicling America](https://chroniclingamerica.loc.gov/about/). See also the [description of its API](https://chroniclingamerica.loc.gov/about/api/). 

In addition to revisiting our JSON data structures, we'll use some advanced techniques in Google Sheets to add to our ability to munge data.

**Materials**

 - Web browser, Google Sheets
 

### 1. Reflection

*Chronicling America* reports that the project contains millions of [digitized pages](https://chroniclingamerica.loc.gov/search/pages/results/#tab=tab_search) from 140,000 newspaper titles spanning 1789 to 1963. This project was built on a long-term NEH grant to select and digitize local newspapers for preservation.

1. What general cautions should we take when drawing conclusions from these sources?

Consider:

 - What limitations might we expect in the process of turning *digitized* newspapers turned into *machine-readable* texts?
 - How have newspapers changed over time?
 

### 2. Using a Browser and API

*Chronicling America* provides a helpful basic description of how to query its newspaper metadata and full text. Like many APIs, this one works by sending a properly formed URL via HTTP (the protocal that governs passing web content to local applications such as browsers). 

It notes that we need to send a URL request formed like this:

[https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan](https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan)

...to query all newspaper titles with "michigan" present. Click on the query above to try it in your own browser. Note  this use of a "?" to denote the start of a set of parameters sent to the API endpoint (base URL). 

We can also ask for a different format, JSON:

[https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json](https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json)

If you view the JSON results in Firefox, you'll get a nice human-readable version of your JSON. Otherwise, you may need to copy and paste the raw JSON into a text editor to understand its contents.

<img align="left" width="500" src="screencaptures/json-page-example.png" alt="Paginated Results"><br clear="both"><br/>


Note that the results are given to you 20 at a time. This prevents your application/browser from crashing by receiving thousands or more of results at once. Instead, the system (like many APIs) relies on a paging system in which results are grouped into 20 and you must continually request results in batches to eventually download the full set of results. Pagination can be set like this:

[https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json](https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json)

[https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json&page=2](https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json&page=2)

[https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json&page=3](https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json&page=3)

...etc.

We do have another option, however, which is to explicitly tell the API how many results we want:

[https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json&rows=200](https://chroniclingamerica.loc.gov/search/titles/results/?terms=michigan&format=json&rows=200)





### 3. Exploring Full Text and Parsing JSON

Now that we know how this works, let's try to pull some interesting full-text data. The API documentation tells us that we can search for full-text results like this:

[https://chroniclingamerica.loc.gov/search/pages/results/?andtext=thomas&format=json&rows=200](https://chroniclingamerica.loc.gov/search/pages/results/?andtext=thomas&format=json&rows=200)

Let's try an interesting term like "suffrage" (or something of interest to you) and generate some JSON. Take a glance at your results.

JSON is for robots. Let’s try this again in a more readable form using Google Sheets. Visit the Sheet below and make a copy of it for you to use: https://docs.google.com/spreadsheets/d/1AiqNl54veUn1lrVl8WTqMSN6RBA5qVWPLZPOCmeXtyo/edit?usp=sharing

After opening it, click on File >> Make a Copy. This is a special premade Sheet that has a built in custom function that can pull JSON from a website and parse it as a table (thanks to this [neat little tutorial](https://medium.com/@paulgambill/how-to-import-json-data-into-google-spreadsheets-in-less-than-5-minutes-a3fede1a014a); read the details if you want to know how it works). To use it, you will type this function into a cell:

=ImportJSON("https://chroniclingamerica.loc.gov/search/pages/results/?andtext=thomas&format=json&rows=200")

You should see the cell read "Loading" before producing a tabular version of the 200 results.

We could now shift to OpenRefine for some cleanup, but note that Google Sheets, because it nicely protects the encoding of your data and ability to cleanly export to CSV without disrupting data types (e.g. forcing a date-like number to be a date without asking), it is a solid tool for some data cleaning.

For example, let's say that we want to 

### 4. Building our GeoJSON and Adding to a Map

There a

This week we are thinking about the power of annotation. Consider:

[Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)
Example: [Matt Wilken and Elizabeth Evans, “Nation, Ethnicity, and the Geography of British Fiction, 1880-1940,” out in Cultural Analytics,” Journal of Cultural Analytics (2018)](http://culturalanalytics.org/2018/07/nation-ethnicity-and-the-geography-of-british-fiction-1880-1940/); see [summary here](https://mattwilkens.com/) and [visualization here](https://plot.ly/~mattwilkens/119/british-literary-geography-1880-1940/#/).
[Parts of Speech (POS)](https://nlp.stanford.edu/software/tagger.shtml) tagging
Example: Topical/Local Classifier, meaning of word as understood in sentence combined with meaning of word in context of words surrounding it, e.g. Chodrow, Leacock, & Miller, [“A Topical/Local Classifier for Word Sense Identification.”](http://www.jstor.org/stable/30204796) Computers and the Humanities (2000).
Wait...how old is this concept?
[Cardiff Special Collections examples of marginalia](https://scolarcardiff.wordpress.com/2012/04/30/a-well-used-book-marginalia-and-manuscript-notes-in-an-early-16th-century-herbal/)
Leo Kramer’s [Hart Crane’s “The Bridge”: An Annotated Edition](https://www.jstor.org/stable/j.ctt1c5chkg)
[Markup](https://en.wikipedia.org/wiki/Markup_language) computing languages as annotation? Where does the text end and annotation begin? Alan Liu and LNRP article in JSTOR?
Editing
Also...you are probably annotating the text “ENG-GA 2971” right now.


Let’s see what literary artifacts we can dig up in the LoC Labs’ Chronicling America...

https://chroniclingamerica.loc.gov/about/api/


Step 1

Give the API a try. It is an unauthenticated, JSON-serving, HTTP request endpoint. So anyone can use their browser to send a request via a URL with the parameters contained in it.

Point your browser to the URL below, substituting an interesting search term for the word TERM below:

https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=TERM&format=json

Note that this is a typical way of passing parameters in a URL. At the end of a HTTP address, a question mark is appended to mark the start of parameters to pass; the parameters are separated by “&”  Below, we ask it to search for the word “dog” and to give us the results in JSON:

https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=dog&format=json

If you view the results in Firefox, it has a nice way of handling JSON.

Step 2

What is JSON? Let’s discuss….

What is the concept of paging….

Step 3

JSON is for robots. Let’s put it into a more readable form using Google Sheets. Visit the Sheet below and make a copy of it for you to use: https://docs.google.com/spreadsheets/d/1AiqNl54veUn1lrVl8WTqMSN6RBA5qVWPLZPOCmeXtyo/edit?usp=sharing

After opening it, click on File >> Make a Copy. This is a special premade Sheet that has a built in custom function that can pull JSON from a website and parse it as a table (thanks to this neat little tutorial). To use it, you will type this function into a cell:

=ImportJSON(“https://URL-SOURCE”)

Step 4
Try it out. Let’s see if we can gather all mentions of “Melville” in the first 200 results from the API. Put this in cell A1:

=ImportJSON("https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=melville&format=json&rows=200")

Step 5

We can do a quick investigative work. Note that we have the year in column K, consisting of a 8-digit number with year the leading 4 characters. Let’s slice out that info and do a check of frequency of mentions over time. Create a new column at the end of the sheet called year. Enter the following formula in the second row:

=LEFT(K2, 4)

This tells it to populate this column with the first 4 characters of the value in column K. Once you see the first value popup, hold down shift while the cell with the formula is activated, scroll to the bottom of the sheet to row 201, click the last row in the same column (activating/highlighting all cells below your initial formula cell), and press Control + D (or Command + D).

Step 6

Highlight this entire year column, and at the bottom, select plus to create a new sheet. Select Control + C (or Command + C) to copy, and then visit the new sheet. In cell A1, click on Edit >> Paste Special and select paste values only. We now have a column of year values (not formulas).

Next, click on the column header and find the “funnel” filter button at the top. This activates quick filtering. Filter the values in your column A-Z to put them in chronological order. 

Step 7

Highlight the entire year column, and select from the main menu Insert >> Chart. This will get us a quick frequency count of the mentions of the word “Melville” over time.

What can we conclude?
   