By <a href="https://nkelber.com">Nathan Kelber</a> <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.

This notebook describes how to create a dataset using the [Digital Scholar Workbench](https://tdm-pilot.org/). The dataset generated is compatible with the following notebooks:

Digital Scholar Workbench Compatible Notebooks
* [Metadata](2-metadata.ipynb)
* [Word Frequencies](3-word-frequencies.ipynb)
* [Significant Terms](4-significant-terms.ipynb)
* [Topic Modeling](5-topic-modeling.ipynb)

# Notebook Table of Contents

* [Introduction](#build-intro)
* [Search Tools](#search-tools)
* [Visualization Tools](#visualizations-tools)
* [Building Your Dataset](#building-dataset)
* [Technical Details of Your Dataset](#details-dataset)
___

# Introduction 
<a name="build-intro"></a>

Pick out a set of texts to analyze. Choose from texts within JSTOR and Portico with primary and secondary sources (1700-present) spanning disciplines including:

* Agriculture
* Anthropology
* Education
* Fine Arts
* Geography
* History
* Language and Literature
* Law
* Medicine
* Music
* Philosophy
* Political Science
* Psychology
* Religion
* Science
* Social Sciences
* Technology

Design your dataset to fit your personal research interests using powerful search and visualization tools.
___

# Search Tools
<a name="search-tools"></a>

![Search and Visualization Interfaces](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/searchandvisualization.png)

**The search tools (left) and visualization tools (right) for creating a dataset.**

## Keyword

In the keyword searchbox, users may:

* Enter individual keywords separated by a space. (No commas are necessary.)
* Match exact phrases using quotation marks (“prison education”) for titles only. (Phrase-matching is not supported for the body text.)
* Use boolean operators (prison AND education) (prison OR education) (education NOT prison). 

![Keyword User Interface](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/searchui.png)

**The Search interface contains filters for keywords, publication title(s), publication dates, language(s), discipline(s), and provider(s).**

## Publication Title(s)
Users may additionally sort by individual titles if they are interested in a particular journal or other data source.

## Publication Dates
Sources may be filtered by year.

## Language(s)
Choose the language(s) you would like returned in search results. The Digital Scholar Workbench supports dozens of languages.

## Discipline and Provider
![Disciplines](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/disciplines.png)

The discipline and provider areas allow users to choose what disciplines to include or exclude for their search. The disciplines are ordered by Library of Congress subject headings.

## Creating your Dataset

When the dataset reflects your intended research, click “Build.” A pop-up will prompt you for an email address to notify you when the dataset is ready. (Depending on the size of your dataset, this process may take 15-30 minutes.) 

![Build Button](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/buildbutton.png)

**When the dataset meets your specifications, click "build".** For next steps, see [Using Your Dataset](#your-dataset).

___

# Visualization Tools
<a name="visualization-tools"></a>

The search tools for creating a dataset are coupled with visualizations in order to help researchers understand what data is available and where it is coming from.

## Your Customized Dataset
The section displays the number of issues/volumes in your prospective dataset and their sources. The makeup is displayed in a pie chart.

![Visualizing a customized dataset](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/yourCustomizedDataset.png)

## Term Frequency
The term frequency graph gives a raw count of the number of times a word is mentioned per year in the current dataset. The bulk of data in JSTOR, Portico, and HathiTrust is from the 20th century to the present, so it is likely that the graph will be similar to the one below.

![The term frequency visualization](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/termFrequency.png)

## Publication Dates
The Publication Dates graph gives a raw count of the number of publications from each year of the dataset. The bulk of data in JSTOR, Portico, and HathiTrust is from the 20th century to the present, so it is likely that the graph will be similar to the one below.

![The publication dates visualization](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/publicationDates.png)

## Discipline Treemap
The discipline tree map displays the makeup of the dataset by discipline. Hovering over a particular discipline will indicate the percentage of the dataset represented.

![The discipline treemap visualization](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/disciplineTreemap.png)

**When the dataset meets your specifications, click "build".**
___



# Building the Dataset 
<a name="building-dataset"></a>

![The visualization of dataset processing](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/processingDataset.png)

Depending on the size and complexity of your dataset, the build process can take from 5-30 minutes. If you prefer not to wait, enter your email address and you will be contacted automatically as soon as your dataset is ready.

![The email prompt for a dataset](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/emailPrompt.png)

The resulting email will contain:
* A link to a page that summarizes your dataset and allows you to explore it
* Your dataset ID that can be copied into specialized Jupyter Notebooks for analysis
* A download link for your dataset

![Email received after processsing](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/emailWorksetID.png)

**Your dataset ID can be used in any of the following Digital Scholar Workbench Notebooks:**

* [Metadata](2-metadata.ipynb)
* [Word Frequencies](3-word-frequencies.ipynb)
* [Significant Terms](4-significant-terms.ipynb)
* [Topic Modeling](5-topic-modeling.ipynb)

___

# What's in my dataset?
<a name="details-dataset"></a>
The JSTOR & Portico corpus builder creates datasets that are [non-consumptive](./key-terms.ipynb#non-consumptive) [bags of words](./key-terms.ipynb#bag-of-words). Each [dataset](./key-terms.ipynb#dataset) is a single file that contains all of the information for each document on a single line of code. This information includes:
* [Bibliographic Metadata](./key-terms.ipynb#metadata)
    * An id containing a stable JSTOR URL for the article or book
    * The title of the book or journal article
    * The title of the Journal
    * The author(s)
    * The type of publication
    * The publication date
    * The publisher
* Document and Word Counts
    * Total word count
    * Unique words and their frequencies
    * Page numbers
    
We are still refining the structure of the dataset file. We anticipate adding additional “features” (such as named entity recognition) in the future. Please reach out to Ted Lawless <Ted.Lawless@ithaka.org> if you have comments or suggestions.

# The data format and structure

Each dataset is represented by a single [JSON Lines file](./key-terms.ipynb#jsonl) (file extension ".jsonl"). The data for each document in the [corpus](./key-terms.ipynb#corpus) is a written on a single line. (If there are 1,245 documents in the corpus, the file will 1,245 lines long.) Each line contains a list of key/value pairs that map a **key** concept to a matching **value**. The basic structure looks like

> "Key": Value

![View of the top of a sample file](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/headSectionOfJSONL.png)

*This JSON Lines file has been broken down into a nested hierarchy on separate lines using [JSON Editor Online](https://jsoneditoronline.org/). This visualization makes it much easier for human readers to see the key/value pairs that make up a portion of the textual metadata information. The original JSONL is very difficult for human readers to read, but makes it easy to add or subtract individual texts by adding or removing a single line at a time.*

In the above example, we can a portion of the metadata for the text. Here are a few items of interest:

* The title is "Shakespeare and the Middling Sort" ("title": "Shakespeare and the Middling Sort")
* The author is "Theodore B. Leinwand" ("creators": ["Theodore B. Leinwand"])
* The text is a journal article ("doctypeType": "article")
* The journal is *Shakespeare Quarterly* ("isPartOf": "Shakespeare Quarterly")
* Identifiers such as ISSN, OCLC, and DOI
* PageCount and WordCount

At the end of the 
If you look closely, you'll discover additional metadata such as the publication date, DOI, page numbers, ISSN, and more. The frequency of each word in the text is found within the "unigramCount" section. In this context, the word "[unigram](./key-terms.ipynb#unigram)" describes a single word construction like the word "chicken." There are also [bigrams](./key-terms.ipynb#bigram) (e.g. "chicken stock"), [trigrams](./key-terms.ipynb#trigram) ("homemade chicken stock"), and [n-grams](./key-terms.ipynb#n-gram) of any length. At the present time, our tools are only counting unigrams. 

![The JSONL file section that lists unigrams](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/unigramCountFromJSONL.png)

*The start of the section of the JSONL file that lists the [unigrams](./key-terms.ipynb#unigram) for the text*

Notice that the beginning of the "unigramCount" section mostly contains numbers (represented as [strings](./key-terms.ipynb#string). Each text represents a raw representation of what is on the published page, so it is seems likely that these number references are in fact page numbers. If these numbers are not useful for your analysis, they can be filtered out with [stopwords](./key-terms.pynb#stop-words). JSTOR and Portico do not pre-filter out any words or numbers from corpora. 

![JSON section showing words](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/unigramCountWords.png)
*On each line, a **key** on the left is matched to **value** representing its frequency on the right*

Each word is treated as a [string](./key-terms.ipynb#string). Since python strings are case-sensitive so that means that "Tiger" is a different string than "tiger". Counting all the occurences of the word "tiger" then would require combining of the two strings. These methods are covered in later notebooks. 

# Can analyze an individual chapter of a book?

This question mainly applies to Portico content where we do have metadata at the level of book chapters. We hope to support the analysis of individual chapters in the future. (We do have the metadata to implement this feature, so it is mostly a technical challenge at this point.)

# What about open content in JSTOR?

Open content is currently served as [bags of words](./key-terms.ipynb#bag-of-words), but we are planning to supply full-text. We understand that many text analysis methods require full-text, and we plan to share full-text documents to the greatest extent we are able to. One of the future sources for this content will be the [HathiTrust Digital Library](https://www.hathitrust.org/). 

# How is the text sourced?

Most of the JSTOR content is produced through [optical character recognition (OCR)](./key-terms.ipynb#ocr) which means there are gaps and errors in the content. We are considering methods for assessing the accuracy of the OCR. 

Some of the Portico content is sourced from the original [XML](./key-terms.ipynb#XML) (the highest reasonable level of accuracy). 

# What's the difference between your data format and the Extracted Features format used by HathiTrust?

We worked closely with HathiTrust to try to develop a shared data format. Ultimately, we decided to improve the [HathiTrust Extracted Features format](https://worksets.htrc.illinois.edu/context/ef_context.json) to support key features our users sought (for example, the ability to analyze texts at the level of individual journal articles instead of at the issue-level). We are excited by the volume of content found within HathiTrust and plan to include more material from them in the future.

 



