By <a href="https://nkelber.com">Nathan Kelber</a> <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.

This notebook describes how to create a dataset using the [Digital Scholar Workbench](https://tdm-pilot.org/). The dataset generated is compatible with the following notebooks:

Digital Scholar Workbench Compatible Notebooks
* [Metadata](2-metadata.ipynb)
* [Word Frequencies](3-word-frequencies.ipynb)
* [Significant Terms](4-significant-terms.ipynb)
* [Topic Modeling](5-topic-modeling.ipynb)

# Notebook Table of Contents

* [Introduction](#build-intro)
* [Search Tools](#search-tools)
* [Visualization Tools](#visualizations-tools)
* [Using Your Dataset](#your-dataset)
* [Technical Details of Your Dataset](#details-dataset)
___

# Introduction 
<a name="build-intro"></a>

Pick out a set of texts to analyze. Choose from texts within JSTOR and Portico with primary and secondary sources (1700-present) spanning disciplines including:

* Agriculture
* Anthropology
* Education
* Fine Arts
* Geography
* History
* Language and Literature
* Law
* Medicine
* Music
* Philosophy
* Political Science
* Psychology
* Religion
* Science
* Social Sciences
* Technology

Design your dataset to fit your personal research interests using powerful search and visualization tools.
___

# Search Tools
<a name="search-tools"></a>

![Search and Visualization Interfaces](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/searchandvisualization.png)

**The search tools (left) and visualization tools (right) for creating a dataset.**

## Keyword

In the keyword searchbox, users may:

* Enter individual keywords separated by a space. (No commas are necessary.)
* Match exact phrases using quotation marks (“prison education”) for titles only. (Phrase-matching is not supported for the body text.)
* Use boolean operators (prison AND education) (prison OR education) (education NOT prison). 

![Keyword User Interface](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/searchui.png)

**The Search interface contains filters for keywords, publication title(s), publication dates, language(s), discipline(s), and provider(s).**

## Publication Title(s)
Users may additionally sort by individual titles if they are interested in a particular journal or other data source.

## Publication Dates
Sources may be filtered by year.

## Language(s)
Choose the language(s) you would like returned in search results. The Digital Scholar Workbench supports dozens of languages.

## Discipline and Provider
![Disciplines](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/disciplines.png)

The discipline and provider areas allow users to choose what disciplines to include or exclude for their search. The disciplines are ordered by Library of Congress subject headings.

## Creating your Dataset

When the dataset reflects your intended research, click “Build.” A pop-up will prompt you for an email address to notify you when the dataset is ready. (Depending on the size of your dataset, this process may take 15-30 minutes.) 

![Build Button](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/buildbutton.png)

**When the dataset meets your specifications, click "build".** For next steps, see [Using Your Dataset](#your-dataset).

___

# Visualization Tools
<a name="visualization-tools" />

The search tools for creating a dataset are coupled with visualizations in order to help researchers understand what data is available and where it is coming from.

## Your Customized Dataset
The section displays the number of issues/volumes in your prospective dataset and their sources. The makeup is displayed in a pie chart.

![Visualizing a customized dataset](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/yourCustomizedDataset.png)

## Term Frequency
The term frequency graph gives a raw count of the number of times a word is mentioned per year in the current dataset. The bulk of data in JSTOR, Portico, and HathiTrust is from the 20th century to the present, so it is likely that the graph will be similar to the one below.

![The term frequency visualization](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/termFrequency.png)

## Publication Dates
The Publication Dates graph gives a raw count of the number of publications from each year of the dataset. The bulk of data in JSTOR, Portico, and HathiTrust is from the 20th century to the present, so it is likely that the graph will be similar to the one below.

![The publication dates visualization](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/publicationDates.png)

## Discipline Treemap
The discipline tree map displays the makeup of the dataset by discipline. Hovering over a particular discipline will indicate the percentage of the dataset represented.

![The discipline treemap visualization](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/disciplineTreemap.png)

**When the dataset meets your specifications, click "build".**
___

# Using Your Dataset
<a name="your-dataset" />

## Building the Dataset
![The visualization of dataset processing](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/processingDataset.png)

Depending on the size and complexity of your dataset, the build process can take from 5-30 minutes. If you prefer not to wait, enter your email address and you will be contacted automatically as soon as your dataset is ready.

![The email prompt for a dataset](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/emailPrompt.png)

The resulting email will contain:
* A link to a page that summarizes your dataset and allows you to explore it
* Your dataset ID that can be copied into specialized Jupyter Notebooks for analysis
* A download link for your dataset

![Email received after processsing](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/emailWorksetID.png)

**Your dataset ID can be used in any of the following Digital Scholar Workbench Notebooks:**

* [Metadata](2-metadata.ipynb)
* [Word Frequencies](3-word-frequencies.ipynb)
* [Significant Terms](4-significant-terms.ipynb)
* [Topic Modeling](5-topic-modeling.ipynb)

___

# Technical Details of Your Dataset
<a name="details-dataset" />

The following section contains technical details that describe the format of Digital Scholar Workbench Notebooks. Reading and understanding this information is not required for doing text analysis with the Digital Scholar Workbench Notebooks, but it may be useful for advanced users who are curious.

## Format

The JSTOR & Portico text mining platform delivers to researchers a non-consumptive “bag-of-words,” formally referred to as extracted features. Each journal issue or book in the researcher’s requested dataset is represented by a single JSON-LD file that contains 1) bibliographic metadata for the articles and chapters within the journal issues and books, 2) the unique set of words on each page, 3) the part of speech of each word, and 4) the number of times the word occurs on the page.  In the future, we may choose to expand extracted features with additional derived data (for example, named entity recognition to identify the proper names in the document).  The JSTOR & Portico extracted features will never contain enough information to recreate the full-text of the original content, it will always be non-consumptive.

In order to remain compatible with content providers who do not have the article and chapter level detail of JSTOR and Portico, the extracted features have bibliographic metadata for the issue or book level in addition to bibliographic metadata for the individual articles and book chapters, so that analytics may be performed by the researchers on the article and chapter level bibliographic metadata, where it is available.

As can be seen below, the metadata section of the sample extracted features file starts with information about the journal issue:
![JSON Journal Issue code](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/JSON1.png)

The extracted features file references both the journal of which this issue is a part and each of the articles within it.  Below is an example of the extracted features file referencing up to the journal of which the issue the extracted features represents is a part:

![JSON Journal Issue code](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/JSON2.png)

And below is an example of the bibliographic metadata of an article published within this issue:

![JSON Journal Issue code](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/JSON3.png)

Within the extracted features JSON-LD file, below all the bibliographic metadata, is a list of pages where the unique set of words (more accurately,  tokens) on each page is identified, along with each word’s part of speech and frequency.  Below is a snippet of the information about a page.

![JSON Journal Issue code](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/JSON4.png)

We worked closely with HathiTrust to develop the extracted features JSON-LD data format, which builds upon their previous work.  Our format is compatible with HathiTrust extracted features 2.0 (expected to be released in 2020). For those technically minded, the JSON-LD context document can be found at  https://worksets.htrc.illinois.edu/context/ef_context.json.  

HathiTrust, Portico, and JSTOR all provide volume level metadata in the extracted features. Portico and JSTOR also provide article level metadata.

Our JSON files include:
* Bibliographic metadata
* Page level details including:
  * Unigrams on each page
  * The frequency of those unigrams on the page
  * The part-of-speech of those unigrams

HathiTrust, JSTOR, and Portico all used the Apache OpenNLP natural language processor to tokenize, count, and determine part-of-speech.  The source code can be found at: https://github.com/htrc/JSTOR-FeatureExtractor.  Researchers have made it clear that they prefer if any extracted features in the service were built using the same tool set.

We anticipate adding additional “features” to our JSON files (for example, named entity recognition), in the future.