By <a href="https://nkelber.com">Nathan Kelber</a> <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.
___

# Table of Contents

* [About *JSTOR TDM*](#about)
* [Why learn text mining?](#why-learn)
* Text Mining Methods by Scholarly Question
  * What's it about?
    * [Word Frequency](#wf-method)
    * Collocation
    * TF-IDF
    * Topic Analysis
  * How are they connected?
    * Collocation
    * Network Analysis
  * How does it feel?
    * Sentiment Analysis
  * What names are in here?
    * Named Entity Recognition
  * How are they similar?
    * Authorship Attribution
    * Clustering
    * Supervised Machine Learning
* [Why should I learn Python? (And how much?)](#intro-to-python)
* [What is a Jupyter Notebook?](#intro-to-jupyter)
___

# About JSTOR TDM 
<a id ="about"></a>


![JSTOR and Portico Logos](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/JSTORandPorticoLogo.png)

Text mining, or the process of deriving new information from pattern and trend analysis of the written word, has the potential to revolutionize research across disciplines, but there is a massive hurdle facing those eager to unleash its power: the coding skills and statistical knowledge it requires can take years to develop.   All too often, researchers are shown the promise of text mining, but then told that promise can only be realized by the select few with the necessary technical skills.  Ted Underwood, Professor of English at the University of Illinois, analogizes this challenge by saying that analytics techniques are a “deceptively gentle welcome mat, followed by a trapdoor.”   

JSTOR and Portico are addressing this problem by building a text and data mining platform aimed at teaching and enabling a generation of researchers to text mine.  The JSTOR & Portico text mining platform includes a user interface to allow researchers, students, and instructors to curate, visualize, and save custom datasets.  Researchers may download the extracted features of their curated datasets -- a non-consumptive “bag-of-words” where each journal issue or book in the custom dataset is represented with bibliographic metadata for the articles and chapters, the unique set of words on each page, the part of speech of each word, and the number of times the word occurs on the page. The platform includes a teaching and development environment (a Jupyter Hub) which will be populated with easy-to-use code tutorials and templates where new text miners can analyze their custom datasets and learn to modify the Python or R code to better suit their own research purposes.  Researchers may download and locally hold the extracted features of any content and the full-text of open content, while the full-text of rights restricted content will be available for analysis in a secure computing environment.

The content in the text mining platform will at least include all of JSTOR and the content from those Portico publishers who choose to participate (currently, 30 publishers including John Wiley & Sons, Inc., Project Muse, Thieme Publishing Group, and Hindawi).  In addition, we are in discussions with third party content providers about participating with content and the service will include the ability for researchers to upload their own content for analysis.

The JSTOR & Portico text mining service will provide both free tools and tools accessible exclusively for  institutional participants.  

We are working with a set of ten reference institutions from late 2019 and through 2020 to identify and build all of the necessary features, with an aim to release the service in 2021.

More information, including access to the prototype, may be found at: https://tdm-pilot.org.

Any questions, discussion items, or requests for a demonstration may be sent to Amy Kirchhoff, Text and Data Mining Business Manager (amy.kirchhoff@ithaka.org).



# Why learn text mining?
<a id="why-learn"></a>

You may have heard buzz on campus about text analysis, artificial intelligence, or big data. But if you’re a humanities scholar, librarian, or someone who has never used data in their research, it is not obvious why text analysis matters. Many disciplines have eschewed text analysis for decades (and we all know a few faculty members likely to continue for decades more), but there is no doubt that interest in the field is growing. Why should you learn text mining when you could be writing your next article?

For researchers, the primary advantage that text and data mining offer is an ability to consider knowledge at non-human scales (both very big and very small). Text analysis can enable us to consider a million books across a thousand-dimensional space, revealing aspects of our records that are not obvious to human readers whether those aspects are imperceptibly small, diffused across centuries, or simply within records never read. What does that mean in practice though? The short answer is more evidence (and more kinds of evidence) for interrogating humanities problems. In ["Searching for the Victorians"](http://dancohen.org/2010/10/04/searching-for-the-victorians/) (2010), Dan Cohen asks, 'how much evidence is enough?': 

>Many humanities scholars have been satisfied, perhaps unconsciously, with the use of a limited number of cases or examples to prove a thesis. Shouldn’t we ask, like the Victorians, what can we do to be most certain about a theory or interpretation? If we use intuition based on close reading, for instance, is that enough?
>
>Should we be worrying that our scholarship might be anecdotally correct but comprehensively wrong? Is 1 or 10 or 100 or 1000 books an adequate sample to know the Victorians? What we might do with all of Victorian literature—not a sample, or a few canonical texts, as in Houghton’s work, but all of it.

To operate as a researcher in the 21st century is to be confronted with the challenges and opportunities of data—at once both being overwhelmed by too much and yet not nearly enough of *the right kind*. As Miriam Posner has pointed out, "...even if they don’t call their sources data, traditional humanists do have pretty pressing data-management needs" (["Humanities Data: A Necessary Contradiction"](https://miriamposner.com/blog/humanities-data-a-necessary-contradiction/) 2015). Tom Scheinfeldt suggests that data concerns are becoming the primary concern of the humanities:

>The new technology of the Internet has shifted the work of a rapidly growing number of scholars away from thinking big thoughts to forging new tools, methods, materials, techniques, and modes or work that will enable us to harness the still unwieldy, but obviously game-changing, information technologies now sitting on our desktops and in our pockets. These concerns touch all scholars. ["Sunset for Ideology, Sunrise for Methodology"](http://dancohen.org/2010/10/04/searching-for-the-victorians/) (2008)

Indeed, humanists cannot afford to ignore computational methods since they are, for better or worse, at the heart of modern culture and industry. Future humanists will not be able to study our digital present without becoming adept at reading and manipulating the burgeoning data of our historical record. Ted Underwood describes this new horizon:

>It is becoming clear that we have narrated literary history as a sequence of discrete movements and periods because chunks of that size are about as much of the past as a single person could remember and discuss at one time. Apparently, longer arcs of change have been hidden from us by their sheer scale—just as you can drive across a continent noticing mountains and political boundaries but never the curvature of the earth. A single pair of eyes at ground level can't grasp the curve of the horizon, and arguments limited by a single reader's memory can't reveal the largest patterns organizing literary history. <br /> [*Distant Horizons: Digital Evidence and Literary Change*](https://www.press.uchicago.edu/ucp/books/book/chicago/D/bo35853783.html) (2019)

For many scholars, text analysis sounds *potentially* powerful and useful, but the reality remains that learning text analysis is not a trivial task. Most humanities coursework does not prepare students to work with data. The good news is that text analysis, like any skill, can be learned to a greater or lesser degree. For historians to study the early modern period, it is very helpful to have a command of Latin. Still, there are plenty of successful early modern scholars that never learn the language (or learn enough to navigate the resources significant to their research).

Depending on your research question, *you may not need to learn any coding* to do text analysis. The problem for many scholars is the possible applications for text analysis are not clear, so they are not in a good position to decide what to learn (and how much). At the same time, the sophistication needed for doing text analysis is a moving target. Topic modeling was once a very complicated task, requiring an understanding of the command line. Today, it can be accomplished in minutes using just a mouse. 

This notebook is intended to help researchers get started with text analysis by addressing the fundamental opportunity-cost question for doing this kind of research:

* How can (and has) text analysis improve(d) scholarship?
* What can I do with the knowledge I already have?
* What method(s) could I learn quickly to advance my research?
* What tools and resources are available to help?

In this introduction, we explain the various kinds of text analysis for a general scholarly audience. What are they? Why would you use them? How long will it take to apply them? (The methods presented here are among the most well-known but certainly not exhaustive.) Afterward, you'll be prepared to decide how much or how little text analysis may be useful to your research. As you read about these methods, it will be helpful to keep in mind the current, intractable problems that face your field. Could you use one of these methods to address them? Along the way, we will reference recent scholarly arguments as examples and models. 

## What are these texts about?
* **Word Frequency** *Little or no coding required* <br />
Counting the frequency of a word in any given text. This includes Bag of Words and TF-IDF. **Example:** "Which of these texts focus on women?"

* **Collocation** *Little or no coding required* <br />
Examining where words occur close to one another. **Example:** "Where are women mentioned in relation to home ownership?"

* **Topic Analysis (or Topic Modeling)** *Little or no coding required* <br />
Discovering the topics within a group of texts. **Example:** "What are the most frequent topics discussed in this newspaper?"

* **TF/IDF** *Little or no coding required* <br />
Finding the significant words within a text. **Example:** "What language is most significant within 1970s political speech?"

## How are these texts connected?
* **Concordance** *Little or no coding required* <br />
Where is this word or phrase used in these documents? **Example:** "Which journal articles mention Maya Angelou's phrase, 'If you're for the right thing, then you do it without thinking.'"
* **Network Analysis** *Moderate coding required* <br />
How are the authors of these texts connected? **Example:** "What local communities formed around civil rights in 1963?"

## What emotions (or affects) are found within these texts?
* **Sentiment Analysis** *Moderate coding required* <br />
Does the author use positive or negative language? **Example:** "How do presidents describe gun control?"

## What names are used in these texts?
* **Named Entity Recognition** *Moderate coding required* <br />
List every example of a kind of entity from these texts. **Example:** "What are all of the geographic locations mentioned by Tolstoy?"

## Which of these texts are most similar?

* **Authorship Attribution** *Moderate coding required* <br />
Find the author of an anonymous document. **Example:** "Who wrote The Federalist Papers?"
* **Clustering** *Moderate coding required* <br />
Which texts are the most similar? **Example:** "Is this play closer to comedy or tragedy?"
* **Supervised Machine Learning** *Moderate coding required* <br />
Are there other texts similar to this? **Example:** "Are there other Jim Crow laws like these we have already identified?"

The next section will examine each method in greater depth.
___

Cohen, Dan. ["Searching for the Victorians."](http://dancohen.org/2010/10/04/searching-for-the-victorians/). (2010).

Posner, Miriam. ["Humanities Data: A Necessary Contradiction"](https://miriamposner.com/blog/humanities-data-a-necessary-contradiction/). (2015).

Scheinfeldt, Tom. ["Sunset for Ideology, Sunrise for Methodology?"](http://foundhistory.org/2008/03/sunset-for-ideology-sunrise-for-methodology/). (2008).

Underwood, Ted. [*Distant Horizons: Digital Evidence and Literary Change*](https://www.press.uchicago.edu/ucp/books/book/chicago/D/bo35853783.html). (2019).

# Text Mining Methods by Scholarly Question

  * What's it about
    * Word Frequency
    * Collocation
    * Topic Analysis
    * TF-IDF
  * How are they connected?
    * Collocation
    * Network Analysis
  * How does it feel?
    * Sentiment Analysis
  * What names are in here?
    * Named Entity Recognition
  * How are they similar?
    * Authorship Attribution
    * Clustering
    * Supervised Machine Learning


## What's it about?

The most common reason to use text analysis is to summarize and/or describe the content of a collection of texts, also known as a [corpus](./key-terms.ipynb#corpus). The following methods can help researchers by:

* Giving a broad overview of topics, themes, and language
* Locating and ranking the significance of particular words and phrases
* Discovering language in each document surrounding particular topics

These methods are useful at getting a quick, high-level view of a large variety of materials. Assuming your dataset is ready to analyze (like those created by [JSTOR's Digital Scholar Workbench](https://tdm-pilot.org)), these methods can be executed easily with text analysis software and require little or no coding expertise. 



### Word Frequency
<a id ="wf-method"></a>

The [Word Frequency](./key-terms.ipynb#word-frequency) method counts the number of occurrences of individual words within a particular text. Each document is described as a set of words and their counts. [Word Frequency](./key-terms.ipynb#word-frequency) uses a [bag of words](./key-terms.ipynb#bag-of-words) model where the order of words is not significant. Just as the letters of a Scrabble game are tossed into a bag without order, word frequency merely records the number of occurences with no regard to where a particular word occurs within a document. 

For example, the ten most common words in Shakespeare's Hamlet can be represented in the following table:

|  Word  | Count|
| -------|:----:|
| the    | 1148 |
| and    | 970  | 
| to     | 764  |  
| of     | 671  |
| i      | 573  |
| a      | 550  |
| you    | 550  |
| my     | 514  |
| hamlet | 485  |
| in     | 437  |

To represent the whole text, our table would have to include all of the 4,728 unique words in the play. While these counts can be useful, the above list contains mostly [function words](./key-terms.ipynb#function-words), common words with little lexical meaning like articles, prepositions, and conjunctions. We can remove the [function words](./key-terms.ipynb#function-words) from our analysis using a [stop words](./key-terms.ipynb#stop-words) list. This is a list of common words we would like to filter out of our analysis. If we filter out common [function words](./key-terms.ipynb#function-words) in English, the result is:

|   Word  | Count|
| --------|:----:|
| hamlet  | 485  |
| lord    | 313  | 
| king    | 199  |  
| horatio | 159  |
| polonius| 124  |
| claudius| 122  |
| queen   | 121  |
| shall   | 114  |
| good    | 109  |
| come    | 106  |

Our new list contains more [content words](./key-terms.ipynb#content-words), yet it is dominated by character names. The reason for this is that a play-text contains speech headings before each line. The word "hamlet" is the most common word in the play because it is counted every time Hamlet has a speaking line. This may be useful information if we are interested in determining who has the most lines in the play. If that is not our goal, we can filter out character names by adding them to our [stop words](./key-terms.ipynb#stop-words) list. Then we get:

|  Word |Count|
| ------|:---:|
| good  | 109 |
| come  | 106 | 
| let   | 95  |  
| like  | 85  |
| sir   | 75  |
| know  | 74  |
| enter | 72  |
| love  | 68  |
| speak | 63  |
| make  | 56  |

Far from a truly objective viewpoint, text analysis is often about refining and tailoring your analysis. Refining a [stop words](./key-terms.ipynb#stop-words) list is an important part of obtaining useful results using [word frequencies](./key-terms.ipynb#word-frequency).  The researcher must decide whether the filtering is adequate and appropriate. The answer always depends on the context of the argument. Is the above table appropriately filtered? The high frequency of the word "enter" is likely from stage directions instead of speaking lines. That may or may not be appropriate depending on the argument at hand.


### Examples of Word Frequency

One of the most popular kinds of visualization associated with the digital humanities is the tag cloud (or word cloud). A tag cloud visualizes word frequency by connecting the size of a word to its frequency. More common words are larger. 
![Word cloud of The Narrative of the Life of Frederick Douglass, An American Slave](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tagCloudDouglass.png)
**Tag Cloud of *The Narrative of the Life of Frederick Douglass
       An American Slave* created using Voyant.**
In the visualization above, the ten most common words listed below are the largest words in the tag cloud.

|Word   | Count |
|----   |:-----:|
| mr    |  168  |
|slaves | 125   |
|master | 124   |
|slave  | 122   |
|time   | 117   |
|man    | 78    |
|slavery| 69    |
|covey  | 61    |
|old    | 58    |
|said   | 56    |

In the above examples, the word frequencies are counted from a single work, *The Narrative of the Life of Frederick Douglass, An American Slave*. Depending on the size of our dataset, we can choose the appropriate size of the source text (e.g. chapter, book, volume, journal, month of articles, year of books, etc.). Text analysis can help us understand longterm trends in writing and publishing. Let us consider the very large [Early English Books Online- Text Creation Partnership (EEBO-TCP) corpus](https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/) which contains 25,368 early English print books (with an additional 34,963 to be made public in 2021). Using the N-gram Browser of earlyprint.org](https://earlyprint.org/lab/tool_ngram_browser.html?requestFromClient={%221%22:{%22spe%22:%22love,loue%22,%22reg%22:%22%22,%22lem%22:%22%22,%22pos%22:%22%22,%22originalPos%22:%22%22},%222%22:{%22spe%22:%22%22,%22reg%22:%22%22,%22lem%22:%22%22,%22pos%22:%22%22,%22originalPos%22:%22%22},%223%22:{%22spe%22:%22%22,%22reg%22:%22%22,%22lem%22:%22%22,%22pos%22:%22%22,%22originalPos%22:%22%22},%22databaseType%22:%22unigrams%22,%22smoothing%22:%22True%22,%22rollingAverage%22:%2220_year%22,%20%22instructionToggle%22:%20%22show%22}), we can visualize teh historical standardization of the spelling of the word love in the 17th century. 

![Love vs. loue graph from 1450-1700](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/earlyprint-love-loue.png)

From this graph, we can see that approximate crossover point is 1630 when the spelling "love" became more popular than "loue."

### When should I use word frequency?

Generally, word frequency is used as an initial form of data exploration. The kind of scholarly arguments that can be mounted from word frequency analysis are limited in scope, but the technique can help understand larger patterns in the overall dataset that may not be obvious beforehand. While a word frequency analysis may not form the basis of an entire article or chapter, word freqency is the foundation for many advanced text mining methods such as TF-IDF and Topic Analysis.

One of the shortcomings of word frequency is that sometimes individual words are not useful units of analysis. For example, a social scientist may be interested in the relationship between nuclear families and drug offenses. Word frequency would allow us to search a thousand recent journal articles for the occurrences of "nuclear," "family," "drug," and "offense." But occurrences of "nuclear" could be about power plants and occurrences of "offense" may be about many other kinds of crime (or even sports). 

"Nuclear family" and "drug offense" are collocations. To address them, we'll need the method found in the next section.

### Collocation

An alternative to this approach is using [n-grams](./key-terms.ipynb#n-gram) which can capture phrases in addition to individual words.

### TF-IDF

### Topic Analysis



## How are they connected?

Lorem Ipsum

### Concordance
Lorem Ipsum

### Network Analysis
 Lorem Ipsum

## How does it feel?
Lorem Ipsum

### Sentiment Analysis

Lorem Ipsum

## What names are in here? 
Lorem Ipsum

### Named Entity Recognition
 Lorem Ipsum

## How are they similar?
Lorem Ipsum

### Authorship Attribution

Lorem Ipsum

### Clustering
Lorem Ipsum

### Supervised Machine Learning
Lorem Ipsum

# Why learn Python?
<a id ="intro-to-python"></a>

Lorem Ipsum

## Hands-on Learning Opportunities

### Intensive Institutes (Travel required)
* [Digital Humanities Summer Institute](https://dhsi.org)
* [Digital Humanities Research Institute](http://dhinstitutes.org/)
* [Humanities Intensive Learning and Teaching](http://dhtraining.org/hilt/)
* [Data Matters](http://datamatters.org)


## Python Tutorials by Discipline
### Humanities
* [Python Programming for the Humanities (Folgert Karsdorp)](http://www.karsdorp.io/python-course/)
* [Intro to Python Workshop (Digital Humanities Research Institute)](https://github.com/DHRI-Curriculum/python)
* [Intro to Python I and II (University Libraries UNC Chapel Hill)](https://unc-libraries-data.github.io/Python/)
* [Python Lessons (The Programming Historian)](https://programminghistorian.org/en/lessons/?topic=python)

### Libraries
* [Python Intro for Libraries (Library Carpentry)](https://librarycarpentry.org/lc-python-intro/)














# Why use Jupyter Notebooks?
<a id="intro-to-jupyter"></a>

Jupyter notebooks are documents that contain both computer code (e.g. python or R) and rich text elements (i.e. explanatory text, figures, links). Jupyter notebooks have two significant advantages for teaching and learning to code:
* Minimal Setup
  * Traditional code editors may require students to become familiar with terminals, environments, libraries, etc. 
  * With a Jupyter notebook, users can run code immediately.
* Rich Support Content
  * Traditional code supports plain text commenting, but has a limited ability to include supporting explanation and information.
  * Jupyter notebooks allow instructors to embed many kinds of supporting content including text, images, equations, links, and videos.
  


"At present, one of [the] best ways of sharing or publishing these workflows might be with the free and open-source Jupyter Notebook." (Dobson 39)






Examples from
Programming Historian
Ted Underwood's Book
DHRI
DHSI
HILT



Essentially, a Jupyter notebook is a file (.ipynb) that can be easily saved, uploaded, downloaded, and converted. (For example, a notebook file (.ipynb) can be converted into a python file (.py), HTML file (.html), or PDF file (.pdf).) Users edit Jupyter notebook files (.ipynb) with specialized software such as the Jupyter Notebook application. Yes, they unfortunately named the file type and the application "Jupyter Notebook" which is confusing. 

To clarify, *The* Jupyter Notebook is a browser application that runs *a* Jupyter Notebook. To clarify the difference, you may hear people refer to them as “the Jupyter Notebook app” and "a Jupyter notebook.” Fortunately, future versions of the application will be called “JupyterLab” instead of “The Jupyter Notebook” which should alleviate some of this confusion.

While Jupyter notebook files can be saved, viewed, and edited on your local machine using The Jupyter Notebook application, it is often helpful to connect Jupyter notebooks to a Jupyter server that contains the unique environment (e.g. the right kernel, dependencies) to execute the notebook’s code. Using a server ensures that the environment for executing the code is consistent and correct.

If all of this is confusing, 


Other Jupyter servers include:

* Jupyter Lab (Jupyter's new replacement for "The Jupyter Notebook")
* Microsoft Azure Notebooks
* Google Colab
* Kaggle (popular with the data science community)

The interfaces differ slightly between these Jupyter servers, but they are essentially the same software.

How can I use a Jupyter notebook file?


What computer languages do Jupyter notebooks support?
The Jupyter system supports over 100 programming languages including Python, Java, R, Julia, Matlab, Octave, Scheme, Processing, Scala, and more. The most common languages used for text and data mining are python and R. 
How does the Digital Scholar Workbench connect to the TDM Jupyter notebooks?
The Digital Scholar Workbench 


In order to use Jupyter notebooks (or any coding environment) for text and data mining, a user must have facility with the command line and install the appropriate dependencies. In order to make this work more accessible, we have deployed a JupyterHub server with kernels and depencies designed for text and data mining. Teachers and students simply sign in to the Digital Scholar Workbench using their JSTOR login, and they can write, edit, and run code immediately. Whether you are a novice or an advanced programmer, you can:
●	Build replicable and shareable datasets in minutes
●	Start complex analyses like Topic Modeling or TF/IDF in minutes
●	Create, edit, run, and share notebooks focused on particular TDM methods

---
Dobson, James E. [*Critical Digital Humanities*](https://www.press.uillinois.edu/books/catalog/48xfp2zp9780252042270.html). (2019).