<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
**This notebook does not contain code but links to notebooks with code.**
___

# Why Learn Text Analysis?: A Guide for Absolute Beginners


**Description:** This lesson introduces key concepts in text analysis for a general academic audience. If you are completely new to text analysis, this is the place to start.

**Use Case:** For learners who wonder if they should learn text analysis

**Difficulty:** Beginner with no coding experience

**Completion time:** 45 minutes

**Knowledge Required:** None

**Knowledge Recommended:** None

**Data Format:** None

**Libraries Used:** None

**Research Pipeline:** None
___

## Introduction

This notebook is a high-level overview of common text analysis methods used in academic research. In particular, this notebook describes:

* Common text analysis methods useful for academic research
* The kind of research questions each method can answer
* How difficult each method is to learn
* Constellate events and open educational resources
* Additional learning resources

Text analysis has emerged from many research areas (computer science, linguistics, information science, social sciences, and computational humanities, to name a few) to help address the over-abundance of data. When a single scholar—or even a team of dozens of scholars—is confronted with millions of documents, it becomes impossible to read all the available evidence. This is true whether the documents are full-length monographs or tweets. The challenge of informational synthesis becomes more difficult as the research record grows larger with each passing day.

Text analysis is part of a generational shift in the toolkit of 21st-century researchers. Textual research methods play an important role in helping address our post-modern data deluge. No matter the field, all academics are increasingly confronted by data overload. From students to academic leaders, the "data problem" has taken center stage: ChatGPT, Large Language Models, Machine Learning, Big Data, Neural Networks. Whether academics use these technologies for their research or not, they increasingly underpin the systems we use for knowledge-making. 

Thriving in this universe of information requires learning digital literacies, ways of seeing the size, scope, and shape of human thought. Text analysis then, is not just a field or set of methods, but a growing body of textual literacies that help us connect, discover, and interpret humanity's knowledge at a superhuman scale. Like a researcher operating a telescope, knowing the right methods can maake the difference between a shot in the dark and a glimpse into the richness of newfound galaxies and constellations.

## Text Analysis Data

Text analysis creates insights from a large number of texts (also known as a corpus). Some texts may be ready for analysis and others may require thousands of hours for preparation. All research has time and funding restraints, so understanding the difference is important for determining whether a project is feasible.

Research projects that start with printed documents can be prohibitively expensive in time and labor. Assuming the printed documents can be scanned accurately into a suitable image (`.jpg`, `.tiff`, etc.) or portable document format (`.pdf`), the researcher will then need to convert the file using a process known as Optical Character Recognition (OCR). The process can be semi-automated, but high-quality, accurate results at scale almost always require significant human labor.

Since it is easier to start with "born-digital" materials, many researchers gather data from:
* Pre-existing datasets
* Scraping from websites
* Application Programming Interfaces (APIs)
If the data has a consistent structure, the relevant text can probably be extracted at-scale with some Python literacy. 

In most cases, the ideal scenario is having texts in an easily readable format, such as a plaintext file, e.g. a `.txt` file extension. Document formats such as Extensible Markup Langauge `.xml` or JavaScript Object Notation `.json` can be even richer sources of data, especially if they contain relevant metadata such as titles, authors, publishers, dates, etc.

Finally, since text analysis methods rely on statistical models, it is generally true that having more texts will improve the outcomes of your analysis.

### What about copyright law?

Text analysis on copyrighted materials poses an additional challenge. It is not possible to secure the copyrights for millions of documents. There are two common approaches to this problem:

1. Analyze the data within a **secure computing environment**
2. Use a **non-consumptive dataset**

For example, the HathiTrust Research Center (HTRC) offers two approaches. Researchers who need full-text access, can use a secure computing environment known as an HTRC [data capsule](https://wiki.htrc.illinois.edu/display/COM/HTRC+Data+Capsule+Environment). The primary benefit is that researchers can work with full-text data directly in the environment. The capsule monitors data that is exported, ensuring that copyright is observed.

There are some significant downsides to this secure computing approach. Most notably, researchers must bring their tools into the environment, and they are limited by the technical specifications of the secure compute environment.

To address these issues, HTRC also offers a non-consumptive dataset called ["Extracted Features"](https://analytics.hathitrust.org/datasets). Users can download the non-consumptive dataset to their machine (~4 TB in size). The dataset does not contain full-text (ensuring copyright is observed), but it does contain [n-gram](https://constellate.org/docs/key-terms/#n-gram) counts which enables a significant amount of useful text analysis.

The downside to the non-consumptive dataset approach is that some text analysis methods, such as Named Entity Recognition (NER), require full-text data for the most accurate results.

### Constellate Data

Constellate's [dataset builder](https://constellate.org/builder) offers full-text when legally permissible and non-consumptive text for copyrighted materials. In the cases where Constellate cannot supply full-text due to copyright laws (JSTOR and Portico content), the datasets include three n-gram counts for each document:

* Unigrams- A single-word construction, for example: "vegetable".
* Bigrams- An two-word construction, for example: "vegetable stock".
* Trigrams- A three-word construction, for example: "homemade vegetable stock".

While having the full texts for the documents in your corpus is ideal, a great deal can be still be discovered through the use of unigrams. Even when researchers have access to the full-texts of a corpus, it is common for them to create a list of n-gram counts for analysis. 

As a non-profit, Constellate gives the maximum access (as allowed by law) to its textual data without charge. We offer additional services for a modest fee in the form of educational classes, community events, and a specialized lab designed for teaching, learning, and research. We do not resell content to libraries that have purchased licenses for those materials. In fact, we offer access to all of our available materials for text analysis—whether or not the institution has licensed them. 

Constellate does not offer a secure compute environment for working with full-text documents. Researchers tell us they have little interest in being forced to conduct research within a secure compute environment. Instead, we offer full-text datasets for researchers willing to sign a legal agreement through [Data For Research](https://jstor.libwizard.com/f/dfr-request). 

___
<font color="red">Read more</font>
* [Constellate Dataset Builder: full-text and n-gram content](https://constellate.org/docs/data-sources)
* [Bring your own data into Constellate](https://constellate.org/docs/import-data-into-constellate)

### I have my own data. What will it take to get it ready?

For a major text analysis project, such as UNC Chapel Hill's [On the Books: Jim Crow and Algorithms of Resistance](https://onthebooks.lib.unc.edu/), about 90% of the labor is creating the corpus. One of the most significant benefits of using Constellate to teach, learn, and do research is that the dataset builder takes out *the vast majority* of effort in doing text analysis. Of course, we recognize researchers' scholarly interests may require building their own dataset, and we support researchers by offering educational classes and events on building a dataset. Of course, we also offer the Constellate Lab for conducting and sharing research.

If you have your own data, you will need to assess what it will take to make it ready for analysis. Constellate offers classes on preparing your own dataset based on our open educational resources. We can help you answer the following questions:

* How can I convert my data into plain text? 
    * <font color="red">Start learning</font> [Optical Character Recognition Basics](./ocr-basics.ipynb)
* How can I tokenize my texts (separate the individual words)? 
    * <font color="red">Start learning</font> [Tokenize Text Files](./tokenizing-text-files.ipynb)
    * <font color="red">Start learning</font> [Tokenize Text Files with NLTK](./tokenize-text-files-with-nltk.ipynb)
    
Consider the data's current form as well as the size and skill of your project staff. The corpus creation process could take anywhere from a few hours to many years of labor. If there is a significant amount of labor, you may need to write a grant proposal to hire help. *If writing a grant, contact your library early in the process since funding agencies will require a data preservation plan in the grant application, which will likely include committing your dataset to your institutional repository.*

In addition to the cleaned-up texts for your corpus, you will also need a strategy for dealing with textual metadata, information such as author, year, etc. Some of this is discussed in [Tokenize Text Files with NLTK](./tokenize-text-files-with-nltk.ipynb), but it would also help to have some experience with working with data at scale. The Python [Pandas](https://pandas.pydata.org/) Library is one of the best ways to work with data at scale. (Compared to Excel, Pandas is faster, more flexible, works at larger scales, and can be automated.) Constellate offers classes for those interested in learning Pandas.
___

<font color="red">Start learning</font>
* [Pandas 1](./pandas-1.ipynb)

## Common Research Questions

Here are a few common research questions that text analysis can help answer:

1. What are these texts about?
2. What emotions are expressed?
3. What key names can I find?
4. Which texts are similar?

Let's consider the methods to answer each of these questions.

### What are these texts about?

The most common problem for researchers is trying to sort through a large pile of data: "What are these texts about?" There are a variety of approaches for answering this question, varying from basic word frequency accounts to machine learning methods such as topic modeling and topic classification. The following approaches try to answer:

**What are the words, topics, concepts, and significant terms in these documents?**
___

**Word Frequency** (Beginner Friendly)

Researchers often begin to explore a corpus by counting the frequency of each word in each document. Almost all text analysis tools feature word frequency analysis, often including visualizations such as a [word cloud](https://constellate.org/docs/key-terms/#tag-cloud). When exploring a corpus using word frequencies, it is often helpful to make a distinction between [content words](https://constellate.org/docs/key-terms/#content-words)(generally nouns and verbs) and [function words](https://constellate.org/docs/key-terms/#function-words) (grammatical word constructions like "the", "of", and "or"). Researchers may refine a [stop words list](https://constellate.org/docs/key-terms/#stop-words) to improve their data output.

<font color="red">Start learning</font> 
* [Word Frequency Analysis](./exploring-word-frequencies.ipynb) Create a word cloud
* [Creating a Stop Words List](./creating-stopwords-list.ipynb) Improve your research results
___

**Significant Terms** (Beginner Friendly)

Search engines use significant terms analysis to match a user query with a list of appropriate documents. This method could be useful if you want to search your corpus for the most significant texts based on a word (or set of words). It can also be useful in reverse. For a given document, you could create a list of the ten most significant terms. This can be useful for summarizing the content of a document. 

<font color="red">Start learning</font> 
* [Significant Terms Analysis](./finding-significant-terms.ipynb) Create a simple search engine
___
**Topic Analysis** or Topic Modeling (Intermediate)

While significant terms analysis reveals terms commonly found in a given document, a topic analysis can tell us what words tend to cluster together across a corpus. For example, if we were to study newspapers, we would expect that certain words would cluster together into topics that match the sections of the newspaper. We might see something like:

* Topic 1: baseball, ball, player, trade, score, win, defeat
* Topic 2: market, dow, bull, trade, run, fund, stock
* Topic 3: campaign, democratic, polls, red, vote, defeat, state

We can recognize that these words tend to cluster together within newspaper sections such as "Sports", "Finance", and "Politics". If we have never read a set of documents, we might use a topic analysis to get a sense of what topics are in a given corpus. Given that Topic Analysis is an exploratory technique, it may require some expertise to fine-tune and get good results for a given corpus. However, if the topics can be discovered then they could potentially be used to train a model using [machine learning](https://constellate.org/docs/key-terms/#machine-learning) to categorize the topics in a given document automatically using Topic Classification.

<font color="red">Read more</font>
* Keli Du's "A Survey on LDA Topic Modeling in Digital Humanities"

<font color="red">Start learning</font> 
* [Topic Analysis](./finding-significant-terms.ipynb) Create an interactive topic visualization
___
**Concordance** (Beginner Friendly)

The concordance has a long history in the computational humanities and Roberto Busa's concordance *Index Thomisticus*—started in 1946—is arguably the first digital humanities project. Before computers were common, they were printed in large volumes such as John Bartlett's 1982 reference book *A Complete Concordance to Shakespeare*—it was 1909 pages pages long! A concordance gives the context of a given word or phrase in a body of texts. For example, a literary scholar might ask: how often and in what context does Shakespeare use the phrase "honest Iago" in Othello? A historian or sociologist may examine social media data to discover the contextual use of dog whistles, while a medical researcher may use concordance to identify journal articles that mention "remdesivir" in the context of "Covid-19" for a systematic review.

<font color="red">Start learning</font> 
* [Concordance](./concordance.ipynb) View context windows with journal data and visualize terms with a lexical dispersion plot

<font color="red">Read more</font>

* Steven E. Jones [Roberto Busa, S.J., and the Emergence of Humanities Computing](https://www.routledge.com/Roberto-Busa-S-J-and-the-Emergence-of-Humanities-Computing-The-Priest/Jones/p/book/9781138587250) (2016)
* Julianne Nyhan and Marco Passarotti, eds. [One Origin of Digital Humanities: Fr Roberto Busa in His Own Words](https://www.amazon.com/One-Origin-Digital-Humanities-Roberto/dp/3030183114/) (2019)

### What emotions are expressed?

**Sentiment Analysis** (Intermediate)

Sentiment analysis can help determine the emotions expressed in a given text. This can be determined using grammar-based algorithms, [Machine Learning](https://constellate.org/docs/key-terms/#machine-learning), or both. This is important for many business use-cases such as market research, consumer sentiment, and recommender systems.

<font color="red">Start learning</font> 
* [Sentiment Analysis with VADER](./finding-significant-terms.ipynb) Classify positive and negative product reviews

### What key names can I find?

**Named Entity Recognition** or NER (Intermediate)

Named Entity Recognition (NER) automatically identifies entities within a text and can helpful for extracting certain kinds of entities such as proper nouns. For example, NER could identify names of organizations, people, and places. It might also help identify things like dates, times, or dollar amounts. Like sentiment analysis, NER relies on grammar rules and/or machine learning.

NER is very prominent in molecular biology and bioinformatics, particularly for identifying genes and gene products. NER can help summarize and/or extract data by identifying key phrases in relevant documents. Some examples include tweets, resumes, novels, journal articles, oral histories, and reviews.

<font color="red">Read more</font>
* Miguel Won, Patricia Murrieta-Flores, and Bruno Martins [Ensemble Named Entity Recognition (NER): Evaluating NER Tools in the Identification of Place Names in Historical Corpora](https://www.frontiersin.org/articles/10.3389/fdigh.2018.00002/full) (2018)

<font color="red">Start learning</font>
* [Named Entity Recognition](./NER-1.ipynb)
* [Multilingual NER](https://github.com/wjbmattingly/tap-2022-multilingual-ner) by William Mattingly (2022 TAP Institute)
* [Named Entity Recognition](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/12-Named-Entity-Recognition.html) by Melanie Walsh (2021 TAP Institute)
* [Named Entity Recognition](https://nkelber.github.io/tapi2021/book/courses/ner.html) by Zoe LeBlanc (2021 TAP Institute)

### Which texts are similar?

**Stylometrics and Authorship Attribution** (Intermediate to Advanced)

The digital humanities, and its precursor "humanities computing," have a long history in the analysis of literature, particularly for analyzing genre and authorship. For example, the New Oxford Shakespeare surprised many scholars by assigning significant authorship of Shakespeare's "Henry VI," Parts 1, 2, and 3. It also lists as co-authors many Shakespeare contemporaries such as Thomas Nashe, George, Peele, Thomas Heywood, Ben Jonson, George Wilkins, Thomas Middleton, and John Fletcher.


<font color="red">Read more</font>
* Patrick Juola [How a Computer Program Helped Show J.K. Rowling Wrote A Cuckoo's Calling](https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling/) (2013)
* Ros Barber [Big data or not enough? Zeta test reliability and the attribution of Henry VI](https://academic.oup.com/dsh/article-abstract/36/3/542/5918973?redirectedFrom=fulltext)

___

**Supervised Machine Learning** (Intermediate to Advanced)

Supervised machine learning is an excellent choice for classification problems, identifying whether a text belongs to one group or another. These methods "train" computers to identify and classify similar items based on data that has been labeled or tagged by experts. For example, **On the Books: Jim Crow and Algorithms of Resistance** was able to use machine learning to identify 1939 North Carolina Jim Crow laws enacted between Reconstruction and the Civil Rights Movement. 

<font color="red">Start learning</font>
* [Introduction to Machine Learning](https://github.com/wjbmattingly/intro-to-ml) by William Mattingly (2022 TAP Institute)
* [Introduction to Machine Learning](https://github.com/Grantglass/intro_to_ml) by Grant Glass (2022 TAP Institute)

<font color="red">Read more</font>
* [Project Outcomes for On the Books](https://onthebooks.lib.unc.edu/otb-research/project-outcomes/)


## More learning materials

### Digital Humanities Resources

* [TAP Institute Open Course Materials](https://labs.jstor.org/projects/text-analysis-pedagogy-institute-2/)
* [PythonHumanities.com](https://pythonhumanities.com/about/) by [William Mattingly](https://wjbmattingly.com/)
* [Programming Historian](https://programminghistorian.org/en/lessons/) by various authors
* [The Carpentries](https://carpentries.org/workshops-curricula/) by various authors
* [Digital Humanities Research Institutes](https://www.dhinstitutes.org/curricula/) by various authors
* [Computational Humanities Research](https://discourse.computational-humanities-research.org/) <br />
* [YaleDHLab Lab Workshops](https://github.com/YaleDHLab/lab-workshops) <br />
* [Jupyter notebooks for digital humanities](https://github.com/quinnanya/dh-jupyter/blob/master/README.md) curated by [Quinn Dombrowski](https://quinndombrowski.com/)
* [Data Sitter's Club](https://datasittersclub.github.io/site/) by various authors
* [HathiTrust Digital Library Collections and Tools](https://www.hathitrust.org/htrc_collections_tools)
* [Documenting the Now](https://github.com/DocNow)
  
**Books on Python, Text Analysis, and DH**
* *Automate the Boring Stuff with Python: Practical Programming for Total Beginners* (2019) by Al Sweigart
* *Python Crash Course: A Handson, project-based introduction to programming* (2019) by Eric Matthes
* *Machine Learning with Python Cookbook* (2018) by Chris Albon
* *Natural Langauge Processing in Action* (2019)by Hobson Lane, Cole Howard, and Hannes Max Hapke
* [*Humanities Data Analysis: Case Studies with Python*](https://www.humanitiesdataanalysis.org/) by Folgert Karsdorp, Mike Kestemont, and Allen Riddell
* [Technical Textbooks List](https://cmu-lib.github.io/dhlg/global-resources/educational-resources/textbooks/) by [Scott B. Weingart](https://scottbot.github.io/)
* [Introduction to Named Entity Recognition](https://ner.pythonhumanities.com/intro.html) by [William Mattingly](https://wjbmattingly.com/)

**Books on Data Ethics**
* *Algorithms of Oppression* (2018) by Safiya Noble
* *Race After Technology* (2019) by Ruha Benjamin
* *Data Feminism* (2020) by Catherine D'Ignazio and Lauren F. Klein

**Instructional Video**
* [DH, Coding, and Book History](https://www.youtube.com/user/pvierth/videos)
* [Python Tutorials for Digital Humanities](https://www.youtube.com/@python-programming)

**Course Examples**
* [Humanities Analytics](https://humanitiesanalytics.com/) by [Matt Lavin](https://matthew-lavin.com/)
* [Introduction to Cultural Analytics and Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html) by [Melanie Walsh](https://melaniewalsh.org/)
* [CodeLab](https://github.com/ZoeLeBlanc/CodeLab) by [Shane Lin](https://www.library.virginia.edu/staff/ssl2ab), [Zoe LeBlanc](https://zoeleblanc.com/), and [Brandon Walsh](https://scholarslab.lib.virginia.edu/people/brandon-walsh/)
* [Computational and Inferential Thinking: The Foundations of Data Science](https://inferentialthinking.com/chapters/intro.html) by Ani Adhikari, John DeNero, David Wagner