# TDM Key Terms and Concepts

## Application Programming Interface (API) <a id="api"></a>
A protocol that defines communication between a client and server, often used to request data. APIs can help retrieve data from remote repositories, anything from weather to Twitter and Facebook. 
## Artificial Intelligence <a id="artificial-intelligence"></a>
The science of making intelligent machines, especially machines that react to input data in a way similar to a human being. Historically, artificial intelligence has tended to rely on simple if-then statements (e.g. if the user mentions their mother, ask how she is doing), but recent advancements in artificial intelligence have focused on [machine learning](#machine-learning): the ability of machines to rewrite their own algorithms to improve their accuracy.
## Bag of Words (Model) <a id ="bag-of-words"></a>
A model of texts that counts individual words without regard to grammatical location or phrases. Just as the letters of a Scrabble game are tossed into a bag without order, a "bag of words" model gathers all the words of a text into a "bag" with no regard to where a particular word occurs within the document. In this model, the reader knows every word and its frequency within the text but does not have the context of the word's use.
## Bibliographic Metadata <a id="bibliographic-metadata"></a>
Also known as "descriptive metadata," informational [metadata](#metadata) that describes a published item such as a book or journal article.  Bibliographic metadata contains data elements to help users identify and retrieve the published items.   It often has a formalized bibliographic format.
## Bigram <a id="bigram"></a>
An [n-gram](#n-gram) with a length of two. For example, "chicken stock" is a word bigram.
## Bayesian Classification
A classification method based on [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) that describes the probability of an event based on available prior knowledge. For example, given a dataset of the historical weather conditions (temperature, humidity, windspeed) from December 25th for every year over the last century, will it snow on December 25th, 2027?
## Clean Data<a id="clean-data"></a>
Data that has been standardized and corrected for accurate results. This phrase can also be used as a verb such as "to clean data" or "cleaning data." In practice, data cleaning makes up the bulk of text analysis work. 
## Clustering
## Collocation

## Concordance
## Content Words <a id="content-words"></a>
As opposed to [function words](#function-words) (e.g. articles, pronouns, conjuctions), content words (e.g. nouns, verbs, and adjectives) carry greater lexical meaning. Word frequency analysis typically attempts to filter out function words, in order to make content words more prominent. This filtering is accomplished with a [stop words](#stop-words) list.
## Corpus <a id="corpus"></a>
A large (and often structured) collection of texts used for analysis. For example, all of the plays written by Shakespeare. A simple example might be a set of plain text files in a folder on your computer. A more complicated example may use [JSON](#json), [XML](#xml), or another form of markup, to allow for deeper analysis. The plural form is corpora.

See also [TEI XML](#tei-xml). 
## Counter (in Python) <a id="python-counter"></a>
A data type similar to a [Python dictionary](#python-dictionary) with a few key differences:
* A counter object with a value of zero or less, always returns 0
* When a key is called that doesn't exist in the counter, it returns 0 instead of an error like in a dictionary
* A counter object has additional methods for counting including `.most_common(x)` that returns the x most common values.
* Counter objects can be added, subtracted, as well as being modified through unions (&) and intersections (|)

## CSV (file) <a id="csv-file"></a>
A .csv file, or Comma-Separated Value file, is a simple format for storing structured data where each entry in the file is separated by a comma. Similarly, a [TSV file](#tsv-file) uses tabs to separate individual data entries. 
## Dataframe (in [Pandas](#pandas)) <a id="pandas-dataframe"></a>
The primary data structure for analysis, manipulation, and presentation of data with the [Pandas library](#pandas).
## Dataset <a id="dataset"></a>
A collection of information, usually computer files, used for statistical analysis. Most datasets are digital text (either numbers, words, or both), but they can also be other formats such as image, audio, and/or video content. Datasets are usually referred to as structured, semi-structured, or unstructured.
Structured data fits into a predetermined format and can usually be represented by a table, spreadsheet, or relational database. 
Unstructured data is more freeform. For example, longform texts, audio, or video content are unstructured. 
Semi-structured data uses tags or elements to mark out structures within an unstructured data set. Email files, for example, have both structured aspects (Sender, Subject, etc.), but the body of an email is usually unstructured.
## Dataset ID <a id="dataset-id"></a>
A unique identifier for a [dataset](#dataset) created using the [corpus](#corpus) builder. A copy of your dataset ID will be sent to you in an email.
## Descriptive Metadata <a id="descriptive-metadata"></a>
See [bibliographic metadata](#bibliographic-metadata).
## Dictionary (Python) <a id="python-dictionary"></a>
A variable in [Python](#python) that stores data in [key/value pairs](#key-value-pair). This differs from a [Python](#python) [list](#python-list) which stores data in numberical order beginning with item 0.
## Discipline
An academic field or body of knowledge taught and studied within colleges or universities. Generally academic disciplines are divided into three large groups: 
* The Humanities include disciplines like English, History, Law
* The Sciences include disciplines like Physics, Biology, Mathematics
* The Social Sciences include include disciplines like Anthropology, Economics, and Sociology

Academic disciplines as divisions are matters of convenience for organizing departments, but many, if not most, professors research in two or more disciplines at a time. 
## Environment
## Extracted Features <a id="extracted-features"></a>
The [JSON](#json) data format for [non-consumptive research](#non-consumptive) used by [HathiTrust Research Center](#htrc) tools. The format is similar to that used by [JSTOR](#jstor) and [Portico](#portico). [Read more](https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+Dataset) on the HathiTrust Research Center website.
## Floating-point number (float) <a id="integer"></a>
A float is a data type that contains a decimal number that can assigned to a variable as a value. (Other kinds of data types in Python include [strings](#string) and [integers](#integer).) 

|  Data type             | Examples                                      |
| -----------------------|:---------------------------------------------:|
| Integers               | -5, -3, 0, 5, 201                             |
| Floating-point numbers | -3.74, -3.14, 0.0, 503.4, 506                 | 
| Strings                | 'potatoes', 'Hello world!, 'no', '24 pizzas'  |
## Function Words <a id="function-words"></a>
The words in a sentence that have little lexical meaning and express grammatical relationships. Function words include articles, pronouns, and conjunctions. When using a [word frequency](#word-frequency) approach, function words are often filtered out in favor of [content words](#content-words) using a [stopwords](#stop-words) list. 
## Gensim <a id="gensim"></a>


## Google Colab <a id="google-colab"></a>
An online software for reading, writing, sharing, and editing [Jupyter notebooks](#jupyter-notebook). [Google Colab](https://colab.research.google.com/) is compatible with Google Drive and can be installed through the G Suite Marketplace. 
## HathiTrust <a id="hathitrust"></a>
A partnership of academic and research institutions, offering access to millions of digitized volumes through the [HathiTrust Digital Library](#hathitrust-digital-library). 
## HathiTrust Digital Library <a id="hathitrust-digital-library"></a>
An online library of ~14 million volumes created by [HathiTrust](#hathitrust). Most of the materials are from research libraries, including content digitized via [Google Books](https://books.google.com/) and [Internet Archive](https://archive.org/).
## HathiTrust Research Center (HTRC)<a id="htrc"></a>
A partnership between Indiana University and the University of Illinois at Urbana-Champaign that develops a suite of tools and services for text data mining the [HathiTrust Digital Library](#hathitrust-digital-library). 
## Integer <a id="integer"></a>
An integer is a data type that contains a whole number that can assigned to a variable as a value. (Other kinds of data types in Python include [strings](#string) and [floating point numbers](#float).) 

|  Data type             | Examples                                      |
| -----------------------|:---------------------------------------------:|
| Integers               | -5, -3, 0, 5, 201                             |
| Floating-point numbers | -3.74, -3.14, 0.0, 503.4, 506                 | 
| Strings                | 'potatoes', 'Hello world!, 'no', '24 pizzas'  |

## JavaScript (Programming Language) <a id="javascript"></a>
An object-oriented computer programming language often used to create interactive effects within webbrowsers.  Learn more at [w3schools](https://www.w3schools.com/js/default.asp).
## JSON (JavaScript Object Notation)<a id="json"></a>
An open-standard file format for storing and exchanging data that is intended to be easy to read and write humans and machines. Like [XML](#xml), JSON is often used by [APIs](#api) to transmit data from a remote repository (say weather data or Twitter data) to a local machine. 
## JSON Lines<a id="jsonl"></a>
Also called newline-delimited [JSON](#json), JSON Lines (file extension .jsonl) is structured so it may be processed one record at a time. Each line is a valid value. The file extension for JSON Lines is ".jsonl".
## json (Python library)<a id="json-python-library"></a>
A library for interpreting and converting [JSON](#json) into [Python](#python) code.
## JSTOR <a id="jstor"></a>
A not-for-profit that collaborates with the academic community and manages the [JSTOR Digital Library](#jstor-digital-library), a digital library for scholars, researchers, and students featuring more than 12 million academic journal articles, books, and primary sources in 75 disciplines. JSTOR is part of [ITHAKA](http://ithaka.org), a not-for-profit organization that also includes [Artstor](http://artstor.org), [Ithaka S+R](http://www.sr.ithaka.org/), and [Portico](http://portico.org).
## JSTOR Digital Library <a id="jstor-digital-library"></a>
A digital library for scholars, researchers, and students featuring more than 12 million academic journal articles, books, and primary sources in 75 disciplines.
## JupyterHub <a id="jupyterhub"></a>
A multi-user version of [The Jupyter Notebook](#the-jupyter-notebook), ideal for teaching environments.
## JupyterLab <a id="jupyterlab"></a>
The newest software from [Project Jupyter](#project-jupyter), intended to replace [The Jupyter Notebook](#the-jupyter-notebook), for executing and editing [Jupyter notebook](#jupyter-notebook) files.
## Jupyter Notebook, The (software) <a id="the-jupyter-notebook"></a>
A single-user web application for executing and editing [Jupyter notebook files](#jupyter-notebook). Will be replaced by [JupyterLab](#jupyterlab).
## Jupyter notebook (file) <a id="jupyter-notebook"></a>
A file with extension .ipynb that contains computer code (e.g. [Python](#python) or R) alongside other explanatory media (text, images, video). 
## Jupyter Server <a id="jupyter-serve"></a>
A server with the appropriate software environment (e.g. [JupyterHub](#jupyterhub), [JupyterLab](#jupyterlab), [Google Colab](#google-colab)) for running and editing [Jupyter notebooks](#jupyter-notebook).
## Key/Value Pair<a id="key-value-pair"></a>
A key/value pair is a unit where the key provides a category for an item and the value provides informational data.  [JSON](#json) is commonly used to encode key/value pairs. 

>"title": "Hamlet"

The **key** is “title” and the **value** is “Hamlet”. 

In [python](#python), [dictionaries](#python-dictionary) and [Counters](#python-counter) also use key/value pairs.

## Keyword Extraction

## Latent Dirichlet Allocation (LDA)

## Lemmatization <a id="lemmatization"></a>

## Library (in Python) <a id="library"></a>
A collections of methods and functions for achieving certain tasks (e.g. image manipulation, web scraping. This saves time since the code can be added quickly and all at once around a specific group of tasks. The [Natural Language Toolkit (NLTK)](#nltk) is a common library used in [natural language processing](#nlp).

## List (in Python) <a id="python-list"></a>
A variable that stores items in numbered order beginning with item 0. This is different than a Python [dictionary](#python-dictionary) variable which stores data in [key/value pairs](#key-value-pair). 

## Machine Learning <a id="machine-learning"></a>
A subset of [artificial intelligence](#artificial-intelligence) that focuses on a machine algorithms that improve accuracy when exposed to additional data without being explicitly reprogrammed by a human.

## Metadata <a id="metadata"></a>
Data that describes data. In the humanities and library contexts, this often refers to [bibliographic metadata](#bibliographic-metadata) that describes information such as author, publication date, medium, etc. It may also describe other kinds of data like files, for example "date created" or "file size."

## Modulo (in Python)<a id="modulo"></a>
Notated as "%", an arithmetic operation that gives the remainder of a division. 34 % 6 = 4

## N-gram <a id ="n-gram"></a>
A sequence of n items from a given sample of text or speech. Most often, this refers to a sequence of words, but it can also be used to analyze text at the level of syllables, letters, or phonemes. N-grams are often described by their length. For example, word n-grams might include:
* stock (a 1-gram, or unigram)
* chicken stock (a 2-gram, or [bigram](#bigram))
* homemade chicken stock (a 3-gram, or [trigram](#trigram))
A text analysis approach that looks only at unigrams at the word level will not be able to differentiate between the "stock" in "stock market" and "chicken stock."

One of the most popular examples of text analysis with n-grams is the [Google N-Gram Viewer](https://books.google.com/ngrams).

See also [Natural Language Processing](#nlp). 

## Named Entity Recognition (NER)

## Natural Language Processing (NLP) <a id="nlp"></a>

## Natural Language Toolkit (NLTK) <a id="nltk"></a>
A suite of libraries and programs for [Natural Language Processing](#nlp) written in [python](#python). NLTK includes libraries for tokening, collocation, n-grams, Part of Speech (POS) Tagging, and Named Entity Recognition (NER).

See the [project documentation](https://www.nltk.org/) and book [Natural Language Processing with Python](http://www.nltk.org/book/).

## Neural Network <a id="neural-network"></a>

## Non-consumptive Research<a id="non-consumptive"></a>
Non-consumptive research allows analysts to do text analysis without displaying or reading substantial portions of copyrighted materials. In practice, this usually means giving analysts a [bag of words](./key-terms.ipynb#bag-of-words) that describes the frequency of every word in a text but not the order in which they occur. 

## Optical Character Recognition (OCR)<a id="ocr"></a>
The process of turning printed text into machine-readable digital text. Physical materials are scanned into digital images then specialized software attempts to turn the image into text. Two popular examples of OCR software are [Tesseract (Open Source)](https://github.com/tesseract-ocr/tesseract) and [ABBYY Finereader (Proprietary)](https://www.abbyy.com/en-us/finereader/). 

## Package

## Pandas (Python)<a id="pandas"></a>
A library for visualizing, analyzing, and manipulating data in Python. Learn more about [Pandas at pydata.org](https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html).

## Part of Speech (POS) Tagging <a id="pos-tagging"></a>

## Plain text <a id="plain-text"></a>
A file that only contains text and can be easily read in a text editor (as opposed to a binary or executable file)

## Portico

## Parts of Speech (POS) Tagging <a id="pos-tagging"></a>

## Primary Source

## Project Jupyter <a id="project-jupyter"></a>
A non-profit that develops open-source software, open standards, and services across many programming languages. They are most well-known for software such as [The Jupyter Notebook](#the-jupyter-notebook), [JupyterLab](#jupyterlab), and [JupyterHub](#jupyterhub). All three of these programs are used to create, edit, and share programming notebooks, known as [Jupyter notebooks](#jupyter-notebook).

## Python (Programming Language) <a id="python"></a>

## R (Programming Language) <a id="r"></a>

## Secondary Source

## Sentiment Analsis

## Stop Words (List) <a id="stop-words"></a>
A stop words list is a set of words or phrases that are ignored in [word frequency](#word-frequency) analysis. It is common for a researcher who is interested in prominent nouns and verbs to remove [function words](#function-words) (e.g. the, and, I, to, of, a). A stop word list may also include other common words, such as character ids which are usually the most common words in a play text.

## String (Python) <a id="string"></a>
A string is a data type that contains a set of characters that can assigned to a variable as a value. (Other kinds of data types in Python include [integers](#integer) and [floating point numbers](#float).) 

|  Data type             | Examples                                      |
| -----------------------|:---------------------------------------------:|
| Integers               | -5, -3, 0, 5, 201                             |
| Floating-point numbers | -3.74, -3.14, 0.0, 503.4, 506                 | 
| Strings                | 'potatoes', 'Hello world!, 'no', '24 pizzas'  |  

## Tag Cloud (or Word Cloud)<a id ="tag-cloud"></a>
A tag cloud is a visualization of the relative word frequencies in a [corpus](#corpus). The relative size of each word in a tag cloud depends on its frequency within a text. Larger words occur more frequently.

![Tag Cloud of The Narrative of the Life of Frederick Douglass
       An American Slave](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tagCloudDouglass.png)

**A Tag Cloud of *The Narrative of the Life of Frederick Douglass
       An American Slave* generated using Voyant.**



## TEI XML <a id ="tei-xml"></a>
A form of [XML Markup](#xml), or tagging, created by the [Text Encoding Initiative](https://tei-c.org/) to describe digital documents. This markup can help computers recognize particular aspects of the text. Text analysis often requires explicit marking, even for textual aspects that a human reader can easily pick out:
* Title
* Author Name
* Name of the speaker in a play
* A paragraph
* The speaker in a play
* Stage directions
* A stanza

See also [Parts of Speech Tagging](#pos-tagging), [Lemmatization](#lemmatization), [Tokenization](#tokenization).

## Term Frequency-Inverse Document Frequency (TFIDF)<a id="tf-idf"></a>
A statistical method that intends to reflect how important a particular word is within a [corpus](#corpus). A simple measurement of "term frequency" is divided by inverse document frequency, limiting the weight of common words like "the", "of", and "to".

## Text Extraction

## Token <a id="token"></a>
A chunk or [string](#string) of text, most often a single word. 

## Tokenization <a id="tokenization"></a>

## Topic Modeling (or Topic Analysis)

## Tree Map

## Trigram <a id="trigram"></a>
An [n-gram](#n-gram) with a length of three. For example, "homemade chicken stock" is a word trigram.

## TSV (file) <a id="tsv-file"></a>
A .tsv file, or Tab-Separated Value file, is a simple format for storing structured data where each entry in the file is separated by a tab. Similarly, a [CSV file](#csv-file) uses commas to separate individual data entries.

## Unigram <a id="unigram"></a>
An [n-gram](#n-gram) with a length of one. For example, "chicken" is a unigram.

## Voyant <a id="voyant"></a>
A flexible, web-based platform for text analysis that can also be run locally. [Voyant](https://voyant-tools.org/) has many kinds of visualizations, supports saving, and creates embeddable html objects. To learn more, see [the documentation](https://voyant-tools.org/docs/#!/guide/start). 

## Word2vec <a id="word2vec"></a>

## Word Cloud<a id="word-cloud"></a>
See [Tag Cloud](#tag-cloud).

## Word Co-Occurence Matrix <a id="word-co-occurrence-matrix"></a>

## Word Embedding
A collective name for [Natural Language Processing](#nlp) techniques that map words to vectors of real numbers using [neural networks](#neural-network) and dimensionality reduction on a [word co-occurence matrix](#word-co-occurrence-matrix). [Word2vec](#word2vec) is a common model for producing word embeddings. 

## Word Frequency <a id="#word-frequency"></a>
A text analysis method that counts the number of occurences of individual words within a particular text. Word frequency uses a [bag of words](#bag-of-words) model where the order of words is not significant. Just as the letters of a Scrabble game are tossed into a bag without order, word frequency merely records the number of occurences with no regard to where a particular word occurs within a document. 

An alternative to this approach is using [n-grams](#n-gram) which can capture phrases in addition to individual words.

Read more about [Word Frequency](./0-why-text-mining.ipynb#wf-method).

## XML <a id="xml"></a>
Short for (eXtensible markup language), XML uses tags to identify parts of a document for a machine to understand. Like HTML, these tags have an opening tag (e.g. <l>) and a closing tag marked by a forward slash (e.g. </l>). Unlike HTML, these tags can be freely created according to whatever standard the creator needs. One prominent example is the [Text Encoding Initiative](https://tei-c.org/). The example below uses [TEI-XML](#tei-xml) to describe Shakespeare's Sonnet 130 by labeling lines, quatrains, and the final couplet. This kind of markup enables computers to do complex analysis quickly such as comparing every couplet, quatrain, or line in Shakespeare's sonnets.
```
<text>
 <body>
  <lg>
   <lg type="quatrain">
    <l>My Mistres eyes are nothing like the Sunne,</l>
    <l>Currall is farre more red, then her lips red</l>
    <l>If snow be white, why then her brests are dun:</l>
    <l>If haires be wiers, black wiers grown on her head:</l>
   </lg>
   <lg type="quatrain">
    <l>I have seene Roses damaskt, red and white,</l>
    <l>But no such Roses see I in her cheekes,</l>
    <l>And in some perfumes is there more delight,</l>
    <l>Then in the breath that from my Mistres reekes.</l>
   </lg>
   <lg type="quatrain">
    <l>I love to heare her speake, yet well I know,</l>
    <l>That Musicke hath a farre more pleasing sound:</l>
    <l>I graunt I never saw a goddesse goe,</l>
    <l>My Mistres when shee walkes treads on the ground.</l>
   </lg>
  </lg>
  <lg type="couplet">
   <l>And yet by heaven I think my love as rare,</l>
   <l>As any she beli'd with false compare.</l>
  </lg>
 </body>
</text>
```