# TDM Key Terms and Concepts

## Artificial Intelligence <a id="artificial-intelligence"></a>
The science of making intelligent machines, especially machines that react to input data in a way similar to a human being. Historically, artificial intelligence has tended to rely on simple if-then statements (e.g. if the user mentions their mother, ask how she is doing), but recent advancements in artificial intelligence have focused on [machine learning](#machine-learning): the ability of machines to rewrite their own algorithms to improve their accuracy.
## Bag of Words (Model) <a id ="bag-of-words"></a>
A model of texts that counts individual words without regard to grammatical location or phrases. Just as the letters of a Scrabble game are tossed into a bag without order, a "bag of words" model gathers all the words of a text into a "bag" with no regard to where a particular word occurs within the document. 
## Bigram <a id="bigram"></a>
An [n-gram](#n-gram) with a length of two. For example, "chicken stock" is a word bigram.
## Bayesian Classification
A classification method based on [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) that describes the probability of an event based on available prior knowledge. For example, given a dataset of the historical weather conditions (temperature, humidity, windspeed) from December 25th for every year over the last century, will it snow on December 25th, 2027?
## Cleaning (Data)

## Clustering
## Collocation

## Concordance
## Content Words
As opposed to [function words](#function-words) (e.g. articles, pronouns, conjuctions), content words (e.g. nouns, verbs, and adjectives) carry greater lexical meaning. Word frequency analysis typically attempts to filter out function words, in order to make content words more prominent. This filtering is accomplished with a [stop words](#stop-words) list.
## Corpus <a id="corpus"></a>
A large (and often structured) collection of texts used for analysis. For example, all of the plays written by Shakespeare. A simple example might be a set of plain text files in a folder on your computer. A more complicated example may use XML, or another form of markup, to allow for deeper analysis. The plural form is corpora.

See also [TEI XML](#tei-xml). 
## CSV (file) <a id="csv-file"></a>
A .csv file, or Comma-Separated Value file, is a simple format for storing structured data where each entry in the file is separated by a comma. Similarly, a [TSV file](#tsv-file) uses tabs to separate individual data entries. 
## Dataset
A collection of information, usually computer files, used for statistical analysis. Most datasets are digital text (either numbers, words, or both), but they can also be other formats such as image, audio, and/or video content. Datasets are usually referred to as structured, semi-structured, or unstructured.
Structured data fits into a predetermined format and can usually be represented by a table, spreadsheet, or relational database. 
Unstructured data is more freeform. For example, longform texts, audio, or video content are unstructured. 
Semi-structured data uses tags or elements to mark out structures within an unstructured data set. Email files, for example, have both structured aspects (Sender, Subject, etc.), but the body of an email is usually unstructured.
## Discipline
An academic field or body of knowledge taught and studied within colleges or universities. Generally academic disciplines are divided into three large groups: 
* The Humanities include disciplines like English, History, Law
* The Sciences include disciplines like Physics, Biology, Mathematics
* The Social Sciences include include disciplines like Anthropology, Economics, and Sociology

Academic disciplines as divisions are matters of convenience for organizing departments, but many, if not most, professors research in two or more disciplines at a time. 
## Environment
## Extracted Features
## Function Words <a id="function-words"></a>
The words in a sentence that have little lexical meaning and express grammatical relationships. Function words include articles, pronouns, and conjunctions. When using a [word frequency](#word-frequency) approach, function words are often filtered out in favor of content words using a [stopwords](#stop-words) list. 
## Gensim
## Google Colab <a id="google-colab"></a>
## HathiTrust
## HathiTrust Research Center (HTRC)
## JSTOR
## JupyterHub <a id="jupyterhub"></a>
A multi-user version of [The Jupyter Notebook](#the-jupyter-notebook), ideal for teaching environments.
## JupyterLab <a id="jupyterlab"></a>
The newest software from [Project Jupyter](#project-jupyter), intended to replace [The Jupyter Notebook](#the-jupyter-notebook), for executing and editing [Jupyter notebook](#jupyter-notebook) files.
## Jupyter Notebook, The (software) <a id="the-jupyter-notebook"></a>
A single-user web application for executing and editing [Jupyter notebook files](#jupyter-notebook). Will be replaced by [JupyterLab](#jupyterlab).
## Jupyter notebook (file) <a id="jupyter-notebook"></a>
A file with extension .ipynb that contains computer code (e.g. [Python](#python) or R) alongside other explanatory media (text, images, video). 
## Jupyter Server <a id="jupyter-serve"></a>
A server with the appropriate software environment (e.g. [JupyterHub](#jupyterhub), [JupyterLab](#jupyterlab), [Google Colab](#google-colab)) for running and editing [Jupyter notebooks](#jupyter-notebook).
## Keyword Extraction
## Latent Dirichlet Allocation (LDA)
## Lemmatization <a id="lemmatization"></a>
## Library (in Python)
A collections of methods and functions for achieving certain tasks (e.g. image manipulation, web scraping. This saves time since the code can be added quickly and all at once around a specific group of tasks. The [Natural Language Toolkit (NLTK)](#nltk) is a common library used in [natural language processing](#nlp).
## Machine Learning <a id="machine-learning"></a>
A subset of [artificial intelligence](#artificial-intelligence) that focuses on a machine algorithms that improve accuracy when exposed to additional data without being explicitly reprogrammed by a human.
## N-gram <a id ="n-gram"></a>
A sequence of n items from a given sample of text or speech. Most often, this refers to a sequence of words, but it can also be used to analyze text at the level of syllables, letters, or phonemes. N-grams are often described by their length. For example, word n-grams might include:
* stock (a 1-gram, or unigram)
* chicken stock (a 2-gram, or [bigram](#bigram))
* homemade chicken stock (a 3-gram, or [trigram](#trigram))
A text analysis approach that looks only at unigrams at the word level will not be able to differentiate between the "stock" in "stock market" and "chicken stock."

One of the most popular examples of text analysis with n-grams is the [Google N-Gram Viewer](https://books.google.com/ngrams).

See also [Natural Language Processing](#nlp). 
## Named Entity Recognition (NER)
## Natural Language Processing (NLP) <a id="nlp"></a>
## Natural Language Toolkit (NLTK) <a id="nltk"></a>
A suite of libraries and programs for [Natural Language Processing](#nlp) written in [python](#python). NLTK includes libraries for tokening, collocation, n-grams, Part of Speech (POS) Tagging, and Named Entity Recognition (NER).

See the [project documentation](https://www.nltk.org/) and book [Natural Language Processing with Python](http://www.nltk.org/book/).
## Neural Net
## Optical Character Recognition (OCR)
## Package
## Part of Speech (POS) Tagging <a id="pos-tagging"></a>
## Plain text
## Portico
## Parts of Speech (POS) Tagging <a id="pos-tagging"></a>
## Primary Source
## Project Jupyter <a id="project-jupyter"></a>
A non-profit that develops open-source software, open standards, and services across many programming languages. They are most well-known for software such as [The Jupyter Notebook](#the-jupyter-notebook), [JupyterLab](#jupyterlab), and [JupyterHub](#jupyterhub). All three of these programs are used to create, edit, and share programming notebooks, known as [Jupyter notebooks](#jupyter-notebook).
## Python (Programming Language) <a id="python"></a>

## R (Programming Language) <a id="r"></a>
## Secondary Source
## Sentiment Analsis
## Stop Words (List) <a id="stop-words"></a>
A stop words list is a set of words or phrases that are ignored in [word frequency](#word-frequency) analysis. It is common for a researcher who is interested in prominent nouns and verbs to remove [function words](#function-words) (e.g. the, and, I, to, of, a). A stop word list may also include other common words, such as character ids which are usually the most common words in a play text.
## Tag Cloud (or Word Cloud)<a id ="tag-cloud"></a>
A tag cloud is a visualization of the relative word frequencies in a [corpus](#corpus). The relative size of each word in a tag cloud depends on its frequency within a text. Larger words occur more frequently.

![Tag Cloud of The Narrative of the Life of Frederick Douglass
       An American Slave](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tagCloudDouglass.png)
**A Tag Cloud of *The Narrative of the Life of Frederick Douglass
       An American Slave* generated using Voyant.**
## TEI XML <a id ="tei-xml"></a>
A form of [XML Markup](#xml), or tagging, created by the [Text Encoding Initiative](https://tei-c.org/) to describe digital documents. This markup can help computers recognize particular aspects of the text. Text analysis often requires explicit marking, even for textual aspects that a human reader can easily pick out:
* Title
* Author Name
* Name of the speaker in a play
* A paragraph
* The speaker in a play
* Stage directions
* A stanza

See also [Parts of Speech Tagging](#pos-tagging), [Lemmatization](#lemmatization), [Tokenization](#tokenization).
## Term Frequency
## Term Frequency-Inverse Document Frequency (TFIDF)
## Text Extraction
## Token
## Tokenization <a id="tokenization"></a>
## Topic Modeling (or Topic Analysis)
## Tree Map
## Trigram <a id="trigram"></a>
An [n-gram](#n-gram) with a length of three. For example, "homemade chicken stock" is a word trigram.
## TSV (file) <a id="tsv-file"></a>
A .tsv file, or Tab-Separated Value file, is a simple format for storing structured data where each entry in the file is separated by a tab. Similarly, a [CSV file](#csv-file) uses commas to separate individual data entries.
## Unigram
## Voyant
## Word2vec
## Word Cloud<a id="word-cloud"></a>
See [Tag Cloud](#tag-cloud).
## Word Embedding
## Word Frequency <a id="#word-frequency"></a>
A text analysis method that counts the number of occurences of individual words within a particular text. Word frequency uses a [bag of words](#bag-of-words) model where the order of words is not significant. Just as the letters of a Scrabble game are tossed into a bag without order, word frequency merely records the number of occurences with no regard to where a particular word occurs within a document. 

An alternative to this approach is using [n-grams](#n-gram) which can capture phrases in addition to individual words.

Read more about [Word Frequency](./0-why-text-mining.ipynb#wf-method). 
## XML <a id="xml"></a>
Short for (eXtensible markup language), XML uses tags to identify parts of a document for a machine to understand. Like HTML, these tags have an opening tag (e.g. <l>) and a closing tag marked by a forward slash (e.g. </l>). Unlike HTML, these tags can be freely created according to whatever standard the creator needs. One prominent example is the [Text Encoding Initiative](https://tei-c.org/). The example below uses [TEI-XML](#tei-xml) to describe Shakespeare's Sonnet 130 by labeling lines, quatrains, and the final couplet. This kind of markup enables computers to do complex analysis quickly such as comparing every couplet, quatrain, or line in Shakespeare's sonnets.
```
<text>
 <body>
  <lg>
   <lg type="quatrain">
    <l>My Mistres eyes are nothing like the Sunne,</l>
    <l>Currall is farre more red, then her lips red</l>
    <l>If snow be white, why then her brests are dun:</l>
    <l>If haires be wiers, black wiers grown on her head:</l>
   </lg>
   <lg type="quatrain">
    <l>I have seene Roses damaskt, red and white,</l>
    <l>But no such Roses see I in her cheekes,</l>
    <l>And in some perfumes is there more delight,</l>
    <l>Then in the breath that from my Mistres reekes.</l>
   </lg>
   <lg type="quatrain">
    <l>I love to heare her speake, yet well I know,</l>
    <l>That Musicke hath a farre more pleasing sound:</l>
    <l>I graunt I never saw a goddesse goe,</l>
    <l>My Mistres when shee walkes treads on the ground.</l>
   </lg>
  </lg>
  <lg type="couplet">
   <l>And yet by heaven I think my love as rare,</l>
   <l>As any she beli'd with false compare.</l>
  </lg>
 </body>
</text>
```
___