<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
____


# Tokenization Basics

**Description:**
This notebook focuses on the basic concepts surrounding tokenization. It includes material on the following concepts:

* Word segmentation
* n-grams
* Stemming
* Lemmatization
* Tokenizers

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 60-90 minutes

**Knowledge Required:** 
* Python Basics ([Start Python Basics 1](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Working with Dataset Files](./working-with-dataset-files.ipynb)

**Data Format:** None

**Libraries Used:**
* re

**Research Pipeline:**

1. Scan documents
2. OCR files
3. Clean up texts
4. **Tokenize text files** (this notebook)
___

## What is a word?

The concept of a word makes intuitive sense in everyday language, but it starts to break down significantly when we begin trying to formalize it for analysis with computer programs. Linguists have spent decades creating formal rules for breaking down texts into smaller parts for analysis, dealing in great detail with the normally unspoken rules of grammar. In this lesson, we consider what a word is and consider how we could write a program for collecting the words within a text.

Let's take a look at an example sentence:

> Now that summer's here, we're going to visit the beach at Lake Michigan and eat ice cream.

How many words are in this sentence? We could start by simply looking at words that are separated by spaces. 

> Now, that, summer's, here, we're, going, to, visit, the, beach, at, Lake, Michigan, and, eat, ice, cream.

That would give us 17 words. But we could ask a few questions about this count. For example, is 'Lake Michigan' one word or two words? Certainly, lake and Michigan have their own individual meanings, but Lake Michigan certainly has a different meaning from either of those words individually. Similarly, what about 'ice cream'?

What about contractions? Is 'we're' a single word or two words: 'we' and 'are'? If our goal is to count how many times a given word occurs in the sentence, does 'we' occur in the sentence? Does the word 'summer' occur in our sentence?

Verb conjugations pose yet another problem. Should the word 'going' be counted separately from 'go'. What about 'went'? From a computational linguistics perspective, we could 'stem' words, simply lopping off the 'ing' from 'going' to get 'go'. But that would poses some serious programming challenges for words like 'running' where the base form is 'run' instead of 'runn'. And we might run into issues with words 'sing' or 'singing' that should not have 'ing' removed in the former case but once in the later case. How could we distinguish between words that are conjugated 'sings' and words that are plural 'wings'. Sometimes an -s ending is plural (fens) and other times it is not (lens).

Tokenization, or segmenting a text into word chunks, is the first part of a Natural Language Processing pipeline, and it can have significant effects on the results of your analysis. We will focus on tokenizing a text to create a bag of words model. A bag of words approach will help us break down our text into one-, two-, and three-word constructions. The general name for these is n-grams.

An n-gram is a sequence of n items from a given sample of text or speech. Most often, this refers to a sequence of words, but it can also be used to analyze text at the level of syllables, letters, or phonemes. N-grams are often described by their length. For example, word n-grams might include:

* stock (a 1-gram, or unigram)
* vegetable stock (a 2-gram, or bigram)
* homemade vegetable stock (a 3-gram, or trigram)

A text analysis approach that looks only at unigrams at the word level will not be able to differentiate between the "stock" in "stock market" and "chicken stock." One of the most popular examples of text analysis with n-grams is the [Google N-Gram Viewer](https://books.google.com/ngrams).

Before we examine several Python tokenizers, it is important to understand more about how Python strings work. Tokenization is a process that segments strings into smaller chunks and part of getting good results is understanding the decisions that tokenizers make. A deeper understanding of strings will help, not just with tokenization, but the whole text analysis pipeline.