# Word n-grams with zip and with NLTK

by Koenraad De Smedt at UiB

---
*N-grams* are consecutive parts of a text broken up into *n* tokens (words or letters), such as the following word *trigrams* (3-grams) for *Je pense, donc je suis* (a famous line by René Descartes).

> ```
Je pense donc
pense donc je
donc je suis
```

or, if we also consider punctuation to be tokens,

> ```
Je pense ,
pense , donc
, donc je
donc je suis
je suis .
```

N-grams shows words in their (limited) contexts. Computing all n-grams in a text or corpus can be useful for several NLP purposes, such as translation, error correction, finding collocations, document classification, etc. Imagine you want to correct a misspelled word and you have two possible corrections, one which occurs in an n-gram based on a large corpus, and one which does not occur in such a context, then you might prefer the word occurring in the n-gram.

This notebook shows how we can compute n-grams as a list of tuples by using `zip`. Also the `ngrams` function in NLTK is demonstrated.

---

Let's start with list of tokens, such as the following famous quote from René Descartes. Also, suppose we want to make trigrams (3-grams), so we set `n` to `3`.

In [None]:
tokens = ['Je', 'pense', 'donc', 'je', 'suis']
n = 3

We can make partial copies of this list starting at item 0, 1, 2, ... until we reach n.


In [None]:
partlists = [tokens[i:] for i in range(n)]
partlists

In the previous result, read the lists vertically from left to right, and you see the trigrams appear. So how do we combine the first elements of each of these lists, then their second elements, etc.?

```
Je        pense     donc
↓         ↓         ↓
pense     donc      je
↓         ↓         ↓
donc      je        suis
```

The solution is to use `zip`. If we unpack `partlists` and give the contained lists as arguments to `zip`, we get a list of tuples which are word n-grams – in this case, trigrams.

In [None]:
tri = zip(*partlists)
[*tri]

We are now ready to define a function that operates on a list of tokens and produces n-gram tuples. It has an extra argument for *n* with a default of 3.

In [None]:
def n_grams (seq, n=3):
  return zip(*[seq[i:] for i in range(n)])

ng = n_grams(tokens)
ng

Unpack the zip items into a list if wanted.

In [None]:
[*ng]

Override the default in order to produce bigrams.

In [None]:
[*n_grams(tokens, n=2)]

---
## N-grams with NLTK

Now that you understand how n-grams can be computed, let's look at NLTK which also provides a tool for n-grams. If we want to tokenize first, we need to import the tokenize module as well as the ngrams module.

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize
from nltk.util import ngrams

In the `ngrams` function provided by NLTK, the second argument is obligatory; it has no default. The result is also a zip of tuples.

In [None]:
cogito = 'Je pense, donc je suis fatigué...'
ng = ngrams(word_tokenize(cogito), n=3)
ng

Unpack the zip items into a list if wanted. This result is similar to the result above.

In [None]:
[*ng]

###User interactions to set parameters (optional)

Google Colab offers user interactions to set parameters. These are indicated as special comments with the `#@` characters. The following illustrates the use of a *slider* to choose the length of the n-grams. See the [forms example](https://colab.research.google.com/notebooks/forms.ipynb) for more possibilities. This may not work outside of Google Colab, but [IPywidgets](https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6) offers something similar.

In [None]:
#@title Choose the length of the n-grams
N_gram_length = 2 #@param {type:"slider", min:2, max:5, step:1}
ng2 = ngrams(word_tokenize(cogito), n=N_gram_length)
[*ng2][:10]

### Exercises

1.   Use a larger text, tokenize and compute the word bigrams.
2.   Compute the number of *different* bigrams in your result.
3.   Convert the n-gram tuples to strings with spaces, for instance, `'Je pense donc'`.
4.   Make a frequency list of the n-grams, for instance by using a Counter. Get the 5 most common n-grams.
5.   Compute *character* n-grams from a single text string, using `n_grams` or `ngrams`. Compare the result with the approach in the notebook on Ranges.