# Deriving N-Grams from Text

Based on [N-Gram-Based Text Categorization: Categorizing Text With Python](https://web.archive.org/web/20240714175307/https://blog.alejandronolla.com/2013/05/20/n-gram-based-text-categorization-categorizing-text-with-python/) by Alejandro Nolla.

If you are unfamiliar with n-grams, I recommend reading the article [Language Detection using N-Grams
](http://cloudmark.github.io/Language-Detection/) by Mark Galea before proceeding with this tutorial.

In [1]:
s = "Le temps est un grand maître, dit-on, le malheur est qu’il tue ses élèves."
s = s.lower()

## 1. Tokenization

We will use `RegexpTokenizer`, which allows us to enter our own regexp to make sure we deal correctly with the accented letters and apostrophes.

In [2]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer("[a-zA-Z'’`éèî]+")
s_tokenized = tokenizer.tokenize(s)
s_tokenized

['le',
 'temps',
 'est',
 'un',
 'grand',
 'maître',
 'dit',
 'on',
 'le',
 'malheur',
 'est',
 'qu’il',
 'tue',
 'ses',
 'élèves']

In [3]:
from nltk.util import ngrams

generated_4grams = []

for word in s_tokenized:
    generated_4grams.append(
        list(
            ngrams(
                word, 4, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_'
            )
        )
    )
generated_4grams[:2]

[[('_', '_', '_', 'l'),
  ('_', '_', 'l', 'e'),
  ('_', 'l', 'e', '_'),
  ('l', 'e', '_', '_'),
  ('e', '_', '_', '_')],
 [('_', '_', '_', 't'),
  ('_', '_', 't', 'e'),
  ('_', 't', 'e', 'm'),
  ('t', 'e', 'm', 'p'),
  ('e', 'm', 'p', 's'),
  ('m', 'p', 's', '_'),
  ('p', 's', '_', '_'),
  ('s', '_', '_', '_')]]

`generated_4grams` needs flattening since it’s supposed to be a list of 4-grams.

In [4]:
ng_list_4grams = [word for sublist in generated_4grams for word in sublist]
ng_list_4grams[:13]

[('_', '_', '_', 'l'),
 ('_', '_', 'l', 'e'),
 ('_', 'l', 'e', '_'),
 ('l', 'e', '_', '_'),
 ('e', '_', '_', '_'),
 ('_', '_', '_', 't'),
 ('_', '_', 't', 'e'),
 ('_', 't', 'e', 'm'),
 ('t', 'e', 'm', 'p'),
 ('e', 'm', 'p', 's'),
 ('m', 'p', 's', '_'),
 ('p', 's', '_', '_'),
 ('s', '_', '_', '_')]

## 2. Obtaining n-grams (`n` = 4)

In [5]:
for idx, val in enumerate(ng_list_4grams):
    ng_list_4grams[idx] = ''.join(val)
ng_list_4grams[:13]

['___l',
 '__le',
 '_le_',
 'le__',
 'e___',
 '___t',
 '__te',
 '_tem',
 'temp',
 'emps',
 'mps_',
 'ps__',
 's___']

## 3. Sorting the n-grams by frequency

We now sort the n-grams by frequency and keep the 300 most popular n-grams. This number was suggested in one of the first papers about using n-grams for NLP, [N-Gram-Based Text Categorization](https://www.let.rug.nl/vannoord/TextCat/textcat.pdf).

In [6]:
from operator import itemgetter

freq_4grams = {}

for ngram in ng_list_4grams:
    if ngram not in freq_4grams:
        freq_4grams.update({ngram: 1})
    else:
        freq_4grams.update({ngram: freq_4grams[ngram] + 1})

freq_4grams_sorted = sorted(freq_4grams.items(), key=itemgetter(1), reverse=True)[0:300]
freq_4grams_sorted[:5]

[('e___', 4), ('s___', 3), ('t___', 3), ('___l', 2), ('__le', 2)]

## 4. Obtaining n-grams for multiple values of n

To get n-grams for `n` = 1, 2, 3, and 4, we can use `everygrams`. This requires the raw sentence (without punctuation), as opposed to the tokens.

In [7]:
from nltk import everygrams

s_clean = ' '.join(s_tokenized)
s_clean

'le temps est un grand maître dit on le malheur est qu’il tue ses élèves'

In [8]:
def ngram_extractor(sent):
    return [
        ''.join(ng)
        for ng in everygrams(sent.replace(' ', '_ _'), 1, 4)
        if ' ' not in ng and '\n' not in ng and ng != ('_',)
    ]


ngram_extractor(s_clean)[:25]

['l',
 'le',
 'le_',
 'e',
 'e_',
 '_t',
 '_te',
 '_tem',
 't',
 'te',
 'tem',
 'temp',
 'e',
 'em',
 'emp',
 'emps',
 'm',
 'mp',
 'mps',
 'mps_',
 'p',
 'ps',
 'ps_',
 's',
 's_']