As discussed earlier, one of the problems with Naive Bayes is that it doesn't consider the order words appear in or their context. One way we can fix this is to use 'n-grams', rather than simple words as features for training our classfier. 

A 2-gram (also known as a bigram) is the collection of every set of two words in a sentence. For example, in the sentence:

'Wolves run faster than dogs'

the bigrams would be: 'Wolves run', 'run faster', 'faster than', 'than dogs'.

Note that most words appear more than once - both at the start of and the end of a bigram.

A 3-gram (or trigram) is the collection of every set of three words. For our sentence this would be: 'Wolves run faster', 'run faster than', 'faster than dogs'.

We can extend this to collections of any number of words. Collections of 7 words are 7-grams, and collections of 11 words are 11-grams. However, it's unusal to use more than trigrams in most applications. 

### Bigrams in NLTK

NLTK conatins a function bigrams() (call with nltk.bigrams()) which returns an iterator of bigrams. 

However, if you want to create a list of bigrams, you will need to pass the iterator to list(). It also expects a sequence of items to generate bigrams from, so you have to split the text before passing it:

In [1]:
import nltk
from nltk import ngrams

In [2]:
text = "The Force Awakens brings back the Old Trilogy 's heart , humor , mystery , and fun ."

In [11]:
bigram_list = list(nltk.bigrams(text.split()))

To print them out separated with commas: print(*map(' '.join, bigrm), sep=', ') 

In [12]:
print(*map(' '.join, bigram_list), sep=', ') 

The Force, Force Awakens, Awakens brings, brings back, back the, the Old, Old Trilogy, Trilogy 's, 's heart, heart ,, , humor, humor ,, , mystery, mystery ,, , and, and fun, fun .


In [13]:
#Alternatively: instead of split, you can tokenize the text before creatinh the bigrams
from nltk.tokenize import word_tokenize
tokens = nltk.word_tokenize(text)
blist = list(nltk.bigrams(tokens))
blist

[('The', 'Force'),
 ('Force', 'Awakens'),
 ('Awakens', 'brings'),
 ('brings', 'back'),
 ('back', 'the'),
 ('the', 'Old'),
 ('Old', 'Trilogy'),
 ('Trilogy', "'s"),
 ("'s", 'heart'),
 ('heart', ','),
 (',', 'humor'),
 ('humor', ','),
 (',', 'mystery'),
 ('mystery', ','),
 (',', 'and'),
 ('and', 'fun'),
 ('fun', '.')]

How would you print out 3-grams? 

In [14]:
trigram_list = list(nltk.trigrams(text.split()))
print(*map(' '.join, trigram_list), sep=', ') 

The Force Awakens, Force Awakens brings, Awakens brings back, brings back the, back the Old, the Old Trilogy, Old Trilogy 's, Trilogy 's heart, 's heart ,, heart , humor, , humor ,, humor , mystery, , mystery ,, mystery , and, , and fun, and fun .


What if you want to print out a different number of n, e.g. 6-grams?

In [15]:
n = 6
sixgram_list = ngrams(text.split(), n)
for grams in sixgram_list:
    print(grams)

('The', 'Force', 'Awakens', 'brings', 'back', 'the')
('Force', 'Awakens', 'brings', 'back', 'the', 'Old')
('Awakens', 'brings', 'back', 'the', 'Old', 'Trilogy')
('brings', 'back', 'the', 'Old', 'Trilogy', "'s")
('back', 'the', 'Old', 'Trilogy', "'s", 'heart')
('the', 'Old', 'Trilogy', "'s", 'heart', ',')
('Old', 'Trilogy', "'s", 'heart', ',', 'humor')
('Trilogy', "'s", 'heart', ',', 'humor', ',')
("'s", 'heart', ',', 'humor', ',', 'mystery')
('heart', ',', 'humor', ',', 'mystery', ',')
(',', 'humor', ',', 'mystery', ',', 'and')
('humor', ',', 'mystery', ',', 'and', 'fun')
(',', 'mystery', ',', 'and', 'fun', '.')
