# Ngrams

The ngrams() function returns a generator where each item is a tuple containing all the words -- 1 word for unigrams, 2 for bigrams, etc. This looks strange for unigrams so if you wanted to extract the word you would need unigram[0].

First we do the imports.

In [1]:
from nltk.util import ngrams

In [2]:
text = "Mary had a little lamb ."
tokens = text.split()
unigrams = ngrams(tokens, 1)
for unigram in unigrams:
    print(unigram)

('Mary',)
('had',)
('a',)
('little',)
('lamb',)
('.',)


In [3]:
bigrams = ngrams(tokens, 2)
for bigram in bigrams:
    print(bigram)

# note that if I want to access it again I need to regenerate
print("reconstruct the bigrams")
bigrams = ngrams(tokens, 2)
for bigram in bigrams:
    print(bigram[0] + ' ' + bigram[1])


('Mary', 'had')
('had', 'a')
('a', 'little')
('little', 'lamb')
('lamb', '.')
reconstruct the bigrams
Mary had
had a
a little
little lamb
lamb .


# iterators and generators

In the examples above we printed the results of the ngrams() method of nltk. These are actually generator objects which brings us to a discussion of Python iterators and generators. 

## iterators

We can view iterators as any Python type that can be used with for loops. Python built-in structures that work with iterators include lists, tuples, dicts and sets. They are iterators because they implement the iterator protocol and the __iter__ method. This method returns an object that has a **next** method. 

In [4]:
for fruit in ['apple', 'banana', 'kiwi']:
    print("I want a", fruit)

I want a apple
I want a banana
I want a kiwi


The above code worked because Python is sending the **next** item to our variable *fruit*. 

### generators

Python generators are also iterators as shown below:

In [5]:
for ngram in ngrams('apple banana cucumber dill'.split(), 3):
    print(ngram)

('apple', 'banana', 'cucumber')
('banana', 'cucumber', 'dill')


### writing our own generators

In the next 2 cells we see a Fibonacci function first done the regular way and second with a generator.

What happens in the generator function is that when the generator is called, the body is not executed, rather a generator-iterator object is returned. This object can then be used as an iterator. Python knows it is a generator by use of the **yield** reserved word.


In [6]:
def fib(max):
    numbers = []
    a, b = 0, 1
    while a < max:
        numbers.append(a)
        a, b = b, a + b
    return numbers

fib(5)

[0, 1, 1, 2, 3]

In [7]:
def fib_gen(max):
    a, b = 0, 1
    while a < max:
        yield a
        a, b = b, a + b
        
fib_gen(5)

<generator object fib_gen at 0x7f8fb43df200>

In [8]:
for item in fib_gen(5):
    print(item)

0
1
1
2
3


So this looks really strange, why use generators? This is useful when you have a **huge** amount of data you are processing. Generators will take up less memory by streaming the data. 

### pickle

Want if you need to save a Python object to disk to read in later? One way would be to write it out item by item to disk. You may have to do some processing when you read back in a dict for example to identify the key and the value.

A better way is to use pickle. The pickle module can serialize and deserialize a Python object. The pickling process converts the object to a byte stream and unpickling does the reverse. 

In [9]:
import pickle

my_dict = {'a':5, 'b':7, 'c':2}
with open('example.pickle', 'wb') as handle:
    pickle.dump(my_dict, handle)
    
with open('example.pickle', 'rb') as handle:
    new_dict = pickle.load(handle)
    
print(new_dict)

{'a': 5, 'b': 7, 'c': 2}


### Preprocessing

Let's look some more at creating bigram and unigram iterator objects.

In [10]:
import nltk
from nltk.util import ngrams

nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [11]:
emma_raw = nltk.corpus.gutenberg.raw("austen-emma.txt")
emma_raw[:500]

"[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.\n\nShe was the youngest of the two daughters of a most affectionate,\nindulgent father; and had, in consequence of her sister's marriage,\nbeen mistress of his house from a very early period.  Her mother\nhad died t"

In [12]:
emma_text = emma_raw.replace('\n', ' ')
emma_text[:500]

"[Emma by Jane Austen 1816]  VOLUME I  CHAPTER I   Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.  She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.  Her mother had died t"

In [13]:
from nltk.tokenize import word_tokenize
emma_tokens = word_tokenize(emma_text)
emma_tokens[40:50]

['and',
 'had',
 'lived',
 'nearly',
 'twenty-one',
 'years',
 'in',
 'the',
 'world',
 'with']

### create ngrams

The ngrams(tokens, n) function takes two arguments. The first is a list of tokens, the second is n for the ngram desired.

In [14]:
unigrams = ngrams(emma_tokens, 1)
count = 1
for unigram in unigrams:
    print(unigram)
    count += 1
    if count > 5:
        break


('[',)
('Emma',)
('by',)
('Jane',)
('Austen',)


In [15]:
unigram_dict = {}
for unigram in set(unigrams):
    unigram_dict[unigram[0]] = emma_text.count(unigram[0])
    
count = 1
for uni in unigram_dict.keys():
    print(uni, '->', unigram_dict[uni])
    count += 1
    if count > 5:
        break

apple-dumplings -> 1
deposit -> 2
lists -> 2
agitation -> 16
inconstancy -> 2


In [16]:
bigrams = ngrams(emma_tokens, 2)
count = 1
for bigram in bigrams:
    print(bigram)
    count += 1
    if count > 10:
        break

('[', 'Emma')
('Emma', 'by')
('by', 'Jane')
('Jane', 'Austen')
('Austen', '1816')
('1816', ']')
(']', 'VOLUME')
('VOLUME', 'I')
('I', 'CHAPTER')
('CHAPTER', 'I')


In [17]:
bigram_dict = {}
for bigram in set(bigrams):
    if bigram not in bigram_dict:
        bi = bigram[0] + ' ' + bigram[1]
        bigram_dict[bi] = emma_text.count(bi)
    
count = 1
for bi in bigram_dict.keys():
    print(bi, '->', bigram_dict[bi])
    count += 1
    if count > 10:
        break

, overcome -> 1
but correctly -> 1
beyond expression -> 1
consider the -> 2
the wind -> 17
CHAPTER XVI -> 9
an egg -> 2
months later -> 1
-- Still -> 2
plague you -> 1


As you can see, NLTK ngrams() is pretty slow, so in our homework we will build the dictionaries in one program and pickle them. Then unpickle them in the second program.

Let's see the most common bigrams.

In [18]:
count = 1
for bi in sorted(bigram_dict, key=bigram_dict.get, reverse=True):
    print(bi, '->', bigram_dict[bi])
    count += 1
    if count > 20:
        break


, a -> 2547
, an -> 1909
, and -> 1880
; a -> 918
; an -> 878
; and -> 867
of the -> 672
to be -> 649
, I -> 570
in the -> 505
he had -> 499
he was -> 486
as a -> 474
. We -> 441
; but -> 427
. Weston -> 426
and a -> 397
I am -> 395
, the -> 387
. Elton -> 381
