# collocations

Collocations are words that appear together to form a meaning greater than the sum of their parts. For example the collocation *disk drive* means more than the individual words can convey. 

In [1]:
import nltk
from nltk.book import *
text6


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


<Text: Monty Python and the Holy Grail>

In [2]:
# get collocations
text6.collocations()

BLACK KNIGHT; clop clop; HEAD KNIGHT; mumble mumble; Holy Grail;
squeak squeak; FRENCH GUARD; saw saw; Sir Robin; Run away; CARTOON
CHARACTER; King Arthur; Iesu domine; Pie Iesu; DEAD PERSON; Round
Table; clap clap; OLD MAN; dramatic chord; dona eis


The above code called a collocations() method on an nlkt Text object. We don't have to convert our text to an nltk Text object to use collocations, as shown below. You can use various metrics as shown on the [nlkt how-to page](http://www.nltk.org/howto/collocations.html)

In [3]:
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
 
def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)
                if type(ngram) == tuple])

bigram_word_feats(text6)

{('ALL', 'HEADS'): True,
 ('Anybody', 'armed'): True,
 ('Attila', 'raised'): True,
 ('Badon', 'Hill'): True,
 ('Bon', 'magne'): True,
 ('Chapter', 'Two'): True,
 ('Clark', 'Gable'): True,
 ('DEAD', 'PERSON'): True,
 ('Divine', 'Providence'): True,
 ('Eternal', 'Peril'): True,
 ('Fetchez', 'la'): True,
 ('Great', 'scott'): True,
 ('Hand', 'Grenade'): True,
 ('Iesu', 'domine'): True,
 ('Most', 'kind'): True,
 ('Olfin', 'Bedwere'): True,
 ('PRINCESS', 'LUCKY'): True,
 ('Pie', 'Iesu'): True,
 ('Recently', 'Said'): True,
 ('Round', 'Table'): True,
 ('Tall', 'Tower'): True,
 ('Thou', 'hast'): True,
 ('Thy', 'mer'): True,
 ('Til', 'Recently'): True,
 ('Too', 'late'): True,
 ('Uther', 'Pendragon'): True,
 ('absolutely', 'necessary'): True,
 ('advancing', 'behaviour'): True,
 ('aptly', 'named'): True,
 ('aquatic', 'ceremony'): True,
 ('autonomous', 'collective'): True,
 ('basic', 'medical'): True,
 ('became', 'convinced'): True,
 ('binding', 'sense'): True,
 ('bowels', 'unplugged'): True,
 ('br

Collocation identification can be useful component of larger NLP applications like sentiment analysis, machine translation, natural language generation, and more.

### Mutual information

Let's calculate mutual information for a couple of the collocations that NLTK identified. (Note that NLTK doesn't necessarily use this metric)

Recall that mutual information is the log of the probability:

P(x,y) / [P(x) * P(y)]

In [12]:
text = ' '.join(text6.tokens)
text[:50]

'SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR '

In [13]:
import math
vocab = len(set(text6))
hg = text.count('Holy Grail')/vocab
print("p(Holy Grail) = ",hg )
h = text.count('Holy')/vocab
print("p(Holy) = ", h)
g = text.count('Grail')/vocab
print('p(Grail) = ', g)
pmi = math.log2(hg / (h * g))
print('pmi = ', pmi)


p(Holy Grail) =  0.008771929824561403
p(Holy) =  0.0110803324099723
p(Grail) =  0.01569713758079409
pmi =  5.6563196990804165


In [14]:
hg = text.count('of the')/vocab
print("p(of the) = ",hg )
o = text.count('of')/vocab
print("p(of) = ", o)
t = text.count('the ')/vocab # space so it doesn't capture 'their' etc.
print('p(the) = ', t)
pmi = math.log2(hg / (o * t))
print('pmi = ', pmi)


p(of the) =  0.018928901200369344
p(of) =  0.08448753462603878
p(the) =  0.13804247460757155
pmi =  0.6986680197442634


We see that 'Holy Grail' has higher mutual information than 'of the' indicating it is more likely to be a collocation. 