# Collocation

## Data Preparation

The famous Brown corpus, built-in with NLTK, is used for exmaple. 

In [1]:
import nltk
from nltk.corpus import brown
nltk.download('brown')

# Get all words in the Brown Corpus.
tokens = brown.words()
print("The first 20 tokens: ")
print(tokens[:20])

print("The number of words in the Brown corpus: %d" % len(tokens))

[nltk_data] Downloading package brown to /Users/hhhuang/nltk_data...
[nltk_data]   Package brown is already up-to-date!


The first 20 tokens: 
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that']
The number of words in the Brown corpus: 1161192


### Lookup the Brown corpus at different linguistisc units.

In [3]:
print("The first five sentences:")
print(brown.sents()[0:5])
print("Number of sentences: %d" % len(brown.sents()))

print("\n")
# The Brown corpus does not provide the paragraph information. 
print("The first five paragraphs:")
print(brown.paras()[0:5])
print("Number of paragraphs: %d" % len(brown.sents()))

The first five sentences:
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possible', '``', 'irregularities', "''", 'in', 'the', 'hard-fought', 'primary', 'which', 'was', 'won', 'by', 'Mayor-nominate', 'Ivan', 'Allen', 'Jr.', '.'], ['``', 'Only', 'a', 'relative', 'ha

### Tagged Tokens

The Brown corpus also provides the words with part-of-speech tagged. 

In [4]:
# Get all words in the Brown Corpus.
print("The first 20 tokens with tag: ")
print(brown.tagged_words()[:20])


The first 20 tokens with tag: 
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS')]


### Fundamental text processing (Last week)

Alternatively, you can also perform the part-of-speech tagging by yourself following the procedure introduced last week.

In [5]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

print("The first sentence in the Brown corpus with POS tagged:")
print(nltk.pos_tag(brown.sents()[0]))

The first sentence in the Brown corpus with POS tagged:


[nltk_data] Downloading package punkt to /Users/hhhuang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/hhhuang/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('The', 'DT'), ('Fulton', 'NNP'), ('County', 'NNP'), ('Grand', 'NNP'), ('Jury', 'NNP'), ('said', 'VBD'), ('Friday', 'NNP'), ('an', 'DT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NNP'), ('recent', 'JJ'), ('primary', 'JJ'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'DT'), ('evidence', 'NN'), ("''", "''"), ('that', 'IN'), ('any', 'DT'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')]


## Most Frequent Tokens

In [6]:
from collections import Counter
word_counts = Counter(tokens)
for w, c in word_counts.most_common(20):
    print("%s\t%d" % (w, c))

the	62713
,	58334
.	49346
of	36080
and	27915
to	25732
a	21881
in	19536
that	10237
is	10011
was	9777
for	8841
``	8837
''	8789
The	7258
with	7012
it	6723
as	6706
he	6566
his	6466


## Most Frequent Tokens excluding Punctuation Marks

In [7]:
from collections import Counter
word_counts = Counter(tokens)
for w, c in word_counts.most_common(20):
    if w.isalpha():
        print("%s\t%d" % (w, c))

the	62713
of	36080
and	27915
to	25732
a	21881
in	19536
that	10237
is	10011
was	9777
for	8841
The	7258
with	7012
it	6723
as	6706
he	6566
his	6466


## Most Frequent Collocations

In [8]:
word_pair_counts = Counter()
for i in range(len(tokens) - 1):
    (w1, w2) = (tokens[i], tokens[i + 1])
    if w1.isalpha() and w2.isalpha():
        word_pair_counts[(w1, w2)] += 1
    
for pair, c in word_pair_counts.most_common(20):
    print("%s\t%s\t%d" % (pair[0], pair[1], c))

of	the	9625
in	the	5546
to	the	3426
on	the	2297
and	the	2136
for	the	1759
to	be	1697
at	the	1506
with	the	1472
of	a	1461
that	the	1368
from	the	1351
in	a	1316
by	the	1310
as	a	896
with	a	881
it	is	881
is	a	864
of	his	806
is	the	782


Show the content in word_pair_counts

In [9]:
print(word_pair_counts.most_common(1)[0])

(('of', 'the'), 9625)


In [10]:
print(word_pair_counts[('of', 'the')])

9625


Alternative way to get the contents in the pair.

In [11]:
for (w1, w2), c in word_pair_counts.most_common(20):
    print("%s\t%s\t%d" % (w1, w2, c))

of	the	9625
in	the	5546
to	the	3426
on	the	2297
and	the	2136
for	the	1759
to	be	1697
at	the	1506
with	the	1472
of	a	1461
that	the	1368
from	the	1351
in	a	1316
by	the	1310
as	a	896
with	a	881
it	is	881
is	a	864
of	his	806
is	the	782


## Filtering with POS Tag Patterns


### Check the POS tags in the Brown corpus

In [21]:
tag_counter = Counter()
tagged_tokens = brown.tagged_words()
for (word, tag) in tagged_tokens:
    tag_counter[tag] += 1
    
print(tag_counter.most_common())

[('NN', 152470), ('IN', 120557), ('AT', 97959), ('JJ', 64028), ('.', 60638), (',', 58156), ('NNS', 55110), ('CC', 37718), ('RB', 36464), ('NP', 34476), ('VB', 33693), ('VBN', 29186), ('VBD', 26167), ('CS', 22143), ('PPS', 18253), ('VBG', 17893), ('PP$', 16872), ('TO', 14918), ('PPSS', 13802), ('CD', 13510), ('NN-TL', 13372), ('MD', 12431), ('PPO', 11181), ('BEZ', 10066), ('BEDZ', 9806), ('AP', 9522), ('DT', 8957), ('``', 8837), ("''", 8789), ('QL', 8735), ('VBZ', 7373), ('BE', 6360), ('RP', 6009), ('WDT', 5539), ('HVD', 4895), ('*', 4603), ('WRB', 4509), ('BER', 4379), ('JJ-TL', 4107), ('NP-TL', 4019), ('HV', 3928), ('WPS', 3924), ('--', 3405), ('BED', 3282), ('ABN', 3010), ('DTI', 2921), ('PN', 2573), ('NP$', 2565), ('BEN', 2470), ('DTS', 2435), ('HVZ', 2433), (')', 2273), ('(', 2264), ('NNS-TL', 2226), ('EX', 2164), ('JJR', 1958), ('OD', 1935), ('NR', 1566), (':', 1558), ('NN$', 1480), ('IN-TL', 1477), ('NN-HL', 1471), ('DO', 1353), ('NPS', 1275), ('PPL', 1233), ('RBR', 1182), ('DOD'

### Filtering the results with the pattern A + N

In [22]:
word_pair_counts = Counter()

for i in range(len(tagged_tokens) - 1):
    (w1, w2) = (tagged_tokens[i][0], tagged_tokens[i + 1][0])
    (tag1, tag2) = (tagged_tokens[i][1], tagged_tokens[i + 1][1])
    if tag1 == 'JJ' and tag2[0] == 'N':
        word_pair_counts[(w1, w2)] += 1
    
for pair, c in word_pair_counts.most_common(20):
    print("%s\t%s\t%d" % (pair[0], pair[1], c))

fiscal	year	56
high	school	54
old	man	52
young	man	47
great	deal	43
long	time	39
new	members	30
young	men	29
dominant	stress	28
real	estate	27
foreign	policy	27
good	deal	27
recent	years	25
large	number	25
small	business	24
American	people	22
human	beings	22
nuclear	weapons	21
front	door	20
big	man	17


### Filtering the resuls with the pattern N + N

In [23]:
word_pair_counts = Counter()

for i in range(len(tagged_tokens) - 1):
    (w1, w2) = (tagged_tokens[i][0], tagged_tokens[i + 1][0])
    (tag1, tag2) = (tagged_tokens[i][1], tagged_tokens[i + 1][1])
    if tag1[0] == 'N' and tag2[0] == 'N':
        word_pair_counts[(w1, w2)] += 1
    
for pair, c in word_pair_counts.most_common(20):
    print("%s\t%s\t%d" % (pair[0], pair[1], c))

Rhode	Island	90
World	War	60
U.	S.	57
Peace	Corps	52
Los	Angeles	47
President	Kennedy	40
General	Motors	40
San	Francisco	39
Mr.	Kennedy	34
Du	Pont	34
St.	Louis	32
St.	John	28
Soviet	Union	27
Sam	Rayburn	26
Air	Force	26
York	City	25
home	runs	25
State	Department	24
Kansas	City	24
Jesus	Christ	24


### Filtering the resuls with the pattern N + N + N

In [24]:
word_trio_counts = Counter()

for i in range(len(tagged_tokens) - 2):
    (w1, w2, w3) = (tagged_tokens[i][0], tagged_tokens[i + 1][0], tagged_tokens[i + 2][0])
    (tag1, tag2, tag3) = (tagged_tokens[i][1], tagged_tokens[i + 1][1], tagged_tokens[i + 2][1])
    if tag1[0] == 'N' and tag2[0] == 'N' and tag3[0] == 'N':
        word_trio_counts[(w1, w2, w3)] += 1
    
for pair, c in word_trio_counts.most_common(20):
    print("%s\t%s\t%s\t%d" % (pair[0], pair[1], pair[2], c))

Drug's	chemical	name	15
John	A.	Notte	15
General	Motors	stock	10
Prairie	Du	Chien	8
hearing	officer's	report	8
John	F.	Kennedy	7
tax	collection	year	7
combustion	chamber	volume	6
Lauro	Di	Bosis	6
labor	surplus	areas	6
Mr.	Justice	Frankfurter	6
home	rule	charter	5
Lord	Jesus	Christ	5
Peace	Corps	volunteers	5
Field	Marshal	Slim	5
Hudson's	Bay	Company	5
Dwight	D.	Eisenhower	4
Mrs.	William	H.	4
potato	chip	industry	4
Wall	Street	Journal	4


### Top Bigram Collocations 

In [25]:
def match_patterns(tag1, tag2):
    if tag1[0] == 'N' and tag2[0] == 'N':
        return True
    elif tag1 == 'JJ' and tag2[0] == 'N':
        return True
    return False

word_pair_counts = Counter()

for i in range(len(tagged_tokens) - 1):
    (w1, w2) = (tagged_tokens[i][0], tagged_tokens[i + 1][0])
    (tag1, tag2) = (tagged_tokens[i][1], tagged_tokens[i + 1][1])
    if match_patterns(tag1, tag2):
        word_pair_counts[(w1, w2)] += 1
    
for pair, c in word_pair_counts.most_common(15):
    print("%s\t%s\t%d" % (pair[0], pair[1], c))

Rhode	Island	90
World	War	60
U.	S.	57
fiscal	year	56
high	school	54
Peace	Corps	52
old	man	52
young	man	47
Los	Angeles	47
great	deal	43
President	Kennedy	40
General	Motors	40
long	time	39
San	Francisco	39
Mr.	Kennedy	34


### Top Trigram Collocations

In [26]:
def match_patterns(tag1, tag2, tag3):
    if tag1 == 'JJ' and tag2 == 'JJ' and tag3[0] == 'N':
        return True
    elif tag1 == 'JJ' and tag2[0] == 'N' and tag3[0] == 'N':
        return True
    elif tag1[0] == 'N' and tag2 == 'JJ' and tag3[0] == 'N':
        return True
    elif tag1[0] == 'N' and tag2[0] == 'N' and tag3[0] == 'N':
        return True
    elif tag1[0] == 'N' and tag2 == 'IN' and tag3[0] == 'N':
        return True
    return False

word_trio_counts = Counter()

for i in range(len(tagged_tokens) - 2):
    (w1, w2, w3) = (tagged_tokens[i][0], tagged_tokens[i + 1][0], tagged_tokens[i + 2][0])
    (tag1, tag2, tag3) = (tagged_tokens[i][1], tagged_tokens[i + 1][1], tagged_tokens[i + 2][1])
    if match_patterns(tag1, tag2, tag3):
        word_trio_counts[(w1, w2, w3)] += 1
    
for pair, c in word_trio_counts.most_common(15):
    print("%s\t%s\t%s\t%d" % (pair[0], pair[1], pair[2], c))

way	of	life	28
point	of	view	26
time	to	time	24
period	of	time	20
matter	of	fact	17
basic	wage	rate	16
Drug's	chemical	name	15
John	A.	Notte	15
number	of	people	13
years	of	age	12
number	of	years	12
small	business	concerns	12
couple	of	weeks	11
uniform	fiscal	year	11
side	by	side	10


### Collocations in the pattern powerful + N

In [27]:
word_pair_counts = Counter()

powerful_partners = set()

for i in range(len(tagged_tokens) - 1):
    (w1, w2) = (tagged_tokens[i][0], tagged_tokens[i + 1][0])
    (tag1, tag2) = (tagged_tokens[i][1], tagged_tokens[i + 1][1])
    if w1.lower() == 'powerful' and tag2[0] == 'N':
        word_pair_counts[(w1.lower(), w2.lower())] += 1
        powerful_partners.add(w2.lower())
    
for pair, c in word_pair_counts.most_common(20):
    print("%s\t%s\t%d" % (pair[0], pair[1], c))

powerful	engines	2
powerful	arms	2
powerful	transmitter	1
powerful	man	1
powerful	nations	1
powerful	nation	1
powerful	mirror	1
powerful	efforts	1
powerful	weapon	1
powerful	glasses	1
powerful	divine	1
powerful	victory	1
powerful	music	1
powerful	act	1
powerful	indictment	1
powerful	congressmen	1
powerful	influence	1
powerful	opposition	1
powerful	orwell	1
powerful	factor	1


### Collocations in the pattern strong + N

In [28]:
word_pair_counts = Counter()

strong_partners = set()
for i in range(len(tagged_tokens) - 1):
    (w1, w2) = (tagged_tokens[i][0], tagged_tokens[i + 1][0])
    (tag1, tag2) = (tagged_tokens[i][1], tagged_tokens[i + 1][1])
    if w1.lower() == 'strong' and tag2[0] == 'N':
        word_pair_counts[(w1.lower(), w2.lower())] += 1
        strong_partners.add(w2.lower())
    
for pair, c in word_pair_counts.most_common(20):
    print("%s\t%s\t%d" % (pair[0], pair[1], c))

strong	stress	11
strong	hands	7
strong	opposition	3
strong	men	3
strong	arm	2
strong	pressures	2
strong	oil	2
strong	point	2
strong	feeling	2
strong	woman	2
strong	stresses	2
strong	encouragement	1
strong	fight	1
strong	advocate	1
strong	executives	1
strong	christianity	1
strong	reactions	1
strong	pressure	1
strong	dissents	1
strong	conviction	1


In [29]:
print("The words only occur with powerful:")
print(list(powerful_partners - strong_partners))

print("The words only occur with strong:")
print(list(strong_partners - powerful_partners))

print("The words occur with both strong and powerful:")
print(list(strong_partners & powerful_partners))

The words only occur with powerful:
['transmitter', 'representative', 'force', 'bond', 'nations', 'act', 'glasses', 'factor', 'efforts', 'union', 'music', 'indictment', 'congressmen', 'sources', 'microphone', 'victory', 'orwell', 'jab', 'societies', 'nation', 'man', 'weapon', 'civilizations', 'introject', 'streams', 'aid', 'mirror', 'arms', 'engines', 'body', 'divine']
The words only occur with strong:
['af', 'compulsion', 'dissents', 'explanation', 'temperance', 'listener', 'suspicions', 'fight', 'potions', 'pressures', 'features', 'light', 'convictions', 'nose', 'executives', 'home-blend', 'hints', 'stresses', 'position', 'jaws', 'liking', 'feeling', 'support', 'delights', 'winds', 'hand', 'men', 'reactions', 'determination', 'activity', 'stress', 'emotion', 'recovery', 'possibility', 'signals', 'demand', 'poland', 'tea', 'wave', 'wine', 'material', 'backing', 'adhesive', 'credit', 'contrast', 'point', 'performance', 'arm', 'transom', 'protest', 'branches', 'oil', 'apple', 'christian

### Verb Particles

In [69]:
word_pair_counts = Counter()

for i in range(len(tagged_tokens) - 1):
    (w1, w2) = (tagged_tokens[i][0], tagged_tokens[i + 1][0])
    (tag1, tag2) = (tagged_tokens[i][1], tagged_tokens[i + 1][1])
    if tag1[0] == 'V' and tag2 == 'IN':
        word_pair_counts[(w1.lower(), w2.lower())] += 1
        
for pair, c in word_pair_counts.most_common(20):
    print("%s\t%s\t%d" % (pair[0], pair[1], c))

went	to	109
go	to	96
came	to	96
used	in	93
based	on	82
looked	at	81
found	in	77
come	to	72
think	of	72
look	at	70
shown	in	62
interested	in	61
related	to	59
thought	of	59
made	by	56
made	in	56
followed	by	56
live	in	54
returned	to	54
concerned	with	52


## Distant Collocations

Most frequent collocations with a distance of k

In [36]:
window_size = 9

word_pair_counts = Counter()
word_pair_distance_counts = Counter()
for i in range(len(tokens) - 1):
    w1 = tokens[i]
    if not w1.isalpha():
        continue
    for distance in range(1, window_size):
        if i + distance < len(tokens):
            w2 = tokens[i + distance]
            if not w2.isalpha():
                continue
            word_pair_distance_counts[(w1.lower(), w2.lower(), distance)] += 1
            word_pair_counts[(w1.lower(), w2.lower())] += 1
print(len(tokens))
for (w1, w2, distance), c in word_pair_distance_counts.most_common(20):
    print("%s\t%s\t%d\t%d" % (w1, w2, distance, c))

1161192
the	of	2	10822
of	the	1	9717
the	the	3	6529
in	the	1	6025
the	the	4	5648
the	the	8	4984
the	the	7	4955
the	the	6	4806
the	the	5	4346
the	of	3	4304
to	the	1	3484
a	of	2	2806
of	the	4	2510
of	the	5	2506
the	and	2	2468
on	the	1	2466
the	of	8	2430
of	the	8	2379
of	the	6	2376
of	the	7	2370


Show an entry in word_pair_distance

In [37]:
print(word_pair_distance_counts.most_common(1)[0])

print(word_pair_distance_counts['the', 'of', 1])
print(word_pair_distance_counts['the', 'of', 100])


for distance in range(1, window_size):
    print("Occurrences of the word pair (%s, %s) with a distance of %d: %d" % (
        'the', 'of', distance, word_pair_distance_counts['the', 'of', distance]))

print("Occurrences of the usage 'the * * of'")
print(word_pair_distance_counts['the', 'of', 2])

print("Occurrences of the usage 'of * * the'")
print(word_pair_distance_counts['of', 'the', 2])

(('the', 'of', 2), 10822)
0
0
Occurrences of the word pair (the, of) with a distance of 1: 0
Occurrences of the word pair (the, of) with a distance of 2: 10822
Occurrences of the word pair (the, of) with a distance of 3: 4304
Occurrences of the word pair (the, of) with a distance of 4: 1662
Occurrences of the word pair (the, of) with a distance of 5: 2080
Occurrences of the word pair (the, of) with a distance of 6: 2101
Occurrences of the word pair (the, of) with a distance of 7: 2123
Occurrences of the word pair (the, of) with a distance of 8: 2430
Occurrences of the usage 'the * * of'
10822
Occurrences of the usage 'of * * the'
327


Filtering the collocations with mean distance

In [38]:
pair_mean_distances = Counter()

for (w1, w2, distance), c in word_pair_distance_counts.most_common():
    pair_mean_distances[(w1, w2)] += distance * (c / word_pair_counts[(w1, w2)])

for (w1, w2), distance in pair_mean_distances.most_common(20):
    print("%s\t%s\t%f\t%d" % (w1, w2, distance, word_pair_counts[(w1, w2)]))

i	governor	8.000000	7
governor	providence	8.000000	7
state	proclaim	8.000000	7
testimony	hand	8.000000	7
whereof	and	8.000000	7
have	seal	8.000000	7
hereunto	of	8.000000	7
caused	affixed	8.000000	7
proclamation	governor	8.000000	6
independence	governor	8.000000	6
paragraphs	c	8.000000	6
a	machines	8.000000	5
made	saw	8.000000	5
guam	to	8.000000	5
pay	would	8.000000	4
carcass	the	8.000000	4
he	machines	8.000000	4
controversial	the	8.000000	4
populated	to	8.000000	4
talked	that	8.000000	4


Filtering one-time cases

In [39]:
pair_mean_distances = Counter()

print(len(word_pair_counts))
for (w1, w2, distance), c in word_pair_distance_counts.most_common():
    if word_pair_counts[(w1, w2)] > 5:
        pair_mean_distances[(w1, w2)] += distance * (c / word_pair_counts[(w1, w2)])

for (w1, w2), distance in pair_mean_distances.most_common(20):
    print("%s\t%s\t%f\t%d" % (w1, w2, distance, word_pair_counts[(w1, w2)]))

2532836
i	governor	8.000000	7
governor	providence	8.000000	7
state	proclaim	8.000000	7
testimony	hand	8.000000	7
whereof	and	8.000000	7
have	seal	8.000000	7
hereunto	of	8.000000	7
caused	affixed	8.000000	7
proclamation	governor	8.000000	6
independence	governor	8.000000	6
paragraphs	c	8.000000	6
thousand	independence	7.714286	7
by	discovered	7.666667	6
of	constants	7.666667	6
court	from	7.666667	6
should	were	7.666667	6
at	paper	7.666667	6
more	very	7.666667	6
were	should	7.666667	6
going	were	7.571429	7


In [40]:
for (w1, w2), distance in pair_mean_distances.most_common()[-20:]:
    print("%s\t%s\t%f\t%d" % (w1, w2, distance, word_pair_counts[(w1, w2)]))

image	intensifiers	1.000000	6
electron	optical	1.000000	6
drove	home	1.000000	6
hour	later	1.000000	6
go	ahead	1.000000	6
damn	it	1.000000	6
kent	house	1.000000	6
jean	jacques	1.000000	6
rector	said	1.000000	6
backed	off	1.000000	6
his	cheek	1.000000	6
shell	people	1.000000	6
brannon	said	1.000000	6
tom	horn	1.000000	6
benson	said	1.000000	6
herr	schaffner	1.000000	6
miss	jen	1.000000	6
gratt	shafer	1.000000	6
eddie	lee	1.000000	6
hanford	college	1.000000	6


In [41]:
num_pairs = len(pair_mean_distances)
mid = num_pairs // 2
for (w1, w2), distance in pair_mean_distances.most_common()[mid-10:mid+10]:
    print("%s\t%s\t%f\t%d" % (w1, w2, distance, word_pair_counts[(w1, w2)]))

by	faculty	4.454545	11
who	over	4.454545	11
then	has	4.454545	11
diet	and	4.454545	11
have	find	4.454545	11
children	i	4.454545	11
am	my	4.454545	11
china	in	4.454545	11
than	used	4.454545	11
and	hearts	4.454545	11
you	yet	4.454545	11
well	your	4.454545	11
and	operating	4.454545	11
and	departments	4.454545	11
a	stands	4.454545	11
and	discover	4.454545	11
testimony	of	4.454545	11
better	there	4.454545	11
for	act	4.454545	11
relax	the	4.454545	11


Filtering with offset deviation

In [42]:
pair_deviations = Counter()
for (w1, w2, distance), c in word_pair_distance_counts.most_common():
    if word_pair_counts[(w1, w2)] > 5:
        pair_deviations[(w1, w2)] += c * ((distance - pair_mean_distances[(w1, w2)]) ** 2)
    
for (w1, w2), dev_tmp in pair_deviations.most_common():
    s_2 = dev_tmp / (word_pair_counts[(w1, w2)] - 1)
    pair_deviations[(w1, w2)] = s_2 ** 0.5
    
for (w1, w2), dev in pair_deviations.most_common(20):
    print("%s\t%s\t%f\t%f\t%d" % (w1, w2, pair_mean_distances[(w1, w2)], dev, word_pair_counts[(w1, w2)]))

the	denver	4.500000	3.834058	6
very	center	4.500000	3.834058	6
may	never	4.500000	3.834058	6
be	started	4.000000	3.741657	7
his	objective	4.000000	3.741657	7
and	belly	5.000000	3.741657	7
more	reason	4.000000	3.741657	7
placing	a	4.000000	3.741657	7
year	because	4.333333	3.669696	6
engineer	for	4.333333	3.669696	6
mechanism	is	4.333333	3.669696	6
frequently	has	4.333333	3.669696	6
and	expectations	4.333333	3.669696	6
to	tie	4.333333	3.669696	6
a	wider	4.333333	3.669696	6
policy	will	4.666667	3.669696	6
devices	to	4.333333	3.669696	6
the	fierce	4.333333	3.669696	6
in	applying	4.333333	3.669696	6
operating	policy	4.333333	3.669696	6


In [43]:
for (w1, w2), dev in pair_deviations.most_common()[-20:]:
    print("%s\t%s\t%f\t%f\t%d" % (w1, w2, pair_mean_distances[(w1, w2)], dev, word_pair_counts[(w1, w2)]))    

hour	later	1.000000	0.000000	6
go	ahead	1.000000	0.000000	6
damn	it	1.000000	0.000000	6
kent	house	1.000000	0.000000	6
got	feet	3.000000	0.000000	6
jean	jacques	1.000000	0.000000	6
rector	said	1.000000	0.000000	6
backed	off	1.000000	0.000000	6
make	feel	2.000000	0.000000	6
his	cheek	1.000000	0.000000	6
to	jeep	2.000000	0.000000	6
shell	people	1.000000	0.000000	6
brannon	said	1.000000	0.000000	6
tom	horn	1.000000	0.000000	6
benson	said	1.000000	0.000000	6
herr	schaffner	1.000000	0.000000	6
miss	jen	1.000000	0.000000	6
gratt	shafer	1.000000	0.000000	6
eddie	lee	1.000000	0.000000	6
hanford	college	1.000000	0.000000	6


With a higher supportive threshold

In [44]:
pair_deviations = Counter()
for (w1, w2, distance), c in word_pair_distance_counts.most_common():
    if word_pair_counts[(w1, w2)] > 10:
        pair_deviations[(w1, w2)] += c * ((distance - pair_mean_distances[(w1, w2)]) ** 2)
    
for (w1, w2), dev_tmp in pair_deviations.most_common():
    s_2 = dev_tmp / (word_pair_counts[(w1, w2)] - 1)
    pair_deviations[(w1, w2)] = s_2 ** 0.5
    
for (w1, w2), dev in pair_deviations.most_common()[-20:]:
    print("%s\t%s\t%f\t%f\t%d" % (w1, w2, pair_mean_distances[(w1, w2)], dev, word_pair_counts[(w1, w2)]))

wall	street	1.000000	0.000000	11
western	europe	1.000000	0.000000	11
great	britain	1.000000	0.000000	11
status	quo	1.000000	0.000000	11
taken	place	1.000000	0.000000	11
declaration	independence	2.000000	0.000000	11
strong	enough	1.000000	0.000000	11
fat	man	1.000000	0.000000	11
word	god	2.000000	0.000000	11
room	temperature	1.000000	0.000000	11
ballistic	missile	1.000000	0.000000	11
be	maintained	1.000000	0.000000	11
middle	ages	1.000000	0.000000	11
good	luck	1.000000	0.000000	11
less	developed	1.000000	0.000000	11
motor	pool	1.000000	0.000000	11
strong	stress	1.000000	0.000000	11
in	operand	2.000000	0.000000	11
oxygen	transfer	1.000000	0.000000	11
blue	throat	1.000000	0.000000	11


## Pearson's Chi-Square Test

Back to bigrams

In [46]:
word_pair_counts = Counter()
word_counts = Counter(tokens)
num_bigrams = 0

for i in range(len(tokens) - 1):
    w1 = tokens[i]
    w2 = tokens[i + 1]
    if w1.isalpha() and w2.isalpha():
        word_pair_counts[(w1, w2)] += 1
        num_bigrams += 1

Chi-Square function

In [47]:
def chisquare(o11, o12, o21, o22):
    n = o11 + o12 + o21 + o22
    x_2 = (n * ((o11 * o22 - o12 * o21)**2)) / ((o11 + o12) * (o11 + o21) * (o12 + o22) * (o21 + o22)) 
    return x_2

Now we can compute the chi-squares.

In [48]:
pair_chi_squares = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    w1_only_count = word_counts[w1] - w1_w2_count
    w2_only_count = word_counts[w2] - w1_w2_count
    rest_count = num_bigrams - w1_only_count - w2_only_count - w1_w2_count
    pair_chi_squares[(w1, w2)] = chisquare(w1_w2_count, w1_only_count, w2_only_count, rest_count)

for (w1, w2), x_2 in pair_chi_squares.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], x_2))
    

Lo	Shu	21	838076.000000
Hong	Kong	11	838076.000000
deja	vue	7	838076.000000
Notre	Dame	6	838076.000000
Baton	Rouge	5	838076.000000
Dolce	Vita	5	838076.000000
Duncan	Phyfe	4	838076.000000
Hwang	Pah	4	838076.000000
Chemische	Krystallographie	4	838076.000000
Beech	Pasture	4	838076.000000
Ku	Klux	3	838076.000000
Klux	Klan	3	838076.000000
Sancho	Panza	3	838076.000000
bel	canto	3	838076.000000
Estate	Boards	3	838076.000000
Sultan	Ahmet	3	838076.000000
Planned	Parenthood	3	838076.000000
Furious	Overfall	3	838076.000000
Grands	Crus	3	838076.000000
Agreeable	Autocracies	3	838076.000000


Focus on more frequent bigrams

In [55]:
pair_chi_squares = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    if w1_w2_count > 10:
        w1_only_count = word_counts[w1] - w1_w2_count
        w2_only_count = word_counts[w2] - w1_w2_count
        rest_count = num_bigrams - w1_only_count - w2_only_count - w1_w2_count
        pair_chi_squares[(w1, w2)] = chisquare(w1_w2_count, w1_only_count, w2_only_count, rest_count)

for (w1, w2), x_2 in pair_chi_squares.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], x_2))
    

Lo	Shu	21	838076.000000
Hong	Kong	11	838076.000000
Los	Angeles	47	772340.862538
Puerto	Rico	21	733313.874934
Viet	Nam	13	680934.312462
United	States	392	624942.825079
Simms	Purdew	12	591579.529361
Armed	Forces	22	518035.039321
carbon	tetrachloride	18	510399.811865
Air	Force	26	484209.555146
Rhode	Island	90	478857.709261
San	Francisco	39	444131.903948
New	York	296	422705.994262
Peace	Corps	52	415016.950298
Du	Pont	34	413123.371544
Saxon	Shore	12	352865.684075
per	cent	146	311401.424614
Virgin	Islands	13	302628.444270
minimal	polynomial	16	283779.555274
Linda	Kay	17	274594.311495


In [56]:
for (w1, w2), x_2 in pair_chi_squares.most_common()[-20:]:
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], x_2))

use	a	15	0.003443
is	like	15	0.003435
use	the	42	0.003194
those	in	18	0.002944
in	use	13	0.002915
all	I	17	0.002725
the	mind	24	0.002673
to	these	38	0.002399
the	fine	11	0.002171
city	of	11	0.002115
business	to	11	0.001959
got	the	35	0.001839
feel	the	16	0.001782
not	one	15	0.001756
nor	the	12	0.000947
work	and	25	0.000900
that	other	20	0.000815
right	in	14	0.000515
to	John	11	0.000265
brought	the	19	0.000265


List the top collocations with a threshold.

In [65]:
rank = Counter()

for (w1, w2), x_2 in pair_chi_squares.most_common():
    if x_2 > 50000:
        rank[(w1, w2)] = word_pair_counts[(w1, w2)]
for (w1, w2), c in rank.most_common(20):        
    print("%s\t%s\t%d" % (w1, w2, c))
    

United	States	392
New	York	296
per	cent	146
years	ago	136
Rhode	Island	90
White	House	65
World	War	60
Peace	Corps	52
United	Nations	49
Los	Angeles	47
General	Motors	41
New	Orleans	40
San	Francisco	39
Civil	War	36
Du	Pont	34
nineteenth	century	30
dominant	stress	28
Supreme	Court	27
real	estate	27
Air	Force	26


## Mutual Information

Define a function for computing mutual information

In [66]:
import math
def mutual_information(w1_w2_prob, w1_prob, w2_prob):
    return math.log2(w1_w2_prob / (w1_prob * w2_prob))

In [67]:
num_unigrams = sum(word_counts.values())

pair_mutual_information_scores = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    if w1_w2_count > 0:
        w1_prob = word_counts[w1] / num_unigrams
        w2_prob = word_counts[w2] / num_unigrams
        w1_w2_prob = w1_w2_count / num_bigrams
        pair_mutual_information_scores[(w1, w2)] = mutual_information(w1_w2_prob, w1_prob, w2_prob)

for (w1, w2), mi in pair_mutual_information_scores.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], mi))

Durwood	Pye	1	20.617629
Decries	joblessness	1	20.617629
Resentment	welled	1	20.617629
Audrey	Knecht	1	20.617629
Frankford	Elevated	1	20.617629
vouchers	certifying	1	20.617629
Inspections	Barnet	1	20.617629
Barnet	Lieberman	1	20.617629
Cedvet	Sunay	1	20.617629
Rosy	Fingered	1	20.617629
Mal	Whitfield	1	20.617629
Abner	Haynes	1	20.617629
Patti	Waggin	1	20.617629
Rip	Randall	1	20.617629
Camilo	Carreon	1	20.617629
Harmon	Killebrew	1	20.617629
Rocky	Colavito	1	20.617629
routed	loser	1	20.617629
Shipman	Payson	1	20.617629
Deane	Beman	1	20.617629


In [68]:
num_unigrams = sum(word_counts.values())

pair_mutual_information_scores = Counter()
for (w1, w2), w1_w2_count in word_pair_counts.most_common():
    if w1_w2_count > 5:
        w1_prob = word_counts[w1] / num_unigrams
        w2_prob = word_counts[w2] / num_unigrams
        w1_w2_prob = w1_w2_count / num_bigrams
        pair_mutual_information_scores[(w1, w2)] = mutual_information(w1_w2_prob, w1_prob, w2_prob)

for (w1, w2), mi in pair_mutual_information_scores.most_common(20):
    print("%s\t%s\t%d\t%f" % (w1, w2, word_pair_counts[(w1, w2)], mi))

Notre	Dame	6	18.032666
deja	vue	7	17.810274
Pulley	Bey	6	17.617629
Walnut	Trees	6	17.617629
Gratt	Shafer	6	17.617629
Yugoslav	Claims	7	17.447704
Hong	Kong	11	17.158197
bacterial	diarrhea	7	16.917189
Herr	Schaffner	6	16.710738
Farm	Credit	7	16.670096
Viet	Nam	13	16.617629
Monroe	Doctrine	6	16.617629
aerated	lagoon	7	16.554619
Simms	Purdew	12	16.530166
Pathet	Lao	10	16.530166
crown	gall	7	16.530166
Adlai	Stevenson	6	16.488346
Skeletal	Age	6	16.447704
Morton	Foods	7	16.348168
El	Paso	10	16.295701
