# PMI

The first step taken for this exercise was to treat the corpus, the Jungle Book, as a list of tokens.

In [107]:
import math
from collections import Counter

In [114]:
with open('../corpora/tokenized/junglebook_tokenized.txt') as jb_tk:
    jb_list = []
    for line in jb_tk.readlines():
        jb_list.extend(line.split())

### Unigram Occurences
Now we use the Counter dict to obtain the occurences for each word. We list the top 10 most frequent words

In [196]:
uni_count = Counter(jb_list)
print(sorted(uni_count.items(),key=lambda x:x[1],reverse=True)[:10])

[('the', 3695), ('and', 2321), ('of', 1347), ('to', 1261), ('a', 1147), ('he', 1071), ('in', 742), ('his', 659), ('that', 659), ('i', 580)]


We create a list of the less frequent values by first using dict comprehension.

In [159]:
less_freq = list({key:uni_count[key] for key in uni_count if uni_count[key]<10})

Then we remove the less frequent values from the jb_list variable using list comprehension.

In [96]:
jb_list_filtered = [token for token in jb_list if token not in less_freq]

43980


We now create tuples from the word pairs [i-1] and [i] and store them in a list.Afterwards we create another Counter dict to store the occurances of these tuples.

In [160]:
# Counting 
list_of_tup = []
for i in range(len(jb_list_filtered)):
    list_of_tup.append((jb_list_filtered[i-1],jb_list_filtered[i]))

del list_of_tup[0] # removes the tuple generated from the first and last token([-1],[0]).

# 
bi_count = Counter(list_of_tup)

We than calculate the **PMI** for each word pair in the previously defined Counter object ```bi_count``` and store them in dictionary, where the key is the word pair and the value is the PMI value for this word pair. In the next cell we see an example of the acessed iterable object.

In [184]:
list(bi_count.items())[:1]

[(('the', 'project'), 36)]

To obtain the **PMI** value we use the lenght of the list ```jb_list_filtered``` as the size of the corpus, and we acess the absolute frequency of the word pairs in index 1 of the item **```i```**. The absolute frequencies of the words alone in the corpus is acessed through the ```uni_count``` dict.

In [185]:
pmi_values = {}

for i in bi_count.items():
    pmi = math.log((i[1] * len(jb_list_filtered))/(uni_count[i[0][0]]*uni_count[i[0][1]]))
    pmi_values[i[0]] = pmi


The values are then sorted using the ```sorted()``` built in function, giving us a list tuples, in which we have the word pair in the tuple's index 0 and it's **PMI** value in index 1

In [164]:
sorted_pmi_values = sorted(pmi_values.items(), key=lambda x:x[1])

To estabilish what a 0, or near 0, **PMI** would look like, the values of PMI in the range [-0.01, 0.01] where stored in a list and some of them are shown here.

In [198]:
near_zero_pmi = [word_pair for word_pair, pmi in sorted_pmi_values if pmi>-0.01 and pmi<0.01]
print(len(near_zero_pmi))
print(near_zero_pmi[:10])

70
[('can', 'with'), ('page', 'the'), ('forget', 'the'), ('the', 'wonderful'), ('perhaps', 'the'), ('the', 'jump'), ('tree', 'the'), ('quickly', 'the'), ('high', 'the'), ('makes', 'the')]


## The highest and lowest PMI values

In [174]:
for word_pair, value in sorted_pmi_values[-20:]:
    print('Word pair{0}'.format(word_pair)+'\t\t\t'+ 'pmi value = {0}'.format(value))

Word pair('council', 'rock')			pmi value = 6.5018355220823265
Word pair('mans', 'cub')			pmi value = 6.603224747963034
Word pair('bring', 'news')			pmi value = 6.61961855773871
Word pair('years', 'ago')			pmi value = 6.637967696406907
Word pair('master', 'words')			pmi value = 6.666138573373603
Word pair('hind', 'flippers')			pmi value = 6.684157078876281
Word pair('electronic', 'works')			pmi value = 6.702506217544478
Word pair('whole', 'line')			pmi value = 6.714928737543035
Word pair('twenty', 'yoke')			pmi value = 6.8466760083740565
Word pair('fore', 'paws')			pmi value = 6.907300630190491
Word pair('hind', 'legs')			pmi value = 6.984911232895415
Word pair('petersen', 'sahib')			pmi value = 7.127484139876224
Word pair('stretched', 'myself')			pmi value = 7.184932366788771
Word pair('gutenberg', 'literary')			pmi value = 7.290292882446597
Word pair('cold', 'lairs')			pmi value = 7.501601976113804
Word pair('archive', 'foundation')			pmi value = 7.6004478107504365
Word pair('darzees'

In [175]:
for word_pair, value in sorted_pmi_values[:20]:
    print('Word pair{0}'.format(word_pair)+'\t\t\t'+ 'pmi value = {0}'.format(value))

Word pair('he', 'of')			pmi value = -3.4904929827493607
Word pair('his', 'the')			pmi value = -3.320821923216112
Word pair('the', 'not')			pmi value = -3.2435573458809617
Word pair('little', 'the')			pmi value = -3.0038844926155415
Word pair('the', 'a')			pmi value = -2.9587127739688204
Word pair('the', 'be')			pmi value = -2.851121738063131
Word pair('a', 'his')			pmi value = -2.8441383875231256
Word pair('of', 'was')			pmi value = -2.7945407512618066
Word pair('said', 'of')			pmi value = -2.60545479437931
Word pair('he', 'he')			pmi value = -2.5680586962268
Word pair('the', 'no')			pmi value = -2.533880863369806
Word pair('in', 'in')			pmi value = -2.5272082222260086
Word pair('and', 'is')			pmi value = -2.4897993524999444
Word pair('a', 'the')			pmi value = -2.4887091447230847
Word pair('the', 'if')			pmi value = -2.4720054596517183
Word pair('very', 'the')			pmi value = -2.4504992544307544
Word pair('they', 'of')			pmi value = -2.4490391079211995
Word pair('of', 'they')			pmi value

# Analyzing the Results

The top 20 results with the highest PMI are bigrams, or word pairs, which seem semantically plausible. We have, for example, the pair ("united","states"), which indicates the country, and the pair ('council', 'rock'), which is a location from the book. We also see words from certain parts of speech, such as nouns (wife, legs,paws), adverbs (eletronic, fore) and some propper collocations (('years','ago'), ('bring','news').

On the other hand the top 20 results with the lowest PMI show mostly prepositions(at, int, of), determinants (a, the) and pronouns (he, his, they). Those are the words which have a high absolute frequency, but a small likelihood to occuring in pairs.

Using such a qualitative and quantiative analysis we are able to infer that in natural occuring texts, there is a dependency relation between one word and the word that occurs next.