# Distributional Similarity Practice

This practice sheet should help you gain a little familiarity with creating a distributional representation for a word, and how to query it later.

## 1. Getting our Data

Our first step in creating a distributional representation for our vocabulary is to get our data set. This should be pretty familiar by now; find a resource with many sentences, and tokenize those sentences.

*(NOTE: You may need to first launch a python interpreter and run the following:)*

    >>> import nltk
    >>> nltk.download('brown')

In [1]:
import nltk
from nltk.corpus import stopwords 

brown_sents = nltk.corpus.brown.sents()
print(brown_sents[0])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']


## 2. Designing Our Data Store

Now that we have the data that we'll be using for an input, we need to figure out the best way to store this data.

Our target structure will let us look up a pair of words, $\langle w_1, w_2 \rangle$ and see how many times $w_2$ occurred within some window $n$ of $w_1$.

A common, efficient way to store such information is to create a 2-dimensional matrix, where each row index is correlated to a unique vocabulary word which will be $w_1$, and every column index will represent $w_2$.

Let's go ahead and make a data structure that allows us to easily increment counts of collocations, as well as allow us to look up a unique ID for a word, whether we've seen it, or wheter it's a new vocab item. We can do this with a standard matrix, or with nested dictionaries, if we prefer.

In [2]:
from collections import defaultdict

class CollocationMatrix(dict):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._word_mapping = {}  # Where we'll store string->int mapping.        
    
    def word_id(self, word, store_new=False):
        """
        Return the integer ID for the given vocab item. If we haven't
        seen this vocab item before, give ia a new ID. We can do this just
        as 1, 2, 3... based on how many words we've seen before.
        """
        if word not in self._word_mapping:
            if store_new:
                self._word_mapping[word] = len(self._word_mapping)
                self[self._word_mapping[word]] = defaultdict(int)  # Also add a new row for this new word.
            else:
                return None
        return self._word_mapping[word]
    
    def add_pair(self, w_1, w_2):
        """
        Add a pair of colocated words into the coocurrence matrix.
        """
        w_id_1 = self.word_id(w_1, store_new=True)
        w_id_2 = self.word_id(w_2, store_new=True)
        self[w_id_1][w_id_2] += 1  # Increment the count for this collocation
        
    def get_pair(self, w_1, w_2):
        """
        Return the colocation for w_1, w_2
        """
        w_1_id = self.word_id(w_1)
        w_2_id = self.word_id(w_2)
        if w_1_id and w_2_id:
            return self[w_1_id][w_2_id]
        else:
            return 0
        
    def get_row(self, word):
        word_id = self.word_id(word)
        if word_id is not None:
            return self.get(word_id)
        else:
            return defaultdict(int)
    
    def get_row_sum(self, word):
        """
        Get the number of total contexts available for a given word        
        """
        return sum(self.get_row(word).values())
    
    def get_col_sum(self, word):
        """
        Get the number of total contexts a given word occurs in
        """
        f_id = self.word_id(word)
        return sum([self[w][f_id] for w in self.keys()])
    
    @property
    def total_sum(self):
        return sum([self.get_row_sum(w) for w in self._word_mapping.keys()])

## 3. Populating Our Colocation Data

Now that we've got the data to store our colocations in, we need to populate it!

This simple code steps through our sentences up to `sent_limit` using a window size of `window_size` and grabs the words within that window to add to to colocation matrix.

Note that no special treatment is made for word-initial or word-final tokens here, but it's possible to create such a modification!

In [None]:
import string
window_size = 3
sent_limit = 1000
matrix = CollocationMatrix()
stopwords = nltk.corpus.stopwords.words('english')

for sent in brown_sents[:sent_limit]:
    
    print(sent)
       
    sent = [w for w in sent if w.lower() not in stopwords]
    
    sent = [w for w in sent if w.lower() not in string.punctuation]
    
    print(sent)
        
    for i, word in enumerate(sent):
        # Increment the count of words we've seen.
        for j in range(-window_size, window_size+1):
            # Skip counting the word itself.
            if j == 0:
                continue
                
            # At the beginning and end of the sentence,
            # you can either skip counting, or add a
            # unique "<START>" or "<END>" token to indicate
            # the word being colocated at the beginning or
            # end of sentences.            
            if len(sent) > i+j > 0:
                word_1 = sent[i].lower()
                word_2 = sent[i+j].lower()
                
                matrix.add_pair(word_1, word_2)  
                
    

def print_colocate(w_1, w_2):
    print('"{}" and "{}" seen together {} times.'.format(w_1, w_2,
                                                     matrix.get_pair(w_1, w_2)))
    
def print_count(word):
    print('"{}" has {} contexts in the data.'.format(word, 
                                                     matrix.get_row_sum(word)))

print_count('the')                
print_colocate('jury', 'grand')
print_colocate('primary', 'election')
print_colocate('midterm', 'election')

## 3. Calculating PMI

Now, having collocation counts is handy, as we've discussed in class, but recall that "the" is going to collocate with all sorts of words, and so isn't all that helpful.

Instead, we should try calculating Pointwise Mutual Information (PMI), which tells us, in essence, how likely it is to see two things in the same place, compared to seeing them independently.

The formula is: $\log \frac{p(w,f)}{p(w)\cdot p(f)}$, where $p(w,f)$ is the probability of seeing context $f$ for word $w$ out of all possible contexts; $p(w)$, is the probability of seeing word $w$ in any context, and $p(f)$ is the probability for the context $f$ across all words.

If we have access to the counts for each of these factors individually:

* Sum of all contexts (`matrix.total_sum`)
* All contexts for a given word (`matrix.get_row_sum(word)`)
* All contexts a given word appears in (`matrix.get_col_sum(word)`)

Write a function that calculates PPMI for word $w$ and context word $f$.

In [27]:
# Code to calculate PPMI for two words, using the
# colocation matrix calculated above.

import math

def calculate_ppmi(w, f):
    sum_all_context = matrix.total_sum
    word_count = matrix.get_row_sum(w)
    context_count = matrix.get_col_sum(f)
    joint_count = matrix.get_pair(w,f)
    
    if sum_all_context:    
        p_w = word_count/ float(sum_all_context)
        p_f = context_count/float(sum_all_context)
        p_w_f = joint_count/float(sum_all_context)
    
    if p_w * p_f > 0:
        ratio = (p_w_f)/(p_w * p_f)
        if ratio > 0:
            return max(math.log2(ratio),0)
        else:
            return 0
    
    else:
        return 0

#change tor eturn only positive#watch video

In [9]:
calculate_ppmi('grand', 'jury')

7.497352638218837

In [None]:
def calculate_freq(w, f):   
    return matrix.get_pair(w,f)

In [13]:
print(matrix)

{0: defaultdict(<class 'int'>, {1: 1, 2: 1, 3: 1}), 1: defaultdict(<class 'int'>, {2: 1, 3: 1, 4: 1}), 2: defaultdict(<class 'int'>, {1: 1, 3: 1, 4: 1, 5: 1}), 3: defaultdict(<class 'int'>, {1: 1, 2: 1, 4: 2, 5: 1, 6: 1, 18: 1, 19: 1}), 4: defaultdict(<class 'int'>, {1: 1, 2: 1, 3: 1, 5: 1, 6: 1, 7: 1, 18: 1, 19: 1, 20: 1}), 5: defaultdict(<class 'int'>, {2: 1, 3: 1, 4: 1, 6: 1, 7: 1, 8: 1}), 6: defaultdict(<class 'int'>, {3: 1, 4: 1, 5: 1, 7: 1, 8: 1, 9: 1}), 7: defaultdict(<class 'int'>, {4: 1, 5: 1, 6: 1, 8: 1, 9: 1, 10: 1}), 8: defaultdict(<class 'int'>, {5: 1, 6: 1, 7: 1, 9: 1, 10: 1, 11: 1}), 9: defaultdict(<class 'int'>, {6: 1, 7: 1, 8: 1, 10: 1, 11: 1, 12: 1}), 10: defaultdict(<class 'int'>, {7: 1, 8: 1, 9: 1, 11: 1, 12: 2, 13: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 28: 1, 14: 1, 29: 1, 30: 1}), 11: defaultdict(<class 'int'>, {8: 1, 9: 1, 10: 1, 12: 1, 13: 1, 14: 1}), 12: defaultdict(<class 'int'>, {9: 1, 10: 2, 11: 1, 13: 1, 14: 1, 15: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1}), 13

In [12]:
import numpy as np
ppi_weighted =  np.zeros((len(matrix), len(matrix[0])))
print(ppi_weighted.shape)

(31, 3)


In [14]:
print(matrix[30])

defaultdict(<class 'int'>, {14: 1, 29: 1, 10: 1})


In [15]:
print(matrix[30][0])

0


In [16]:
print(matrix[30][10])

1


In [17]:
print(matrix._word_mapping)

{'fulton': 0, 'county': 1, 'grand': 2, 'jury': 3, 'said': 4, 'friday': 5, 'investigation': 6, "atlanta's": 7, 'recent': 8, 'primary': 9, 'election': 10, 'produced': 11, '``': 12, 'evidence': 13, "''": 14, 'irregularities': 15, 'took': 16, 'place': 17, 'term-end': 18, 'presentments': 19, 'city': 20, 'executive': 21, 'committee': 22, 'over-all': 23, 'charge': 24, 'deserves': 25, 'praise': 26, 'thanks': 27, 'atlanta': 28, 'manner': 29, 'conducted': 30}


In [18]:
print(matrix._word_mapping.keys())

dict_keys(['fulton', 'county', 'grand', 'jury', 'said', 'friday', 'investigation', "atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'evidence', "''", 'irregularities', 'took', 'place', 'term-end', 'presentments', 'city', 'executive', 'committee', 'over-all', 'charge', 'deserves', 'praise', 'thanks', 'atlanta', 'manner', 'conducted'])


In [29]:
vocab_size = len(matrix._word_mapping.keys())
print(vocab_size)
ppmi_matrix = np.zeros((vocab_size,vocab_size))

vocab = list(matrix._word_mapping.keys())

for word_1 in vocab:
    for word_2 in vocab:        
        ppmi = calculate_ppmi(word_1,word_2)
        w_id_1 = matrix.word_id(word_1, store_new=True)
        w_id_2 = matrix.word_id(word_2, store_new=True)
        ppmi_matrix[w_id_1][w_id_2] = ppmi
        

print(ppmi_matrix)

31
[[ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          0.          3.72246602  3.45943162  2.72246602  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.        ]
 [ 0.          3.62935662  0.          3.04439412  2.30742853  3.04439412
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.

In [33]:
with open(r'C:\Users\priyak.REDMOND\PycharmProjects\Ling571_hw7_distributed_semantics\data\mc_similarity.txt','r') as hj_file:
    judgements = hj_file.readlines()
    for line in judgements:
        print(line)
        word1 = line[0]
        word2 = line[1]        
        w_id_1 = matrix.word_id(word_1, store_new=True)
        w_id_2 = matrix.word_id(word_2, store_new=True)
        arr_w1 = ppmi_matrix[w_id_1]        
        a1 = arr_w1.argsort()[-10:][::-1]
        print(a1)
        arr_w2 = ppmi_matrix[w_id_2]        
        a2 = arr_w2.argsort()[-10:][::-1]
        print(a2)

car,automobile,3.92

[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          2.04439412  0.          0.          0.
  2.45943162  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          3.72246602  0.        ]
[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          2.04439412  0.          0.          0.
  2.45943162  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          3.72246602  0.        ]
journey,voyage,3.84

[ 0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          2.04439412  0.          0.          0.
  2.45943162  0.          0.          0.          0.          0.          0.
  0.          0.          0.       

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([1, 0, -1], [-1,-1, 0])

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))

#https://masongallo.github.io/machine/learning,/python/2016/07/29/cosine-similarity.html