# Word similarity

**Goal:** Quantifying the similarity between pairs of words using the structure of WordNet and word co-occurrence in the Brown corpus, using PMI, LSA, and word2vec. The objective is to quantify how well these methods work by comparing to a carefully filtered human annotated gold-standard.

**Dependencies:** NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim

For this workshop, we will compare the methods against a popular dataset of word similaritites called Similarity-353. The file will be downloaded and unzipped automatically. 

The dataset contains many rare words that should be filtered. The first goal is to generate a smaller test set where you will evaluate the word similarity methods.

The filtering is meant to be based on document frequencies in the Brown corpus, in order to remove rare words. The paragraphs of Brown are meant to be considered as _documents_, they are iterable using the `paras` method of the corpus reader. The treatment is to lower-case and then lemmatize before adding the elements to the set. Then, using the information in this corpus, calculate document frequencies and remove from your test set any word pairs where at least one of the two words has a document frequency of less than 10 in this corpus. 

The second filter is based on words with highly ambiguous senses and involves using the NLTK WordNet corpus. The first step is to remove any words which don't have a _single primary sense_ (i.e., either having only one sense, or synset, or where the count of the most common sense is at least five and at least five times larger than the next most common sense). Also, it's expected to remove any words where the primary sense is not a noun. Removing any word pairs from the test set where at least one of the words doesn't cotain a single primary sense or if the single primary sense is not a noun.

Finally print out all the pairs in the filtered test set.

In [1]:
!rm -rf *
!wget http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.zip
!unzip wordsim353.zip -d data/
!pip3 install nltk numpy scipy matplotlib sklearn gensim
import nltk
nltk.download('all')

--2019-05-27 02:36:56--  http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.zip
Resolving www.cs.technion.ac.il (www.cs.technion.ac.il)... 132.68.32.15
Connecting to www.cs.technion.ac.il (www.cs.technion.ac.il)|132.68.32.15|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23257 (23K) [application/zip]
Saving to: ‘wordsim353.zip’


2019-05-27 02:36:59 (9,98 KB/s) - ‘wordsim353.zip’ saved [23257/23257]

Archive:  wordsim353.zip
  inflating: data/combined.csv       
  inflating: data/set1.csv           
  inflating: data/set2.csv           
  inflating: data/combined.tab       
  inflating: data/set1.tab           
  inflating: data/set2.tab           
  inflating: data/instructions.txt   
Collecting nltk
  Downloading https://files.pythonhosted.org/packages/73/56/90178929712ce427ebad179f8dc46c8deef4e89d4c853092bee1efd57d05/nltk-3.4.1.zip (3.1MB)
[K    100% |████████████████████████████████| 3.1MB 422kB/s ta 0:00:01
[?25hCollecting numpy
  U

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/mesi/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /home/mesi/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /home/mesi/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /home/mesi/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /home/mesi/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /home/mesi/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to
[nltk_data]    |     /home/mesi/nltk_data

[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package toolbox to
[nltk_data]    |     /home/mesi/nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /home/mesi/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /home/mesi/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to /home/mesi/nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to /home/mesi/nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /home/mesi/nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_treebanks

True

In [2]:
import re
import nltk
from nltk.corpus import brown
from nltk.stem import WordNetLemmatizer
import nltk.corpus.reader.plaintext as reader
from nltk.wsd import wordnet as wn

dataset_path = 'data/combined.tab'
head = str('Word 1\tWord 2\tHuman (mean)\n')

def con_dic(dataset_path):
  dataset = open(dataset_path, "r")
  next(dataset)
  testset = {}
  #key = (word1,word2) word pairs in combined.tag
  for l in dataset:
    k1 = re.sub(r'^([a-zA-Z]+)	[a-zA-Z]+	[0-9.]+$',r'\1',l)
    k2 = re.sub(r'^[a-zA-Z]+	([a-zA-Z]+)	[0-9.]+$',r'\1',l)
    key = (k1.strip('\n'),k2.strip('\n'))
    value = re.sub(r'^[a-zA-Z]+	[a-zA-Z]+	([0-9.]+)$',r'\1',l)
    testset[key] = float(value)
  return testset

def doc_setup(dictionary):
  ds = []
  paras = brown.paras()
  wordnet_lemmatizer = WordNetLemmatizer()
  for doc in paras:
    d = []
    for sent in doc:
      for word in sent:
        w = wordnet_lemmatizer.lemmatize(word.lower())
        d.append(w)
    ds.append(set(d))
  return ds

def doc_frequency(w, docs):
  df = 0
  for p in docs:
    if w in p:
      df += 1
  return df

def first_filter(dictionary, remove_degree):
  docs = doc_setup(dictionary)
  for (k1,k2) in list(dictionary):
    if doc_frequency(k1,docs) < remove_degree or doc_frequency(k2,docs) < remove_degree:
      del dictionary[(k1,k2)]
  print("testset length after 1st filter:")
  print(len(dictionary))
  return dictionary, docs

def second_max(a):
  hi = mid = 0
  for x in a:
    if x > hi:
      mid = hi
      hi = x
    elif x < hi and x > mid:
      lo = mid
      mid = x
  return mid

def primary_filter(word, filter_degree):
  synsets = wn.synsets(word)
  # only one synset
  if len(synsets) == 1:
    if synsets[0].pos() != 'n':
      return None
    else:
      return synsets[0].name()
  #more than one synset
  elif len(synsets) > 1:
    return lemma_filter(word, filter_degree)

def lemma_filter(word, filter_degree):
  synsets = wn.synsets(word)
  count = {}
  for index,s in enumerate(synsets):
    #record compare most common n next most common
    c = 0
    for lemma in s.lemmas():
      lemma_name = lemma.name()
      if lemma_name == word:
        c = lemma.count()
        count[c] = index
    secmax = second_max(list(count))
    c = max(list(count))
    
    if synsets[count.get(c)].pos() == 'n':
      #at least five or five times bigger
      if c >= filter_degree and c >= filter_degree * secmax:
        return synsets[count.get(c)].name()
      else:
        return None
    else:
      return None

def second_filter(dictionary, filter_degree, add_synset):
  for (k1,k2) in list(dictionary):
    f1 = primary_filter(k1,filter_degree)
    f2 = primary_filter(k2,filter_degree)
    if f1 and f2:
      synsets = [f1,f2]
      if add_synset is True:
        dictionary[(k1,k2)] = synsets
    else:
      del dictionary[(k1,k2)]
  if add_synset is True:
    print("second processed dataset length:")
    print(len(dictionary))
    print(dictionary)
  return dictionary

testset_dic = con_dic(dataset_path)
testset_dic, docs = first_filter(testset_dic, 10)
testset_dic = second_filter(testset_dic, 5, True)

testset length after 1st filter:
222
second processed dataset length:
170


**Wu-Palmer:** Create several dictionaries with similarity scores for pairs of words in the test set derived. Use [Wu-Palmer](https://blog.thedigitalgroup.com/words-similarityrelatedness-using-wupalmer-algorithm) scores derived from the hypernym relationships in WordNet, which should be calculated using the primary sense for each word derived above. Print out the dictionary of word pair/similarity mappings.


In [3]:
def WP_scores(dictionary):
  similarity_dict = {}
  for (k1,k2) in list(dictionary):
    synsets = dictionary.get((k1,k2))
    w1 = wn.synset(synsets[0])
    w2 = wn.synset(synsets[1])
    score=w1.wup_similarity(w2)
    similarity_dict[(k1,k2)] = score
  print(similarity_dict)
  return similarity_dict

WP_sim = WP_scores(testset_dic)
print(len(testset_dic))

170


**Positive Point Mutual Information (PPMI):** Calculate PPMI for the word pairs using statistics derived from the Brown corpus: use the same set up as for the calculation of document frequency above: paragraphs as documents, lemmatized, lower-cased, and with term frequency information removed by conversion to Python sets. Avoid building the entire co-occurrence matrix, instead keep track of the sums needed for the probabilities as going. When PMI is calculated for all the pairs, the code should print out the Python dictionary of word pair/PPMI similarity mappings.

In [4]:
import math
def PPMI(word1, word2, documents):
  w1_count = 0
  w2_count = 0
  total_count = 0
  both_count = 0
  for doc in documents:
    for i in doc:
      total_count += 1
      if i == word1:
        w1_count += 1
        for w in doc:
          if w == word2:
            both_count += 1
      elif i == word2:
        w2_count += 1
  base = (both_count/total_count)/((w1_count/total_count)*(w2_count/total_count))
  if base > 0:
    PMI = math.log((both_count/total_count)/((w1_count/total_count)*(w2_count/total_count)), 2)
    return PMI
  else:
    return float(0)
  
def dic_PPMI(dictionary, docs):
  pmi_dic = {}
  for (k1,k2) in list(dictionary):
    pmi = PPMI(k1,k2,docs)
    pmi_dic[(k1,k2)] = pmi
  print(pmi_dic)
  return pmi_dic

PPMI_sim = dic_PPMI(testset_dic, docs)



**[Latent Semantic Analysis (LSA)](https://en.wikipedia.org/wiki/Latent_semantic_analysis):** Derive similarity scores using LSA, applying [Singular Value Decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) and truncate to get a dense vector and then use cosine similarity between the two vectors for each word pair. As result we'll have a matrix where rows (non-sparse) correspond to words in vocabulary, and the columns (sparse) correspond to the texts where they appear. Using the Brown corpus, in the same format as with PPMI and document frequency. After having a matrix in the correct format, use truncated SVD in the Sci-Kit lear pack to produce dense vectors of length 500, and then use the cosine similarity to produce similarities for the word pairs and print the resulting dictionary.

In [5]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from scipy.spatial.distance import cosine as cos_distance

def get_BOW(text):
  BOW = {}
  for word in text:
    BOW[word.lower()] = BOW.get(word.lower(),0) + 1
  return BOW

def cons_feature_matrix(docs):
  texts = []
  for doc in docs:
    texts.append(get_BOW(doc))
  vectorizer = DictVectorizer()
  #Tfidf is optional
  transformer = TfidfTransformer(smooth_idf=False,norm=None)
  svd = TruncatedSVD(n_components=500)
  # create truncated transposed document-word feature matrix 
  # as word-doc feature matrix
  feature_matrix = svd.fit_transform(vectorizer.fit_transform(texts).T)
  return feature_matrix,vectorizer

def cos_sim(testset,docs,feature_matrix, vectorizer):
  look_up = {}
  cos_dic = {}
  for index,w in enumerate(vectorizer.get_feature_names()):
    look_up[w]=index
  for (k1,k2) in list(testset):
    v1 = look_up.get(k1)
    v2 = look_up.get(k2)
    cos = cos_distance(feature_matrix[v1],feature_matrix[v2])
    # cosine_similary = 1-cosine_distance
    cos_dic[(k1,k2)]=1-cos
  print(cos_dic)
  return cos_dic

feature_matrix,vectorizer = cons_feature_matrix(docs)
cos_sim = cos_sim(testset_dic,docs,feature_matrix, vectorizer)



**Word2Vec:** Derive a similarity score from word2vec vectors using the Gensim pack (tutorial [here](https://radimrehurek.com/gensim/models/word2vec.html)). Again, use the Brown corpus. In this case train the model at the sentence level rather than in paragraphs. The vectors should be 500 length (same dimensions as LSA), running 50 iterations (**this is going to take several minutes**). Extract similarities directly from the Gensim model, put them into a dictionary and print them out.

In [9]:
from gensim.models import Word2Vec

def train_w2v():
    sentences = brown.sents()   
    model = Word2Vec(sentences, size=500, iter=50)
    return model

def w2v_sim_w(word1, word2, model):
    similarity = model.wv.similarity(word1, word2)
    #print(similarity)
    return similarity
    
    
def w2v_sim(dataset, model):
    w2v_sim = {}
    for (k1,k2) in list(dataset):
        sim = w2v_sim_w(k1,k2,model)
        w2v_sim[(k1,k2)] = sim
    print(w2v_sim)
    return w2v_sim
        
    
model = train_w2v()    
W2V = w2v_sim(testset_dic, model)

KeyError: "word 'medal' not in vocabulary"

**Comparison:** Finally, compare all the similarities created to the gold standard in the first step. Use the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) `pearsonr`, included in `scipy.stats`.  Pay attention to dictionaries conversion to lists, data must be in the same order for correct comparison via correlation. Write a general function, apply it to each of the similarity score dictionaries and print out the result for each. 

In [8]:
from scipy import stats

ori_testset = con_dic(dataset_path)
ori_test, docs = first_filter(ori_testset, 10)
testset_tem = second_filter(ori_testset,5,False)

def dic2list(dic1, dic2):
  l1 = []
  l2 = []
  for (k1,k2) in list(dic1):
    l1.append(dic1.get((k1,k2)))
    l2.append(dic2.get((k1,k2)))
  return l1,l2



olist, WPlist   = dic2list(ori_testset, WP_sim)
olist, PPMIlist = dic2list(ori_testset,PPMI_sim)
olist, COSlist  = dic2list(ori_testset,cos_sim)
#olist, W2Vlist  = dic2list(ori_testset, W2V)
print("wu-Palmer:")
print(stats.pearsonr(olist, WPlist))
print("PPMI:")
print(stats.pearsonr(olist, PPMIlist))
print("Cosine:")
print(stats.pearsonr(olist,COSlist))
print("word2vector:")
#print(stats.pearsonr(olist,W2Vlist))

testset length after 1st filter:
222
wu-Palmer:
(0.3565521640169933, 1.8207081504242935e-06)
PPMI:
(0.4213300206094628, 1.0555076721264572e-08)
Cosine:
(0.41040558417467576, 2.7226407050210468e-08)
word2vector:
