## Introduction to NLP course (2017-2018).

### Homework 3: Distributional semantic models.

### Jonatan Piñol & Peter Weber

Objectives:

1) Obtain co-occurrence vector representations with the followin properties:
- window size 1, pmi, svd (50)
- window size 3, no modifications
- window size 3, pmi, no svd
- window size 3, no pmi, svd (50)
- window size 3, pmi, svd (50)

2) Obtain word2vec embeddings with the following properties
- window size 1, 50 dimensions
- window size 1, 200 dimensions
- window size 3, 50 dimensions
- window size 3, 200 dimensions
- window size 5, 50 dimensions

3) Compare the performance of the 10 representations in 1 and 2 on the following tasks:
- similarity between "man" and "woman"
- the 5 most similar words to "car"
- for DISSECT representations , correlation with gold standard
- for Word2Vec, the similarity between "queen" and "king + woman - man"

In [2]:
# Import section
import nltk
from nltk.corpus import gutenberg
from nltk import FreqDist
from nltk.collocations import *
import re
from collections import Counter
import numpy as np
import operator
from scipy import spatial

# Dissect
from composes.semantic_space.space import Space
from composes.utils import io_utils
from composes.transformation.scaling.ppmi_weighting import PpmiWeighting
from composes.transformation.dim_reduction.svd import Svd
from composes.similarity.cos import CosSimilarity
from composes.utils import scoring_utils

# Gensim
import gensim 

In [8]:
# Path to the folder where the data files are
my_path = "../Data/"

# Loading the matrix from the three different files
my_space_1 = Space.build(data = my_path + "gutenberg_surface_1.sm",
                       rows = my_path + "gutenberg_surface_1.rows",
                       cols = my_path + "gutenberg_surface_1.cols",
                       format = "sm")

Progress...1000000


In [7]:
# Path to the folder where the data files are
my_path = "../Data/"

# Loading the matrix from the three different files
my_space_3 = Space.build(data = my_path + "gutenberg_surface_3.sm",
                       rows = my_path + "gutenberg_surface_3.rows",
                       cols = my_path + "gutenberg_surface_3.cols",
                       format = "sm")

Progress...1000000
Progress...2000000
Progress...3000000


In [12]:
# window size 1, pmi, svd (50)
my_ppmi_svd_space_1 = my_space_1.apply(PpmiWeighting()).apply(Svd(50))

# window size 3, no modifications
my_space_3

# window size 3, pmi, no svd
my_ppmi_space_3 = my_space_3.apply(PpmiWeighting())

# window size 3, no pmi, svd (50)
my_svd_space_3 = my_space_3.apply(Svd(50))

# window size 3, pmi, svd (50)
my_ppmi_svd_space_3 = my_ppmi_space_3.apply(Svd(50))



In [14]:
## Load the corpus
corpus = gutenberg.sents()

# Setup word2vec model
# NB. word2vec model requires tokenized (and possibly sentence/document 
# segmented) corpus - list of tokens or list of lists of tokens
# If you use a raw corpus, 
# you need to pre-process it before passing it to w2v.


# window size 1, 50 dimensions
model_1_50 = gensim.models.Word2Vec (corpus, size=50, window=1, min_count=2, workers=10)
model_1_50.train(corpus,total_examples=len(corpus),epochs=10)


# window size 1, 200 dimensions
model_1_200 = gensim.models.Word2Vec (corpus, size=200, window=1, min_count=2, workers=10)
model_1_200.train(corpus,total_examples=len(corpus),epochs=10)

# window size 3, 50 dimensions
model_3_50 = gensim.models.Word2Vec (corpus, size=50, window=3, min_count=2, workers=10)
model_3_50.train(corpus,total_examples=len(corpus),epochs=10)

# window size 3, 200 dimensions
model_3_200 = gensim.models.Word2Vec (corpus, size=200, window=3, min_count=2, workers=10)
model_3_200.train(corpus,total_examples=len(corpus),epochs=10)

# window size 5, 50 dimensions
model_5_50 = gensim.models.Word2Vec (corpus, size=50, window=5, min_count=2, workers=10)
model_5_50.train(corpus,total_examples=len(corpus),epochs=10)


(18421219, 26217850)

Performance on the Dissect representations

In [13]:
# Comparing similarity between "man" and "woman"
print "Calculating similarity between man and woman","\n"
print "PPMI and SVD matrix window size 1",my_ppmi_svd_space_1.get_sim("man", "woman", CosSimilarity())
print "No modifications, window size 3",my_space_3.get_sim("man", "woman", CosSimilarity())
print "PPMI matrix window size 3",my_ppmi_space_3.get_sim("man", "woman", CosSimilarity())
print "SVD matrix window size 3",my_svd_space_3.get_sim("man", "woman", CosSimilarity())
print "PPMI and SVD matrix window size 3",my_ppmi_svd_space_3.get_sim("man", "woman", CosSimilarity())
print "----------------------------------------------\n"

# Comparing the 5 most similar words to "car"
print "Obtaining the 5 most similar words to 'car'\n"
print "PPMI and SVD matrix window size 1",my_ppmi_svd_space_1.get_neighbours("car", 5, CosSimilarity()),"\n"
print "No modifications, window size 3",my_space_3.get_neighbours("car", 5, CosSimilarity()),"\n"
print "PPMI matrix window size 3",my_ppmi_space_3.get_neighbours("car", 5, CosSimilarity()),"\n"
print "SVD matrix window size 3",my_svd_space_3.get_neighbours("car", 5, CosSimilarity()),"\n"
print "PPMI and SVD matrix window size 3",my_ppmi_svd_space_3.get_neighbours("car", 5, CosSimilarity()),"\n"
print "----------------------------------------------\n"


# Comparing the similarity with "gold standard"
print "Comparing similarity with 'gold standard'"
fname = my_path + "synonyms.txt"
# Load the pairs
word_pairs = io_utils.read_tuple_list(fname, fields=[0,1])
# Load the score
gold = io_utils.read_list(fname, field=2)
# Predict similarity
predicted_ppmi_svd = [round(sim,2) for sim in my_ppmi_svd_space_1.get_sims(word_pairs, CosSimilarity())]
print "Pairs:",word_pairs
print "Gold scores",gold
print "\n PPMI and SVD matrix window size 1:"
print "Predicted scores",predicted_ppmi_svd
print "Spearman correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "spearman")
print "Pearson correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "pearson")
print "\n ---------- \n"

predicted_ppmi_svd = [round(sim,2) for sim in my_space_3.get_sims(word_pairs, CosSimilarity())]
print "\n No PPMI and SVD matrix:"
print "Predicted scores",predicted_ppmi_svd
print "Spearman correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "spearman")
print "Pearson correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "pearson")
print "\n ---------- \n"

predicted_ppmi_svd = [round(sim,2) for sim in my_ppmi_space_3.get_sims(word_pairs, CosSimilarity())]
print "\n PPMI matrix window size 3:"
print "Predicted scores",predicted_ppmi_svd
print "Spearman correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "spearman")
print "Pearson correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "pearson")
print "\n ---------- \n"

predicted_ppmi_svd = [round(sim,2) for sim in my_svd_space_3.get_sims(word_pairs, CosSimilarity())]
print "\n SVD matrix window size 3:"
print "Predicted scores",predicted_ppmi_svd
print "Spearman correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "spearman")
print "Pearson correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "pearson")
print "\n ---------- \n"

predicted_ppmi_svd = [round(sim,2) for sim in my_ppmi_svd_space_3.get_sims(word_pairs, CosSimilarity())]
print "\n PPMI and SVD matrix window size 3:"
print "Predicted scores",predicted_ppmi_svd
print "Spearman correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "spearman")
print "Pearson correlation:",scoring_utils.score(gold, predicted_ppmi_svd, "pearson")
print "\n ---------- \n"


Calculating similarity between man and woman 

PPMI and SVD matrix window size 1 0.9019762337114995
No modifications, window size 3 0.9686231105706885
PPMI matrix window size 3 0.10418482739863029
SVD matrix window size 3 0.9812513957585547
PPMI and SVD matrix window size 3 0.7836632623853524
----------------------------------------------

Obtaining the 5 most similar words to 'car'

PPMI and SVD matrix window size 1 [('car', 1.0), ('table', 0.8509212821582576), ('floor', 0.8463456622201355), ('chimney', 0.8439582387251383), ('dining', 0.8411947312022016)] 

No modifications, window size 3 [('car', 1.0), ('key', 0.9330561909016643), ('wall', 0.9303721008975641), ('street', 0.930343415021141), ('sea', 0.9249835568137234)] 

PPMI matrix window size 3 [('car', 1.0), ('bicycles', 0.12817727288119293), ('popping', 0.12467056194780535), ('corpusants', 0.1071900718295629), ('stoical', 0.10669112901298419)] 

SVD matrix window size 3 [('car', 0.9999999999999998), ('key', 0.9870374477494949), (

Performance on the Word2Vec representations

In [18]:

print("\nwindow size 1, 50 dimensions")
# Calculate the similarity between "man" and "woman"
print("\nThe similarity between ``man'' and ``woman'' is: {}".format (model_1_50.wv.similarity(w1="man",w2="woman")))

# Get the 5 words most similar to "Car"
print ("\nThe five words most similar to car are: {}".format(model_1_50.wv.most_similar (positive="car", topn=5)))

# Substract "man" from "king", add "woman" and compare with "queen"
custom_queen = np.add(np.subtract(model_1_50["king"],model_1_50["man"]),model_1_50["woman"])
print("\nThe similarity between ``queen'' and ``king - man + woman'' is {}".format(1 - spatial.distance.cosine(custom_queen,model_1_50["queen"])))

print("\n-------------\n")

print("window size 1, 200 dimensions")
# Calculate the similarity between "man" and "woman"
print("\nThe similarity between ``man'' and ``woman'' is: {}".format (model_1_200.wv.similarity(w1="man",w2="woman")))

# Get the 5 words most similar to "Car"
print ("\nThe five words most similar to car are: {}".format(model_1_200.wv.most_similar (positive="car", topn=5)))

# Substract "man" from "king", add "woman" and compare with "queen"
custom_queen = np.add(np.subtract(model_1_200["king"],model_1_200["man"]),model_1_200["woman"])
print("\nThe similarity between ``queen'' and ``king - man + woman'' is {}".format(1 - spatial.distance.cosine(custom_queen,model_1_200["queen"])))

print("\n-------------\n")

print("window size 3, 50 dimensions")
# Calculate the similarity between "man" and "woman"
print("\nThe similarity between ``man'' and ``woman'' is: {}".format (model_3_50.wv.similarity(w1="man",w2="woman")))

# Get the 5 words most similar to "Car"
print ("\nThe five words most similar to car are: {}".format(model_3_50.wv.most_similar (positive="car", topn=5)))

# Substract "man" from "king", add "woman" and compare with "queen"
custom_queen = np.add(np.subtract(model_3_50["king"],model_3_50["man"]),model_3_50["woman"])
print("\nThe similarity between ``queen'' and ``king - man + woman'' is {}".format(1 - spatial.distance.cosine(custom_queen,model_3_50["queen"])))

print("\n-------------\n")

print("window size 3, 200 dimensions")
# Calculate the similarity between "man" and "woman"
print("\nThe similarity between ``man'' and ``woman'' is: {}".format (model_3_200.wv.similarity(w1="man",w2="woman")))

# Get the 5 words most similar to "Car"
print ("\nThe five words most similar to car are: {}".format(model_3_200.wv.most_similar (positive="car", topn=5)))

# Substract "man" from "king", add "woman" and compare with "queen"
custom_queen = np.add(np.subtract(model_3_200["king"],model_3_200["man"]),model_3_200["woman"])
print("\nThe similarity between ``queen'' and ``king - man + woman'' is {}".format(1 - spatial.distance.cosine(custom_queen,model_3_200["queen"])))

print("\n-------------\n")

print("window size 5, 50 dimensions")
# Calculate the similarity between "man" and "woman"
print("\nThe similarity between ``man'' and ``woman'' is: {}".format (model_5_50.wv.similarity(w1="man",w2="woman")))

# Get the 5 words most similar to "Car"
print ("\nThe five words most similar to car are: {}".format(model_5_50.wv.most_similar (positive="car", topn=5)))

# Substract "man" from "king", add "woman" and compare with "queen"
custom_queen = np.add(np.subtract(model_5_50["king"],model_5_50["man"]),model_5_50["woman"])
print("\nThe similarity between ``queen'' and ``king - man + woman'' is {}".format(1 - spatial.distance.cosine(custom_queen,model_5_50["queen"])))


window size 1, 50 dimensions

The similarity between ``man'' and ``woman'' is: 0.882613873325

The five words most similar to car are: [(u'lane', 0.7859511375427246), (u'cabin', 0.7837317585945129), (u'breeze', 0.7645677328109741), (u'bushes', 0.7635990381240845), (u'shutters', 0.7610712051391602)]

The similarity between ``queen'' and ``king - man + woman'' is 0.555125713348

-------------

window size 1, 200 dimensions

The similarity between ``man'' and ``woman'' is: 0.66602621464

The five words most similar to car are: [(u'shopman', 0.7213327288627625), (u'bushes', 0.7186962366104126), (u'lane', 0.7164766788482666), (u'shutters', 0.7156256437301636), (u'dial', 0.7134419679641724)]

The similarity between ``queen'' and ``king - man + woman'' is 0.418639153242

-------------

window size 3, 50 dimensions

The similarity between ``man'' and ``woman'' is: 0.833102187214

The five words most similar to car are: [(u'lane', 0.858578085899353), (u'coach', 0.8423855304718018), (u'boat', 0

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
