# **CS 6120: Natural Language Processing - Prof. Ahmad Uzair** 

### Assignment 4: SVD, Cross-Language Word Embeddings and Pointwise Mutual Information

### **Total points: 100**



# Q1. SVD (30 Points) 


 - A. Singular Value Decomposition (SVD) based distributed representation of text and documents. You can use python libraries for matrix decomposition (scipy). To demonstrate your work, use the example dataset (Table 2) of "R. A. Harshman (1990). Indexing by latent semantic analysis. Journal of the American society for information science". (10 Points)

 - B. Visualize (2-D) the documents and terms using library of your choice. (10 Points)

 - C. Implement a function that converts a query string to distributed representation and retrieves relevent documents. Visualize the the results as shown in Fig 1 of the paper. (10 Points)

# Q2. Cross-Language Word Embeddings (30 Points)

Different modeling choices for word embeddings may be ultimately evaluated by the effectiveness of classifiers, parsers, and other inference models that use those embeddings.<br>

In this assignment, however, we will consider another common method of evaluating word embeddings: by judging the usefulness of pairwise distances between words in the embedding space.<br>

Follow along with the examples in this notebook, and implement the sections of code flagged with TODO.

In [None]:
import gensim
import numpy as np
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

We'll start by downloading a plain-text version of the Shakespeare plays.

In [None]:
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/shakespeare_plays.txt
lines = [s.split() for s in open('shakespeare_plays.txt')]

Then, we'll estimate a simple word2vec model on the Shakespeare texts.

In [None]:
model = Word2Vec(None)

Even with such a small training set size, you can perform some standard analogy tasks.

In [None]:
model.wv.most_similar(positive=None, negative=None)

For the rest of this assignment, we will focus on finding words with similar embeddings, both within and across languages. For example, what words are similar to the name of the title character of Othello?

In [None]:
model.wv.most_similar(positive=None)

This search uses cosine similarity. In the default API, you should see the same similarity between the words othello and desdemona as in the search results above.

In [None]:
model.wv.similarity(None, None)

TODO: Your first task, therefore, is to implement your own cosine similarity function so that you can reuse it outside of the context of the gensim model object.

In [None]:
def cosim(v1, v2): 
    return None

cosim(model.wv['othello'], model.wv['desdemona'])

<h3>Evaluation: </h3>

We could collect a lot of human judgments about how similar pairs of words, or pairs of Shakespearean characters, are. Then we could compare different word-embedding models by their ability to replicate these human judgments.<br>

If we extend our ambition to multiple languages, however, we can use a word translation task to evaluate word embeddings.<br>

We will use a subset of Facebook AI's FastText cross-language embeddings for several languages. Your task will be to compare English both to French, and to one more language from the following set: Arabic, German, Portuguese, Russian, Spanish, Vietnamese, and Chinese.<br>

In [None]:
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.en.vec
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.fr.vec

# TODO: uncomment at least one of these to work with another language
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.ar.vec
#!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.de.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.pt.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.ru.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.es.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.vi.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.zh.vec

We'll start by loading the word vectors from their textual file format to a dictionary mapping words to numpy arrays.

In [None]:
def vecref(s):
    (word, srec) = s.split(' ', 1)
    return (word, np.fromstring(srec, sep=' '))

def ftvectors(fname):
    return { k:v for (k, v) in [vecref(s) for s in open(fname)] if len(v) > 1} 

# loading vectors for english and french languages.
envec = ftvectors('30k.en.vec')
frvec = ftvectors('30k.fr.vec')

# TODO: load vectors for one more language, such as zhvec (Chinese) just like english or french


In [None]:
## TODO: implement this search function
def mostSimilar(vec, vecDict):
  ## Use your cosim function from above
    mostSimilar = ''
    similarity = 0
    for row in vecDict.items():
        csm = cosim(None, None)
        if None:
            similarity = None
            mostSimilar = None
    return (mostSimilar, similarity)

## some example searches
[mostSimilar(envec[e], frvec) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']]

TODO: Your next task is to write a simple function that takes a vector and a dictionary of vectors and finds the most similar item in the dictionary. For this assignment, a linear scan through the dictionary using your cosim function from above is acceptible.</br>

Some matches make more sense than others. Note that computer most closely matches informatique, the French term for computer science. If you looked further down the list, you would see ordinateur, the term for computer. This is one weakness of a focus only on embeddings for word types independent of context.</br>

To evalute cross-language embeddings more broadly, we'll look at a dataset of links between Wikipedia articles.</br>

In [None]:
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/links.tab
links = [s.split() for s in open('links.tab')]

In [None]:
links[302]

TODO: Evaluate the English and French embeddings by computing the proportion of English Wikipedia articles whose corresponding French article is also the closest word in embedding space. Skip English articles not covered by the word embedding dictionary. Since many articles, e.g., about named entities have the same title in English and French, compute the baseline accuracy achieved by simply echoing the English title as if it were French. Remember to iterate only over English Wikipedia articles, not the entire embedding dictionary.

In [None]:
## TODO: Compute English-French Wikipedia retrieval accuracy.
t = 0
b = 0
g = 0
for row in links:
    if row[1] == 'fr':
        if row[0] in envec.keys():
            t += 1
            if row[0] == row[2]:
                b += 1
            similar, _ = mostSimilar(None, None)
            if None:
                g += 1

baselineAccuracy = b/t
accuracy = g/t

In [None]:
print(baselineAccuracy, accuracy)

TODO: Compute accuracy and baseline (identity function) acccuracy for Englsih and another language besides French. Although the baseline will be lower for languages not written in the Roman alphabet (i.e., Arabic or Chinese), there are still many articles in those languages with headwords written in Roman characters.

In [None]:
## TODO: Compute English-X Wikipedia retrieval accuracy.
#Follow the above procedure to do this task.

print(baselineAccuracy, accuracy)

# Q 3. Mutual Information (40 Points)

Please read this paper https://aclanthology.org/J92-4003.pdf to answer the following questions.

**"A quick fox jumps over the lazy dog. A quick fox jumps over the lazy dog. A quick fox jumps over the lazy dog. A quick fox jumps over the lazy dog. A quick fox jumps over the lazy dog. "**

1. Implement a function to compute the mutual information between pair of any adjacent words (w1,w2) of in the above text. (10 Points)

2. What do you mean by semantic sticky pairs. Write a function that extracts semactic sticky pairs (10 Points)
   
   
3. What is the relation between stickiness of the pair of words and mutual information ? (10 Points)
   
4. Explain the differences between Semantically sticky words and Adjacent sticky words ? (10 Points)