## Part 1: Synchronic word embedding

**Step 1**: Download vectors

**Step 2**: Using gensim, extract embeddings of words in Table 1 of RG65 that also appeared in the set W from the earlier exericse, i.e., the pairs of words should be identical in all analyses.


In [None]:
import gensim
from gensim.models import KeyedVectors

model_W2V = KeyedVectors.load_word2vec_format("data/GoogleNews-vectors-negative300.bin", binary=True)

**Step 3**: Calculate cosine distance between each pair of word embeddings you have extracted, and report the Pearson correlation between word2vec-based and human similarities. [1 point] Comment on this value in comparison to those from LSA and word-context vectors from analyses in the earlier exercise. [1 point]

In [19]:
import common
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import pearsonr


def get_sim(embedding_dict, word_pairs):
    result = []
    for w1, w2 in word_pairs:
        w1_vec = embedding_dict[w1]
        w2_vec = embedding_dict[w2]
        result.append(cosine_similarity([w1_vec], [w2_vec])[0][0])
    return result


# Load LSA embeddings from exercise.
M2300 = common.load_embedding_dict("data/M2300.pickle")

# Filter RG65 word pairs, based on what words have LSA embeddings
# (so results are comparable).
rg65_words, rg65_word_pairs = common.load_rg65()
P = [(w1, w2) for w1, w2 in rg65_word_pairs.keys()
     if w1 in M2300 and w2 in M2300]
S = [rg65_word_pairs[word_pair] for word_pair in P]

# Compute cosine similarity for word pairs, based on word2vec embeddigs.
SW2V = get_sim(model_W2V, P)

# Compare w2v similarity to human annotations.
r, p = pearsonr(SW2V, S)
print(f"r={r:.4f}, p={p}")




r=0.7878, p=3.579379128582004e-13


**Step 4**: Perform the analogy test based on data here (or as provided) with the pre-trained word2vec embeddings. Report the accuracy on the semantic analogy test and the syntactic analogy test (see Note below).

Repeat the analysis with LSA vectors (300 dimensions) from the earlier exercise, and commment on the results in comparison to those from word2vec. [1 point]

Note: It is expected that the number of entries you could test with LSA would be smaller than that based on word2vec. For a fair comparison, you should consider reporting model accuracies based on the small test set, for both word2vec and LSA.

In [14]:
import common
from sklearn.metrics.pairwise import cosine_similarity


def run_analogy_task(embedding_dict, analogy_dataset):
    words = list(embedding_dict.keys())
    embedding_matrix = np.array(
        [embedding_dict[i] for i, word in words])
    correct = 0
    for a, b, c, d_true in analogy_datset:
        d_pred_vector = embedding_dict[a] - embedding_dict[b] + embedding_dict[c]
        
        sim = cosine_similarity([d_pred_vector], embedding_matrix)[0]
        d_rankings = np.argsort(-sim)[:4]
        d_pred = [words[item] for item in d_rankings
                  if words[item] not in {a, b, c}][0]
        
        if d_pred == d_true:
            correct += 1
        print(a, b, c, d_true, "\t", d_pred)
    accuracy = correct / len(analogy_dataset)
    print(f"accuracy = {:.4f}")


M2300 = common.load_embedding_dict("data/M2300.pickle")
analogy_dataset = common.load_analogy_task(vocab=set(M2300.keys()))



In [25]:
# Delete me once you get run_analogy_task working

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

words = ["a", "b", "c", "d"]

embedding_matrix = np.array([[0.1, 0.2, 0.3], [0.3, 0.2, 0.1], [0, 0, 4], [1, 1, 1]])
d_pred_vector = np.array([0.1, 0.15, 0.5])
vocab = {"a": 0, "b": 1, "c": 2, "d": 3}

a, b, c = "a", "b", "c"



sim = cosine_similarity([d_pred_vector], embedding_matrix)[0]
print(sim)
d_rankings = np.argsort(-sim)[:4]
print(d_rankings)
d_pred = [words[item] for item in d_rankings
          if words[item] not in {a, b, c}][0]
print(d_pred)


[0.95538926 0.5531201  0.94072087 0.81468817]
[0 2 3 1]
d


**Step 5**: Suggest a way to improve the existing set of vector-based models in capturing word similarities in general, and provide justifications for your suggestion. [2 points]

## Part 2: 

**Step 1**. Download the diachronic word2vec embeddings from the course syllabus page. These embeddings capture historical usage of a small subset of English words over the past century.

**Step 2**. Propose three different methods for measuring degree of semantic change for individual words and report the top 20 most and least changing words in table(s) from each measure. Measure the intercorrelations (of semantic change in all words, given the embeddings from Step 1) among the three methods you have proposed and summarize the Pearson correlations in a 3-by-3 table. [3 points]

**Step 3**. Propose and justify a procedure for evaluating the accuracy of the methods you have proposed in Step 2, and then evaluate the three methods following this proposed procedure and report Pearson correlations or relevant test statistics. [2 points]

**Step 4**. Extract the top 3 changing words using the best method from Steps 2 and 3. Propose and implement a simple way of detecting the point(s) of semantic change in each word based on its diachronic embedding time course—visualize the time course and the detected change point(s). [3 points]