### **1.1 Introduction**

Volabulary in a language is always evolving, new words get introduced as society progresses and to relate these new words to classes in a Thesaurus manually is just too slow, The task of taxonomy induction takes in these new words and attaches in a heirarchical graph with similar meanings. I chose to use Word2Vec and Node2Vec embeddings for this task though it can also be approached via transformer based parsers like Bert. As My understanding of Russian Language and thus vocabulary is close to null :) I chose to do this task for english. The suggested competitiion [SemEval-2015 Task 17](https://alt.qcri.org/semeval2015/task17/) though has no test set for such a task. It is locked. I asked help in the Telegram grp for help but no one responded, Since it is the last assignment and everyone is in a festive mood I would like to use the test data from seminar demonstration for this task, hoping I will be excused. I would also like to mention that I found our TA's paper on the same [Studying Taxonomy Enrichment on Diachronic WordNet Versions](https://arxiv.org/abs/2011.11536) but when I looked for the data set I again could not find it [paper dataset](https://paperswithcode.com/paper/studying-taxonomy-enrichment-on-diachronic). I hope this notebook will provide some proof of concept of my understanding of the task and following of the lectures. 

### **1.2 Methodology**

1) We first train the Word2vec corpus using random walks on nouns with walk length = 30 and 20 such walks in parallel with total walks = 200.\
2) We then use Word2Vec and this trained corpus to create our own Node2Vec.\
3) We then import nouns_w2v_all_2.0-3.0.txt file which has out of volabulary words and use our trained Node2Vec model to find mutual lemmas to create a dictionary\
4) We use this dictionary to create a transform from Word2Vec to Node2Vec and find our Node2Vec representations of the OOV words\
5) We then import our evaluate function and test set with reference hypernyms and the same unmarked words\
6) We convert out unmarked words to Node2Vec and evaluate the mAP score, mRR score\
7) I also used nouns_w2v_all_2.0-3.0.txt Word2Vec embedding to calculate the similar embeddings for OOV words and run the evaluate function again\
8) At last eh HCHModel function is used to find the similarity words(associates) and top 10 hypernyms of similarity words.



In [1]:
import numpy as np
from tqdm import tqdm
import random
from joblib import Parallel, delayed

In [2]:
class ProbabilityGenerator:
    def __init__(self, p_return_param, q_unseen_param):
        self.p_return_param = p_return_param
        self.q_unseen_param = q_unseen_param

    def generate_probabilities(self, source_id, prev_value, previous_starts, possible_starts):
        result_weights = self.__compute_weights(source_id, prev_value, 
                                                previous_starts, 
                                                possible_starts)
        sum_weights = sum(result_weights)
        if sum_weights == 0:
            return None
        else:
            return [weight / sum_weights for weight in result_weights]

    def __compute_weights(self, source_id, prev_value, previous_starts, 
                          possible_starts):
        if prev_value:
            weights = [self.__compute_weight(source_id, possible_start, 
                                             prev_value, previous_starts)
                for possible_start in possible_starts]
        else:  # equal weights
            weights = [1]*len(possible_starts)
        
        return weights

    def __compute_weight(self, source_id, possible_start, prev_value, previous_starts):
        if possible_start == source_id:
            # equals to node one step before
            return 1 / self.p_return_param
        elif possible_start in previous_starts or possible_start==prev_value:
            # is neighbour of previous node
            return 1
        else:
            # not seen yet
            return 1 / self.q_unseen_param

In [3]:
def choose_node(possible_nodes, probabilities):
    random_index = np.random.choice(len(possible_nodes), p=probabilities)
    chosen_link_with_anchor = possible_nodes[random_index]
    return chosen_link_with_anchor

In [4]:
def go_for_a_walk(node, walk_length, prob_generator):
    source_id = node
    prev_start = None
    prev_possible_nodes = []
    sequence = [source_id]
    
    while len(sequence) < walk_length:
        possible_nodes = list(G.neighbors(source_id))
        if possible_nodes:
            probabilities = prob_generator.generate_probabilities(source_id, prev_start, 
                                                   prev_possible_nodes,
                                                   possible_nodes)
            if probabilities:
                    chosen_node = choose_node(possible_nodes, probabilities)
                    prev_start = source_id
                    prev_possible_nodes = possible_nodes
                    source_id = chosen_node
                    sequence.append(source_id)
            else:
                break
        else:
            break
    return sequence

In [5]:
def parallel_walk(graph, walk_length, num_walks, prob_gen):
    walks = []
    for n_walk in tqdm(range(num_walks)):
        shuffled_nodes = list(graph.nodes())
        random.shuffle(shuffled_nodes)
        for source in shuffled_nodes:
            walks.append(go_for_a_walk(source, walk_length, prob_gen))
    return walks

In [6]:
import networkx as nx

def build_graph(wordnet, pos, directed=False):
    if directed:
        G = nx.DiGraph()
    else:
        G = nx.Graph()
    for synset in wordnet.all_synsets(pos):
        for hypernym in synset.hypernyms():
            G.add_edge(synset.name(), hypernym.name())
        if len(synset.hypernyms()) == 0:
            G.add_node(synset.name())
    return G

In [12]:
!wget http://wordnetcode.princeton.edu/2.0/WordNet-2.0.tar.gz

--2021-12-24 10:59:52--  http://wordnetcode.princeton.edu/2.0/WordNet-2.0.tar.gz
Resolving wordnetcode.princeton.edu (wordnetcode.princeton.edu)... 128.112.136.61
Connecting to wordnetcode.princeton.edu (wordnetcode.princeton.edu)|128.112.136.61|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://wordnetcode.princeton.edu/2.0/WordNet-2.0.tar.gz [following]
--2021-12-24 10:59:53--  https://wordnetcode.princeton.edu/2.0/WordNet-2.0.tar.gz
Connecting to wordnetcode.princeton.edu (wordnetcode.princeton.edu)|128.112.136.61|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12847598 (12M) [application/x-gzip]
Saving to: ‘WordNet-2.0.tar.gz’


2021-12-24 10:59:55 (5.98 MB/s) - ‘WordNet-2.0.tar.gz’ saved [12847598/12847598]



In [13]:
!tar -xf WordNet-2.0.tar.gz

In [14]:
from nltk.corpus import WordNetCorpusReader

In [15]:
wn = WordNetCorpusReader('./WordNet-2.0/dict', None)

In [16]:
POS = 'n'
G = build_graph(wn, POS, False)

left to train via random walks for 1hr 20mins still didn't finish

In [19]:
# prob_gen = ProbabilityGenerator(p_return_param=4, q_unseen_param=0.5)
# num_walks = 200
# walk_length = 30
# n_workers = 20

# num_walks_lists = np.array_split(range(num_walks), n_workers)
# flatten = lambda l: [item for sublist in l for item in sublist]

# walk_results = Parallel(n_jobs=n_workers, temp_folder=None, require=None, verbose=200)(
#         delayed(parallel_walk)(G, walk_length, len(num_walks), prob_gen) for idx, 
#         num_walks in enumerate(num_walks_lists))

# walks = flatten(walk_results)

[Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.




[Parallel(n_jobs=20)]: Done   1 tasks      | elapsed: 77.1min
[Parallel(n_jobs=20)]: Done   2 out of  20 | elapsed: 77.1min remaining: 694.0min
[Parallel(n_jobs=20)]: Done   3 out of  20 | elapsed: 77.1min remaining: 437.0min
[Parallel(n_jobs=20)]: Done   4 out of  20 | elapsed: 77.1min remaining: 308.5min
[Parallel(n_jobs=20)]: Done   5 out of  20 | elapsed: 77.1min remaining: 231.3min
[Parallel(n_jobs=20)]: Done   6 out of  20 | elapsed: 77.1min remaining: 179.9min
[Parallel(n_jobs=20)]: Done   7 out of  20 | elapsed: 77.1min remaining: 143.2min
[Parallel(n_jobs=20)]: Done   8 out of  20 | elapsed: 77.1min remaining: 115.7min
[Parallel(n_jobs=20)]: Done   9 out of  20 | elapsed: 77.1min remaining: 94.2min
[Parallel(n_jobs=20)]: Done  10 out of  20 | elapsed: 77.1min remaining: 77.1min
[Parallel(n_jobs=20)]: Done  11 out of  20 | elapsed: 77.1min remaining: 63.1min
[Parallel(n_jobs=20)]: Done  12 out of  20 | elapsed: 77.1min remaining: 51.4min
[Parallel(n_jobs=20)]: Done  13 out of  

KeyboardInterrupt: ignored

In [21]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1rBgK260arEeUe42oyJ8n_PfVqHJUIKFU' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1rBgK260arEeUe42oyJ8n_PfVqHJUIKFU" -O noun_random_walks.pkl && rm -rf /tmp/cookies.txt

--2021-12-24 12:21:48--  https://docs.google.com/uc?export=download&confirm=y4UH&id=1rBgK260arEeUe42oyJ8n_PfVqHJUIKFU
Resolving docs.google.com (docs.google.com)... 64.233.187.102, 64.233.187.139, 64.233.187.138, ...
Connecting to docs.google.com (docs.google.com)|64.233.187.102|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0o-9s-docs.googleusercontent.com/docs/securesc/2vhhujfvaorqgiuf2mtr2osvng92c0gk/23cqap5uf5c1vtcifrjc6283udgp80p9/1640348475000/06548341279621459266/12347628958155703582Z/1rBgK260arEeUe42oyJ8n_PfVqHJUIKFU?e=download [following]
--2021-12-24 12:21:49--  https://doc-0o-9s-docs.googleusercontent.com/docs/securesc/2vhhujfvaorqgiuf2mtr2osvng92c0gk/23cqap5uf5c1vtcifrjc6283udgp80p9/1640348475000/06548341279621459266/12347628958155703582Z/1rBgK260arEeUe42oyJ8n_PfVqHJUIKFU?e=download
Resolving doc-0o-9s-docs.googleusercontent.com (doc-0o-9s-docs.googleusercontent.com)... 108.177.125.132, 2404:6800:4008:c01::84
Connectin

In [23]:
import pickle

with open("noun_random_walks.pkl", 'rb') as f:
    random_walks = pickle.load(f)

In [24]:
# from gensim.models import Word2Vec
# sg_model = Word2Vec(random_walks, min_count=0, workers=100, negative=20)

KeyboardInterrupt: ignored

Again coudl't finish within 30mins and lead to RAM overflow multiple times

In [26]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1IcjBFgr011eEkefo6kMANAUSDhS6Otyk' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1IcjBFgr011eEkefo6kMANAUSDhS6Otyk" -O gensim_node2vec.txt && rm -rf /tmp/cookies.txt

--2021-12-24 12:52:24--  https://docs.google.com/uc?export=download&confirm=yS48&id=1IcjBFgr011eEkefo6kMANAUSDhS6Otyk
Resolving docs.google.com (docs.google.com)... 74.125.204.100, 74.125.204.139, 74.125.204.113, ...
Connecting to docs.google.com (docs.google.com)|74.125.204.100|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-00-1g-docs.googleusercontent.com/docs/securesc/g0aq72hut5dtu23km53kdo4ogj8itgaj/9nd7i0d1g0jlpmlk4qnb4ga80on0ttnb/1640350275000/06548341279621459266/08203973431494720941Z/1IcjBFgr011eEkefo6kMANAUSDhS6Otyk?e=download [following]
--2021-12-24 12:52:24--  https://doc-00-1g-docs.googleusercontent.com/docs/securesc/g0aq72hut5dtu23km53kdo4ogj8itgaj/9nd7i0d1g0jlpmlk4qnb4ga80on0ttnb/1640350275000/06548341279621459266/08203973431494720941Z/1IcjBFgr011eEkefo6kMANAUSDhS6Otyk?e=download
Resolving doc-00-1g-docs.googleusercontent.com (doc-00-1g-docs.googleusercontent.com)... 108.177.125.132, 2404:6800:4008:c01::84
Connectin

In [27]:
from gensim.models import KeyedVectors

sg_model = KeyedVectors.load_word2vec_format("gensim_node2vec.txt")

In [58]:
sg_model.wv.similar_by_vector("train.n.01")

  """Entry point for launching an IPython kernel.


[('freight_liner.n.01', 0.9477183818817139),
 ('freight_train.n.01', 0.9414694905281067),
 ('bullet_train.n.01', 0.9110889434814453),
 ('passenger_train.n.01', 0.9105428457260132),
 ('commuter.n.01', 0.8963304758071899),
 ('mail_train.n.01', 0.8922033905982971),
 ('hospital_train.n.01', 0.8852114081382751),
 ('car_train.n.01', 0.8846890330314636),
 ('subway_train.n.01', 0.8838037252426147),
 ('streamliner.n.01', 0.8825638294219971)]

In [29]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=14K2HKaa1Qg5MJzumyX2wPUqvWfHJhX8X' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=14K2HKaa1Qg5MJzumyX2wPUqvWfHJhX8X" -O nouns_w2v_all_2.0-3.0.txt && rm -rf /tmp/cookies.txt

--2021-12-24 12:59:04--  https://docs.google.com/uc?export=download&confirm=rBou&id=14K2HKaa1Qg5MJzumyX2wPUqvWfHJhX8X
Resolving docs.google.com (docs.google.com)... 64.233.188.138, 64.233.188.102, 64.233.188.139, ...
Connecting to docs.google.com (docs.google.com)|64.233.188.138|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0s-ag-docs.googleusercontent.com/docs/securesc/obens7ffvs9fc9na3ftc1fa1ni7odj25/g8ake993v0d7e6b7sh6p508h5nm7e3d1/1640350725000/06548341279621459266/08097295962724454176Z/14K2HKaa1Qg5MJzumyX2wPUqvWfHJhX8X?e=download [following]
--2021-12-24 12:59:04--  https://doc-0s-ag-docs.googleusercontent.com/docs/securesc/obens7ffvs9fc9na3ftc1fa1ni7odj25/g8ake993v0d7e6b7sh6p508h5nm7e3d1/1640350725000/06548341279621459266/08097295962724454176Z/14K2HKaa1Qg5MJzumyX2wPUqvWfHJhX8X?e=download
Resolving doc-0s-ag-docs.googleusercontent.com (doc-0s-ag-docs.googleusercontent.com)... 108.177.125.132, 2404:6800:4008:c01::84
Connectin

In [30]:
import numpy as np

def normalized(a, axis=-1, order=2):
    """Utility function to normalize the rows of a numpy array."""
    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
    l2[l2==0] = 1
    return a / np.expand_dims(l2, axis)

def make_training_matrices(source_dictionary, target_dictionary, bilingual_dictionary):
    """
    Source and target dictionaries are the FastVector objects of
    source/target languages. bilingual_dictionary is a list of 
    translation pair tuples [(source_word, target_word), ...].
    """
    source_matrix = []
    target_matrix = []

    for (source, target) in bilingual_dictionary:
        if source in source_dictionary and target in target_dictionary:
            source_matrix.append(source_dictionary[source])
            target_matrix.append(target_dictionary[target])

    # return training matrices
    return np.array(source_matrix), np.array(target_matrix)

def learn_transformation(source_matrix, target_matrix, normalize_vectors=True):
    """
    Source and target matrices are numpy arrays, shape
    (dictionary_length, embedding_dimension). These contain paired
    word vectors from the bilingual dictionary.
    """
    # optionally normalize the training vectors
    if normalize_vectors:
        source_matrix = normalized(source_matrix)
        target_matrix = normalized(target_matrix)

    # perform the SVD
    product = np.matmul(source_matrix.transpose(), target_matrix)
    U, s, V = np.linalg.svd(product)

    # return orthogonal transformation which aligns source language to the target
    return np.matmul(U, V)

In [31]:
class FastVector:
    def __init__(self, vector_file='', transform=None):
        self.word2id = {}
        self.id2word = []

        print('reading word vectors from %s' % vector_file)
        with open(vector_file, 'r', encoding="utf-8") as f:
            (self.n_words, self.n_dim) = \
                (int(x) for x in f.readline().rstrip('\n').split(' '))
            self.embed = np.zeros((self.n_words, self.n_dim))
            for i, line in enumerate(f):
                elems = line.rstrip('\n').split(' ')
                self.word2id[elems[0]] = i
                self.embed[i] = elems[1:self.n_dim+1]
                self.id2word.append(elems[0])

        if transform is not None:
            print('Applying transformation to embedding')
            self.apply_transform(transform)

    def apply_transform(self, transform):
        transmat = np.loadtxt(transform) if isinstance(transform, str) else transform
        self.embed = np.matmul(self.embed, transmat)

    def export(self, outpath):

        fout = open(outpath, "w")

        fout.write(str(self.n_words) + " " + str(self.n_dim) + "\n")
        for token in self.id2word:
            vector_components = ["%.6f" % number for number in self[token]]
            vector_as_string = " ".join(vector_components)

            out_line = token + " " + vector_as_string + "\n"
            fout.write(out_line)

        fout.close()


    @classmethod
    def cosine_similarity(cls, vec_a, vec_b):
        """Compute cosine similarity between vec_a and vec_b"""
        return np.dot(vec_a, vec_b) / \
            (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))

    def __contains__(self, key):
        return key in self.word2id

    def __getitem__(self, key):
        return self.embed[self.word2id[key]]

In [32]:
word2vec_vectors = FastVector(vector_file="./nouns_w2v_all_2.0-3.0.txt")
node2vec_vectors = FastVector(vector_file="gensim_node2vec.txt")

reading word vectors from ./nouns_w2v_all_2.0-3.0.txt
reading word vectors from gensim_node2vec.txt


In [33]:
def check_lemmas(synset):
    final_lemmas = []
    for lemma in synset.lemmas():
        if len(wn.synsets(lemma.name())) == 1 and "_" not in lemma.name():
            final_lemmas.append(lemma.name())
    return final_lemmas

In [34]:
common_lemmas = [i.name() for i in wn.all_synsets(POS) if len(check_lemmas(i)) > 0]
bilingual_dictionary = [(entry, entry) for entry in common_lemmas]
print(len(bilingual_dictionary))

30294


In [36]:
src_matrix, trg_matrix = make_training_matrices(word2vec_vectors, node2vec_vectors, bilingual_dictionary)

In [37]:
transform = learn_transformation(src_matrix, trg_matrix)
word2vec_vectors.apply_transform(transform)

In [38]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1qR-ujvbRe2LW-KS54PyhVQeSMrbAmMQB' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1qR-ujvbRe2LW-KS54PyhVQeSMrbAmMQB" -O nouns_en.2.0-3.0.tsv && rm -rf /tmp/cookies.txt
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=18efUTnAA2ZhhZC2i3_hR_4UX-CfmbN6H' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=18efUTnAA2ZhhZC2i3_hR_4UX-CfmbN6H" -O no_labels_nouns_en.2.0-3.0.tsv && rm -rf /tmp/cookies.txt

--2021-12-24 13:57:25--  https://docs.google.com/uc?export=download&confirm=&id=1qR-ujvbRe2LW-KS54PyhVQeSMrbAmMQB
Resolving docs.google.com (docs.google.com)... 64.233.189.100, 64.233.189.138, 64.233.189.101, ...
Connecting to docs.google.com (docs.google.com)|64.233.189.100|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-10-1g-docs.googleusercontent.com/docs/securesc/vsj7dnt08d55jl6mq53n7f1adv4dqvcj/86o8344onvbklcknppu6klobu1rtfsvb/1640354175000/06548341279621459266/04175200007555094289Z/1qR-ujvbRe2LW-KS54PyhVQeSMrbAmMQB?e=download [following]
--2021-12-24 13:57:26--  https://doc-10-1g-docs.googleusercontent.com/docs/securesc/vsj7dnt08d55jl6mq53n7f1adv4dqvcj/86o8344onvbklcknppu6klobu1rtfsvb/1640354175000/06548341279621459266/04175200007555094289Z/1qR-ujvbRe2LW-KS54PyhVQeSMrbAmMQB?e=download
Resolving doc-10-1g-docs.googleusercontent.com (doc-10-1g-docs.googleusercontent.com)... 108.177.125.132, 2404:6800:4008:c01::84
Connecting to

In [39]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1phKUhamxVSa8f68FKgs5lOIUQMnlt58l' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1phKUhamxVSa8f68FKgs5lOIUQMnlt58l" -O evaluate.py && rm -rf /tmp/cookies.txt

--2021-12-24 13:58:09--  https://docs.google.com/uc?export=download&confirm=pc_M&id=1phKUhamxVSa8f68FKgs5lOIUQMnlt58l
Resolving docs.google.com (docs.google.com)... 74.125.204.101, 74.125.204.139, 74.125.204.100, ...
Connecting to docs.google.com (docs.google.com)|74.125.204.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0c-24-docs.googleusercontent.com/docs/securesc/n4i0jccn2khd3a24fmlud00hma5lfk85/ns4aak217ov5hek6mu4t7shc5d86meec/1640354250000/06548341279621459266/07306670677730082107Z/1phKUhamxVSa8f68FKgs5lOIUQMnlt58l?e=download [following]
--2021-12-24 13:58:09--  https://doc-0c-24-docs.googleusercontent.com/docs/securesc/n4i0jccn2khd3a24fmlud00hma5lfk85/ns4aak217ov5hek6mu4t7shc5d86meec/1640354250000/06548341279621459266/07306670677730082107Z/1phKUhamxVSa8f68FKgs5lOIUQMnlt58l?e=download
Resolving doc-0c-24-docs.googleusercontent.com (doc-0c-24-docs.googleusercontent.com)... 108.177.125.132, 2404:6800:4008:c01::84
Connectin

In [40]:
from collections import defaultdict
import json

def read_dataset(data_path, read_fn=lambda x: json.loads(x), sep='\t'):
    vocab = defaultdict(list)
    with open(data_path, 'r', encoding='utf-8') as f:
        for line in f:
            line_split = line.replace("\n", '').split(sep)
            word = line_split[0]
            hypernyms = read_fn(line_split[1])
            vocab[word].append(hypernyms)
    return vocab

In [41]:
reference = read_dataset("./nouns_en.2.0-3.0.tsv")

with open("no_labels_nouns_en.2.0-3.0.tsv", 'r') as f:
    new_words = f.read().split("\n")[:-1][:200]

In [42]:
predicted_node2vec = {}

for word in new_words:
    predicted_node2vec[word] = [i[0] for i in sg_model.similar_by_vector(word2vec_vectors[word])]

In [43]:
from evaluate import get_score

get_score(reference, predicted_node2vec)

(0.01125595238095238, 0.011791666666666667)

In [44]:
for i, v in list(predicted_node2vec.items())[:10]:
    print(f"word: {i}")
    print(f"true: {reference[i]}")
    print(f"predicted: {v}")
    print("=====")

word: sanguine
true: [['chromatic_color.n.01', 'red.n.01']]
predicted: ['petit_bourgeois.n.01', 'lay_reader.n.01', 'pip-squeak.n.01', 'bourgeois.n.02', 'plebeian.n.01', 'philistine.n.01', 'apiary.n.01', 'cipher.n.04', 'layman.n.01', 'muse.n.01']
=====
word: arccosine
true: [['trigonometric_function.n.01', 'function.n.01']]
predicted: ['anserinae.n.01', 'synodontidae.n.01', 'poeciliidae.n.01', 'dasyatis.n.01', 'peristediinae.n.01', 'ardeidae.n.01', 'toxotidae.n.01', 'artamidae.n.01', 'catostomidae.n.01', 'genyonemus.n.01']
=====
word: smoothbore
true: [['firearm.n.01', 'gun.n.01']]
predicted: ['weapon.n.01', 'firearm.n.01', 'gun.n.01', 'autoloader.n.01', 'knife.n.02', 'muzzle_loader.n.01', 'crossbow.n.01', 'semiautomatic_firearm.n.01', "cupid's_bow.n.02", 'garand_rifle.n.01']
=====
word: arthrodesis
true: [['operation.n.06', 'arthroplasty.n.01']]
predicted: ['sigmoidectomy.n.01', 'nephrectomy.n.01', 'oophorosalpingectomy.n.01', 'thyroidectomy.n.01', 'sympathectomy.n.01', 'prostatectomy.

In [45]:
wv = KeyedVectors.load_word2vec_format("./nouns_w2v_all_2.0-3.0.txt")

In [46]:
wv_predict = {}

for test_name in new_words:
    wv_predict[test_name] = [i[0] for i in wv.similar_by_word(test_name)]

In [47]:
get_score(reference, wv_predict)

(0.022936507936507936, 0.024186507936507937)

In [48]:
wn_names = [i.name() for i in wn.all_synsets()]

In [49]:
class HCHModel:
    def __init__(self, wv):
        self.w2v_synsets = wv

    def compute_candidates(self, neologism, topn=10):
        return self.compute_hchs(neologism, topn=10)[:topn]

    def compute_hchs(self, neologism, topn=100) -> list:
        associates = map(itemgetter(0), self.generate_associates(neologism, topn))
        associates = [i for i in associates if '.n.' in i and i in wn_names]
        hchs = [hypernym.name() for associate in associates for hypernym in wn.synset(associate).hypernyms()]
        return hchs
    
    def generate_associates(self, neologism, topn=10) -> list:
        return self.w2v_synsets.similar_by_word(neologism, topn)


In [50]:
from operator import itemgetter
hch = HCHModel(wv)

In [51]:
wv_predict = {}

for test_name in new_words:
    wv_predict[test_name] = hch.compute_candidates(test_name)
    
get_score(reference, wv_predict)

(0.0874781746031746, 0.08976984126984126)

In [53]:
wn.synset("gal.n.01").hypernyms()

[Synset('united_states_liquid_unit.n.01')]

In [56]:
hch.generate_associates("camera.n.01")    

[('television_camera.n.01', 0.9472144246101379),
 ('camera_obscura.n.01', 0.9435300827026367),
 ('camera_lucida.n.01', 0.9417228102684021),
 ('camera_tripod.n.01', 0.8724722266197205),
 ('flash_camera.n.01', 0.856145977973938),
 ('digital_camera.n.01', 0.8518924117088318),
 ('portrait_camera.n.01', 0.8179097175598145),
 ('sound_camera.n.01', 0.8039113283157349),
 ('camera_angle.n.01', 0.7917401790618896),
 ('candid_camera.n.01', 0.7882478833198547)]

In [57]:
hch.compute_candidates("camera.n.01")

['television_equipment.n.01',
 'chamber.n.01',
 'optical_device.n.01',
 'tripod.n.01',
 'camera.n.01',
 'camera.n.01',
 'camera.n.01',
 'motion-picture_camera.n.01',
 'point_of_view.n.02',
 'camera.n.01']

### 1.3 Results and Discussion

1) Training Data in SemEval-2015 Task 17 missing so I used the test set attached in rar file.\
2) The mAP score for the new_word set from our trained node2vec model = 0.01125595238095238 while after changing similar words from nouns_w2v_all_2.0-3.0.txt the score was 0.022936507936507936\
3) Final mAP score for top 10 synonyms and their hypernyms = 0.0874781746031746\
4) the mRR score fot the new_word set from out trained model = 0.011791666666666667 while after changing similar words from nouns_w2v_all_2.0-3.0.txt the score was 0.024186507936507937\
5) 3) Final mRR score for top 10 synonyms and their hypernyms = 0.08976984126984126\