# GloVe Assignment

<b>Name:</b> Min Set Aung <b>Student Id:</b> st122825

## Introduction

In this assignment, the implemented embedding models (Skip-gram, Skip-gram with negative sampling, CBOW, and GloVe) were trained and compared based on their syntactic and semantic accuracy. Word-analogy task dataset was used for the assessment. Brown corpus was used for training the models. To reduce the number of vocabularies due to memeory limitation, only news category files were used. Then the first 2000 sentences were used from the  corpus category to further reduce the vocabulary size. For the models to train more more effectively, stopwords (commonly used words like "the", "a", and "an") and punctuations are filtered out.

Also, another assessment was performed using a similarity dataset (WordSim353). WordSim353 contains pairs of words and a score is given by a human judge on how similar/related the pair are. In this assessment, only the similarity part is checked. The main purpose of the evaluation is to check the correlation of the similarity given by the trained models with the similarity score given by the judge for word pairs similarity. Spearman correlation coefficient with associated p-value is used as a correlation metric for this purpose.

In [10]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

## 1) Load Data

In [11]:
import nltk

In [12]:
 nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


True

In [13]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

https://stackoverflow.com/questions/15547409/how-to-get-rid-of-punctuation-using-nltk-tokenizer

In [14]:
import string

In [15]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
punctuations = []
for i in string.punctuation:
    punctuations.append(i)
punctuations.append("''")
punctuations.append('""')
punctuations.append('``')

https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

In [17]:
from nltk.corpus import stopwords

In [18]:
stop_words = set(stopwords.words('english'))

In [19]:
from nltk.corpus import brown

In [20]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [21]:
corpus_tokenized = []
sentence = []
for word in brown.words(categories='news'):
    if word == ".":
        corpus_tokenized.append(sentence)
        sentence = []
    elif word in punctuations or word.lower() in stop_words:
        continue
    sentence.append(word.lower())

In [22]:
len(corpus_tokenized)

4030

In [23]:
corpus_tokenized = corpus_tokenized[:2000]

In [24]:
corpus_tokenized[0]

['fulton',
 'county',
 'grand',
 'jury',
 'said',
 'friday',
 'investigation',
 "atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 'evidence',
 'irregularities',
 'took',
 'place']

## 2) Prepare Data for Training

### Creating vocabulary

In [25]:
#we want to flatten this (basically merge all list)
flatten = lambda l: [item for sublist in l for item in sublist]
vocabs  = list(set(flatten(corpus_tokenized)))  #vocabs is a term defining all unique words your system know

In [26]:
print(vocabs[:10])

["kowalski's", '44', 'bay-front', 'detriment', '8-4', 'containing', '4-homer', 'fumble', 'next', '6.5']


In [27]:
print("Vocabulary size:", len(vocabs))

Vocabulary size: 8029


### Numericalization

In [28]:
word2index = {v: idx for idx, v in enumerate(vocabs)}

In [29]:
#add <UNK>, which is a very normal token exists in the world
vocabs.append('<UNK>') #chaky, can it be ##UNK, or UNKKKKKK, or anything

In [30]:
#now we have a way to know what is the id of <UNK>
word2index['<UNK>'] = 6  #usually <UNK> is 0

In [31]:
#create index2word dictionary
#2 min    
index2word = {v:k for k, v in word2index.items()}

In [32]:
word2index["greece"]

4689

In [33]:
index2word[7974]

'$4'

### Creating (context word, outside word) tuples for Skip-gram, Skip-gram with negative sampling, GloVe

In [34]:
window_size = 2

In [35]:
skipgrams = []

#for each corpus
for sent in corpus_tokenized:
    #for each sent
    for i in range(window_size, len(sent) - window_size):
        center_word = word2index[sent[i]]
        outside_words = []
        for j in range(1, window_size + 1):
            outside_words.append(word2index[sent[i-j]])
            outside_words.append(word2index[sent[i+j]])
            
        for o in outside_words:
            skipgrams.append([center_word, o])

skipgrams[:10]

[[4235, 472],
 [4235, 4121],
 [4235, 4034],
 [4235, 5184],
 [4121, 4235],
 [4121, 5184],
 [4121, 472],
 [4121, 4378],
 [5184, 4121],
 [5184, 4378]]

### Unigram distribution
$$P(w)=U(w)^{3/4}/Z$$

Defining the probability of sampling negative words

In [36]:
from collections import Counter

In [37]:
z = 0.001

In [38]:
word_count = Counter(flatten(corpus_tokenized))

In [39]:
num_total_words = sum([c for w, c in word_count.items()])
num_total_words

28276

In [40]:
unigram_table = []

for v in vocabs:
    uw = word_count[v]/num_total_words
    uw_alpha = uw ** 0.75
    uw_alpha_dividebyz = int(uw_alpha / z)
    # print("vocab: ", v)
    # print("distribution: ", uw_alpha_dividebyz)
    unigram_table.extend([v] * uw_alpha_dividebyz)

In [41]:
unigram_table[:5]

['containing', 'next', 'next', 'next', 'next']

### Creating (context word, outside word) tuples for CBOW

In [42]:
skipgrams_CBOW = []

#for each corpus
for sent in corpus_tokenized:
    for i in range(window_size, len(sent) - window_size):
        center_word   = word2index[sent[i]]
        outside_words = []
        
        low  = i - window_size
        high = i + window_size
        for j in range(low, high + 1):
            if j == i:
                continue
            outside_words.append(word2index[sent[j]])
        skipgrams_CBOW.append([center_word, outside_words])

skipgrams_CBOW[:10]

[[4235, [4034, 472, 4121, 5184]],
 [4121, [472, 4235, 5184, 4378]],
 [5184, [4235, 4121, 4378, 4358]],
 [4378, [4121, 5184, 4358, 1611]],
 [4358, [5184, 4378, 1611, 4794]],
 [1611, [4378, 4358, 4794, 3043]],
 [4794, [4358, 1611, 3043, 6551]],
 [3043, [1611, 4794, 6551, 602]],
 [6551, [4794, 3043, 602, 1470]],
 [602, [3043, 6551, 1470, 5900]]]

### Co-occurrence matrix

In [43]:
#count the frequency of each word....
X_i = Counter(flatten(corpus_tokenized)) #merge all list

In [44]:
gloVes = []

#loop through each corpus
for sent in corpus_tokenized: 
    #loop through each word from 1 to n-1 (because 0 and n has no context window)
    for i in range(1, len(sent)-1):
        target  = sent[i]
        context = [sent[i+1], sent[i-1]]
        #append(i, i+1) and append(i, i-1)
        for c in context:
            gloVes.append((target, c))

In [45]:
gloVe_id = [(word2index[gloVe[0]], word2index[gloVe[1]]) for gloVe in gloVes]

In [46]:
print(gloVe_id[:5])

[(472, 4235), (472, 4034), (4235, 4121), (4235, 472), (4121, 5184)]


In [47]:
#since we have these occurrences, we can count, to make our co-occurrence matrix!!!
X_ik_skipgram = Counter(gloVes)

In [48]:
X_ik_skipgram[("fulton", "county")]

5

### Weighting function f

In [49]:
def weighting(w_i, w_j, X_ik):   #why we need w_i and w_j, because we can try its co-occurrences, if it's too big, we scale it down
    
    #check whether the co-occurrences between these two word exists???
    try:
        x_ij = X_ik[(w_i, w_j)]
    except:
        x_ij = 1  #why one, so that the probability thingy won't break...(label smoothing)
        
    #maximum co-occurrences; we follow the paper
    x_max = 100
    alpha = 0.75
    
    #if the co-occurrences does not exceed x_max, scale it down based on some alpha
    if x_ij < x_max:
        result = (x_ij/x_max) ** alpha
    else:
        result = 1 #this is the maximum probability you can have
        
    return result

In [50]:
w_i  = 'fulton'
w_j  = 'county'
w_j2 = 'chaky'

print(weighting(w_i, w_j, X_ik_skipgram))   #scales from 1 to 0.0316
print(weighting(w_i, w_j2, X_ik_skipgram))  #the paper says that f(0) = 0

0.10573712634405642
0.0


In [51]:
#now apply this weighting to alcorpus_tokenizedible pairs
from itertools import combinations_with_replacement

X_ik = {} #for keeping the co-occurrences
weighting_dic = {} #for keeping all the probability after passing through the weighting function

for bigram in combinations_with_replacement(vocabs, 2):  #we need to also think its reverse
    #if this bigram exists in X_ik_skipgrams
    #we gonna add this to our co-occurence matrix
    if X_ik_skipgram.get(bigram) is not None:
        cooc = X_ik_skipgram[bigram]  #get the co-occurrence
        X_ik[bigram] = cooc + 1 #this is again basically label smoothing....(stability issues (especially when divide something))
        X_ik[(bigram[1], bigram[0])] = cooc + 1  #trick to get all pairs
    else: #otherwise, do nothing
        pass
    
    #apply the weighting function using this co-occurrence matrix thingy    
    weighting_dic[bigram] = weighting(bigram[0], bigram[1], X_ik)
    weighting_dic[(bigram[1], bigram[0])] = weighting(bigram[1], bigram[0], X_ik)

In [52]:
len(X_ik_skipgram)

43854

### Create random batch sampler function

In [53]:
def random_batch(batch_size, corpus, skip_grams):
    random_inputs = []
    random_labels = []
    random_index = np.random.choice(range(len(skip_grams)), batch_size, replace=False) #randomly pick without replacement
        
    for i in random_index:
        random_inputs.append([skip_grams[i][0]])  # target, e.g., 2
        random_labels.append([skip_grams[i][1]])  # context word, e.g., 3
            
    return np.array(random_inputs), np.array(random_labels)

<b>Test Function</b>

In [54]:
input, label = random_batch(5, corpus_tokenized, skipgrams)

print("Input sample:", input)
print(f"Input shape: {input.shape}", end="\n\n")
print(f"Label sample: {label=}")
print("Label shape:", label.shape)

Input sample: [[4896]
 [  76]
 [1244]
 [1952]
 [7725]]
Input shape: (5, 1)

Label sample: label=array([[7347],
       [3483],
       [3165],
       [5455],
       [5677]])
Label shape: (5, 1)


In [55]:
def random_batch_cbow(batch_size, corpus, cbow):
    random_inputs = []
    random_labels = []
    random_index = np.random.choice(range(len(cbow)), batch_size, replace=False) #randomly pick without replacement
        
    for i in random_index:
        random_inputs.append([cbow[i][0]])  # target, e.g., 2
        random_labels.append([cbow[i][1]])  # context word, e.g., 3
            
    return np.array(random_inputs), np.array(random_labels).squeeze()

<b>Test Function</b>

In [56]:
input, label = random_batch_cbow(5, corpus_tokenized, skipgrams_CBOW)

print("Input sample:", input)
print(f"Input shape: {input.shape}", end="\n\n")
print(f"Label sample: {label=}")
print("Label shape:", label.shape)

Input sample: [[  30]
 [7674]
 [7398]
 [ 518]
 [5912]]
Input shape: (5, 1)

Label sample: label=array([[7261, 6148, 7661, 7638],
       [7212, 2729, 2250,   48],
       [4410, 5184, 4842, 3483],
       [6288, 3106, 7690, 4601],
       [7017, 4456, 3944, 2597]])
Label shape: (5, 4)


In [57]:
import math

def random_batch_gloVe(batch_size, word_sequence, skip_grams, X_ik, weighting_dic):
    
    #loop through this skipgram, and change it id  because when sending model, it must number
    skip_grams_id = [(word2index[skip_gram[0]], word2index[skip_gram[1]]) for skip_gram in skip_grams]
    
    #randomly pick "batch_size" indexes
    number_of_choices = len(skip_grams_id)
    random_index = np.random.choice(number_of_choices, batch_size, replace=False) #no repeating indexes among these random indexes
    
    random_inputs = [] #xi, wi (in batches)
    random_labels = [] #xj, wj (in batches)
    random_coocs  = [] #Xij (in batches)
    random_weighting = [] #f(Xij) (in batches)
    #for each of the sample in these indexes
    for i in random_index:
        random_inputs.append([skip_grams_id[i][0]]) #same reason why i put bracket here....
        random_labels.append([skip_grams_id[i][1]])
        
        #get cooc
        #first check whether it exists...
        pair = skip_grams[i]  #e.g., ('banana', 'fruit)
        try:
            cooc = X_ik[pair]
        except:
            cooc = 1 #label smoothing
            
        random_coocs.append([math.log(cooc)])  #1. why log, #2, why bracket -> size ==> (, 1)  #my neural network expects (, 1)
        
        #get weighting
        weighting = weighting_dic[pair]  #why not use try....maybe it does not exist....
        random_weighting.append(weighting)

        
    return np.array(random_inputs), np.array(random_labels), np.array(random_coocs), np.array(random_weighting)
    

<b>Test Function</b>

In [58]:
input, target, cooc, weightin = random_batch_gloVe(5, corpus_tokenized, gloVes, X_ik, weighting_dic)

In [59]:
print("Input sample:", input)
print(f"Input shape: {input.shape}", end="\n\n")
print(f"Label sample: {label=}")
print("Label shape:", label.shape)

Input sample: [[1896]
 [7092]
 [7166]
 [5666]
 [3683]]
Input shape: (5, 1)

Label sample: label=array([[7261, 6148, 7661, 7638],
       [7212, 2729, 2250,   48],
       [4410, 5184, 4842, 3483],
       [6288, 3106, 7690, 4601],
       [7017, 4456, 3944, 2597]])
Label shape: (5, 4)


In [60]:
input, target, cooc, weightin

(array([[1896],
        [7092],
        [7166],
        [5666],
        [3683]]),
 array([[6780],
        [6060],
        [5134],
        [4922],
        [ 405]]),
 array([[0.69314718],
        [0.69314718],
        [0.69314718],
        [2.30258509],
        [0.69314718]]),
 array([0.05318296, 0.05318296, 0.05318296, 0.17782794, 0.05318296]))

### Helper functions for skipgram with negative sampling

In [61]:
def prepare_sequence(seq, word2index):
    #map(function, list of something)
    #map will look at each of element in this list, and apply this function
    idxs = list(map(lambda w: word2index[w] if word2index.get(w) is not None else word2index["<UNK>"], seq))
    return torch.LongTensor(idxs)

In [62]:
import random
#you don't want to pick samples = targets, basically negative samples
#k = number of negative samples - how many? they found 10 is the best
#will be run during training
#after random_batch, 
def negative_sampling(targets, unigram_table, k):
    #targets is already in id.....
    #but the unigram_table is in word....
    #1. get the batch size of this targets
    batch_size = targets.shape[0]
    neg_samples = []
    #2. for each batch
    for i in range(batch_size):
        #randomly pick k negative words from unigram_table
        target_index = targets[i].item()  #looping each of the batch....
        nsample = []
        while len(nsample) < k:
            neg = random.choice(unigram_table)
            #if this word == target, skip this word
            if word2index[neg] == target_index:
                continue
            nsample.append(neg)
        #append this word to some list
        neg_samples.append(prepare_sequence(nsample, word2index).reshape(1, -1))  #tensor[], tensor[]
    return torch.cat(neg_samples)  #tensor[[], []]

<b>Test Functions</b>

In [63]:
input_batch, label_batch = random_batch(2, corpus_tokenized, skipgrams)

input_batch, label_batch

(array([[4358],
        [7212]]),
 array([[2375],
        [6786]]))

In [64]:
input_batch = torch.LongTensor(input_batch)
label_batch = torch.LongTensor(label_batch)

In [65]:
num_neg = 5  #in the real code, we gonna use 10 (like in the paper)
neg_samples = negative_sampling(label_batch, unigram_table, num_neg)

In [66]:
neg_samples.shape

torch.Size([2, 5])

## 3) Model

### Skip-gram

$$J(\theta) = -\frac{1}{T}\sum_{t=1}^{T}\sum_{\substack{-m \leq j \leq m \\ j \neq 0}}\log P(w_{t+j} | w_t; \theta)$$

where $P(w_{t+j} | w_t; \theta) = $

$$P(o|c)=\frac{\exp(\mathbf{u_o^{\top}v_c})}{\sum_{w=1}^V\exp(\mathbf{u_w^{\top}v_c})}$$

where $o$ is the outside words and $c$ is the center word

In [67]:
#the model will accept three vectors - u_o, v_c, u_w
#u_o - vector for outside words
#v_c - vector for center word
#u_w - vectors of all vocabs

class Skipgram(nn.Module):
    
    def __init__(self, voc_size, emb_size):
        super(Skipgram, self).__init__()
        self.embedding_center_word  = nn.Embedding(voc_size, emb_size)
        self.embedding_outside_word = nn.Embedding(voc_size, emb_size)
    
    def forward(self, center_word, outside_word, all_vocabs):
        #center_word, outside_word: (batch_size, 1)
        #all_vocabs: (batch_size, voc_size)
        
        #convert them into embedding
        center_word_embed  = self.embedding_center_word(center_word)     #(batch_size, 1, emb_size)
        outside_word_embed = self.embedding_outside_word(outside_word)   #(batch_size, 1, emb_size)
        all_vocabs_embed   = self.embedding_outside_word(all_vocabs)     #(batch_size, voc_size, emb_size)
        
        #bmm is basically @ or .dot , but across batches (i.e., ignore the batch dimension)
        top_term = outside_word_embed.bmm(center_word_embed.transpose(1, 2)).squeeze(2)
        #(batch_size, 1, emb_size) @ (batch_size, emb_size, 1) = (batch_size, 1, 1) ===> (batch_size, 1)
        
        top_term_exp = torch.exp(top_term)  #exp(uo vc)
        #(batch_size, 1)
        
        lower_term = all_vocabs_embed.bmm(center_word_embed.transpose(1, 2)).squeeze(2)
         #(batch_size, voc_size, emb_size) @ (batch_size, emb_size, 1) = (batch_size, voc_size, 1) = (batch_size, voc_size)
         
        lower_term_sum = torch.sum(torch.exp(lower_term), 1) #sum exp(uw vc)
        #(batch_size, 1)
        
        loss_fn = -torch.mean(torch.log(top_term_exp / lower_term_sum))
        #(batch_size, 1) / (batch_size, 1) ==mean==> scalar
        
        return loss_fn

In [68]:
#preparing all_vocabs

batch_size = 5

voc_size = len(vocabs)
voc_size

def prepare_sequence(seq, word2index):
    #map(function, list of something)
    #map will look at each of element in this list, and apply this function
    idxs = list(map(lambda w: word2index[w] if word2index.get(w) is not None else word2index["<UNK>"], seq))
    return torch.LongTensor(idxs)

all_vocabs = prepare_sequence(list(vocabs), word2index).expand(batch_size, voc_size)
all_vocabs.shape

torch.Size([5, 8030])

<b>Test Model</b>

In [69]:
input, label = random_batch(batch_size, corpus_tokenized, skipgrams)

In [70]:
input, label

(array([[7766],
        [3372],
        [7626],
        [5870],
        [4034]]),
 array([[2758],
        [5885],
        [6506],
        [3256],
        [5865]]))

In [71]:
emb_size = 2
model = Skipgram(voc_size, emb_size)

In [72]:
input_tensor = torch.LongTensor(input)  
label_tensor = torch.LongTensor(label) 

In [73]:
loss = model(input_tensor, label_tensor, all_vocabs)

In [74]:
loss

tensor(10.6690, grad_fn=<NegBackward0>)

### Skip-gram with negative sampling

$$\mathbf{J}_{\text{neg-sample}}(\mathbf{v}_c,o,\mathbf{U})=-\log(\sigma(\mathbf{u}_o^T\mathbf{v}_c))-\sum_{k=1}^K\log(\sigma(-\mathbf{u}_k^T\mathbf{v}_c))$$

In [75]:
class SkipgramNeg(nn.Module):
    
    def __init__(self, voc_size, emb_size):
        super(SkipgramNeg, self).__init__()
        self.embedding_center_word  = nn.Embedding(voc_size, emb_size)
        self.embedding_outside_word = nn.Embedding(voc_size, emb_size)
        self.logsigmoid = nn.LogSigmoid()
        
    def forward(self, center_words, outside_words, negative_words):
        #center_words, outside_words: (batch_size, 1)
        #negative_words:  (batch_size, k)
        
        center_embed  = self.embedding_center_word(center_words)    #(batch_size, 1, emb_size)
        outside_embed = self.embedding_outside_word(outside_words)  #(batch_size, 1, emb_size)
        neg_embed     = self.embedding_outside_word(negative_words) #(batch_size, k, emb_size)
        
        uovc          =  outside_embed.bmm(center_embed.transpose(1, 2)).squeeze(2)  #(batch_size, 1)
        ukvc          = -neg_embed.bmm(center_embed.transpose(1, 2)).squeeze(2)  #(batch_size, k)
        ukvc_sum      =  torch.sum(ukvc, 1).view(-1, 1) #(batch_size, 1)
        
        loss = self.logsigmoid(uovc) + self.logsigmoid(ukvc_sum)  #(batch_size, 1) + (batch_size, 1)
                
        return -torch.mean(loss)  #scalar, loss should be scalar, to call backward()

<b>Test Model</b>

In [76]:
input, label = random_batch(batch_size, corpus_tokenized, skipgrams)
input_tensor = torch.LongTensor(input)  
label_tensor = torch.LongTensor(label)

In [77]:
emb_size = 2
voc_size = len(vocabs)
model = SkipgramNeg(voc_size, emb_size)

In [78]:
neg_tensor = negative_sampling(label_tensor, unigram_table, 5)

In [79]:
input_tensor.shape, label_tensor.shape#, neg_tensor.shape

(torch.Size([5, 1]), torch.Size([5, 1]))

In [80]:
loss = model(input_tensor, label_tensor, neg_tensor)

In [81]:
loss

tensor(2.0161, grad_fn=<NegBackward0>)

### CBOW

In [82]:
#the model will accept three vectors - u_c, u_j, v_mean
#u_c - vector for center word from the output word matrix
#u_j - vectors for all vocab from the output word matrix
#v_mean - mean of the vectors of context words from the input word matrix 

class CBOW(nn.Module):
    
    def __init__(self, voc_size, emb_size):
        super(CBOW, self).__init__()
        self.emb_size = emb_size
        self.input_word  = nn.Embedding(voc_size, emb_size) #v
        self.output_word = nn.Embedding(voc_size, emb_size) #u
    
    def forward(self, center_word, outside_word, all_vocabs):
        #center_word: (batch_size, 1)
        #context_words: (batch_size, window_size * 2)
        #all_vocabs: (batch_size, voc_size)
        batch_size = center_word.shape[0]
        
        #convert them into embedding
        center_word_embed  = self.output_word(center_word)   #(batch_size, 1, emb_size)
        outside_word_embed = self.input_word(outside_word)   #(batch_size, window_size * 2, emb_size)
        all_vocabs_embed   = self.output_word(all_vocabs)    #(batch_size, voc_size, emb_size)        
        
        # mean of input word embeddings
        v_mean = torch.sum(outside_word_embed, 1) / len(outside_word) #(batch_size, emb_size)
        
        ucv = center_word_embed.bmm(v_mean.reshape(batch_size, 1, self.emb_size).transpose(1, 2)).squeeze()
        #(batch_size, 1, emb_size) @ (batch_size, emb_size, 1) = (batch_size, 1, 1) ==> (batch_size, 1)
        
        ujv = all_vocabs_embed.bmm(v_mean.reshape(batch_size, 1, self.emb_size).transpose(1, 2)).squeeze(2)
        #(batch_size, voc_size, emb_size) @ (batch_size, emb_size, 1) = (batch_size, voc_size)
        
        ujv_log_exp = torch.log(torch.exp(ujv))
        # (batch_size, voc_size) -> (batch_size,)
        
        loss_fn = - ucv + torch.sum(ujv_log_exp, 1)
        # - (batch_size, 1) + (batch_size, 1) = (batch_size, 1)
        
        return torch.mean(loss_fn) # scaler for back-propagation

<b>Test Model</b>

In [83]:
all_vocabs = prepare_sequence(list(vocabs), word2index).expand(5, voc_size)
model = CBOW(voc_size, emb_size=2)

In [84]:
input, label = random_batch_cbow(5, corpus_tokenized, skipgrams_CBOW)
input_tensor = torch.LongTensor(input)  
label_tensor = torch.LongTensor(label)
input_tensor.shape, label_tensor.shape

(torch.Size([5, 1]), torch.Size([5, 4]))

In [85]:
input_tensor[0], label_tensor[0]

(tensor([1695]), tensor([6908, 2849, 7007, 4609]))

In [86]:
model(input_tensor, label_tensor, all_vocabs)

tensor(4.0261, grad_fn=<MeanBackward0>)

### GloVe

In [87]:
class GloVe(nn.Module):
    
    def __init__(self, vocab_size,embed_size):
        super(GloVe,self).__init__()
        self.embedding_v = nn.Embedding(vocab_size, embed_size) # center embedding
        self.embedding_u = nn.Embedding(vocab_size, embed_size) # out embedding
        
        self.v_bias = nn.Embedding(vocab_size, 1)
        self.u_bias = nn.Embedding(vocab_size, 1)
        
    def forward(self, center_words, target_words, coocs, weighting):
        center_embeds = self.embedding_v(center_words) # [batch_size, 1, emb_size]
        target_embeds = self.embedding_u(target_words) # [batch_size, 1, emb_size]
        
        center_bias = self.v_bias(center_words).squeeze(1)
        target_bias = self.u_bias(target_words).squeeze(1)
        
        inner_product = target_embeds.bmm(center_embeds.transpose(1, 2)).squeeze(2)
        #[batch_size, 1, emb_size] @ [batch_size, emb_size, 1] = [batch_size, 1, 1] = [batch_size, 1]
        
        #note that coocs already got log
        loss = weighting*torch.pow(inner_product +center_bias + target_bias - coocs, 2)
        
        return torch.sum(loss)

<b>Test Model</b>

In [88]:
input, target, cooc, weightin = random_batch_gloVe(5, corpus_tokenized, gloVes, X_ik, weighting_dic)

In [89]:
emb_size = 2
voc_size = len(vocabs)
model = GloVe(voc_size, emb_size)

In [90]:
input_batch    = torch.LongTensor(input)
target_batch   = torch.LongTensor(target)
cooc_batch     = torch.FloatTensor(cooc)
weightin_batch = torch.FloatTensor(weightin)

In [91]:
loss = model(input_batch, target_batch, cooc_batch, weightin_batch)

In [92]:
loss

tensor(2.3809, grad_fn=<SumBackward0>)

## 4) Training Model

In [93]:
import time

In [94]:
num_epochs = 5000

### Skip-gram

In [95]:
emb_size   = 50
model1     = Skipgram(voc_size, emb_size)
optimizer  = optim.Adam(model.parameters(), lr=0.001)
batch_size = 25

In [96]:
all_vocabs = prepare_sequence(list(vocabs), word2index).expand(batch_size, voc_size)
all_vocabs.shape

torch.Size([25, 8030])

In [97]:
train_start_time = time.time()

#for epoch
for epoch in range(num_epochs):
    start_time = time.time()
    #get random batch
    input_batch, label_batch = random_batch(batch_size, corpus_tokenized, skipgrams)
    input_batch = torch.LongTensor(input_batch)
    label_batch = torch.LongTensor(label_batch)
    
    # print(input_batch.shape, label_batch.shape, all_vocabs.shape)
    
    #loss = model
    loss = model1(input_batch, label_batch, all_vocabs)
    
    #backpropagate
    loss.backward()
    
    #update alpha
    optimizer.step()
    
    total_time = time.time() - start_time
    #print epoch loss
    if (epoch + 1) % 250 == 0:
        print(f"Epoch {epoch+1} | Loss: {loss:.6f} | Time: {total_time}")
        
total_training_time = time.time() - train_start_time
print("Total trainig time:", total_training_time)

Epoch 250 | Loss: 27.133650 | Time: 0.0624232292175293
Epoch 500 | Loss: 25.043547 | Time: 0.05271458625793457
Epoch 750 | Loss: 30.269293 | Time: 0.05938458442687988
Epoch 1000 | Loss: 25.299688 | Time: 0.10424232482910156
Epoch 1250 | Loss: 28.746025 | Time: 0.051509857177734375
Epoch 1500 | Loss: 28.257807 | Time: 0.0509343147277832
Epoch 1750 | Loss: 25.788721 | Time: 0.05037236213684082
Epoch 2000 | Loss: 26.517616 | Time: 0.05024242401123047
Epoch 2250 | Loss: 28.925938 | Time: 0.07238578796386719
Epoch 2500 | Loss: 28.873737 | Time: 0.05432248115539551
Epoch 2750 | Loss: 24.266396 | Time: 0.0474543571472168
Epoch 3000 | Loss: 30.304535 | Time: 0.054558753967285156
Epoch 3250 | Loss: 29.937788 | Time: 0.053843021392822266
Epoch 3500 | Loss: 25.381868 | Time: 0.05387735366821289
Epoch 3750 | Loss: 27.632574 | Time: 0.05071067810058594
Epoch 4000 | Loss: 25.153982 | Time: 0.05628061294555664
Epoch 4250 | Loss: 28.964128 | Time: 0.050932884216308594
Epoch 4500 | Loss: 24.550501 | Ti

### Skip-gram with Negative Sampling

In [98]:
emb_size  = 50
model2 = SkipgramNeg(voc_size, emb_size)
optimizer = optim.Adam(model.parameters(), lr=0.001)
batch_size = 25

In [99]:
train_start_time = time.time()

#for epoch
for epoch in range(num_epochs):
    start_time = time.time()
    
    #get random batch
    input_batch, label_batch = random_batch(batch_size, corpus_tokenized, skipgrams)
    input_batch = torch.LongTensor(input_batch)
    label_batch = torch.LongTensor(label_batch)
    neg_batch   = negative_sampling(label_batch, unigram_table, 5)    
    
    #loss = model
    loss = model2(input_batch, label_batch, neg_batch)
    
    #backpropagate
    loss.backward()
    
    #update alpha
    optimizer.step()
    
    total_time = time.time() - start_time
    #print epoch loss
    if (epoch + 1) % 250 == 0:
        print(f"Epoch {epoch+1} | Loss: {loss:.6f} | Time: {total_time}")
        
total_training_time = time.time() - train_start_time
print("Total trainig time:", total_training_time)

Epoch 250 | Loss: 8.840124 | Time: 0.011084318161010742
Epoch 500 | Loss: 12.489389 | Time: 0.011999368667602539
Epoch 750 | Loss: 14.173753 | Time: 0.009119749069213867
Epoch 1000 | Loss: 9.965052 | Time: 0.014000892639160156
Epoch 1250 | Loss: 11.300850 | Time: 0.016988754272460938
Epoch 1500 | Loss: 4.757360 | Time: 0.0112457275390625
Epoch 1750 | Loss: 4.753320 | Time: 0.011295318603515625
Epoch 2000 | Loss: 7.002831 | Time: 0.01408839225769043
Epoch 2250 | Loss: 11.369360 | Time: 0.013231754302978516
Epoch 2500 | Loss: 9.133273 | Time: 0.010119199752807617
Epoch 2750 | Loss: 10.411145 | Time: 0.013116598129272461
Epoch 3000 | Loss: 12.896254 | Time: 0.013008832931518555
Epoch 3250 | Loss: 8.752337 | Time: 0.011004924774169922
Epoch 3500 | Loss: 10.880651 | Time: 0.011614084243774414
Epoch 3750 | Loss: 9.299257 | Time: 0.014136075973510742
Epoch 4000 | Loss: 9.748012 | Time: 0.009064912796020508
Epoch 4250 | Loss: 5.862792 | Time: 0.010156631469726562
Epoch 4500 | Loss: 6.698127 | 

### CBOW

In [100]:
emb_size  = 50
model3 = CBOW(voc_size, emb_size)
optimizer = optim.Adam(model.parameters(), lr=0.001)
batch_size = 25

In [101]:
all_vocabs = prepare_sequence(list(vocabs), word2index).expand(batch_size, voc_size)
all_vocabs.shape

torch.Size([25, 8030])

In [102]:
train_start_time = time.time()

#for epoch
for epoch in range(num_epochs):
    start_time = time.time()
    #get random batch
    input_batch, label_batch = random_batch_cbow(batch_size, corpus_tokenized, skipgrams_CBOW)
    input_batch = torch.LongTensor(input_batch)
    label_batch = torch.LongTensor(label_batch)
    
    #loss = model
    loss = model3(input_batch, label_batch, all_vocabs)
    
    #backpropagate
    loss.backward()
    
    #update alpha
    optimizer.step()
    
    total_time = time.time() - start_time
    #print epoch loss
    if (epoch + 1) % 250 == 0:
        print(f"Epoch {epoch+1} | Loss: {loss:.6f} | Time: {total_time}")
        
total_training_time = time.time() - train_start_time
print("Total trainig time:", total_training_time)

Epoch 250 | Loss: 0.235588 | Time: 0.04146552085876465
Epoch 500 | Loss: 2.079187 | Time: 0.04926276206970215
Epoch 750 | Loss: 2.090953 | Time: 0.04590153694152832
Epoch 1000 | Loss: 4.416236 | Time: 0.04063844680786133
Epoch 1250 | Loss: 3.061728 | Time: 0.046878814697265625
Epoch 1500 | Loss: 7.287611 | Time: 0.048037052154541016
Epoch 1750 | Loss: 3.786499 | Time: 0.03947854042053223
Epoch 2000 | Loss: 11.088496 | Time: 0.04648327827453613
Epoch 2250 | Loss: -2.276132 | Time: 0.046859025955200195
Epoch 2500 | Loss: -6.187428 | Time: 0.043962717056274414
Epoch 2750 | Loss: 6.884033 | Time: 0.045606136322021484
Epoch 3000 | Loss: -14.870471 | Time: 0.043202877044677734
Epoch 3250 | Loss: -11.567373 | Time: 0.04187822341918945
Epoch 3500 | Loss: 10.264377 | Time: 0.04459786415100098
Epoch 3750 | Loss: -4.122556 | Time: 0.048424482345581055
Epoch 4000 | Loss: 7.903654 | Time: 0.03985166549682617
Epoch 4250 | Loss: 13.996526 | Time: 0.04264545440673828
Epoch 4500 | Loss: 8.725114 | Time

### GloVe

In [103]:
emb_size  = 50
model4 = GloVe(voc_size, emb_size)
optimizer = optim.Adam(model.parameters(), lr=0.001)
batch_size = 25

In [104]:
train_start_time = time.time()

#for epoch
for epoch in range(num_epochs):
    start_time = time.time()
    
    #get random batch
    input, target, cooc, weightin = random_batch_gloVe(batch_size, corpus_tokenized, gloVes, X_ik, weighting_dic)
    input_batch    = torch.LongTensor(input)
    target_batch   = torch.LongTensor(target)
    cooc_batch     = torch.FloatTensor(cooc)
    weightin_batch = torch.FloatTensor(weightin)
        
    #loss = model
    loss = model4(input_batch, target_batch, cooc_batch, weightin_batch)
    
    #backpropagate
    loss.backward()
    
    #update alpha
    optimizer.step()
    
    total_time = time.time() - start_time
    
    #print epoch loss
    if (epoch + 1) % 250 == 0:
        print(f"Epoch {epoch+1} | Loss: {loss:.6f} | Time: {total_time}")
        
total_training_time = time.time() - train_start_time
print("Total trainig time:", total_training_time)

Epoch 250 | Loss: 2264.771240 | Time: 0.02010178565979004
Epoch 500 | Loss: 2286.709961 | Time: 0.020184993743896484
Epoch 750 | Loss: 1489.277466 | Time: 0.02032613754272461
Epoch 1000 | Loss: 3218.607666 | Time: 0.019252777099609375
Epoch 1250 | Loss: 1753.649780 | Time: 0.017228126525878906
Epoch 1500 | Loss: 1367.238281 | Time: 0.016101360321044922
Epoch 1750 | Loss: 1028.348022 | Time: 0.017450571060180664
Epoch 2000 | Loss: 1810.755615 | Time: 0.014273881912231445
Epoch 2250 | Loss: 1304.259888 | Time: 0.017575979232788086
Epoch 2500 | Loss: 2069.218994 | Time: 0.01716160774230957
Epoch 2750 | Loss: 1608.393921 | Time: 0.016108274459838867
Epoch 3000 | Loss: 1967.837524 | Time: 0.017348766326904297
Epoch 3250 | Loss: 2499.779785 | Time: 0.01599574089050293
Epoch 3500 | Loss: 1666.710938 | Time: 0.0143280029296875
Epoch 3750 | Loss: 1845.511963 | Time: 0.014918327331542969
Epoch 4000 | Loss: 2660.739746 | Time: 0.017229795455932617
Epoch 4250 | Loss: 1971.471680 | Time: 0.01420402

## 5) Evaluating Embeddings

Since an analogy task dataset is being used, it contains questions in form of  "a is to b as c is to ?". By identifying the fourth word, accuracy for both semantic and syntactic parts will check. Suppose A, B, and C are the embeddings of words a, b, and c. The word whose embedding gives the highest cosine similarity (B - A + C) would be the missing term to the question. Therefore, every word in the vocabulary is tried. 

After doing an evalution, unfortunately it was found that none of the models identified the missing term correctly for both semantic and syntactic parts. This may be due to limited corpus size and vocabulary. Due to this, another accuracy based on how the correct missing term is ranked based on similarity is used. The average was taken from very analogy task questions for both semantic and syntactic parts.

In [105]:
def get_embed(word, model):
    try:
        index = word2index[word]
    except:
        index = word2index['<UNK>']
    
    word = torch.LongTensor([index])

    center_embed  = model.embedding_center_word(word)
    outside_embed = model.embedding_outside_word(word)
    
    embed = (center_embed + outside_embed) / 2
    
    return embed
    # return  embed[0][0].item(), embed[0][1].item()

In [106]:
get_embed("man", model1).shape

torch.Size([1, 50])

In [107]:
def get_embed_CBOW(word, model):
    try:
        index = word2index[word]
    except:
        index = word2index['<UNK>']
    
    word = torch.LongTensor([index])

    center_embed  = model.input_word(word)
    outside_embed = model.output_word(word)
    
    embed = (center_embed + outside_embed) / 2
    
    return embed
    # return  embed[0][0].item(), embed[0][1].item()

In [108]:
get_embed_CBOW("man", model3).shape

torch.Size([1, 50])

In [109]:
def get_embed_GloVe(word, model):
    id_tensor = torch.LongTensor([word2index[word]])
    v_embed = model.embedding_v(id_tensor)
    u_embed = model.embedding_u(id_tensor) 
    word_embed = (v_embed + u_embed) / 2 
    
    return word_embed

    # x, y = word_embed[0][0].item(), word_embed[0][1].item()
    # return x, y

In [110]:
get_embed_GloVe("man", model4).shape

torch.Size([1, 50])

In [111]:
from numpy import dot
from numpy.linalg import norm

def cos_sim(a, b):
    cos_sim = dot(a, b)/(norm(a)*norm(b))
    return cos_sim

### Load and Prepare Test Dataset

https://www.pythontutorial.net/python-basics/python-read-text-file/

In [112]:
with open('questions-words.txt') as f:
    testDataset = [line.strip() for line in f.readlines() if line[0] != ":" ]

In [113]:
testDataset = {}
key         = None
value       = []
with open('questions-words.txt') as f:
    for line in f.readlines():
        if line[0] == ":":
            if key != None:
                testDataset[key] = value
            key   = line.strip()
            value = []
            continue
        value.append(line.strip().lower())
    testDataset[key] = value

In [114]:
len(testDataset)

14

In [115]:
testDataset.keys()

dict_keys([': capital-common-countries', ': capital-world', ': currency', ': city-in-state', ': family', ': gram1-adjective-to-adverb', ': gram2-opposite', ': gram3-comparative', ': gram4-superlative', ': gram5-present-participle', ': gram6-nationality-adjective', ': gram7-past-tense', ': gram8-plural', ': gram9-plural-verbs'])

In [116]:
testDataset[': capital-common-countries'][:10]

['athens greece baghdad iraq',
 'athens greece bangkok thailand',
 'athens greece beijing china',
 'athens greece berlin germany',
 'athens greece bern switzerland',
 'athens greece cairo egypt',
 'athens greece canberra australia',
 'athens greece hanoi vietnam',
 'athens greece havana cuba',
 'athens greece helsinki finland']

In [117]:
semantic  = [': capital-common-countries', ': capital-world', ': currency', ': family']
syntactic = [': gram1-adjective-to-adverb', ': gram2-opposite', ': gram3-comparative', ': gram4-superlative', 
             ': gram5-present-participle', ': gram6-nationality-adjective', ': gram7-past-tense', ': gram8-plural', 
             ': gram9-plural-verbs']

### Semantic Accuracy

In [118]:
def semantic_syntatic_eval(model, testDict, CBOWMode = False, GloVeMode = False):
    total_corr = 0
    pairs_used = 0
    acc_sum    = 0
    
    for key in testDict:
        pairs = testDataset[key]
        # print(pairs[0].split(" "))
        for pair in pairs:
            pair_tokenized = pair.split(" ")
            word_a, word_b, word_c, word_d = pair_tokenized  
            if word_a not in word2index or word_b not in word2index or word_c not in word2index or word_d not in word2index:
                continue
                
            word_d_index = vocabs.index(word_d)
            
            if GloVeMode:
                a_embedding = get_embed_GloVe(word_a, model) 
                b_embedding = get_embed_GloVe(word_b, model)
                c_embedding = get_embed_GloVe(word_c, model)
            elif CBOWMode:
                a_embedding = get_embed_CBOW(word_a, model) 
                b_embedding = get_embed_CBOW(word_b, model)
                c_embedding = get_embed_CBOW(word_c, model)
            else:
                a_embedding = get_embed(word_a, model) 
                b_embedding = get_embed(word_b, model)
                c_embedding = get_embed(word_c, model)
            
            AminusBplusC = (b_embedding - a_embedding + c_embedding).squeeze()
            
            count = -1
            similarity_arr = [0] * len(vocabs)
            for vocab in vocabs:
                count += 1
                if vocab in pair_tokenized[:3]:
                    continue
                
                if GloVeMode:
                    current = get_embed_GloVe(vocab, model).squeeze()
                elif CBOWMode:
                    current = get_embed_CBOW(vocab, model).squeeze()
                else:
                    current = get_embed(vocab, model).squeeze()
                similarity_arr[count] = cos_sim(AminusBplusC.detach().numpy(), current.detach().numpy())
            
            similarity_arr_sorted_index = np.argsort(similarity_arr)
            rank                        = np.where(similarity_arr_sorted_index == word_d_index)[0][0]
            acc_sum += (rank + 1) / len(vocabs)
            
            pairs_used += 1
            predicted_word = np.argmax(similarity_arr)
            if predicted_word == word_d:
                total_corr+= 1
                
    total_acc = total_corr / pairs_used
    avg_acc = acc_sum / pairs_used
    return total_acc, total_corr, avg_acc, pairs_used

##### Skip-gram

In [119]:
total_acc, total_corr, avg_acc, pairs_used = semantic_syntatic_eval(model1, semantic)

In [120]:
print(f"Overall Accuracy: {total_acc}%")
print(f"Average accuracy according to the ranking: {round(avg_acc, 2)}%")
print(f"Total pairs used:", pairs_used)

Overall Accuracy: 0.0%
Average accuracy according to the ranking: 0.49%
Total pairs used: 165


##### Skip-gram with negative sampling

In [121]:
total_acc_neg, total_corr_neg, avg_acc_neg, pairs_used_neg = semantic_syntatic_eval(model2, semantic)

In [122]:
print(f"Overall Accuracy: {total_acc_neg}%")
print(f"Average accuracy according to the ranking: {round(avg_acc_neg, 2)}%")
print(f"Total pairs used:", pairs_used_neg)

Overall Accuracy: 0.0%
Average accuracy according to the ranking: 0.57%
Total pairs used: 165


##### CBOW

In [123]:
total_acc_cbow, total_corr_cbow, avg_acc_cbow, pairs_used_cbow = semantic_syntatic_eval(model3, semantic, CBOWMode=True)

In [124]:
print(f"Overall Accuracy: {total_acc_cbow}%")
print(f"Average accuracy according to the ranking: {round(avg_acc_cbow, 2)}%")
print(f"Total pairs used:", pairs_used_cbow)

Overall Accuracy: 0.0%
Average accuracy according to the ranking: 0.45%
Total pairs used: 165


##### GloVe

In [125]:
total_acc_gloVe, total_corr_gloVe, avg_acc_gloVe, pairs_used_gloVe = semantic_syntatic_eval(model4, semantic, GloVeMode=True)

In [126]:
print(f"Overall Accuracy: {total_acc_gloVe}%")
print(f"Average accuracy according to the ranking: {round(avg_acc_gloVe, 2)}%")
print(f"Total pairs used:", pairs_used_gloVe)

Overall Accuracy: 0.0%
Average accuracy according to the ranking: 0.4%
Total pairs used: 165


### Syntatic Accuracy

##### Skip-gram

In [127]:
total_acc2, total_corr2, avg_acc2, pairs_used2 = semantic_syntatic_eval(model1, syntactic)

In [128]:
print(f"Overall Accuracy: {total_acc2}%")
print(f"Average accuracy according to the ranking: {round(avg_acc2, 2)}%")
print(f"Total pairs used:", pairs_used2)

Overall Accuracy: 0.0%
Average accuracy according to the ranking: 0.48%
Total pairs used: 1266


##### Skip-gram with negative sampling

In [129]:
total_acc_neg2, total_corr_neg2, avg_acc_neg2, pairs_used_neg2 = semantic_syntatic_eval(model2, syntactic)

In [130]:
print(f"Overall Accuracy: {total_acc_neg2}%")
print(f"Average accuracy according to the ranking: {round(avg_acc_neg2, 2)}%")
print(f"Total pairs used:", pairs_used_neg2)

Overall Accuracy: 0.0%
Average accuracy according to the ranking: 0.5%
Total pairs used: 1266


##### CBOW

In [131]:
total_acc_cbow2, total_corr_cbow2, avg_acc_cbow2, pairs_used_cbow2 = semantic_syntatic_eval(model3, syntactic, CBOWMode=True)

In [None]:
print(f"Overall Accuracy: {total_acc_cbow2}%")
print(f"Average accuracy according to the ranking: {round(avg_acc_cbow2, 2)}%")
print(f"Total pairs used:", pairs_used_cbow2)

Overall Accuracy: 0.0%
Average accuracy according to the ranking: 0.51%
Total pairs used: 1266


##### GloVe

In [None]:
total_acc_gloVe2, total_corr_gloVe2, avg_acc_gloVe2, pairs_used_gloVe2 = semantic_syntatic_eval(model4, syntactic, GloVeMode=True)

In [None]:
print(f"Overall Accuracy: {total_acc_gloVe2}%")
print(f"Average accuracy according to the ranking: {round(avg_acc_gloVe2, 2)}%")
print(f"Total pairs used:", pairs_used_gloVe2)

Overall Accuracy: 0.0%
Average accuracy according to the ranking: 0.48%
Total pairs used: 1266


### ----- Semantic and Syntatic Results Summary -----

https://www.geeksforgeeks.org/how-to-make-a-table-in-python/

In [None]:
from tabulate import tabulate

In [None]:
total_pairs_semantic = 0
for key in semantic:
    total_pairs_semantic += len(testDataset[key])

In [None]:
print("Total pairs for semantic evaluation:", total_pairs_semantic)
print("Pairs used for semantic evaluation:", pairs_used)

Total pairs for semantic evaluation: 6402
Pairs used for semantic evaluation: 165


In [None]:
total_pairs_syntactic = 0
for key in syntactic:
    total_pairs_syntactic += len(testDataset[key])

In [None]:
print("Total pairs for syntactic evaluation:", total_pairs_syntactic)
print("Pairs used for syntactic evaluatio:", pairs_used2)

Total pairs for syntactic evaluation: 10675
Pairs used for syntactic evaluatio: 1266


### Total Accuracy (%)

In [None]:
head    = ["Model", "Semantic", "Syntatic"]
content = [
    ["Skip-gram", total_acc, total_acc2],
    ["Skip-gram with negative sampling", total_acc_neg, total_acc_neg2],
    ["CBOW", total_acc_cbow, total_acc_cbow2],
    ["GloVe", total_acc_gloVe, total_acc_gloVe2]
] 

In [None]:
print(tabulate(content, headers=head, tablefmt="grid"))

+----------------------------------+------------+------------+
| Model                            |   Semantic |   Syntatic |
| Skip-gram                        |          0 |          0 |
+----------------------------------+------------+------------+
| Skip-gram with negative sampling |          0 |          0 |
+----------------------------------+------------+------------+
| CBOW                             |          0 |          0 |
+----------------------------------+------------+------------+
| GloVe                            |          0 |          0 |
+----------------------------------+------------+------------+


### Average Accuracy (%) based on Ranking

In [None]:
head    = ["Model", "Semantic", "Syntatic"]
content = [
    ["Skip-gram", avg_acc, avg_acc2],
    ["Skip-gram with negative sampling", avg_acc_neg, avg_acc_neg2],
    ["CBOW", avg_acc_cbow, avg_acc_cbow2],
    ["GloVe", avg_acc_gloVe, avg_acc_gloVe2]
] 

In [None]:
print(tabulate(content, headers=head, tablefmt="grid"))

+----------------------------------+------------+------------+
| Model                            |   Semantic |   Syntatic |
| Skip-gram                        |   0.510768 |   0.487203 |
+----------------------------------+------------+------------+
| Skip-gram with negative sampling |   0.55887  |   0.48972  |
+----------------------------------+------------+------------+
| CBOW                             |   0.581599 |   0.514867 |
+----------------------------------+------------+------------+
| GloVe                            |   0.532616 |   0.479572 |
+----------------------------------+------------+------------+


### --- Similarity Test on Word Pairs from WordSim353 Dataset ---

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

In [None]:
from scipy import stats

### Load Similarity Dataset

In [None]:
with open('wordsim353_sim_rel/wordsim_relatedness_goldstandard.txt') as f:
    relatedness_db = [line.strip().lower().split("\t") for line in f.readlines()]

In [None]:
relatedness_db[:10]

[['computer', 'keyboard', '7.62'],
 ['jerusalem', 'israel', '8.46'],
 ['planet', 'galaxy', '8.11'],
 ['canyon', 'landscape', '7.53'],
 ['opec', 'country', '5.63'],
 ['day', 'summer', '3.94'],
 ['day', 'dawn', '7.53'],
 ['country', 'citizen', '7.31'],
 ['planet', 'people', '5.75'],
 ['environment', 'ecology', '8.81']]

In [None]:
with open('wordsim353_sim_rel/wordsim_similarity_goldstandard.txt') as f:
    similarity_db = [line.strip().lower().split("\t") for line in f.readlines()]

In [None]:
similarity_db[:10]

[['tiger', 'cat', '7.35'],
 ['tiger', 'tiger', '10.00'],
 ['plane', 'car', '5.77'],
 ['train', 'car', '6.31'],
 ['television', 'radio', '6.77'],
 ['media', 'radio', '7.42'],
 ['bread', 'butter', '6.19'],
 ['cucumber', 'potato', '5.92'],
 ['doctor', 'nurse', '7.00'],
 ['professor', 'doctor', '6.62']]

In [None]:
def similarity_eval(dataset, model, CBOWMode = False, GloVeMode = False):
    valid_idx     = []
    cos_sim_arr   = []
    human_val_sim = []
    
    for idx, pair in enumerate(dataset):
        word1, word2, evalHuman = pair
        if word1 not in vocabs or word2 not in vocabs:
            continue
        
        if GloVeMode:
            word1_embedding = get_embed_GloVe(word1, model)
            word2_embedding = get_embed_GloVe(word2, model)
        elif CBOWMode:
            word1_embedding = get_embed_CBOW(word1, model)
            word2_embedding = get_embed_CBOW(word2, model)
        else:
            word1_embedding = get_embed(word1, model)
            word2_embedding = get_embed(word2, model)
        
        sim_eval = cos_sim(word1_embedding.detach().numpy().squeeze(), word2_embedding.detach().numpy().squeeze())
        valid_idx.append(idx)
        cos_sim_arr.append(sim_eval)
        human_val_sim.append(float(evalHuman))
        
        res = stats.spearmanr(cos_sim_arr, human_val_sim)
    
    return valid_idx, cos_sim_arr, human_val_sim, res

### Spearman Correlation Evaluation

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

https://www.youtube.com/watch?v=JwNwbu-g2m0

Spearman correlation test is the measurement of the "monotonicity of the relationship between two datasets". From this test, two paramters, Spearman correlation coefficient (rs) and p-value are returned. Spearman correlation coefficient (rs) gives information on how strong two dataset are un-linearlu correlated with range -1 to 1. On the order hand,  p-value suggests how strongly two datasets are not correlated with range 0 to 1. Since null hypothesis is used, values closer to 1 would strongly indicate that the two datasets are uncorrelated.

##### Skip-gram

In [None]:
valid_idx, cos_sim_arr, human_val_sim, res = similarity_eval(similarity_db, model1)

In [None]:
print("Total samples:", len(similarity_db))
print("Samples used:", len(valid_idx))
print("Spearmanr correlation coefficient (rs):", round(res.correlation,4))
print("p-value:", round(res.pvalue,4))

Total samples: 203
Samples used: 88
Spearmanr correlation coefficient (rs): 0.1381
p-value: 0.1995


##### Skip-gram with negative sampling

In [None]:
valid_idx_neg, cos_sim_arr_neg, human_val_sim_neg, res_neg = similarity_eval(similarity_db, model2)

In [None]:
print("Total samples:", len(similarity_db))
print("Samples used:", len(valid_idx_neg))
print("Spearmanr correlation coefficient (rs):", round(res_neg.correlation,4))
print("p-value:", round(res_neg.pvalue,4))

Total samples: 203
Samples used: 88
Spearmanr correlation coefficient (rs): 0.1189
p-value: 0.27


##### CBOW

In [None]:
valid_idx_CBOW, cos_sim_arr_CBOW, human_val_sim_CBOW, res_CBOW = similarity_eval(similarity_db, model2)

In [None]:
print("Total samples:", len(similarity_db))
print("Samples used:", len(valid_idx_CBOW))
print("Spearmanr correlation coefficient (rs)::", round(res_CBOW.correlation,4))
print("p-value of Spearmanr:", round(res_CBOW.pvalue,4))

Total samples: 203
Samples used: 88
Spearmanr correlation coefficient (rs):: 0.1189
p-value of Spearmanr: 0.27


##### GloVe

In [None]:
valid_idx_gloVe, cos_sim_arr_gloVe, human_val_sim_gloVe, res_gloVe = similarity_eval(similarity_db, model1)

In [None]:
print("Total samples:", len(similarity_db))
print("Samples used:", len(valid_idx_gloVe))
print("Spearmanr correlation coefficient (rs)::", round(res_gloVe.correlation,4))
print("p-value:", round(res_gloVe.pvalue,4))

Total samples: 203
Samples used: 88
Spearmanr correlation coefficient (rs):: 0.1381
p-value: 0.1995


### ----- Similarity Test Results -----

In [None]:
head    = ["Model", "rs", "p-value"]
content = [
    ["Skip-gram", round(res.correlation,4), round(res.pvalue,4)],
    ["Skip-gram with negative sampling", round(res_neg.correlation,4), round(res_neg.pvalue,4)],
    ["CBOW", round(res_CBOW.correlation,4), round(res_CBOW.pvalue,4)],
    ["GloVe", round(res_gloVe.correlation,4), round(res_gloVe.pvalue,4)]
] 

In [None]:
print(tabulate(content, headers=head, tablefmt="grid"))
print("Where rs is Spearman correlation coefficient")

+----------------------------------+--------+-----------+
| Model                            |     rs |   p-value |
| Skip-gram                        | 0.1381 |    0.1995 |
+----------------------------------+--------+-----------+
| Skip-gram with negative sampling | 0.1189 |    0.27   |
+----------------------------------+--------+-----------+
| CBOW                             | 0.1189 |    0.27   |
+----------------------------------+--------+-----------+
| GloVe                            | 0.1381 |    0.1995 |
+----------------------------------+--------+-----------+
Where rs is Spearman correlation coefficient


## 6) Conclusion

For semantic and syntactic evaluation, none of the models identified the missing term correctly for analogy-task questions. Due to this, another accuracy based on how the correct missing term is ranked based on similarity is used. Because of the vocabulary limitation, only a fraction of analogy-task questions for semantic and syntactic parts were used. Based on the summary results for semantic and syntactic, the correct missing term tends to be ranked at around the 50th percentile of all words in the vocabulary. For semantic accuracy, Skip-gram with negative sampling gave the best result while CBOW gave the best for syntactic accuracy.

From the results of the evaluation of the similarity dataset, the Spearman correlation coefficient tends to suggest that there is a weak correlation. But the p-value also tends to suggest that tends to be more uncorrelated. Due to these, none of the models gave the optimal performance. This is mainly due to the insufficient amount of data in the corpus the models were trained on vocabulary limitation as only 88 out of 203 sample points were used.