# 1. Cosine similarity  
Given pre-trained embeddings of Vietnamese words, implement a function for calculating cosine similarity between word pairs. Test your program using word pairs in ViSim-400 dataset (in directory Datasets/ViSim-400). Using Pearson correlation coefficient (https://en.wikipedia.org/wiki/Pearson_correlation_coefficient), Spearman's rank correlation coefficient (https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) to evaluate the correlation between your results and similarity scores in the dataset

In [17]:
import numpy as np
import scipy
from scipy import stats

In [18]:
def read_file_to_list(path):
    file = open(path)
    lines = [line for line in file.readlines()]
    file.close()
    return lines

In [19]:
w2v_path = 'W2V_150.txt'
dataset = read_file_to_list(w2v_path)

num_of_words = int(dataset[0]) #77021 từ trong bộ embedding 
num_of_dimensions = int(dataset[1]) # mỗi từ tương ứng với 1 véctơ 150 chiều 
dataset = dataset[2:]

vector = [None]*num_of_dimensions
embeddings = {}

for line in dataset:
    s = line.split(' ')
    word = s[0]
    for i in range(num_of_dimensions):
        vector[i] = float(s[i+2]) # trừ item đầu tiên và thứ 2 (xâu rỗng)
    embeddings[word] = np.array(vector)
# dict(list(embeddings.items())[0:2])

implement a function for calculating cosine similarity between word pairs

In [20]:
def cosine_similarity(A, B):
   return np.sum(A*B) / np.sqrt(np.sum(A**2)*np.sum(B**2))

chuẩn hóa kết quả về \[0, 1\]

In [21]:
(1 + cosine_similarity(embeddings['wallpaper'], embeddings['sâu']) ) / 2

0.48198830650084357

Test your program using word pairs in ViSim-400 dataset  
Results stored in *rs*

In [22]:
visim_path = "Visim-400.txt"
lines = read_file_to_list(visim_path)[1:]

results = []

for line in lines:
    s = line.split()
    w1 = s[0].strip() #word 1
    if w1 not in embeddings:
        embeddings[w1] = np.ones(num_of_dimensions)
    w2 = s[1].strip() #word 2
    if w2 not in embeddings:
        embeddings[w2] = np.ones(num_of_dimensions)
    sim = (1+cosine_similarity(embeddings[w1], embeddings[w2]))/2
    results.append(sim)

results[:5]

[0.497543830265165,
 0.5412615916460588,
 0.6385429799341388,
 0.5883993141781336,
 0.4706390934550245]

## Evaluate the correlation between the results and similarity scores in the dataset  
### similarity scores in dataset

In [23]:
l = [line.split() for line in lines]
scores = np.array(l)[:,3].astype(np.float)

Using Pearson correlation coefficient

In [24]:
stats.pearsonr(results, scores)

(0.3787184443819735, 4.337055364178592e-15)

Using Spearman correlation

In [25]:
stats.spearmanr(results, scores)

SpearmanrResult(correlation=0.3286625921444831, pvalue=1.5746522506040425e-11)

In [26]:
l1 = [1,2,3,4,5]
l2 = [2.5,3,5,10,16]
stats.pearsonr(l1,l2)

(0.9437165488107271, 0.015892932682276078)

# 2. K-nearest words  
Given a word w, find k most-similar words of w using the function implemented in 1.

In [29]:
def knearest_words(k, word):
    rs = []
    d = {}
    for w in embeddings:
        sim = (1+cosine_similarity(embeddings[word], embeddings[w])) / 2
        rs.append(sim)
    rs = np.array(rs)
    ind = np.argpartition(rs, -4)[-4:]
    return ind
knearest_words(10, 'xinh')

array([  212,  2384, 55449,  1519])

In [37]:
word_to_index = {}
i = 0
for w in embeddings.keys():
    word_to_index[w] = i
    i = i+1

In [45]:
# print(word_to_index.keys())
index_to_word = dict(zip(word_to_index.values(), word_to_index.keys()))

In [46]:
len(index_to_word)

77072

In [53]:
# similarity_table = np.zeros((num_of_words, num_of_words), dtype=np.int8)
# similarity_table.shape

(77021, 77021)