<a href="https://colab.research.google.com/github/heinohen/tko_7095_i2hlt/blob/main/Week6_ex1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exercise 11: Top-K accuracy

In the lecture, we learned to align embedding spaces, which allowed us to "compute" word translations. We marveled at how well this worked, but did not really evaluate this properly, even though we do have a nice test set.

In the exercise, your task will be to evaluate the method using the simple "top-k accuracy" metric. This is a simple metric, which measures whether the correct target is among the first K nearest neighbors. In other words for the pair of source-target words
we consider the transfer successful, if is among the K nearest neighbors of the embedding we obtain by transforming with the matrix . Top K accuracy then is the proportion of successfully transferred pairs, out of all pairs, as a percentage.

In [1]:
import gensim

In [8]:
#!wget http://vectors.nlpl.eu/repository/20/12.zip
#!wget http://vectors.nlpl.eu/repository/20/42.zip 650 KB/s LOL

## Try these if the download above is too slow, I mirrored these:
!wget http://dl.turkunlp.org/TKO_7095_2023/12.zip
!wget http://dl.turkunlp.org/TKO_7095_2023/42.zip # 22 MB/s, much better...

--2024-04-16 18:04:23--  http://dl.turkunlp.org/TKO_7095_2023/12.zip
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 613577258 (585M) [application/zip]
Saving to: ‘12.zip’


2024-04-16 18:04:53 (19.6 MB/s) - ‘12.zip’ saved [613577258/613577258]

--2024-04-16 18:04:53--  http://dl.turkunlp.org/TKO_7095_2023/42.zip
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1849124328 (1.7G) [application/zip]
Saving to: ‘42.zip’


2024-04-16 18:06:23 (19.8 MB/s) - ‘42.zip’ saved [1849124328/1849124328]



In [9]:
!unzip -o 12.zip
!mv model.bin en.bin
!unzip -o 42.zip
!mv model.bin fi.bin

Archive:  12.zip
  inflating: meta.json               
  inflating: model.bin               
  inflating: model.txt               
  inflating: README                  
Archive:  42.zip
  inflating: LIST                    
  inflating: meta.json               
  inflating: model.bin               
  inflating: model.txt               
  inflating: README                  


In [10]:
from gensim.models import KeyedVectors # https://radimrehurek.com/gensim/models/keyedvectors.html


"""
fname == The file path to the saved word2vec-format file.
limit == imit (int, optional) – Sets a maximum number of word-vectors to read from the file. The default, None, means read all.
binary == binary (bool, optional) – If True, indicates whether the data is in binary word2vec format.
"""

wv_embeddings_en = KeyedVectors.load_word2vec_format(fname = 'en.bin', limit = 100000, binary = True)
wv_embeddings_fi = KeyedVectors.load_word2vec_format(fname = 'fi.bin', limit = 100000, binary = True)

In [11]:
# https://github.com/codogogo/xling-eval


# Grab the data
!wget https://raw.githubusercontent.com/codogogo/xling-eval/master/bli_datasets/en-fi/yacle.test.freq.2k.en-fi.tsv
!wget https://raw.githubusercontent.com/codogogo/xling-eval/master/bli_datasets/en-fi/yacle.train.freq.5k.en-fi.tsv


--2024-04-16 18:13:18--  https://raw.githubusercontent.com/codogogo/xling-eval/master/bli_datasets/en-fi/yacle.test.freq.2k.en-fi.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35770 (35K) [text/plain]
Saving to: ‘yacle.test.freq.2k.en-fi.tsv’


2024-04-16 18:13:19 (7.73 MB/s) - ‘yacle.test.freq.2k.en-fi.tsv’ saved [35770/35770]

--2024-04-16 18:13:19--  https://raw.githubusercontent.com/codogogo/xling-eval/master/bli_datasets/en-fi/yacle.train.freq.5k.en-fi.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 82957 (81K) [tex

In [12]:
!cat yacle.test.freq.2k.en-fi.tsv | head -n 10

dedication	omistautuminen
desires	toiveet
dismissed	hylätty
psychic	psyykkinen
cracks	halkeamia
establishments	laitokset
efficacy	tehokkuus
prestige	arvovalta
cocaine	kokaiini
accelerated	kiihtyi


In [15]:
pairs_train = [] #These will be pairs of (source,target) i.e. (Finnish, English) words used to induce the matrix M
pairs_test = [] #same but for testing, so we should make sure there is absolutely no overlap between the train and test data
               #let's do it so that not one word in the test is is seen in any capacity in the training data


import csv

def get_vectors(fname) -> list:
  """
  Read the pairs from the file 'fname'
  """
  pairs = []

  with open(fname) as f:
    r = csv.reader(f, delimiter = '\t') # tab-serparated-values

    for en_word, fi_word in r:
      #I will reverse the order here, go from Finnish as the source, to English as the target
      #That way it will be easier to check how this works using English as the target, which we all understand.
      pairs.append((fi_word, en_word))
  return pairs


train_data = get_vectors('yacle.train.freq.5k.en-fi.tsv')
test_data = get_vectors('yacle.test.freq.2k.en-fi.tsv')

print(train_data[:10])
print(len(train_data))

print(test_data[:10])
print(len(test_data))


[('of', 'of'), ('että', 'to'), ('sisään', 'in'), ('varten', 'for'), ('on', 'is'), ('päällä', 'on'), ('että', 'that'), ('mennessä', 'by'), ('Tämä', 'this'), ('kanssa', 'with')]
5000
[('omistautuminen', 'dedication'), ('toiveet', 'desires'), ('hylätty', 'dismissed'), ('psyykkinen', 'psychic'), ('halkeamia', 'cracks'), ('laitokset', 'establishments'), ('tehokkuus', 'efficacy'), ('arvovalta', 'prestige'), ('kokaiini', 'cocaine'), ('kiihtyi', 'accelerated')]
2000


## Get the embeddings

* Now we have the word pairs
* We need the embeddings, so we can build our S and T matrices
* Not all words will be in our W2V embeddings
* Plus, we want to be 100% sure there is absolutely no overlap between the training and test data
* This means not one word seen in the training data will be in the test data
* The general approach will be to gather the vectors into a list, and then vstack (vertical stack) these to get a 2D array, i.e. a matrix

In [33]:
import numpy as np

def build_arrays(pairs, emb1, emb2, avoid = set()):
  """
  pairs == pairs of (fi,en) words
  emb1 == source side (here Finnish) embeddings
  emb2 == target side (here English) embeddings
  avoid == a set of words to avoid or ignore (will be used when building test data, to avoid train data)
  """
  source_vecs, target_vecs, filtered_pairs = [], [], []
  for word1, word2 in pairs: # iterate through all pairs
    # check if both vectors are available, and none of the words is to be avoided
    if word1 in emb1 and word2 in emb2 and word1 not in avoid and word2 not in avoid:
      # let's go
      source_vecs.append(emb1[word1]) # source-side embedding, the KeyedVectors object can be queried as if it was a dict,
                                      # returns the embedding as 1-dim array
      target_vecs.append(emb2[word2])
      filtered_pairs.append((word1,word2)) # remember the pair as tuple
  return np.vstack(source_vecs),np.vstack(target_vecs),filtered_pairs


# Gather the train data first
array_train_fi, array_train_en, pairs_train = build_arrays(train_data, wv_embeddings_fi, wv_embeddings_en) # keep these in order!

# Now build the set of all words seen in training, so we can avoid them when building the test set. Note that "|" is set union operator
everything_in_train = set(s for s,t in pairs_train)|set(t for s,t in pairs_train)

# Test data next, with avoid as the everything_in_train to ignore
array_test_fi,array_test_en,pairs_test = build_arrays(test_data, wv_embeddings_fi, wv_embeddings_en, avoid = everything_in_train)

In [21]:
# Check for absolutely no overlap

# Let's be super-sure there absolutely is no overlap of any kind!
print("Overlap between train pairs and test pairs:",len(set(pairs_train) & set(pairs_test))) # & is set intersection operator, intersection between train and test should be empty
src_train=set(src_w for src_w,tgt_w in pairs_train) #train source words
tgt_train=set(tgt_w for src_w,tgt_w in pairs_train) #train target words
src_test=set(src_w for src_w,tgt_w in pairs_test)   #test source words
tgt_test=set(tgt_w for src_w,tgt_w in pairs_test)   #test target words
print("Overlap between train fi words and test fi words:",len(src_train & src_test))
print("Overlap between train en words and test en words:",len(tgt_train & tgt_test))

Overlap between train pairs and test pairs: 0
Overlap between train fi words and test fi words: 0
Overlap between train en words and test en words: 0


## Mapping matrix

* Next we need to induce the transformation matrix
* I.e implement the least-squares methods from the lecture
* GPT4 for help

In [23]:
# This code was written by GPT4, but in a bit of a twisted form, so I modified it
# to better correspond to the formulae in the lecture

def learn_transformation_matrix(source, target):
    # Compute the pseudo-inverse of the source matrix
    source_pseudo_inverse = np.linalg.pinv(source) # This implements (S^T S)^-1 S^T  needed in the least-squares formula in the lecture slides
    # Compute the transformation matrix M using least squares method
    M = np.matmul(source_pseudo_inverse,target)  #...and this multiplies by T from right completing the formula in the slides ... two lines(!)
    return M

# fi -> en matrix
M=learn_transformation_matrix(array_train_fi,array_train_en)



In [26]:
print(f'Source (finnish) shape {array_train_fi.shape}')
print(f'Target (english) shape {array_train_en.shape}')
print(f'M shape {M.shape}')

Source (finnish) shape (4506, 100)
Target (english) shape (4506, 300)
M shape (100, 300)


In [27]:
# And now we transform the source (finnish) test embeddings in to the english embedding space
# using the matrix M

test_fi_transformed = np.matmul(array_test_fi, M)
print(f'Transformed shape: {test_fi_transformed.shape}')
np.square(np.subtract(test_fi_transformed, array_test_en)).mean()

Transformed shape: (1285, 300)


0.002326297

## HOW TO EVALUATE

1) Go over the test word pairs(fi,en)

2) Use the transformed Finnish embedding as a query into the English space

3) List top-N English words which appear near this transformed embedding

In [37]:
print(len(pairs_test))

for i, (word1, word2) in enumerate(pairs_test[:10]):
  print(f'{word1} --ENGLISH-> {word2}:')
  """ SIMILAR BY VECTOR
  Word2Vec.similar_by_vector(vector, topn=10, restrict_vocab=None)
  """
  nn = wv_embeddings_en.similar_by_vector(test_fi_transformed[i]) # nearest neighbours
  eng_words = [word for word, score in nn] # comes as tuples, need only words
  print(f"   ",", ".join(eng_words)) #...and print then ,-separated
  print()


1285
toiveet --ENGLISH-> desires:
    desires, importantly, Certainly, qualities, ideas, perspectives, desire, indeed, sense, notions

psyykkinen --ENGLISH-> psychic:
    cognitive, physiological, behavioral, physical, neurological, mental, disorders, empathy, therapy, interpersonal

halkeamia --ENGLISH-> cracks:
    crevices, vegetation, gullies, surfaces, ridges, walls, limestone, reddish, mottled, sediment

kokaiini --ENGLISH-> cocaine:
    additives, pesticides, substances, caffeine, foods, carcinogenic, medications, drugs, side-effects, chemicals

kiihtyi --ENGLISH-> accelerated:
    slowed, worsened, accelerated, surged, exacerbated, spurred, stagnated, slackened, fueled, ebbed

huippu --ENGLISH-> pinnacle:
    magnificent, breathtaking, ideal, marvelous, majestic, perfect, beautiful, gorgeous, fabulous, awesome

edellä --ENGLISH-> supra:
    therefore, although, instances, indeed, simply, Furthermore, merely, Consequently, fact, actually

päärynä --ENGLISH-> pear:
    melon, tom

In [41]:
# Build func for this so it can be run in sequence

"""
Returns percentage of accuracy as a float
value == how many nearest is included descending
"""
def nearest_top_k(value: int) -> float:


  # check codeblock above when we printed them, now count

  corr = 0 # within the top-K defined by value
  all = 0 # all pairs
  """ SIMILAR BY VECTOR
  Word2Vec.similar_by_vector(vector, topn=10, restrict_vocab=None)
  """
  # now use 'topn' attribute to limit the results
  for i, (word1, word2) in enumerate(pairs_test):
    nn = wv_embeddings_en.similar_by_vector(test_fi_transformed[i], topn = value)
    # array means duplicates can appear, dict not needed as we will see only one instance, set is the way to go --> also fast because hashable
    # store for calculations same from above code block, but instead of a array now a set
    words = set(word for word, score in nn) # not intrested in the score of a individual word
    # english word in pair resides in 'word2' variable so that is the target of inspection
    if word2 in words: # if it is top-k defined by 'value', that means it is predicted correctly by criteria and we got a score
      corr += 1
    # if not we move on, but remember to update all
    all += 1

  # All done, return calculation
  return 100 * float(corr) / float(all)






## PRINTS

In [42]:
import tabulate

one = nearest_top_k(1)
five = nearest_top_k(5)
ten = nearest_top_k(10)
fifty = nearest_top_k(50)

data = [
    ["1", one],
    ["5", five],
    ["10", ten],
    ["50", fifty]
]

print(tabulate.tabulate(data, headers=["TOP-K ", "ACCURACY"]))


  TOP-K     ACCURACY
--------  ----------
       1     17.821
       5     34.0078
      10     42.4125
      50     61.0895
