# How to evaluate word embedding models

The existing state-of-the-art approaches for word embedding evaluation can be divided into two major classes: _intrinsic_ vs _extrinsic_ evaluation.

_Extrinsic metrics_ perform the evaluation by using embeddings as features for specific downstream tasks. For instance, question deduplication, Part-of-Speech (POS) Tagging, Language Modelling, and Named Entity Recognition (NER).  As a drawback, extrinsic metrics are:
1. computationally heavy;
2. have high complexity of creating gold standard datasets for downstream tasks;
3.  have a lack of performance consistency over different tasks.

_Intrinsic metrics_ evaluate the quality of a vector model per se, independently from specific downstream tasks. They measure syntactic or semantic relationships between words directly, typically using a gold benchmark of semantic similarity between pair of words. 
The gold benchmark can be obtained directly getting human judgements or using automated semantic similarity measures.

Some of the limitations of human annotated benchmarks are:
1.  suffering from word sense ambiguity(faced by a human tester) and subjectivity;
2.  facing difficulties in finding significant differences between models due to the small size and low inter-annotator agreement of existing datasets;
3.  need for constructing judgement datasets for each language and each domain, given that word meanings can change a lot across different domains.
Moreover, the update and the upscale of handcrafted resources like the similarity human scores are costly, time expensive, and error-prone. 

A well-known __intrinsic evaluation method__ is Semantic Relatedness (or Similarity) (see [Baroni et al.](https://aclanthology.org/P14-1023.pdf), [Schnabel et al.](https://aclanthology.org/D15-1036.pdf)
), which evaluates the performance of a vector model by the correlation between the cosine similarity between pairs of word vectors and the semantic relatedness (or similarity) between them in the gold benchmark.

## Intrinsic evaluations

Gensim provides the possibility to evaluate word embedding models through two of the most used intrinsic evaluation tasks: semantic similarity and word analogy.

You can evaluate your models on those tasks using `evaluate_word_pairs(pairs)` and `evaluate_word_analogies(analogies)`, specifying the benchmark you want to use as the first argument of the function.

### How to evaluate word embedding models with taxonomy‐based semantic similarity measures

The evaluation based on taxonomy‐based semantic similarity measures is another type of intrinsic evaluation, and it exploit the information encoded in an existing taxonomy to build a benchmark for the evaluation of word embeddings.

To perform this type of evaluation you can use the benchmark HSS4570.tsv (you can find it at test/test_data/HSS4570.tsv) through the Gensim function `evaluate_word_pairs()`, which returns:

* the Pearson correlation coefficient with 2-tailed p-value;

* the Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.

In [1]:
from gensim.models import KeyedVectors
path_to_model = 'models/ft_vectors_0_10_50_5.txt'
model = KeyedVectors.load_word2vec_format(path_to_model, binary=False)

In [2]:
import pkgutil
similarities = model.evaluate_word_pairs('benchmark/HSS4570.tsv', delimiter='\t')
similarities

((0.42422276164431727, 4.739561944305236e-193),
 SpearmanrResult(correlation=0.4326345056769725, pvalue=1.508555569557493e-201),
 3.063457330415755)

To perform this evaluation you can also use the library `TaxoSS` to create your benchmark using the taxonomy of your choice that better suits the topic of your work, to select the word embedding model that best encodes the taxonomic relationships between the concepts in the taxonomy.

You can compute the semantic similarity in the following way:

1. create a list of pairs of words for comparison, for example:

In [3]:
w1 = ['cat', 'bird', 'mammal']
w2 = ['dog', 'fish', 'vertebrate']
words = [[x, y] for x, y in zip(w1, w2)]
words

[['cat', 'dog'], ['bird', 'fish'], ['mammal', 'vertebrate']]

2. for each pair compute the similarity between the words:

In [4]:
from TaxoSS.functions import semantic_similarity
hss = []
for w in words:
    hss.append(semantic_similarity(w[0], w[1], 'hss'))
hss

[2.5578320203297156, 1.4267224907237086, 1.4852139609136645]

3.  create your benchmark as a dataframe:

In [5]:
import pandas as pd
benchmark = pd.DataFrame({'word1':[x[0] for x in words], 'word2':[x[1] for x in words], 'hss':hss})
benchmark

Unnamed: 0,word1,word2,hss
0,cat,dog,2.557832
1,bird,fish,1.426722
2,mammal,vertebrate,1.485214


The function `semantic_similarity(word1, word2, kind, ic)` has these options for the argument `kind`:

* *hss* -> HSS (_default_)
* *wup* -> WUP
* *lcs* -> LC
* *path_sim* -> Shortest Path
* *resnik* -> Resnik
* *jcn* -> Jiang-Conrath
* *lin* -> Lin
* *seco* -> Seco

You can choose the one you prefer, and to have more information about them see https://link.springer.com/article/10.1007/s12559-021-09987-7.

Now you can evaluate your word embedding model using the benchmark you just created:

1. load your word embedding model:

In [6]:
from gensim.models import KeyedVectors
path_to_model = 'models/ft_vectors_0_10_50_5.txt'
model = KeyedVectors.load_word2vec_format(path_to_model, binary=False)

2. compute the cosine similarity for each pair of words in your benchmark in the model

In [7]:
cos_sim = [model.similarity(x, y) for x, y in zip(benchmark.word1, benchmark.word2)]
benchmark['cosine_similarity'] = cos_sim
benchmark

Unnamed: 0,word1,word2,hss,cosine_similarity
0,cat,dog,2.557832,0.878396
1,bird,fish,1.426722,0.709674
2,mammal,vertebrate,1.485214,0.784841


3. compute the Spearman (or Pearson) correlation between `benchmark['hss']` and `benchmark['cosine_similarity']`:

In [9]:
import scipy.stats
print(scipy.stats.pearsonr(benchmark['hss'], benchmark['cosine_similarity']))
print(scipy.stats.spearmanr(benchmark['hss'], benchmark['cosine_similarity']))

(0.9151906835681539, 0.2640797648114273)
SpearmanrResult(correlation=1.0, pvalue=0.0)
