# Evaluating Word (and Concept) Embeddings

In the previous notebooks we have seen how to generate word, knowledge graph and joint (word-concept) embeddings.

We also saw that it is easy to explore the resulting embedding spaces using cosine similarity and selecting the *k-nearest neighbours*.

In this notebook we look further into how (word) embeddings are evaluated. In particular, we look into the following methods:
  - **Visual Exploration**: whereby (a subsection of) the embeddings are displayed
  - **Intrinsic Evaluation**: whereby the embeddings are used to perform a token-based task and the results are compared with a gold standard.
    + **Word Prediction**: whereby we look into using a test corpus to evaluate the embeddings by defining a word prediction task.
  - **Extrinsic Evaluation**: whereby a new model is learned (using the embeddings as inputs) to perform a complex task. 
  
KG embeddings tend to be evaluated using **graph completion** tasks, which we will also discuss briefly.

## Recommended papers in this area

[Schnabel, T., Labutov, I., Mimno, D., & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. In EMNLP (pp. 298–307). Association for Computational Linguistics.](http://anthology.aclweb.org/D/D15/D15-1036.pdf) Provides a good overview of methods and introduces terminology to refer to different types of evaluations.

[Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (pp. 238–247).](http://anthology.aclweb.org/P/P14/P14-1023.pdf) Focuses mostly on *intrinsic* evaluations. Showed that predictive models (like word2vec) produced better results than count models (based on co-occurrence counting).

[Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3(0), 211–225.](https://www.transacl.org/ojs/index.php/tacl/article/view/570) Studied how various implementation or optimization 'details' used in predictive models, which were not needed or used in count models affect the performance of the resulting embeddings. Example of such details are: negative sampling, dynamic context windows, subsampling and vector normalization. The paper shows that once such details are taken into account, the difference between count and predictive models is not that large.

In [15]:
%cd /content/tutorial
!git pull
%cd /content/

/content/tutorial
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (1/1), done.[K
remote: Total 5 (delta 4), reused 5 (delta 4), pack-reused 0[K
Unpacking objects: 100% (5/5), done.
From https://github.com/HybridNLP2018/tutorial
   c433662..e57213f  master     -> origin/master
Updating c433662..e57213f
Fast-forward
 scripts/swivel/wordsim.py | 2 [32m+[m[31m-[m
 1 file changed, 1 insertion(+), 1 deletion(-)
/content


In [1]:
!git clone https://github.com/HybridNLP2018/tutorial.git

fatal: destination path 'tutorial' already exists and is not an empty directory.


## Visual Exploration

Use dimensionality reduction algorithms such as t-SNE and PCA to visualize (a subset) of the embedding space to project points to a 2-D or 3-D space.

[Embedding Projector](http://projector.tensorflow.org/)

 - Pros:
   - Can give you a sense of whether the model has correctly learned meaningful relations. Especially if you have a small number of pre-categorized words.
   - Easy to explore the space
 - Cons:
   - Subjective: neighbourhoods may look good, but are they? There is no gold standard
   - Works best for a small subset of the embedding space. But who decides which subset?
   - resulting projection can be deceiving: what looks close in 3-D space can be far in 300-D space (and vice-versa).

## Intrinsic Evaluation

**Intrinsic** evaluations are those where you can use embeddings to perform relatively simple, word-related tasks.

Schnabel et al. distinguish between:
 - **Absolute intrinsic**: you have a (human annotated) gold standard for a particular task and use the embeddings to make predictions.
 - **Comparative intrinsic**: you use the embedding space to present predictions to humans, who then rate them. Mostly used when there is no gold standard available.
 
Tasks:
 - **Relatedness**: How well do embeddings capture human-perceived word similarity? Datasets typically consist of triples: two words and a similarity score (e.g. between 0.0 and 1.0). Several available datasets, although interpretation of 'word similarity' can vary.
 - **Synonym detection**: Can embeddings select a synonym for a given word and a set of options? Datasts are n-tuples where the first word is the input word and the other `n-1` words are the options. Only one of the options is a synonym.
 - **Analogy**: Do embeddings encode relations between words? Datasets are 4-tuples: the first two words define the relation, the third word is the source of the query and the fourth word is the solution. Good embeddings should predict an embedding close to the solution word.
 - **Categorization**: Can embeddings be clustered into hand-annotated categories? Datasets are word-category pairs. Standard clustering algorithms can then be used to generate k-clusters and the purity of the clusters can be computed.
 - **Selectional preference**: Can embeddings predict whether a noun-verb pair is more likely to represent a verb-subject or a verb-object relation? E.g. people-eat is more likely to be found as a verb-subject.

### Compute Relatedness Score

Swivel comes with a `eval.mk` script that downloads and unzips various relatedness and analogy datasets. The script also compiles an `analogy` executable. It assumes you have a unix environment and tools such as `wget`, `tar`, `unzip` and `egrep`, as well as `make` and a `c++` compiler.

For convenience, we have included various relatedness datasets as part of this repo in `eval-datastets/relatedness`. We assume you have generated vectors as part of previous notebooks, which we will test here.

In [0]:
import os

In [5]:
%ls /content/tutorial/datasamples/relatedness/

rarewords.ws.tab  simverb3500.ws.tab  ws353sim.ws.tab
simlex999.ws.tab  ws353rel.ws.tab


In [0]:
%cp /content/umbc/coocs/tlgs_wnscd_5K_ls_f/row_vocab.txt /content/umbc/vec/tlgs_wnscd_5k_ls_f/vocab.txt
umbc_5k_vec = '/content/umbc/vec/tlgs_wnscd_5k_ls_f/'
umbc_full_vec = '/content/umbc/vec/vecsi_tlgs_wnscd_ls_f_6e_160d/'

You can use Swivel's `wordsim.py` to produce metrics for the k-cap embeddings we produced in previous notebooks:

In [16]:
!python /content/tutorial/scripts/swivel/wordsim.py --vocab={umbc_5k_vec}vocab.txt \
  --embeddings={umbc_5k_vec}vecs.bin \
  --word_prefix="lem_" \
  /content/tutorial/datasamples/relatedness/*.ws.tab  

Opening vector with expected size 5632 from file /content/umbc/vec/tlgs_wnscd_5k_ls_f/vocab.txt
vocab size 5632 (unique 5632)
read rows
65 of 2034 pairs found
0.576 /content/tutorial/datasamples/relatedness/rarewords.ws.tab
288 of 999 pairs found
0.066 /content/tutorial/datasamples/relatedness/simlex999.ws.tab
1126 of 3500 pairs found
0.073 /content/tutorial/datasamples/relatedness/simverb3500.ws.tab
92 of 252 pairs found
0.371 /content/tutorial/datasamples/relatedness/ws353rel.ws.tab
57 of 203 pairs found
0.459 /content/tutorial/datasamples/relatedness/ws353sim.ws.tab


In [25]:
%ls {umbc_full_vec}vocab.txt
!python /content/tutorial/scripts/swivel/wordsim.py --vocab=/content/umbc/vec/vecsi_tlgs_wnscd_ls_f_6e_160d/vocab.txt \
  --embeddings={umbc_full_vec}vecs.bin \
  --word_prefix="lem_" \
  /content/tutorial/datasamples/relatedness/*.ws.tab

/content/umbc/vec/vecsi_tlgs_wnscd_ls_f_6e_160d/vocab.txt
Opening vector with expected size 1499136 from file /content/umbc/vec/vecsi_tlgs_wnscd_ls_f_6e_160d/vocab.txt
vocab size 1499136 (unique 1499125)
read rows
1433 of 2034 pairs found
0.401 /content/tutorial/datasamples/relatedness/rarewords.ws.tab
999 of 999 pairs found
0.276 /content/tutorial/datasamples/relatedness/simlex999.ws.tab
3494 of 3500 pairs found
0.191 /content/tutorial/datasamples/relatedness/simverb3500.ws.tab
250 of 252 pairs found
0.529 /content/tutorial/datasamples/relatedness/ws353rel.ws.tab
202 of 203 pairs found
0.649 /content/tutorial/datasamples/relatedness/ws353sim.ws.tab


The numbers show that both embedding spaces only have a small coverage of the evaluation datasets. Furthermore, the correlation score achieved is in the range of 0.07 to 0.22, which is very poor, but expected given the size of the corpus. 

For comparison state-of-the-art results are in the range of 0.65 to 0.8.


### Conclusion for Intrinsic Evaluation

Intrinsic evaluations are the most direct way of evaluating (word) embeddings.

Pros:
 - they provide a single objective metric that enables easy comparison between different embeddings
 - there are several readily available evaluation datasets (for English)
 - if you have an existing, manually crafted, knowledge graph, you can generate your own evaluation datasets
 
Cons:
 - evaluation datasets are small and can be biased in terms of word selection and annotation
 - you need to take coverage into account (besides final metric)
 - existing datasets only support English words (few datasets in other languages, few compound words, few concepts)
 - tasks are low level and thus somewhat artificial: people care about document classification, but not about word categories or word similarities.

## Word Prediction (plots)

This can be seen as a task for intrinsic evaluation, however the task is very close to the original training task used to derive the embeddings in the first place.

Recall that *predictive models* (such as `word2vec`), try to minimize the distance between a word embedding and the embeddings of the context words (and that over a whole corpus).

![word2vec diagrams](https://github.com/hybridNLP2018/tutorial/blob/master/images/word2vec_diagrams.png?raw=1)

This means that, if we have a **test corpus**, we can use the embeddings to try to predict words based on their contexts. Assuming the test corpus and the training corpus contain similar language we should expect better embeddings to produce better predictions on average.

A major advantage of this approach is that we do not need human annotation. Also, we can reuse the tokenization pipeline used for training to produce similar tokens as those in our embedding space. E.g. we can use word-sense-disambiguation to generate a test corpus including lemmas and concepts.

The algorithm in pseudo-code is: 

``` python
similarities = {}
for window in corpus:
  focus_word, context_words = window
  focus_vector = embedding(focus_word)
  context_vector = predict_embedding(context_words, focus_word)
  similarities[focus_word].append(cosine_similarity(focus_vector, context_vector))
return similarities.values().average()
```

The result is a single number that tells you how far the prediction embedding was from the actual word embedding over the whole test corpus. When using cosine similarity this should be a number between -1 and 1.

#### Word prediction plots

We can also use the intermediate `similarities` dictionary to plot diagrams which can provide further insight. For example, random embeddings result in 

![Word prediction plot for random embeddings](https://github.com/hybridNLP2018/tutorial/blob/master/images/Avg_cosine_similarities_for_random_words_at_different_winSizes_recentered.PNG?raw=1)

The horizontal axis is the rank of the `focus_word` sorted by their frequency in the training corpus. (For example, frequent words such as 'be' and 'the' would be close to the origin, while infrequent words would be towards the end of the axis.

The plot shows that, when words have random embeddings, on average the distance between the prediction for each word and the word embedding is close to 0.

These plots can be useful for detecting implementation bugs. For example, when we were implementing the `CogitoPrep` utility for counting co-occurrences for lemmas and concepts, we generated the following plot:

![Buggy embeddings](https://github.com/hybridNLP2018/tutorial/blob/master/images/correlationbug-avg_token_cosine_similarity_skipgram_10.PNG?raw=1)

This showed that we were learning to predict frequent words and some non-frequent words, but that we were not learning most non-frequent words correctly.

After fixing the bug, we got the following plot:

![uncentered](https://github.com/hybridNLP2018/tutorial/blob/master/images/uncentered-avg_token_cosine_similarity_skipgram_4.PNG?raw=1)

This shows that now we were able to learn embeddings that improved word prediction across the whole vocabulary. But it also showed that prediction for the most frequent words lagged behind more uncommon words.

After applying some vector normalization techniques to Swivel and re-centering the vectors (we noticed that the centroid of all the vocabulary embeddings was not the origin), we got:

![recentered](https://github.com/hybridNLP2018/tutorial/blob/master/images/recentered-es10k-avg_token_cosine_similarity_average_rowcol__skipgram__harmonic__5.PNG?raw=1)

This shows better overall prediction.

### Conclusion for Word Prediction

Pros:
 - provides a single objective metric
 - does not require human annotation (although it may requiring pre-processing of the test corpus)
 - allows to re-use the tokenization steps used during embedding creation.
 - can be used to generate plots, which can provide insights about implementation or representation issues 
 
 
Cons:
 - there are no standard test corpora
 - can be slow to generate the metric for large test corpus. We recommend balancing the size of the test corpus to maximise the vocabulary coverage, while minimising the time required to process the corpus.

## Extrinsic Evaluation

In Extrinsic Evaluations, we have a more complex task we are interested in (e.g. text classification, text translation, image captioning), whereby we can use embeddings as a way to represent words (or tokens). Assuming we have:
 - a model architecture and 
 - a corpus for training and evaluation (for which the embeddings provide adequate coverage), 
 
we can then train the model using different embeddings and evaluate its overall performance. The idea is that better embeddings will make it easier for the model to learn the overall task.
