In [None]:
import sys

sys.path.append("code")
import nbutils

We first tell the notebook logic whether we have a full bokeh server. This *is* the case for local jupyter installations, but is *not* the case for notebooks running on mybinder - in the latter case we have some limits on interactivity.

In [None]:
import os

nbutils.initialize(has_bokeh_server="auto")

# Overview

In this notebook we will set forth the full stack of decisions that need to be taken in order to compute multi-token (i.e. sentence or document) similarities from embeddings. Not only will we work with different embeddings but also with different ways of leveraging these embeddings to compute phrase similarities. While our examples are following single paths of computation, we contextualize our decisions and allow interactive readers to take different decisions in terms of embeddings, algorithms and parameters.

We will experiment with four classes of embeddings (see the diagram below for a classification):

* Static token embeddings: these operate on the token level such. We experiment with GloVe (Pennington et al. 2014), fastText (Mikolov et al., 2017) and Numberbatch (Speer et al, 2018). We use these three to compute token similarity and combine this with alignment algorithms (such as Waterman-Smith-Beyer) to compute document similarity. We also investigate the effect of stacking two static embeddings (fastText and Numberbatch).
* Contextual token embeddings: these also operate on the token level, i.e. embeddings that change according to a specific token instance's context. In this notebook we experiment with using such token embeddings from a sentence bert model.
* Document embeddings derived from specially trained models. Document embeddings represent one document via one single embedding. We use document embeddings obtained from a BERT model. More specifically, we use a Sentence-BERT model trained for the semantic textual similarity (STS) task (Reimers and Gurevych, 2019).
* Document embeddings derived from token embeddings. We also experiment with averaging different kinds of token embeddings (static and contextual) to derive document embeddings.

![Different kinds of embeddings](miscellaneous/diagram_embeddings.svg)

For reasons of limited RAM and download times, we use small or compressed versions of the static embeddings we work with. For GloVe, we use the official 50-dimensional version of the 6B variant. For fastText we use a version that was compressed using the standard settings in https://github.com/avidale/compress-fasttext. For Numberbatch we use a 50-dimension version that was reduced using a standard PCA.   

# Technical Setup

In [None]:
import bokeh.io
bokeh.io.output_notebook()

# Choosing Static Word Embeddings

First we need:
    
    * a set of documents of search over (i.e. our corpus)
    * a set of word embeddings to employ for these searches
    
For the latter, we turn to Vectorian's embedding zoo, which offers a number of pretrained word embeddings.

In [None]:
from vectorian.embeddings import Zoo

#Zoo.list()

Let's load the static embeddings as described above from Vectorian's model zoo.

In [None]:
from vectorian.embeddings import StackedEmbedding

emb_glove = Zoo.load('glove-6B-50')
emb_numberbatch = Zoo.load('numberbatch-19.08-en-50')
emb_fasttext = Zoo.load('fasttext-en-mini')
emb_fasttext_numberbatch = StackedEmbedding([emb_fasttext, emb_numberbatch])

We also instantiate an NLP parser based on sentence bert and a shim to use this model's token embeddings in the Vectorian.

In [None]:
import importlib
importlib.reload(nbutils)

nlp = nbutils.make_nlp()

from vectorian.embeddings import SentenceBertEmbedding
emb_sbert = SentenceBertEmbedding(nlp)

# Loading Documents

First load our gold standard that contains our queries.

In [None]:
import json

with open("data/raw_data/gold.json", "r") as f:
    gold = nbutils.Gold(json.loads(f.read()))

In [None]:
gold.phrases[:5]

In [None]:
gold.matches('to be or not to be')[:1]

We are now ready to build a Vectorian session that contains our documents and embeddings. We use preprocessed corpus data. For details, how this was achieved, see `code/prepare_corpus.ipynb`.

In [None]:
from vectorian.session import LabSession
from vectorian.corpus import Corpus

session = LabSession(
    Corpus.load("data/processed_data/corpus"),
    embeddings=[
        emb_sbert,
        emb_glove,
        emb_numberbatch,
        emb_fasttext,
        emb_fasttext_numberbatch],
    normalizers="default")

Let's take a look at the gold standard we imported and whose documents now live inside `session`. We have 20 queries (blue circles), and for each query we have a number of documents (green circles) that we regard as correct matches to these queries. All documents that are not attached to the query are - by definition - no correct matches.

Note for interactive users: you can hover your mouse over the nodes to see their content.

In [None]:
nbutils.plot_gold(gold)

# What are Word Embeddings?

We now turn to single word embeddings.

In [None]:
session.word_vec(emb_glove, "hot")

In [None]:
from vectorian.metrics import TokenSimilarity, CosineSimilarity

token_sim = TokenSimilarity(
    emb_numberbatch,
    CosineSimilarity()
)

session.similarity(token_sim, "hot", "cold")

In [None]:
token_sim = TokenSimilarity(
    emb_glove,
    CosineSimilarity())

session.similarity(token_sim, "hot", "cold")

In [None]:
token_sim = TokenSimilarity(
    emb_sbert,
    CosineSimilarity())

a = list(session.documents[0].spans(session.partition("document")))[0][3]
b = list(session.documents[3].spans(session.partition("document")))[0][2]
session.similarity(token_sim, a, b)

# Exploring Word Embeddings (and our data)

In [None]:
nbutils.browse(gold, "rest is silence", "Fig for Fortune")

This is an example, where similarity and therefore embeddings won't help us much. The syntactic structure is mirrored, and "silence" is replaced with "all but wind". Even if we focus on nouns only, "silence" and "wind" are not generally similar. Still an embedding approach should be able to recognize that the  words at the beginning of phrase are exact matches.

If we inspect the cosine similarity of the token "silence" with other tokens in the context under three of our embeddings, we see that there is more connection between "silence" and "wind" than we expected. Still, the absolute value of 0.3 for numberbatch is low. Interestingly, glove associates "silence" with "action", i.e. an opposite. The phenomenon that embeddings sometimes cluster opposites is a common observation and can be an issue when wanting to differentiate between these.

In [None]:
nbutils.plot_token_similarity(session, nlp, gold, "silence", "Fig for Fortune", n_figures=3)

In [None]:
nbutils.browse(gold, "sea of troubles", "Book of Common Prayer")

This is a different example, where similarity computation might help. Here, "sea" is replaced by "waves", and "troubles" by "troublesome". We should expect to get reasonable results with results on this instance.

Indeed, by inspecting the cosine similarity of the token "sea" with other tokens in the context, we see that this is true.

In [None]:
nbutils.plot_token_similarity(session, nlp, gold, "sea", "The Book of Common Prayer", n_figures=3)

In [None]:
nbutils.plot_token_similarity(session, nlp, gold, "troubles", "The Book of Common Prayer", n_figures=3)

Note how out-of-vocabulary words like "troublesomest" will produce zero similarities under standard key-value embeddings, whereas fastText is still able to produce a vector thanks to subword information.

In [None]:
nbutils.plot_token_similarity(session, nlp, gold, "troublesomest", "The Book of Common Prayer", n_figures=3)

# Exploring Document Embeddings

Before we turn to alignment strategies to match sentences token by token, we first look at representing each document with one embedding in order to gather an understanding how different embedding strategies relate to the nearness of documents. We will later turn to individual token embeddings.

We first prepare additional sentence embeddings using SBERT that we will show in our first big visualization.

In [None]:
from vectorian.embeddings import CachedPartitionEncoder, SpanEncoder

sbert_encoder = CachedPartitionEncoder(SpanEncoder(
    lambda texts: [nlp(t).vector for t in texts]))

sbert_encoder.try_load("data/processed_data/doc_embeddings")
sbert_encoder.cache(session.documents, session.partition("document"))
sbert_encoder.save("data/processed_data/doc_embeddings")

sbert_encoder_name = nlp.meta["name"]

Now we construct an Explorer class. In addition to providing the SBERT encoder we just built, we configure the Explorer to use averaging to build documents embeddings from token embeddings.

Note to interactive readers: you can change the "mean" (averaging) method to other methods for computing document tokens as well.

In [None]:
doc_embedding_explorer = nbutils.DocEmbeddingExplorer(
    session=session, nlp=nlp, gold=gold, extra_encoders={sbert_encoder_name: sbert_encoder})

In [None]:
doc_embedding_explorer.plot([
    {"encoder": "paraphrase_distilroberta", "locator": ("fixed", "carry coals"), 'has_tok_emb': False},
    {"encoder": "paraphrase_distilroberta", "locator": ("fixed", "an old man is twice"), 'has_tok_emb': False}
]);

In the TSNE visualization above, dots are documents and the colors are the query that yields that document in our gold standard. By hovering over dots with the mouse you get details on the document and query the dot represents. Nearby dots of the same color indicate that the embedding tends to cluster documents similar to our gold standard.

On the left above, we see that the phrase "we will not carry coals" (large green-yellow circle with cross) in located closely to the documents associated with that query (smaller green-yellow circles). Similarly, on the right we see that the phrase "an old man is twice a child" clusters with the actual (green) documents we associate with it in our gold standard.

For these phrases and documents, the `paraphrase_distilroberta` model does a good job of producing a document embedding that actually separates inherent topics (without us telling it to do it).

In [None]:
doc_embedding_explorer.plot([
    {"encoder": "numberbatch", "selection": [
        'ww_32c26a7909c83bda',
        'ww_b5b8083a6a1282bc',
        'ww_9a6cb20b0b157545',
        'ww_a6f4b0e3428ad510',
        'ww_8e68a517bc3ecceb']}
]);

In the plot above we look at the document embedding produced by a **token-based** embedding. This has the advantage that we can actually look at token embeddings that make up the document embedding (through averaging). On the right side, we see a TSNE plot of all token embeddings that occur in the documents that are selected on the left. The hope is that this visualization will give us a clue why the documents on the left might be clustered.

The red circles on the left represent contexts that match the phrase "a horse, a horse, my kingform for a horse" are mapped. If we look at the token embeddings (that includes documents from other other classes), we indeed see that a grouping happens due to word embeddings clustering around "horse" (right side), but we also see a cluster around "boat", "sail" and "river" on the left.

In fact context 1 contains "muscle boat", context 2 contains "To swim the river villain", and context 3 contains "A boat, a boat". We thereby see that this kind of unsupervised document clustering clusters items due to inherent qualities that might not actually match our query criteria.

Interactive note: you can compute different token embeddings plots by selecting different documents on the mouse (drag the mouse to lasso).

# Understanding Alignments (WSB vs WMD)

## A Search Query using Alignment over Similar Tokens

In [None]:
def make_index_builder(**kwargs):
    return nbutils.InteractiveIndexBuilder(session, nlp, partition_encoders={
        sbert_encoder_name: sbert_encoder
    }, **kwargs)

In [None]:
index_builder = make_index_builder()

What you see above is the description of a search strategy that we will employ in the following sections of this notebook. Interactive readers can switch to the "Edit" part and actually explore the setting in more detail and even change it to something completely different.

In [None]:
gold.phrases[0]

In [None]:
index_builder.build_index().find(gold.phrases[0], n=1)

## Plotting the NDCG over the Corpus

We first define a strategy for searching the corpus. In the summary below you will find the strategy used for the non-interactive version of this text. In the interactive version, you can click on "Edit" and change these settings and rerun the following sections of the notebook accordingly.

In [None]:
index_builder_a = make_index_builder()

In [None]:
import vectorian.alignment

index_builder_b = make_index_builder(
    strategy="Alignment",
    strategy_options={"alignment": vectorian.alignment.WordMoversDistance.wmd("kusner")})

In [None]:
index_builder_c = make_index_builder(
    strategy="Alignment",
    strategy_options={"alignment": vectorian.alignment.WordMoversDistance.wmd("vectorian")})

In [None]:
index_builder_d = make_index_builder(
    strategy="Partition Embedding")

Now get an overview of the quality of the results we obtain when using the index configures with `index_builder` by computing the NDCG over all queries in our gold standard with regards to the known optimal results.

In [None]:
nbutils.plot_ndcgs(gold, {
    "wsb": index_builder_a.build_index(),
    "wmd (kus.)": index_builder_b.build_index(),
    "wmd (vec.)": index_builder_c.build_index(),
    "doc emb.": index_builder_d.build_index()
})

We see that some queries obtain 100%, i.e. the top results match the optimal ones given in our gold standard. We see that Waterman-Smith-Beyer (WSB) tends to perform a tad better than Word Mover's Distance (WMD), with the exception of "though this be madness...", where WMD outperforms WSB. In general the Vectorian modification of WMD, which does not use nbow, performs better than Kusner's original description of WMD. The one exception here "livers white as milk."

One advantage of WSB over the full WMD variants is its easy interpretability. WSB allows us to understand as a bijective  mapping between tokens, namely a subset of the query and a subset of the document. For WMD, this assumption of bijection often breaks down. We use this character of WSB in the following section to illustrate which mappings actually occur.

Let's look at some queries, where the performance for WSB is bad, and try to understand why our search fails to obtain the optimal results at the top of the result list.

## Focussing on single queries

In [None]:
index_builder = make_index_builder()

For this, we turn to the query with the lowest score "though this be madness, yet there is a method in it", and look at its results in more detail.

In [None]:
plot_a = nbutils.plot_results(gold, index_builder.build_index(), "though this be madness", rank=6)

The best match obtained here - on rank 6 - is anchored on two word matches, namely `madness` (a 100% match) and `methods` (a 72% match). The other words are quite different and there is no good alignment.

In [None]:
plot_b = nbutils.plot_results(gold, index_builder.build_index(), "though this be madness", rank=3)

Above we see the rank 3 result from the same query, which is a false positive - i.e. our search proclaims it is a better result than the one we saw before, but in fact this result is not relevant according to our gold standard. If we analyze why this result gets such a high score nonetheless, we see that "is" and "in" both contribute big 100% scores. In contrast to the scores before, 100% for "madness" and 72% for "methods", this partially explains the higher overall score (if we assume for now that the contributions from the other tokens are somewhat similar).

Let us try to understand why the "correct" (true positive) results are ranked rather low and what we could do about it by looking at the composition of scores for each result we obtain for this query:

In [None]:
nbutils.vis_token_scores(plot_b.matches[:50], highlight={
    "token": ["madness", "method"],
    "rank": [6, 20, 35, 45]
});

The relevant (true positive) results are marked with black triangles. We see that our current search strategy isn't doing a very good job of ranking them highly.

Looking at the score composition of the relevant results, we can make out two distinct features: all relevant results show a rather large contribution of either "madness" (look at rang 6 and rank 35, for example) and/or a rather large contribution of "method" (ranks 45 and 6).

However, these contributions do not lead to higher ranks necessarily, since other words such as "is", "this" and "though" score higher for other results: for example, look at the contribution of "in" for rank 1.

In the next plot below, we visualize this observation using ranks 1, 6 and 35. Compare the rank 1 result on the left - which is a false positive - with the two relevant results on the right. Again, we see that "in", "through" and "is" make up large parts of the score for rank 1, whereas "madness" is a considerable factor for the two relevant matches. Unfortunately, this contribution is not sufficient to bring these results to higher ranks.

In [None]:
import ipywidgets as widgets

@widgets.interact(plot_as=widgets.ToggleButtons(options=['bar', 'pie'], value='pie'))
def plot(plot_as):
    nbutils.vis_token_scores(plot_b.matches, kind=plot_as, ranks=[1, 6, 35], plot_width=800)

The distributions of score contributions we just observed are the motivation for our approach to tag-weighted alignments, that are described in (Liebl and Burghardt, 2020). We demonstrate it now, by using a tag-weighted alignment that will weight nouns like "madness" and "method" 3 times more than other word types. Let's set it up ("NN" is a Penn Treebank tag and identifies singular nouns):

In [None]:
tag_weighted_index_builder = make_index_builder(
    strategy="Tag-Weighted Alignment",
    strategy_options={"tag_weights": {
        'NN': 3
    }})

In [None]:
nbutils.plot_results(gold, tag_weighted_index_builder.build_index(), "though this be madness");

This tag-weighting allows to fix move the correct results far to the top, namely to ranks 1, 2, 4 and 6.

Note that we can bring rank 73 to rank 15 by increasing the NN weight to 5. But this is sort of an extreme measure and we will not follow it here.

Instead we wonder: how will the weighting affect the other queries? Let's re-run the NDCG computation and compare it against unweighted WSB.

In [None]:
index_builder_unweighted = make_index_builder()

In [None]:
nbutils.plot_ndcgs(gold, {
    "wsb_unweighted": index_builder_unweighted.build_index(),
    "wsb_weighted": tag_weighted_index_builder.build_index()
})

So we have shown that we considerably increased our accuracy through employing weighting.

# Interactive Searches with Your Own Data

First specify the text documents you want to search through by an upload widget:

In [None]:
import ipywidgets as widgets

upload = widgets.FileUpload(
    accept='.txt',
    multiple=True
)

upload

From this upload widget contents, we now build a Vectorian session we can perform search through. As always with Vectorian session, we need to specify the embeddings the want to employ for searching. We also need an `nlp` instance for importing the text documents. Depending on the size and number of documents, this step can take some time.

In [None]:
from vectorian.importers import StringImporter
from vectorian.session import LabSession

import codecs


def files_to_session(upload):
    im = StringImporter(nlp)
    
    if not upload.value:
        raise RuntimeError("cannot run on empty upload")

    docs = []
    for k, data in upload.value.items():
        docs.append(im(
            codecs.decode(data["content"], encoding="utf-8"),
            title=k,
            unique_id=k))

    return LabSession(
        docs,
        embeddings=[
            emb_numberbatch,
            emb_fasttext],
        normalizers="default")

upload_session = files_to_session(upload)

Now we present the full interactive search interface the Vectorian offers (we have hidden it so far and focussed on a subset). Note that in contrast to our experiments earlier, we do not search on the *document* level by default, but rather the *sentence* level - i.e. we split each document into sentences and then search on each sentence. You can change this in the "Partition" dropdown.

In [None]:
upload_session.interact(nlp)

# Literaturliste

Pennington, Jeffrey, et al. “Glove: Global Vectors for Word Representation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2014, pp. 1532–43. DOI.org (Crossref), doi:10.3115/v1/D14-1162.

Mikolov, Tomas, et al. “Advances in Pre-Training Distributed Word Representations.” ArXiv:1712.09405 [Cs], Dec. 2017. arXiv.org, http://arxiv.org/abs/1712.09405.

Speer, Robyn, et al. “ConceptNet 5.5: An Open Multilingual Graph of General Knowledge.” ArXiv:1612.03975 [Cs], Dec. 2018. arXiv.org, http://arxiv.org/abs/1612.03975.

Reimers, Nils, and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” ArXiv:1908.10084 [Cs], Aug. 2019. arXiv.org, http://arxiv.org/abs/1908.10084.

Liebl, Bernhard, and Manuel Burghardt. “‘Shakespeare in the Vectorian Age’ – An Evaluation of Different Word Embeddings and NLP Parameters for the Detection of Shakespeare Quotes.” Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2020, pp. 56–58.