In [None]:
import vectorian
import sys

import ipywidgets as widgets
from ipywidgets import interact

sys.path.append("code")
import nbutils
import gold

We first tell the notebook logic whether we have a full bokeh server. This *is* the case for local jupyter installations, but is *not* the case for notebooks running on mybinder - in the latter case we have some limits on interactivity.

In [None]:
import os

import importlib
importlib.reload(nbutils)
importlib.reload(gold)

nbutils.initialize("auto")

In [None]:
import bokeh.io
bokeh.io.output_notebook()

# Overview

In this notebook we will set forth the full stack of decisions that need to be taken in order to compute multi-token (i.e. sentence or document) similarities from embeddings. Not only will we work with different embeddings but also with different ways of leveraging these embeddings to compute phrase similarities. While our examples are following single paths of computation, we contextualize our decisions and allow interactive readers to take different decisions in terms of embeddings, algorithms and parameters.

There are now various established ways to compute embeddings for similarity tasks. A first important distinction is between *token* embeddings and *document* embeddings (see diagram below) - note that we use the terms "token embeddings" and "word embeddings" interchangeably. While the former imply one embedding (i.e. numeric vector) per token, the latter operate by mapping a whole document (a set of tokens) into one single embedding.

There are two common ways to compute document embeddings. One way is to derive them from token embeddings - for example by averaging over them. More complex approaches train dedicated models that are optimized to produce good document embeddings.

So on this level, we can differentiate between three kinds of embeddings: pure token embeddings, document embeddings derived from token embeddings, and - finally - document embeddings from dedicated document embedding models (e.g. SBERT).

![Different kinds of embeddings](miscellaneous/diagram_embeddings_1.svg)


For token embeddings, there are also various options, as the diagram below illustrates. The most recent option are contextual token embeddings (also sometimes called *dynamic* embeddings), which will incorporate a specific token's context and can be obtained from architectures like ELMO or BERT. Another option are static token embeddings, which map one token to one embedding, independent of its specific occurence in a text. For an overview of static and contextual embeddings, and their differences, see (Wang et al. 2020).

For static embeddings there is now a variety of established options like fastText or GloVe. We can also combine embeddings or stack them (i.e. concatenate embedding vectors) to simply create new embeddings from existing ones.

![Different kinds of embeddings](miscellaneous/diagram_embeddings_2.svg)


In this notebook, we pick four classes of embeddings from the wide range of possibilities just described:

* Static token embeddings: these operate on the token level such. We experiment with GloVe (Pennington et al. 2014), fastText (Mikolov et al., 2017) and Numberbatch (Speer et al, 2018). We use these three to compute token similarity and combine this with alignment algorithms (such as Waterman-Smith-Beyer) to compute document similarity. We also investigate the effect of stacking two static embeddings (fastText and Numberbatch).
* Contextual token embeddings: these also operate on the token level, i.e. embeddings that change according to a specific token instance's context. In this notebook we experiment with using such token embeddings from a sentence bert model.
* Document embeddings derived from specially trained models. Document embeddings represent one document via one single embedding. We use document embeddings obtained from a BERT model. More specifically, we use a Sentence-BERT model trained for the semantic textual similarity (STS) task (Reimers and Gurevych, 2019).
* Document embeddings derived from token embeddings. We also experiment with averaging different kinds of token embeddings (static and contextual) to derive document embeddings.


# The Gold Standard

First load our gold standard that contains our queries.

In [None]:
gold_data = gold.Data("data/raw_data/gold.json")

Our gold standard consists of a number of `Pattern`s. Each `Pattern` is associated with a phrase, e.g. "to be or not to be", which occurs in a rephrased form in other works and contexts. These reoccurences, which model text reuse, are called `Occurrence`s in our data. Each such `Occurrence` carries the actual phrase and a larger context in which it occurs, which together we call the `Evidence`. The data layout for our gold standard looks as follows:

![UML of gold standard data](miscellaneous/gold_uml.svg)

One specific example in this data is the occurence of the pattern "to be or not to be" in "The Phoenix" as "to be named or not be named" (the latter is the `Evidence`, which consists of the actual phrase and a larger context):

In [None]:
nbutils.Browser(gold_data, "to be or not to be", "The Phoenix");

Below is a visualization of the full gold standard, with patterns indicated as blue circles and evidence indicated as green circles. Matching evidence and patterns are connected via edges and each bouqet consists of one pattern and the matching instances of text reuse. Interactive readers may want to hover the mouse over the nodes to see their content.

In [None]:
nbutils.plot_gold(gold_data)

# The Vectorian

For performing our actual investigations we rely on a framework called The Vectorian, which we first introduced in 2020 (Liebl and Burghardt, 2020). By employing highly optimized algorithms and data structures, the Vectorian allows us to perform rapid searches over the gold standard texts using a variety of approaches and strategies. 

In order to use the Vectorian, we need to map the gold standard concepts to Vectorian concepts as follows: the `context` texts in the gold standard `Evidence` items, i.e. the texts and contexts that contains an example of text reuse, are created in the Vectorian as `Document`s. A `Document` in Vectorian terminology is something we can perform a search on. `Document`s in the Vectorian are created using different kinds of `Importer`s that perform necessary natural language processing tasks using an additional `NLP` class (see diagram below). Since this step is also time-consuming, we precomputed this step and use the `Corpus` class to quickly load these preprocessed Documents into this notebook. For details about the full preprocessing, see `code/prepare_corpus.ipynb`. 

Using the loaded `Document`s and a set of `Embedding`s we want to work with, we can then create a `Session`, that allows us to perform searches. The specific steps we will take are (see top of diagram below):

* create a `Partition` (which specifies how Documents should be split into searchable units, e.g. sentences)
* create an `Index` for that partition (which specifies the strategy/algorithm we employ for searching)
* perform a search on that `Index` (using a query text)
* retrieve the `Result` and the `Match`es for that search

Later, we will additionally create an `Index` with document embeddings by employing a `PartitionEncoder`, which allows us to specify an embedding that does not operate on the token level, but on the document level.

![UML of important Vectorian classes](miscellaneous/vectorian_session.svg)

## Loading Word Embeddings

We load the static embeddings that were described above from Vectorian's model zoo. This zoo contains a number of prebuilt embeddings for various languages. Given below are some examples:

In [None]:
from vectorian.embeddings import Zoo
Zoo.list()[-15:]

For reasons of limited RAM in the interactive Binder environment (and to limit download times), we use small or compressed versions of the static embeddings we work with:

* for GloVe, we use the official 50-dimensional version of the 6B variant.
* for fastText we use a version that was compressed using the standard settings in https://github.com/avidale/compress-fasttext.
* for Numberbatch we use a 50-dimension version that was reduced using a standard PCA.   

In [None]:
the_embeddings = {}

the_embeddings['glove'] = Zoo.load('glove-6B-50')
the_embeddings['numberbatch'] = Zoo.load('numberbatch-19.08-en-50')
the_embeddings['fasttext'] = Zoo.load('fasttext-en-mini')

We also use one stacked embedding, in which we combine fasttext and numberbatch.

In [None]:
from vectorian.embeddings import StackedEmbedding

the_embeddings['fasttext_numberbatch'] = StackedEmbedding([
    the_embeddings['fasttext'], the_embeddings['numberbatch']])

We instantiate an NLP parser that is able to provide embeddings based on Sentence-BERT (Reimers and Gurevych, 2019).

In [None]:
nlp = nbutils.make_nlp()

Finally, we add a shim that allows us to use Sentence-BERT's contextual token embeddings in the Vectorian.

In [None]:
from vectorian.embeddings import SentenceBertEmbedding

the_embeddings['sbert'] = SentenceBertEmbedding(nlp)

## Creating the Session

The Vectorian `Session` is created with the specified embeddings and the preprocessed documents, which are loaded via the `Corpus` class:

In [None]:
from vectorian.session import LabSession
from vectorian.corpus import Corpus

session = LabSession(
    Corpus.load("data/processed_data/corpus"),
    embeddings=the_embeddings.values(),
    normalizers="default")

The session now contains all embeddings we will work with as well as the list of documents that contain the texts from the gold standard `Evidence` items.

## Word Embeddings and Similarity

We now take a short look at what constitutes word embeddings. A word embedding is a vector **x** of dimension $n$, i.e. a vector consisting of $n$ scalars.

\begin{equation*}
\mathbf{x}=(x_1, x_2, ..., x_{n-1}, x_n)
\end{equation*}

For example, the compressed numberbatch embedding we use has $n=50$ and thus represents the word "coffee" with the following 50 numbers:

In [None]:
widgets.GridBox(
    [widgets.Label(f"{x:.2f}") for x in session.word_vec(the_embeddings["numberbatch"], "coffee")],
    layout=widgets.Layout(grid_template_columns="repeat(10, 50px)"))

Since the above representation is hard to grasp, we visualize the values of

\begin{equation*}x_1, x_2, ..., x_{n-1}, x_n\end{equation*}

through different colors (optionally normalizing by $\mathbf{||x||_2}$):

In [None]:
@interact(embedding=widgets.Dropdown(
    options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
    value=the_embeddings["numberbatch"]), normalize=False)
def plot(embedding, normalize):
    nbutils.plot_embedding_vectors_val(
        ["sail", "boat", "coffee", "tea", "guitar", "piano"],
        get_vec=lambda w: session.word_vec(embedding, w),
        normalize=normalize)

Looking at these color patterns, we can gain some intuitive understanding of why and how word embeddings are suitable for word similarity computations. For example, *sail* and *boat* both show a strong activation on dimension 27. Similarly, *guitar* and *piano* share similar values around dimension 24. The words *coffee* and *tea* also share some similar patterns around dimension 2 and dimension 49, that sets them apart from the other four words.

A common approach to compute the similarity between two word vectors **u** and **v** in this kind of high-dimensional vector spaces is to compute the cosine of the angle $\theta$ between the vectors, which is called cosine similarity:

\begin{equation*}
cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}||_2 ||\mathbf{v}||_2} = \frac{\sum_1^n \mathbf{u}_i \mathbf{v}_i}{\sqrt{\sum_1^n \mathbf{u}_i^2} \sqrt{\sum_1^n \mathbf{v}_i^2}} = \sum_1^n \left( \frac{\mathbf{u}}{||\mathbf{u}||_2} \right)_i \left( \frac{\mathbf{v}}{||\mathbf{v}||_2} \right)_i 
\end{equation*}

A large positive value (i.e. a small $\theta$ between **u** and **v**) indicates high similarity, whereas a small or even negative value (i.e. a large $\theta$) indicates low similarity. For a discussion of issues with this notion of similarity, see (Faruqui et al., 2016).

The visualization below encodes

\begin{equation*}
\left( \frac{\mathbf{u}}{||\mathbf{u}||_2} \right)_i \left( \frac{\mathbf{v}}{||\mathbf{v}||_2} \right)_i 
\end{equation*}

for different $i, 1 \le i \le n$ through colors to illustrate how different components contribute to the cosine similarity for two words.

In [None]:
@interact(embedding=widgets.Dropdown(
    options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
    value=the_embeddings["numberbatch"]))
def plot(embedding):
    nbutils.plot_embedding_vectors_mul([
        ("sail", "boat"),
        ("coffee", "tea"),
        ("guitar", "piano")], get_vec=lambda w: session.word_vec(embedding, w))

A similar investigation into fastText shows similar spots of positive contribution. The situation is more complex due to the higher number of dimensions.

In [None]:
@interact(embedding=widgets.Dropdown(
    options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
    value=the_embeddings["fasttext"]))
def plot(embedding):
    nbutils.plot_embedding_vectors_mul([
        ("sail", "boat"),
        ("coffee", "tea"),
        ("guitar", "piano")], get_vec=lambda w: session.word_vec(embedding, w))

As shown above, computing the cosine similarity is mathematically equivalent to summing up the terms in the diagram above. The overall similarity between *guitar* and *piano* is measured at about 68% with the fastText embedding we use.

In [None]:
from vectorian.metrics import TokenSimilarity, CosineSimilarity

token_sim = TokenSimilarity(
    the_embeddings["fasttext"],
    CosineSimilarity()
)

session.similarity(token_sim, "guitar", "piano")

In [None]:
def plot_mik(words):
    from sklearn.decomposition import PCA
    import bokeh.models
    import bokeh.plotting
    import bokeh.io
    
    vectors = [session.word_vec(emb_fasttext, word) for word in words]

    pca = PCA(n_components=2, whiten=True)
    v2d = pca.fit(vectors).transform(vectors)
    
    source = bokeh.models.ColumnDataSource({
        'x': v2d[:, 0],
        'y': v2d[:, 1]
    })
    

    p = bokeh.plotting.figure(
        plot_width=800,
        plot_height=400,
        title="",
        toolbar_location=None, tools="")
    
    p.circle(source=source, size=10)
    
    for i in range(0, len(words), 2):
        p.add_layout(bokeh.models.Arrow(end=bokeh.models.NormalHead(line_color="black", line_width=1),
            x_start=v2d[i, 0], y_start=v2d[i, 1], x_end=v2d[i + 1, 0], y_end=v2d[i + 1, 1]))
    
    bokeh.io.show(p)
    
    
plot_mik(["man", "woman", "king", "queen", "prince", "princess"])

In [None]:
token_sim = TokenSimilarity(
    the_embeddings["sbert"],
    CosineSimilarity())

a = list(session.documents[0].spans(session.partition("document")))[0][3]
b = list(session.documents[3].spans(session.partition("document")))[0][2]
session.similarity(token_sim, a, b)

# Word Embeddings for the Specific Task

In [None]:
vis = nbutils.TokenSimVis(session, nlp, gold_data)

In [None]:
vis.goto("rest is silence", "Fig for Fortune")

The example above is a situation where token similarity - and therefore embeddings - might not help much. While the syntactic structure is mirrored, the term "silence" is replaced with "all but wind". Even if we focus on nouns only, we would not expect "silence" and "wind" to be understood to be similar. Still an embedding approach should be able to recognize that the  words at the beginning of phrase are exact matches.

If we inspect the cosine similarity of the token "silence" with other tokens in the context under three of our embeddings, we see that there is more connection between "silence" and "wind" than we expected - esp. with numberbatch. Still, the absolute value of 0.3 for numberbatch is low. Interestingly, glove associates "silence" with "action", i.e. an opposite. The phenomenon that embeddings sometimes cluster opposites is a common observation and can be an issue when wanting to differentiate between synonyms and antonyms.

In [None]:
vis.plot("silence")

In [None]:
vis.goto("sea of troubles", "Book of Common Prayer")

This is a different situation, where similarity computation might help. Here, "sea" is replaced by "waves", and "troubles" by "troublesome". We should expect to get reasonable results with results on this instance.

Indeed, by inspecting the cosine similarity of the token "sea" with other tokens in the context, we see that this is true.

In [None]:
vis.plot("sea")

In [None]:
vis.plot("troubles")

Note how out-of-vocabulary words like "troublesomest" will produce zero similarities under standard key-value embeddings, whereas fastText is still able to produce a vector thanks to subword information.

In [None]:
vis.plot("troublesomest")

# Exploring Document Embeddings

Before we turn to alignment strategies to match sentences token by token, we first look at representing each document with one single embedding in order to gather an understanding how different embedding strategies relate to the nearness of documents. We will later return to individual token embeddings.

We will use two strategies for computing document embeddings:

* averaging over token embeddings
* computing document embeddings through a dedicated model

In order to achieve the latter, we compute document embeddings using Sentence-BERT.

In [None]:
from vectorian.embeddings import CachedPartitionEncoder, SpanEncoder

sbert_encoder = CachedPartitionEncoder(SpanEncoder(
    lambda texts: [nlp(t).vector for t in texts]))

sbert_encoder.try_load("data/processed_data/doc_embeddings")
sbert_encoder.cache(session.documents, session.partition("document"))
sbert_encoder.save("data/processed_data/doc_embeddings")

sbert_encoder_name = nlp.meta["name"]

In order to achieve the former, we configure a helper class instance to use averaging to build documents embeddings from token embeddings.

Interactive readers may want to change the "mean" (averaging) method to other methods for computing document tokens as well.

In [None]:
doc_embedding_explorer = nbutils.DocEmbeddingExplorer(
    session=session, nlp=nlp, gold=gold_data, extra_encoders={sbert_encoder_name: sbert_encoder})

In [None]:
@interact(embedding=widgets.Dropdown(
    options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
    value=the_embeddings["numberbatch"]), normalize=False)
def plot(embedding, normalize):
    nbutils.plot_embedding_vectors_val(
        ["sail", "boat", "coffee", "tea", "guitar", "piano"],
        get_vec=lambda w: session.word_vec(embedding, w),
        normalize=normalize)

In [None]:
doc_embedding_explorer.plot([
    {"encoder": "paraphrase_distilroberta", "locator": ("fixed", "carry coals"), 'has_tok_emb': False},
    {"encoder": "paraphrase_distilroberta", "locator": ("fixed", "an old man is twice"), 'has_tok_emb': False}
]);

In the TSNE visualization above, dots are documents and the colors are the query that yields that document in our gold standard. By hovering over dots with the mouse you get details on the document and query the dot represents. Nearby dots of the same color indicate that the embedding tends to cluster documents similar to our gold standard.

On the left above, we see that the phrase "we will not carry coals" (large green-yellow circle with cross) in located closely to the documents associated with that query (smaller green-yellow circles). Similarly, on the right we see that the phrase "an old man is twice a child" clusters with the actual (green) documents we associate with it in our gold standard.

For these phrases and documents, the `paraphrase_distilroberta` model does a good job of producing a document embedding that actually separates inherent topics (without us telling it to do it).

In [None]:
doc_embedding_explorer.plot([
    {"encoder": "numberbatch", "selection": [
        'ww_32c26a7909c83bda',
        'ww_b5b8083a6a1282bc',
        'ww_9a6cb20b0b157545',
        'ww_a6f4b0e3428ad510',
        'ww_8e68a517bc3ecceb']
    }
]);

In the plot above we look at the document embedding produced by a **token-based** embedding. This has the advantage that we can actually look at token embeddings that make up the document embedding (through averaging). On the right side, we see a TSNE plot of all token embeddings that occur in the documents that are selected on the left. The hope is that this visualization will give us a clue why the documents on the left might be clustered.

The red circles on the left represent contexts that match the phrase "a horse, a horse, my kingform for a horse" are mapped. If we look at the token embeddings (that includes documents from other other classes), we indeed see that a grouping happens due to word embeddings clustering around "horse" (right side), but we also see a cluster around "boat", "sail" and "river" on the left.

In fact context 1 contains "muscle boat", context 2 contains "To swim the river villain", and context 3 contains "A boat, a boat". We thereby see that this kind of unsupervised document clustering clusters items due to inherent qualities that might not actually match our query criteria.

Interactive note: you can compute different token embeddings plots by selecting different documents on the mouse (drag the mouse to lasso).

# Understanding Alignments (WSB vs WMD)

Various approaches have been proposed. For an overview of sequence alignment algorithms as well as adjacent approaches like Dynamic Time Warping, see Kruskal (Kruskal, 1983). In this section, we use the Waterman-Smith-Beyer (WSB) algorithm which produces optimal local alignments and allows a general (e.g. non-affine) cost function (Waterman and Smith and Beyer, 1974). Other commonly used global alignment algorithms - such as Smith-Waterman and Gotoh - can be regarded as special cases of WSB. In comparison to Needleman-Wunsch, WSB produces local alignments. In contrast to classic formulations of WSB - which often use a fixed substitution cost - we use the word distance from word embeddings to compute the substitution penalty for specific pairs of words.

A different approach to compute a measure of similarity between bag of words is the so-called Word Mover's Distance introduced by Kusner et al. (Kusner et al., 2015).

## A Search Query using Alignment over Similar Tokens

In [None]:
def make_index_builder(**kwargs):
    return nbutils.InteractiveIndexBuilder(session, nlp, partition_encoders={
        sbert_encoder_name: sbert_encoder
    }, **kwargs)

In [None]:
index_builder = make_index_builder()
index_builder

What you see above is the description of a search strategy that we will employ in the following sections of this notebook. Interactive readers can switch to the "Edit" part and actually explore the setting in more detail and even change it to something completely different.

In [None]:
gold_data.patterns[0].phrase

In [None]:
index_builder.build_index().find(gold_data.patterns[0].phrase, n=1)

## Plotting the NDCG over the Corpus

We first define a strategy for searching the corpus. In the summary below you will find the strategy used for the non-interactive version of this text. In the interactive version, you can click on "Edit" and change these settings and rerun the following sections of the notebook accordingly.

In [None]:
import collections
import ipywidgets as widgets

index_builders = collections.OrderedDict({
    "wsb": make_index_builder(
        strategy="Alignment",
        strategy_options={"alignment": vectorian.alignment.WatermanSmithBeyer()}),
    "wmd nbow": make_index_builder(
        strategy="Alignment",
        strategy_options={"alignment": vectorian.alignment.WordMoversDistance.wmd("nbow")}),
    "wmd bow": make_index_builder(
        strategy="Alignment",
        strategy_options={"alignment": vectorian.alignment.WordMoversDistance.wmd("bow")}),
    "doc embedding": make_index_builder(
        strategy="Partition Embedding")
})

accordion = widgets.Accordion(children=[x.displayable for x in index_builders.values()])
for i, k in enumerate(index_builders.keys()):
    accordion.set_title(i, k)
accordion

In [None]:
vectorian.alignment.WordMoversDistance.wmd("nbow").to_args(None)

In [None]:
vectorian.alignment.WordMoversDistance.wmd("bow").to_args(None)

Now get an overview of the quality of the results we obtain when using the index configures with `index_builder` by computing the NDCG over all queries in our gold standard with regards to the known optimal results.

In [None]:
nbutils.plot_ndcgs(gold_data, dict((k, v.build_index()) for k, v in index_builders.items()))

We see that some queries obtain 100%, i.e. the top results match the optimal ones given in our gold standard. We see that Waterman-Smith-Beyer (WSB) tends to perform a tad better than Word Mover's Distance (WMD), with the exception of "though this be madness...", where WMD outperforms WSB. In general the Vectorian modification of WMD, which does not use nbow, performs better than Kusner's original description of WMD. The one exception here is "livers white as milk."

One advantage of WSB over the full WMD variants is its easy interpretability. WSB allows us to understand as a bijective  mapping between tokens, namely a subset of the query and a subset of the document. For WMD, this assumption of bijection often breaks down. We use this character of WSB in the following section to illustrate which mappings actually occur.

Let's look at some queries, where the performance for WSB is bad, and try to understand why our search fails to obtain the optimal results at the top of the result list.

## Focussing on single queries

In [None]:
index_builder = make_index_builder()
index_builder

For this, we turn to the query with the lowest score "though this be madness, yet there is a method in it", and look at its results in more detail.

In [None]:
plot_a = nbutils.plot_results(gold_data, index_builder.build_index(), "though this be madness", rank=6)

The best match obtained here - on rank 6 - is anchored on two word matches, namely `madness` (a 100% match) and `methods` (a 72% match). The other words are quite different and there is no good alignment.

In [None]:
plot_b = nbutils.plot_results(gold_data, index_builder.build_index(), "though this be madness", rank=3)

Above we see the rank 3 result from the same query, which is a false positive - i.e. our search proclaims it is a better result than the one we saw before, but in fact this result is not relevant according to our gold standard. If we analyze why this result gets such a high score nonetheless, we see that "is" and "in" both contribute big 100% scores. In contrast to the scores before, 100% for "madness" and 72% for "methods", this partially explains the higher overall score (if we assume for now that the contributions from the other tokens are somewhat similar).

Let us try to understand why the "correct" (true positive) results are ranked rather low and what we could do about it by looking at the composition of scores for each result we obtain for this query:

In [None]:
nbutils.vis_token_scores(plot_b.matches[:50], highlight={
    "token": ["madness", "method"],
    "rank": [6, 20, 35, 45]
});

The relevant (true positive) results are marked with black triangles. We see that our current search strategy isn't doing a very good job of ranking them highly.

Looking at the score composition of the relevant results, we can make out two distinct features: all relevant results show a rather large contribution of either "madness" (look at rang 6 and rank 35, for example) and/or a rather large contribution of "method" (ranks 45 and 6).

However, these contributions do not lead to higher ranks necessarily, since other words such as "is", "this" and "though" score higher for other results: for example, look at the contribution of "in" for rank 1.

In the next plot below, we visualize this observation using ranks 1, 6 and 35. Compare the rank 1 result on the left - which is a false positive - with the two relevant results on the right. Again, we see that "in", "through" and "is" make up large parts of the score for rank 1, whereas "madness" is a considerable factor for the two relevant matches. Unfortunately, this contribution is not sufficient to bring these results to higher ranks.

In [None]:
@widgets.interact(plot_as=widgets.ToggleButtons(options=['bar', 'pie'], value='pie'))
def plot(plot_as):
    nbutils.vis_token_scores(plot_b.matches, kind=plot_as, ranks=[1, 6, 35], plot_width=800)

The distributions of score contributions we just observed are the motivation for our approach to tag-weighted alignments, that are described in (Liebl and Burghardt, 2020). We demonstrate it now, by using a tag-weighted alignment that will weight nouns like "madness" and "method" 3 times more than other word types. Let's set it up ("NN" is a Penn Treebank tag and identifies singular nouns):

In [None]:
tag_weighted_index_builder = make_index_builder(
    strategy="Tag-Weighted Alignment",
    strategy_options={"tag_weights": {
        'NN': 3
    }})
tag_weighted_index_builder

In [None]:
nbutils.plot_results(gold_data, tag_weighted_index_builder.build_index(), "though this be madness");

This tag-weighting allows to fix move the correct results far to the top, namely to ranks 1, 2, 4 and 6.

Note that we can bring rank 73 to rank 15 by increasing the NN weight to 5. But this is sort of an extreme measure and we will not follow it here.

Instead we wonder: how will the weighting affect the other queries? Let's re-run the NDCG computation and compare it against unweighted WSB.

In [None]:
index_builder_unweighted = make_index_builder()
index_builder_unweighted

In [None]:
nbutils.plot_ndcgs(gold_data, {
    "wsb_unweighted": index_builder_unweighted.build_index(),
    "wsb_weighted": tag_weighted_index_builder.build_index()
})

So we have shown that we considerably increased our accuracy through employing weighting.

# The Influence of Embeddings

In [None]:
index_builders = {}

for e in the_embeddings.values():
    index_builders[e.name] = make_index_builder(
        strategy="Tag-Weighted Alignment",
        strategy_options={
            "tag_weights": {
                'NN': 3
            },
            "similarity": {
                "embedding": e
            }
        })
    
accordion = widgets.Accordion(children=[x.displayable for x in index_builders.values()])
for i, k in enumerate(index_builders.keys()):
    accordion.set_title(i, k)
accordion

In [None]:
nbutils.plot_ndcgs(gold_data, dict((k, v.build_index()) for k, v in index_builders.items()))

The caveat here is that we are using compressed embeddings, i.e. we would need to verify these results with uncompressed embeddings. Still, the performance of compressed fasttext seems very solid.

In a few queries ("llo, ho, ho my lord", "frailty, thy name is woman", "hell itself should gape"), GloVe gives slightly better results than fastText, but this does not generalize to the overall performance.

For some queries ("I do bear a brain.", "O all you host of heaven!") the embedding does not matter at all.

A real competitor to fastText are contextual embedding from Sentence-BERT - however, these are much more expensive in terms of computation time, storage space and code complexity.

# Interactive Searches with Your Own Data

First specify the text documents you want to search through by an upload widget:

In [None]:
import ipywidgets as widgets

upload = widgets.FileUpload(
    accept='.txt',
    multiple=True
)

upload

From this upload widget contents, we now build a Vectorian session we can perform search through. As always with Vectorian session, we need to specify the embeddings the want to employ for searching. We also need an `nlp` instance for importing the text documents. Depending on the size and number of documents, this step can take some time.

In [None]:
from vectorian.importers import StringImporter
from vectorian.session import LabSession

import codecs


def files_to_session(upload):
    im = StringImporter(nlp)
    
    if not upload.value:
        raise RuntimeError("cannot run on empty upload")

    docs = []
    for k, data in upload.value.items():
        docs.append(im(
            codecs.decode(data["content"], encoding="utf-8"),
            title=k,
            unique_id=k))

    return LabSession(
        docs,
        embeddings=[
            emb_numberbatch,
            emb_fasttext],
        normalizers="default")

upload_session = files_to_session(upload)

Now we present the full interactive search interface the Vectorian offers (we have hidden it so far and focussed on a subset). Note that in contrast to our experiments earlier, we do not search on the *document* level by default, but rather the *sentence* level - i.e. we split each document into sentences and then search on each sentence. You can change this in the "Partition" dropdown.

In [None]:
upload_session.interact(nlp)

# Literaturliste

Pennington, Jeffrey, et al. “Glove: Global Vectors for Word Representation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2014, pp. 1532–43. DOI.org (Crossref), doi:10.3115/v1/D14-1162.

Mikolov, Tomas, et al. “Advances in Pre-Training Distributed Word Representations.” ArXiv:1712.09405 [Cs], Dec. 2017. arXiv.org, http://arxiv.org/abs/1712.09405.

Speer, Robyn, et al. “ConceptNet 5.5: An Open Multilingual Graph of General Knowledge.” ArXiv:1612.03975 [Cs], Dec. 2018. arXiv.org, http://arxiv.org/abs/1612.03975.

Reimers, Nils, and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” ArXiv:1908.10084 [Cs], Aug. 2019. arXiv.org, http://arxiv.org/abs/1908.10084.

Liebl, Bernhard, and Manuel Burghardt. “‘Shakespeare in the Vectorian Age’ – An Evaluation of Different Word Embeddings and NLP Parameters for the Detection of Shakespeare Quotes.” Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2020, pp. 56–58.

Kruskal, Joseph B. “An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules.” SIAM Review, vol. 25, no. 2, Apr. 1983, pp. 201–37. DOI.org (Crossref), doi:10.1137/1025045.

Kusner, Matt J., et al. “From Word Embeddings to Document Distances.” Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, JMLR.org, 2015, pp. 957–66.

Waterman, M. S., et al. “Some Biological Sequence Metrics.” Advances in Mathematics, vol. 20, no. 3, June 1976, pp. 367–87. DOI.org (Crossref), doi:10.1016/0001-8708(76)90202-4.

Faruqui, Manaal, et al. “Problems With Evaluation of Word Embeddings Using Word Similarity Tasks.” ArXiv:1605.02276 [Cs], May 2016. arXiv.org, http://arxiv.org/abs/1605.02276.

Wang, Yuxuan, et al. “From Static to Dynamic Word Representations: A Survey.” International Journal of Machine Learning and Cybernetics, vol. 11, no. 7, July 2020, pp. 1611–30. DOI.org (Crossref), doi:10.1007/s13042-020-01069-8.

Nagoudi, El Moatez Billah, and Didier Schwab. “Semantic Similarity of Arabic Sentences with Word Embeddings.” Proceedings of the Third Arabic Natural Language Processing Workshop, Association for Computational Linguistics, 2017, pp. 18–24. DOI.org (Crossref), doi:10.18653/v1/W17-1303.