In [6]:
from pathlib import Path

import os
import sys

os.environ['VECTORIAN_CPP_IMPORT'] = "1"
vectorian_path = Path("/Users/arbeit/Projects/vectorian-2021")
sys.path.append(str(vectorian_path))

sys.path.append("code")
import nbutils

import importlib
importlib.reload(nbutils)

Matching checksum for /Users/arbeit/Projects/vectorian-2021/vectorian/core/cpp/core.cpp --> not compiling


<module 'nbutils' from 'code/nbutils.py'>

# Overview

Let's first get a high level overview of what we are aiming to do technically. In this notebook we will experiment with four classes of embeddings (see the diagram below for a classification):

* Static token embeddings: these operate on the token level such. We experiment with GloVe (Pennington et al. 2014), fastText (Mikolov et al., 2017) and Numberbatch (Speer et al, 2018). We use these three to compute token similarity and combine this with alignment algorithms (such as Waterman-Smith-Beyer) to compute document similarity. We also investigate the effect of stacking two static embeddings (fastText and Numberbatch).
* Contextual token embeddings: these also operate on the token level, i.e. embeddings that change according to a specific token instance's context. In this notebook we experiment with using such token embeddings from a sentence bert model.
* Document embeddings derived from specially trained models. Document embeddings represent one document via one single embedding. We use document embeddings obtained from a BERT model. More specifically, we use a Sentence-BERT model trained for the semantic textual similarity (STS) task (Reimers and Gurevych, 2019).
* Document embeddings derived from token embeddings. We also experiment with averaging different kinds of token embeddings (static and contextual) to derive document embeddings.

![Different kinds of embeddings](miscellaneous/diagram_embeddings.svg)

For reasons of limited RAM and download times, we use small or compressed versions of the static embeddings we work with. For GloVe, we use the official 50-dimensional version of the 6B variant. For fastText we use a version that was compressed using the standard settings in https://github.com/avidale/compress-fasttext. For Numberbatch we use a 50-dimension version that was reduced using a standard PCA.   

# Technical Setup

In [46]:
from bokeh.io import output_notebook
output_notebook()

# Choosing Static Word Embeddings

First we need:
    
    * a set of documents of search over (i.e. our corpus)
    * a set of word embeddings to employ for these searches
    
For the latter, we turn to Vectorian's embedding zoo, which offers a number of pretrained word embeddings.

In [8]:
from vectorian.embeddings import Zoo

#Zoo.list()

Let's load the static embeddings as described above from Vectorian's model zoo.

In [9]:
from vectorian.embeddings import StackedEmbedding

emb_glove = Zoo.load('glove-6B-50')
emb_numberbatch = Zoo.load('numberbatch-19.08-en-50')
emb_fasttext = Zoo.load('fasttext-en-mini')
emb_fasttext_numberbatch = StackedEmbedding([emb_fasttext, emb_numberbatch])

We also instantiate an NLP parser based on sentence bert and a shim to use this model's token embeddings in the Vectorian.

In [10]:
import spacy_sentence_bert
nlp = spacy_sentence_bert.load_model('en_paraphrase_distilroberta_base_v1')

from vectorian.embeddings import SentenceBertEmbedding
emb_sbert = SentenceBertEmbedding(nlp)

# Loading Documents

First load our gold standard that contains our queries.

In [11]:
import json

with open("data/raw_data/gold.json", "r") as f:
    gold = nbutils.Gold(json.loads(f.read()))

In [12]:
gold.phrases[:5]

['to be or not to be',
 'sea of troubles',
 'pampered jades of Asia',
 'The rest is silence.',
 'an old man is twice a child']

In [13]:
gold.matches('to be or not to be')[:1]

[{'id': 'ww_594e076e93ed4ccf',
  'context': 'Perchance I have not told you all that I think; for not to be when you have been, I think is the greatest misery that may be.',
  'quote': 'not to be when you have been, ',
  'work': 'Those Five Questions (Tusculanae) (1561)',
  'author': 'Marcus Tullius Cicero',
  'lexia': 'to be or not to be',
  'formal_class': 'Snowclone',
  'complexity': 3}]

We are now ready to build a Vectorian session that contains our documents and embeddings. We use preprocessed corpus data. For details, how this was achieved, see `code/prepare_corpus.ipynb`.

In [14]:
from vectorian.session import LabSession
from vectorian.corpus import Corpus

session = LabSession(
    Corpus.load("data/processed_data/corpus"),
    embeddings=[
        emb_sbert,
        emb_glove,
        emb_numberbatch,
        emb_fasttext,
        emb_fasttext_numberbatch],
    normalizers="default")

Opening glove-6B-50: 100%|██████████
Opening numberbatch-19.08-en-50: 100%|██████████
1587it [00:00, 23735.76it/s]
1587it [00:00, 20531.71it/s]


Let's take a look at the documents we imported and that now live inside `session`.

In [15]:
from ipywidgets import interact
from IPython.display import display
import ipywidgets as widgets

@interact(
    doc_index=widgets.IntSlider(min=1, max=len(session.documents)))
def browse_docs(doc_index):
    doc = session.documents[doc_index - 1]
    display(widgets.HTML(nbutils.DocFormatter(gold)(doc)))

interactive(children=(IntSlider(value=1, description='doc_index', min=1), Output()), _dom_classes=('widget-int…

# Comparing Sentence Embeddings

In a first step, let's look at representing each document with one embedding in order to gather an understanding how different embedding strategies relate to the nearness of documents. We will later turn to individual token embeddings.

We first prepare additional sentence embeddings using SBERT that we will show in our first big visualization.

In [11]:
sbert_encoder = None

In [50]:
from vectorian.embeddings import CachedPartitionEncoder, SpanEncoder

sbert_encoder = CachedPartitionEncoder(SpanEncoder(
    lambda texts: [nlp(t).vector for t in texts]))

sbert_encoder.try_load("data/processed_data/doc_embeddings")
sbert_encoder.cache(session.documents, session.partition("document"))
sbert_encoder.save("data/processed_data/doc_embeddings")

sbert_encoder_name = nlp.meta["name"]

We now show SBERT and a number of sentence embeddings we derive from word embeddings by simply averaging over the vectors (according to Mikolov et al., 2013).

In the TSNE visualization below, dots are documents and the colors are the query that yields that document in our gold standard. By hovering over dots with the mouse you get details on the document and query the dot represents. Nearby dots of the same color indicate that the embedding tends to cluster documents similar to our gold standard.

You can also add an intruder text by entering a text into the text field and pressing RETURN (to refresh the plot). This will move the larger crossed circle to where the currently selected embeddings thinks that the given text should be positioned in terms of the other documents.

In some cases, we can clearly make out clusters visually. For example, in the fastText embedding the blue "to be or not be" documents are clustered nicely. SBERT shows a green cluster of "an old man is twice a child". numberbatch reveals a brown cluster of "llo, ho, ho, my lord".

With spaCy transformers, we see some complex but noisy clustering. One example query: "born naturally".

Finally you can switch between different embeddings using the radio buttons. You can also enable "free_text" and enter custom queries that are not in our gold corpus.

In [37]:
from ipywidgets import interact
import ipywidgets as widgets

plotter = nbutils.EmbeddingPlotter(session, nlp, gold)
plotter.encoders[sbert_encoder_name] = sbert_encoder

def mk_text_widget(gold, free):
    if not free:
        return widgets.Dropdown(options=gold.phrases)
    else:
        return widgets.Text("horse", continuous_update=False)
    
@interact(free_text=widgets.Checkbox())
def plot_docs(free_text):
    @interact(
        query=mk_text_widget(gold, free_text),
        embedding=widgets.RadioButtons(
            options=sorted(plotter.encoders.keys()),
            layout={'width': 'max-content'}),
        show_legend=widgets.Checkbox())
    def inner(query, embedding, show_legend):
        plotter.plot(embedding, query, show_legend)
    return inner

interactive(children=(Checkbox(value=False, description='free_text'), Output()), _dom_classes=('widget-interac…

Now let's run an actual search using Vectorian.

In [40]:
from vectorian.metrics import PartitionEmbeddingSimilarity, CosineSimilarity

@interact(free_text=widgets.Checkbox())
def plot_docs(free_text):
    @interact(
        query=mk_text_widget(gold, free_text),
        embedding=widgets.RadioButtons(
            options=sorted(plotter.encoders.keys()), layout={'width': 'max-content'}))
    def inner(query, embedding):
        sent_sim = PartitionEmbeddingSimilarity(
            plotter.encoders[embedding],
            CosineSimilarity())
        index = session.partition("document").index(sent_sim, nlp)
        return index.find(query, n=3)

interactive(children=(Checkbox(value=False, description='free_text'), Output()), _dom_classes=('widget-interac…

# Exploring Word Embeddings

We now turn to single word embeddings.

In [42]:
session.word_vec(emb_glove, "hot")

memmap([-7.6663e-01,  6.9023e-01,  7.5462e-02,  1.1688e-01, -7.9722e-01,
        -1.9606e-01, -7.7409e-01,  1.7351e-01,  2.6248e-01,  5.5295e-01,
        -2.9190e-01, -2.4505e-01,  5.9885e-01,  1.2445e+00,  2.6401e-01,
         2.0211e-01,  4.2139e-02,  5.1844e-01, -8.1704e-01, -1.0801e+00,
         2.2864e-01,  9.1212e-02,  1.5638e+00,  7.5056e-01, -6.1206e-02,
        -6.9001e-01, -5.3558e-01,  1.1311e+00,  1.3871e+00,  3.6151e-01,
         2.8475e+00,  1.0733e-01, -1.7073e-02,  4.5358e-01, -7.1374e-03,
         1.1177e-01, -1.5955e-01,  3.0205e-01,  5.4222e-01, -5.4103e-01,
         2.3276e-01,  2.1756e-01, -4.1444e-02,  1.7056e-03,  7.6265e-01,
         6.6241e-01, -4.5484e-02, -8.1479e-01,  4.6763e-02,  3.1134e-01],
       dtype=float32)

In [43]:
from vectorian.metrics import TokenSimilarity, CosineSimilarity

token_sim = TokenSimilarity(
    emb_numberbatch,
    CosineSimilarity()
)

session.similarity(token_sim, "hot", "cold")

0.70502234

In [44]:
token_sim = TokenSimilarity(
    emb_glove,
    CosineSimilarity())

session.similarity(token_sim, "hot", "cold")

0.8010528

The following interactive board allows you to search for a custom token inside a document. You can choose different documents by changing `doc_index`. The plot gives you the similarity of the entered token with the tokens in the chosen document under the selected embedding.

Note that out-of-vocabulary words like "fasterer" will produce zero similarities under standard key-value embeddings, whereas fastText is still able to produce a vector thanks to subword information.

In [23]:
from ipywidgets import interact
import ipywidgets as widgets

@interact(
    token=widgets.Text(value='high'),
    doc_index=widgets.IntSlider(min=1, max=len(session.documents)),
    embedding=widgets.RadioButtons(
        options=sorted(session.embeddings.keys()), layout={'width': 'max-content'}))
def show_tokens(token, doc_index, embedding):
    token_sim = TokenSimilarity(
        session.embeddings[embedding].factory,
        CosineSimilarity())

    nbutils.plot_token_similarity(
        session, session.documents[doc_index - 1], token_sim, token)

interactive(children=(Text(value='high', description='token'), IntSlider(value=1, description='doc_index', min…

# A Search Query using Alignment over Similar Tokens

In [17]:
from vectorian.metrics import TokenSimilarity, CosineSimilarity
from vectorian.metrics import AlignmentSimilarity
from vectorian.alignment import WatermanSmithBeyer, ExponentialGapCost

token_sim = TokenSimilarity(
    emb_glove,  # the GloVe embedding we loaded earlier
    CosineSimilarity()  # a standard cosine similarity
)

sent_sim = AlignmentSimilarity(
    token_sim=token_sim,
    alignment=WatermanSmithBeyer(gap=ExponentialGapCost(5), zero=0.25))

index = session.partition("document").index(sent_sim, nlp)

In [20]:
gold.phrases[0]

'to be or not to be'

In [21]:
index.find(gold.phrases[0], n=1)

# Plotting the NDCG over the Corpus

In [61]:

import importlib
importlib.reload(nbutils)

import json

with open("data/raw_data/gold.json", "r") as f:
    gold = nbutils.Gold(json.loads(f.read()))

In [62]:
nbutils.interact_plot(session, nlp, nbutils.NDCGPlotter(gold), partition_encoders={
    sbert_encoder_name: sbert_encoder
})

VBox(children=(HBox(children=(ToggleButtons(options=('Result', 'Settings'), value='Result'), Button(button_sty…

# Exploring WSB Parameters 1

In [90]:
# see above (e.g. explore "gap type" and "zero")

# Exploring WSB Parameters 2

In [89]:
# see above (e.g. explore "cosine" vs "euclidean")

# Focussing in on one Query: Understanding Score and Intrusive Results

In [63]:
nbutils.interact_plot(session, nlp, nbutils.ResultScoresPlotter(gold), partition_encoders={
    sbert_encoder_name: sbert_encoder
})

VBox(children=(IntSlider(value=1, description='query', max=20, min=1), HBox(children=(ToggleButtons(options=('…

# Literaturliste

Pennington, Jeffrey, et al. “Glove: Global Vectors for Word Representation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2014, pp. 1532–43. DOI.org (Crossref), doi:10.3115/v1/D14-1162.

Mikolov, Tomas, et al. “Advances in Pre-Training Distributed Word Representations.” ArXiv:1712.09405 [Cs], Dec. 2017. arXiv.org, http://arxiv.org/abs/1712.09405.

Speer, Robyn, et al. “ConceptNet 5.5: An Open Multilingual Graph of General Knowledge.” ArXiv:1612.03975 [Cs], Dec. 2018. arXiv.org, http://arxiv.org/abs/1612.03975.

Reimers, Nils, and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” ArXiv:1908.10084 [Cs], Aug. 2019. arXiv.org, http://arxiv.org/abs/1908.10084.