---
title: "Semantic Search with Sentence Transformers and FAISS"
date: "2023-01-12"
format:
  html:
    code-fold: show
    code-line-numbers: false
jupyter: python3
---

Semantic search essentially means retrieving documents from a corpus using natural language queries. That is, instead of carefully searching for an abundance of keywords, semantic search can enable us to find the most semantically relevant results using intuitive queries.

Below, we'll have a look at how we can leverage the language understanding capabilities of LLMs to quickly build a semantic search engine that allows us to query a dataset of speeches in the German parliament (the German Bundestag).

In [None]:
#| code-fold: true
#| output: false

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
from datasets import Dataset, load_from_disk
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from pathlib import Path

## Loading the data

Each plenary sitting of the Bundestag is documented by shorthand writers and the stenographic records are made available online. The dataset is provided by the folks behind [Open Discourse](https://opendiscourse.de/), who went through the work of scraping and parsing the official minutes of plenary proceedings (this is actually not the easiest task, e.g., because interjections have to be attributed correctly), and available via the [Harvard Dataverse](https://dataverse.harvard.edu/dataverse/opendiscourse).

In [4]:
#| code-fold: true
!wget -O speeches.feather -q https://dataverse.harvard.edu/api/access/datafile/4745913
!wget -O factions.feather -q https://dataverse.harvard.edu/api/access/datafile/4549632

In [6]:
#| code-fold: true
speeches = pd.read_feather("speeches.feather")
factions = pd.read_feather("factions.feather")

As it turns out, the dataset contains all speeches (or to be more precise: all documented utterances) from the beginning of the first electoral term (September 1949) until May 2021 (some months before the end of the 19th electoral term).

Here, we'll only work with data from the 19th electoral term (which began in October 2017 and ended four years later in October 2021). Let's do some quick data wrangling: we add data on party affiliation, delete some columns that are unnecessary for our purposes, and sort the data chronologically.

In [8]:
id2abb = dict(zip(factions.id, factions.abbreviation))

speeches = (
    speeches.query("electoralTerm == 19")
            .assign(faction=lambda df_: df_.factionId.map(id2abb),
                    date=lambda df_: pd.to_datetime(df_.date))
            .drop(columns=["electoralTerm", "politicianId", "factionId", "documentUrl", "positionLong"])
            .rename(columns={"positionShort": "position"})
            .sort_values(by="id")
            .reset_index(drop=True)
)

In [9]:
speeches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60958 entries, 0 to 60957
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             60958 non-null  int64         
 1   session        60958 non-null  int64         
 2   firstName      60958 non-null  object        
 3   lastName       60958 non-null  object        
 4   speechContent  60958 non-null  object        
 5   position       60958 non-null  object        
 6   date           60958 non-null  datetime64[ns]
 7   faction        60958 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 3.7+ MB


In total, there are have been 60,958 utterances (not accounting for some erroneous entries like the second row below).

In [10]:
speeches

Unnamed: 0,id,session,firstName,lastName,speechContent,position,date,faction
0,1000000,1,Alterspräsident Dr. Hermann,Otto Solms,"\n\nGuten Morgen, liebe Kolleginnen und Kolleg...",Presidium of Parliament,2017-10-24,not found
1,1000001,1,Alterspräsident Dr. Hermann,Otto Solms,,Presidium of Parliament,2017-10-24,not found
2,1000002,1,Carsten,Schneider,\n\nSehr geehrter Herr Präsident! Sehr geehrte...,Member of Parliament,2017-10-24,SPD
3,1000003,1,Dr. Hermann Otto,Solms,\n\nDas Wort hat jetzt der Kollege Dr. Bernd B...,Presidium of Parliament,2017-10-24,not found
4,1000004,1,Bernd,Baumann,\n\nHerr Präsident! Meine Damen und Herren! Im...,Member of Parliament,2017-10-24,AfD
...,...,...,...,...,...,...,...,...
60953,1060953,228,Helge,Lindh,\n\nWie bitte?,Member of Parliament,2021-05-07,SPD
60954,1060954,228,Petra,Pau,\n\nSie müssen das jetzt verkürzen und zum Pun...,Presidium of Parliament,2021-05-07,not found
60955,1060955,228,Helge,Lindh,"\n\nGut, ich beschränke mich auf einen Satz:\n...",Member of Parliament,2021-05-07,SPD
60956,1060956,228,Petra,Pau,\n\nDas Wort hat der Kollege Tankred Schipansk...,Presidium of Parliament,2021-05-07,not found


Let's have a look at the remarks of Alterspräsident (president by right of age) Hermann Otto Solms who presided over the Bundestag at the beginning of the first session of the new electoral term.

In [11]:
speeches.iloc[0].speechContent

'\n\nGuten Morgen, liebe Kolleginnen und Kollegen! Nehmen Sie bitte Platz.\n\nMeine sehr verehrten Damen und Herren! Liebe Kolleginnen und Kollegen! Ich begrüße Sie zur konstituierenden Sitzung des 19.\xa0Deutschen Bundestages. Es entspricht der ständigen Übung, zu Beginn der konstituierenden Sitzung nach den Regelungen der bisherigen Geschäftsordnung des Deutschen Bundestages zu verfahren.\n\n§\xa01 Absatz\xa02 der Geschäftsordnung des Deutschen Bundestages sieht vor, dass das am längsten dem Bundestag angehörende Mitglied, das hierzu bereit ist, den Vorsitz übernimmt, bis der Deutsche Bundestag einen Präsidenten gewählt hat.\n\nDie Fraktion der AfD widerspricht diesem Verfahren und hat auf Drucksache\xa019/2 beantragt, einen Versammlungsleiter zu wählen, der die konstituierende Sitzung eröffnen soll. Über diesen Antrag lasse ich sofort abstimmen. Wer dem Antrag der Fraktion der AfD zustimmt, den bitte ich um sein Handzeichen.\xa0– Gegenstimmen?\xa0–\n\n({0})\n\nEnthaltungen?\xa0– Der

We can immediately see some issues (like `({0})` or `\n\nDrucksache\xa`) but the tokenizer will handle some of them and we should be able to get some sound embeddings anyway.

Let's convert our pandas DataFrame to a HuggingFace Dataset:

In [12]:
dataset = Dataset.from_pandas(speeches)
dataset

Dataset({
    features: ['id', 'session', 'firstName', 'lastName', 'speechContent', 'position', 'date', 'faction'],
    num_rows: 60958
})

## Computing embeddings

Getting embeddings should now be very easy. To see what's going on under the hood, we'll first have a look at plain HuggingFace transformers before we make use of the dedicated [Sentence-Transformers](https://www.sbert.net/) library. Note that we'll use the `T-Systems-onsite/cross-en-de-roberta-sentence-transformer` model checkpoint (see [here](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer) for the model card) that was fine-tuned on English and German text and is intended for computing text embeddings.

In [13]:
model_ckpt = "T-Systems-onsite/cross-en-de-roberta-sentence-transformer"

In [14]:
#| output: false
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

Downloading (…)okenizer_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [15]:
encoded = tokenizer(speeches.iloc[0].speechContent, padding=True, truncation=True)
tokens = tokenizer.convert_ids_to_tokens(encoded.input_ids)
print(tokens[:100])
print(len(tokens))

['<s>', '▁Gut', 'en', '▁Morgen', ',', '▁liebe', '▁Kolle', 'ginn', 'en', '▁und', '▁Kollegen', '!', '▁Nehmen', '▁Sie', '▁bitte', '▁Platz', '.', '▁Meine', '▁sehr', '▁ver', 'ehrt', 'en', '▁Damen', '▁und', '▁Herren', '!', '▁Liebe', '▁Kolle', 'ginn', 'en', '▁und', '▁Kollegen', '!', '▁Ich', '▁be', 'gr', 'üß', 'e', '▁Sie', '▁zur', '▁konstitu', 'ierenden', '▁Sitzung', '▁des', '▁19.', '▁Deutschen', '▁Bundestag', 'es', '.', '▁Es', '▁entspricht', '▁der', '▁ständig', 'en', '▁Übung', ',', '▁zu', '▁Beginn', '▁der', '▁konstitu', 'ierenden', '▁Sitzung', '▁nach', '▁den', '▁Regelung', 'en', '▁der', '▁bisherige', 'n', '▁Geschäfts', 'ordnung', '▁des', '▁Deutschen', '▁Bundestag', 'es', '▁zu', '▁', 'verfahren', '.', '▁§', '▁1', '▁Absatz', '▁2', '▁der', '▁Geschäfts', 'ordnung', '▁des', '▁Deutschen', '▁Bundestag', 'es', '▁sieht', '▁vor', ',', '▁dass', '▁das', '▁am', '▁längst', 'en', '▁dem', '▁Bundestag']
436


In [16]:
tokenizer.model_max_length

512

The tokenizer splits our example into 436 tokens which is less than the model's maximum context size of 512. Note that we'll simply truncate the input sequence if it is longer (we could certainly use a more sophisticated approach using windows, but let's keep it simple here).

In [17]:
decoded = tokenizer.decode(encoded.input_ids)
decoded[:1000]

'<s> Guten Morgen, liebe Kolleginnen und Kollegen! Nehmen Sie bitte Platz. Meine sehr verehrten Damen und Herren! Liebe Kolleginnen und Kollegen! Ich begrüße Sie zur konstituierenden Sitzung des 19. Deutschen Bundestages. Es entspricht der ständigen Übung, zu Beginn der konstituierenden Sitzung nach den Regelungen der bisherigen Geschäftsordnung des Deutschen Bundestages zu verfahren. § 1 Absatz 2 der Geschäftsordnung des Deutschen Bundestages sieht vor, dass das am längsten dem Bundestag angehörende Mitglied, das hierzu bereit ist, den Vorsitz übernimmt, bis der Deutsche Bundestag einen Präsidenten gewählt hat. Die Fraktion der AfD widerspricht diesem Verfahren und hat auf Drucksache 19/2 beantragt, einen Versammlungsleiter zu wählen, der die konstituierende Sitzung eröffnen soll. Über diesen Antrag lasse ich sofort abstimmen. Wer dem Antrag der Fraktion der AfD zustimmt, den bitte ich um sein Handzeichen. – Gegenstimmen? – ({0}) Enthaltungen? – Der Antrag ist damit mit den Stimmen al

As we can see, the tokenizer removes some of the issues we've seen earlier in its preprocessing stage. 

We can now apply the model and have a look at its outputs:

In [18]:
encoded = tokenizer(speeches.iloc[0].speechContent, 
                    padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    model_output = model(**encoded)

In [19]:
model_output["last_hidden_state"].shape, len(tokens)

(torch.Size([1, 436, 768]), 436)

As expected, we get an embedding vector of size 768 for every token in our input (including the special tokens `<s>` and `'</s>'` at the beginning and at the end). To get our final embedding for the whole input, we'll have to apply mean pooling to the model's output (i.e., we average the embeddings for all tokens to get the final embedding). Thus, we have to define a corresponding function:

In [20]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

With this function, we can finally compute the embedding:

In [21]:
speech_embedding = mean_pooling(model_output, encoded["attention_mask"])
speech_embedding.shape

torch.Size([1, 768])

So far, so good. We can achieve the very same, though, using the Sentence-Transformers library which is dedicated to computing embeddings.

To load our model, we now use the `SentenceTransformer` class. To obtain the embedding, we simply call `model.encode()`.

In [22]:
#| output: false

model = SentenceTransformer(model_ckpt)

Downloading (…)d7612/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)d38fa98d7612/LICENSE:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

Downloading (…)8fa98d7612/README.md:   0%|          | 0.00/7.90k [00:00<?, ?B/s]

Downloading (…)a98d7612/config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [23]:
speech_embedding_st = model.encode(speeches.iloc[0].speechContent)
speech_embedding_st.shape

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(768,)

This way we actually get the same embedding as we did when using the HuggingFace transformer library.

Now we can compute embeddings for all samples in our dataset. What would have lasted many hours on a CPU gets [processed in batches](https://huggingface.co/docs/datasets/process#batch-processing) in a few minutes on a GPU; just set `batched=True` when calling [`map()`](https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/main_classes#datasets.Dataset.map). Note that we could have also set the `batch_size` parameter (we use the default batch size of 1000).

In [24]:
#| output: false

embeddings_dataset = dataset.map(
    lambda batch: {"embeddings": model.encode(batch["speechContent"])}, batched=True
)

  0%|          | 0/61 [00:00<?, ?ba/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/30 [00:00<?, ?it/s]

Now that we added the embeddings to our dataset, we can make use of the [FAISS library](https://github.com/facebookresearch/faiss) (FAISS is short for Facebook AI Similarity Search) which provides efficient algorithms for searching and clustering embedding vectors. Computing a FAISS index for a HuggingFace Dataset is easy:

In [25]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/61 [00:00<?, ?it/s]

Dataset({
    features: ['id', 'session', 'firstName', 'lastName', 'speechContent', 'position', 'date', 'faction', 'embeddings'],
    num_rows: 60958
})

## Performing semantic search

### Querying the data

Having the FAISS index now allows us to perform queries with the `Dataset.get_nearest_examples()` function. Let's try this with a simple sentence that we embed and then compare against the whole corpus to find the speeches with the most similar embeddings.

In [26]:
q = "Der Kampf gegen den Klimawandel muss oberste Priorität haben!"
q_emb = model.encode(q)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [27]:
scores, samples = embeddings_dataset.get_nearest_examples("embeddings", q_emb, k=10)

`get_nearest_examples()` returns a tuple of scores (quantifying the fit between the query and the example) and the corresponding samples (the `k` best matches). We can put them into a pandas DataFrame for easier analysis:

In [28]:
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [29]:
samples_df

Unnamed: 0,id,session,firstName,lastName,speechContent,position,date,faction,embeddings,scores
9,1060666,227,Julia,Verlinden,"\n\nFangen Sie endlich damit an, den Klimaschu...",Member of Parliament,2021-05-06,Grüne,"[0.06457225978374481, 0.14325504004955292, 0.1...",39.370491
8,1029268,118,Anja,Weisgerber,"\n\nOkay. – Wir schaffen erst Anreize, und in ...",Member of Parliament,2019-10-17,CDU/CSU,"[-0.012752844020724297, 0.19681210815906525, 0...",39.225136
7,1032505,129,Marie-Luise,Dött,\n\nSehr geehrter Herr Präsident! Meine Damen ...,Member of Parliament,2019-11-26,CDU/CSU,"[-0.06372411549091339, 0.17748013138771057, 0....",39.218792
6,1001346,7,Lisa,Badum,\n\nNein. – Sie müssen sich erstens endlich an...,Member of Parliament,2018-01-18,Grüne,"[-0.10626056790351868, 0.23017993569374084, -0...",38.357986
5,1019612,80,Carsten,Träger,\n\nSehr geehrte Frau Präsidentin! Kolleginnen...,Member of Parliament,2019-02-14,SPD,"[-0.07200361043214798, 0.14576080441474915, 0....",37.939503
4,1032183,128,Andreas,Jung,\n\nWir werden mit diesem Klimapaket unserer V...,Member of Parliament,2019-11-15,CDU/CSU,"[-0.05022718012332916, 0.02807101048529148, 0....",37.559578
3,1008459,38,Svenja,Schulze,"\n\nHerr Abgeordneter, die Forschungsseite ist...",Minister,2018-06-13,not found,"[0.03885478153824806, -0.16286222636699677, -0...",37.429123
2,1039271,156,Anja,Weisgerber,\n\nSehr geehrter Herr Kollege Hilse! Klar ist...,Member of Parliament,2020-04-23,CDU/CSU,"[0.034136462956666946, 0.05767284706234932, -0...",32.63641
1,1028351,115,Andreas,Jung,\n\nHerr Präsident! Liebe Kolleginnen und Koll...,Member of Parliament,2019-09-26,CDU/CSU,"[0.08285392075777054, 0.06978157162666321, 0.0...",30.107765
0,1051468,197,Sven-Christian,Kindler,"\n\nWir brauchen endlich eine Bundesregierung,...",Member of Parliament,2020-12-08,Grüne,"[0.10869952291250229, 0.29951632022857666, -0....",24.333567


Let's have a look at the three of the ten best matches:

In [30]:
samples_df.iloc[0].speechContent[:1000]

'\n\nFangen Sie endlich damit an, den Klimaschutz ernst zu nehmen, und überarbeiten Sie das Bundesberggesetz grundlegend, auch im Sinne des Umweltschutzes und zum Schutz der Menschen!\n\n({0})'

In [31]:
samples_df.iloc[1].speechContent[:1000]

'\n\nOkay.\xa0– Wir schaffen erst Anreize, und in einem zweiten Schritt, wie gesagt, steigt der Preis an, und das ist unsere Antwort im Punkt Bepreisung.\n\nWenn die AfD von der Aufgabe aller Klima- und Energieziele spricht, dann sage ich Ihnen ganz klar: Ja, Deutschland alleine kann das Klima der Welt nicht retten; daraus machen wir auch gar keinen Hehl. Deswegen ist die Arbeit von Gerd Müller, des Entwicklungshilfeministers, auch an der Stelle so wichtig. Und deswegen ist es auch so wichtig, dass wir bei den Weltklimakonferenzen immer wieder dafür kämpfen, dass die Staaten der Welt sich auch an die Klimaziele halten, die sie sich selbst gesteckt haben.\n\n({0})\n\nDa ist das Pariser Abkommen ein Riesenschritt nach vorn.\n\nAber ich sage auch: Wir haben doch eine Verantwortung in der Welt. Wir müssen doch als Vorbild vorangehen und müssen zeigen, dass es gelingt, Klimaschutz und Wirtschaftswachstum nicht als Gegensätze zu begreifen, sondern auch als Chance für die Wirtschaft, sich auf

In [32]:
samples_df.iloc[-1].speechContent[:1000]

'\n\nWir brauchen endlich eine Bundesregierung, die den Klimaschutz zur Toppriorität macht.\n\nVielen Dank.\n\n({0})'

If you know German, the results to our query look really good. 

Now remember that our model has been fine-tuned on both English and German text. That is, English and German texts with similar meaning should have similar embeddings. Thus, we should be able to query our dataset with English queries just as well. Let's use the English translation of our first query and see what we get:

In [33]:
q_en = "The fight against climate change must have top priority!"
q_en_emb = model.encode(q)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [34]:
scores_en, samples_en = embeddings_dataset.get_nearest_examples("embeddings", model.encode(q_en), k=10)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [35]:
samples_df_en = pd.DataFrame.from_dict(samples_en)
samples_df_en["scores"] = scores_en
samples_df_en.sort_values("scores", ascending=False, inplace=True)

In [36]:
samples_df_en

Unnamed: 0,id,session,firstName,lastName,speechContent,position,date,faction,embeddings,scores
9,1006817,31,Anja,Weisgerber,\n\nSehr geehrter Herr Präsident! Werte Kolleg...,Member of Parliament,2018-05-15,CDU/CSU,"[-0.026334811002016068, 0.15515056252479553, -...",41.945499
8,1032505,129,Marie-Luise,Dött,\n\nSehr geehrter Herr Präsident! Meine Damen ...,Member of Parliament,2019-11-26,CDU/CSU,"[-0.06372411549091339, 0.17748013138771057, 0....",41.637802
7,1029268,118,Anja,Weisgerber,"\n\nOkay. – Wir schaffen erst Anreize, und in ...",Member of Parliament,2019-10-17,CDU/CSU,"[-0.012752844020724297, 0.19681210815906525, 0...",41.366852
6,1001346,7,Lisa,Badum,\n\nNein. – Sie müssen sich erstens endlich an...,Member of Parliament,2018-01-18,Grüne,"[-0.10626056790351868, 0.23017993569374084, -0...",40.811142
5,1008459,38,Svenja,Schulze,"\n\nHerr Abgeordneter, die Forschungsseite ist...",Minister,2018-06-13,not found,"[0.03885478153824806, -0.16286222636699677, -0...",40.576042
4,1032183,128,Andreas,Jung,\n\nWir werden mit diesem Klimapaket unserer V...,Member of Parliament,2019-11-15,CDU/CSU,"[-0.05022718012332916, 0.02807101048529148, 0....",39.967426
3,1019612,80,Carsten,Träger,\n\nSehr geehrte Frau Präsidentin! Kolleginnen...,Member of Parliament,2019-02-14,SPD,"[-0.07200361043214798, 0.14576080441474915, 0....",39.91975
2,1039271,156,Anja,Weisgerber,\n\nSehr geehrter Herr Kollege Hilse! Klar ist...,Member of Parliament,2020-04-23,CDU/CSU,"[0.034136462956666946, 0.05767284706234932, -0...",35.132324
1,1028351,115,Andreas,Jung,\n\nHerr Präsident! Liebe Kolleginnen und Koll...,Member of Parliament,2019-09-26,CDU/CSU,"[0.08285392075777054, 0.06978157162666321, 0.0...",32.600567
0,1051468,197,Sven-Christian,Kindler,"\n\nWir brauchen endlich eine Bundesregierung,...",Member of Parliament,2020-12-08,Grüne,"[0.10869952291250229, 0.29951632022857666, -0....",26.413271


Indeed: using the English query gives us the same samples as our German query!

### Finding similar samples

Now let's see what happens when we use one of the samples to query the dataset. In particular, we'll use the one below:

In [37]:
speeches.iloc[-4].speechContent

'\n\nSie müssen das jetzt verkürzen und zum Punkt kommen.'

This is one of the Bundestag's vice presidents telling a member who exceeded his speaking time to come straight to the point. Can we find similar occurrences even though the wording used might have been entirely different? Let's get the 500 nearest examples:

In [38]:
q = speeches.iloc[-4].speechContent
q_emb = model.encode(q)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [39]:
scores, samples = embeddings_dataset.get_nearest_examples("embeddings", q_emb, k=500)
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [41]:
samples_df.tail(5)

Unnamed: 0,id,session,firstName,lastName,speechContent,position,date,faction,embeddings,scores
4,1001913,11,Thomas,Oppermann,"\n\nSie müssen langsam zum Ende kommen, Herr K...",Presidium of Parliament,2018-02-01,not found,"[-0.10636333376169205, 0.29633229970932007, 0....",26.62343
3,1030098,121,Wolfgang,Kubicki,\n\nAber jetzt müssen Sie zum Ende kommen.,Presidium of Parliament,2019-10-24,not found,"[0.03086794912815094, 0.2639564275741577, 0.02...",26.31995
2,1023532,95,Petra,Pau,"\n\nSie müssen jetzt den Punkt setzen, bitte.",Presidium of Parliament,2019-04-11,not found,"[0.2779744863510132, 0.04356134310364723, -0.0...",25.8615
1,1032602,129,Petra,Pau,\n\nSie müssen jetzt einen Punkt setzen.,Presidium of Parliament,2019-11-26,not found,"[0.15792761743068695, 0.06023293733596802, -0....",25.58862
0,1060954,228,Petra,Pau,\n\nSie müssen das jetzt verkürzen und zum Pun...,Presidium of Parliament,2021-05-07,not found,"[0.2084154337644577, 0.5121742486953735, -0.12...",1.12309e-11


Again, the results look very promising. 

Of course we merely scratched the surface of what is possible with semantic search, but it hopefully became apparent how embeddings can help us search for examples with a particular meaning.