# Searching Relevant texts within an Embedding spaces

This notebook demonstrates how to use a pre-trained multilingual embedding model downloaded from Hugging Face to search for relevant texts across languages.
We'll load the model, embed some texts and measure cosine similarity between possible matches


## 1. Install Dependencies

First, we need to install `sentence-transformers`


In [1]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.2.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.2.1-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.8/255.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.2.1


## 2. Model Information

In this example, we are using an off the shelf multilingual embedding model hosted on Huggingface: `gte-multilingual-base'.

Note: Newer impresso version of the model is in the works.

This model predicts an embedding representation (list of numbers that stores the "meaning") of a given text (sentence, paragraph, article) that can be used to measure similarity between two texts.


## 3. Loading the embedding model

This class downloads the model from Hugging Face and loads it ready for prediction. We use the SentenceTransformers library to benefit from their functionality and documentation.

In [2]:
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

embedding_model = SentenceTransformer("Alibaba-NLP/gte-multilingual-base", trust_remote_code=True)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/123k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling.py:   0%|          | 0.00/59.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/611M [00:00<?, ?B/s]

Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Simple Test

In [3]:
sentence1_en = "This is an example test sentence"
sentence2_en = "This constitutes a sample sentence"

In [4]:
embedding1_en = embedding_model.encode(sentence1_en)
embedding2_en = embedding_model.encode(sentence2_en)

print("Embedding Representation of sentence1 starts with ... " + str(embedding1_en[:5]))
print("Embedding Representation of sentence2 starts with ... " + str(embedding2_en[:5]))

Embedding Representation of sentence1 starts with ... [-0.03899017  0.05060311 -0.05112957  0.04195484 -0.01264444]
Embedding Representation of sentence2 starts with ... [-0.03233603  0.05431601 -0.04902024  0.01571231 -0.00125108]


Those numbers look cool but am I meant to understanding anything?

Answer: Calculate the similarity between the two representations

In [5]:
similarity_value = round(1 - cosine(embedding1_en, embedding2_en),2)
print("Sentence1 and Sentence2 have a cosine similarity of " + str(similarity_value))

Sentence1 and Sentence2 have a cosine similarity of 0.87


FAQ:

Cosine similarity is not the percentage of similarity between the two texts.

The higher the cosine similarity, the more similar the two texts are. Range of what is considered "high" varies per model and domain.

Based on our experiments on contemporary texts, cosine similarity of 0.85+ means the two texts are mostly equivalent

### Simple Test Across Languages

In [6]:
sentence1_de = "Das ist ein Beispieltestsatz"

In [7]:
embedding1_de = embedding_model.encode(sentence1_de)

In [8]:
similarity_value = round(1 - cosine(embedding1_en, embedding1_de),2)
print("Sentence1 in English and Sentence1 in German have a cosine similarity of " + str(similarity_value))

Sentence1 in English and Sentence1 in German have a cosine similarity of 0.73


### Try yor own similarity calculation

In [9]:
input1 = input()

test1


In [10]:
input2 = input()

test2


In [11]:
embedding1 = embedding_model.encode(input1)
embedding2 = embedding_model.encode(input2)

In [12]:
similarity_value = round(1 - cosine(embedding1_en, embedding1_de),2)
print("Input1 and Input2 have a cosine similarity of " + str(similarity_value))

Input1 and Input2 have a cosine similarity of 0.73


Works! Not in every case I did not cherry pick this. However, similarity across languages is possible with these models!

Works! Not in every case I did not cherry pick this. However, similarity across languages is possible with these models!

## 4. Finding similar texts within collections using the embedding model

Now that we have seen how the model creates a representation and how we can use it to get the similarity of two texts, let's apply it to a couple of collections.

## Setting up utilities

In [19]:
from scipy.spatial.distance import cosine

def find_best_match_in_collection(source_collection, target_collection, matches_sorted=False, link=False):
    """
    Finds the most similar sentence in the target collection for each sentence in the source collection
    based on cosine similarity of their embeddings.

    Args:
        source_collection (list of tuples): A list of tuples where each tuple contains a source sentence and its corresponding embedding.
        target_collection (list of tuples): A list of tuples where each tuple contains a target sentence and its corresponding embedding.

    Returns:
        list of dict: A list of dictionaries, each containing:
            - 'source_text' (str): The original sentence from the source collection.
            - 'best_match' (str): The most similar sentence from the target collection.
            - 'similarity' (float): The cosine similarity score between the source sentence and the best match, rounded to two decimal places.

    Example:
        source = [("sentence in German", embedding_vector)]
        target = [("sentence in French", embedding_vector)]
        matches = find_best_match_in_collection(source, target)

    Note:
        The function assumes the embeddings are already precomputed and provided in the source and target collections.
    """

    matches = []  # List to store the best matches
    for source in source_collection:
        source_embedding = source[1]  # Get the embedding of the source sentence
        best_match_text = ""  # Initialize the best match text
        best_match_similarity = 0  # Initialize the best match similarity score
        best_match_content_id = ""
        # Iterate through the target collection to find the best match
        for target in target_collection:
            if source[0] == target[0]:  # Skip if comparing the same sentence
                continue
            target_embedding = target[1]  # Get the embedding of the target sentence
            # Calculate cosine similarity between source and target embeddings
            similarity_value = 1 - cosine(source_embedding, target_embedding)
            # Update if the current similarity is higher than the previous best
            if similarity_value > best_match_similarity:
                best_match_similarity = similarity_value
                best_match_text = target[0]
                if link:
                  best_match_content_id = target[2]

        # Append the source sentence, best match, and similarity score to the results list
        if link:
            matches.append({
                "source_text": source[0],
                "best_match": best_match_text,
                "similarity": round(best_match_similarity, 2),  # Round to 2 decimal places
                "Content Item Source": f"https://impresso-project.ch/app/article/{source[2]}",  # Corrected f-string for URL
                "Content Item Matched": f"https://impresso-project.ch/app/article/{best_match_content_id}"  # Corrected f-string for URL
            })
        else:
            matches.append({
                "source_text": source[0],
                "best_match": best_match_text,
                "similarity": round(best_match_similarity, 2)  # Round to 2 decimal places
            })


    if matches_sorted:
      # Sort the matches by similarity in descending order if specified
      matches = sorted(matches, key=lambda x: x['similarity'], reverse=True)

    return matches

# Function to print the matches in a nicely formatted way
def print_matches_formatted(matches, link=False, threshold=0):
    for match_dict in matches:
        if threshold > match_dict['similarity']: # if not a good match, skip the match
            continue
        print("Source Text:")
        print(f"  {match_dict['source_text']}\n")

        print("Best Match:")
        print(f"  {match_dict['best_match']}\n")

        print(f"Similarity: {match_dict['similarity']}\n")

        # If 'link' is True, print the URLs for the source and matched content
        if link and 'Content Item Source' in match_dict and 'Content Item Matched' in match_dict:
            print("Content Item Source:")
            print(f"  {match_dict['Content Item Source']}\n")

            print("Content Item Matched:")
            print(f"  {match_dict['Content Item Matched']}\n")

        print("-" * 50)  # Print a separator line for readability

def create_embedding_collection(texts, embedding_model):
    # Encode the sentences using the model
    embeddings = embedding_model.encode(texts)

    # Zip the sentences with their corresponding embeddings
    texts_embedding_collection = list(zip(texts, embeddings))

    return texts_embedding_collection

## 4.1 Searching in a Dummy Sentence Level Collection

In [20]:
german_sentences = [
    "Mit diesen drei Kernkraftwerken wird die Schweiz 1972 die höchste installierte nukleare Kapazität pro Kopf der Bevölkerung aller kontinentaleuropäischer Länder aufweisen .",
    "In anderen Gegenden wiederum wirbt man gegen den Bau von Wasserkraftwerken aus Gründen des Natur- und Heimatschutzes und wäre mit der Aufstellung thermischer Kraftwerke mittlerer Leistung einverstanden , vorausgesetzt , dass deren Abgase keine unzulässige Verschmutzung der Luft verursachen",
    "In Baden beabsichtigt , mit einem Kostenaufwand von 480 Millionen Franken iii Kaiseraugst ein Kernkraftwerk mit einer Leistung von 500 Megawatt zu errichten",
    "Der Bürger akzeptiert das Prinzip des doppelten politischen Programms nicht mehr, bei dem das erste dazu dient, gewählt zu werden, und das zweite zum Regieren verwendet wird.",
]

french_sentences = [
    "Indemnités pour Kaiseraugst Contestées par une minorité au National Le Conseil national a entamé hier de débat sur l'abandon du projet de la centrale nucléaire de Kaiseraugst mais n'a pas encore voté l'entrée enmatière.",
    "Certains milieux voués à la protection de l'environnement s'élèvent non seulement contre la construction de centrales mais même contre l'accroissement de la consommation d'énergie.",
    "En revanche, le projet retenu pour Kaiseraugst, soit un réacteur à eau légère de 500 MW, coûterait 480 millions de francs et produirait de l'électricité au prix de 2,5 centimes par kwh. ",
    "La Suisse disposera en 1972, par habitant, de la capacité nucléaire installée la plus élevée de tous les pays d'Europe continentale."
]

german_collection = create_embedding_collection(german_sentences, embedding_model)
french_collection = create_embedding_collection(french_sentences, embedding_model)

In [21]:
# Example of finding and printing the best matches
matches = find_best_match_in_collection(source_collection=german_collection, target_collection=french_collection)  # Find best matches
print_matches_formatted(matches)  # Print the matches

Source Text:
  Mit diesen drei Kernkraftwerken wird die Schweiz 1972 die höchste installierte nukleare Kapazität pro Kopf der Bevölkerung aller kontinentaleuropäischer Länder aufweisen .

Best Match:
  La Suisse disposera en 1972, par habitant, de la capacité nucléaire installée la plus élevée de tous les pays d'Europe continentale.

Similarity: 0.92

--------------------------------------------------
Source Text:
  In anderen Gegenden wiederum wirbt man gegen den Bau von Wasserkraftwerken aus Gründen des Natur- und Heimatschutzes und wäre mit der Aufstellung thermischer Kraftwerke mittlerer Leistung einverstanden , vorausgesetzt , dass deren Abgase keine unzulässige Verschmutzung der Luft verursachen

Best Match:
  Certains milieux voués à la protection de l'environnement s'élèvent non seulement contre la construction de centrales mais même contre l'accroissement de la consommation d'énergie.

Similarity: 0.73

--------------------------------------------------
Source Text:
  In Baden

## 4.2 Searching in an Article collection exported from the interface

In [22]:
def interface_exported_csv_to_collection(df, embedding_model, batch_size=16, minimum_characters_in_article=2000):
    """
    Converts a DataFrame into a collection of (text, embedding, uid) tuples,
    filtering rows where the 'content' column has at least 2000 characters.
    The encoding is done in batches to handle large datasets efficiently.

    Args:
        df (pd.DataFrame): Input DataFrame containing 'content' (text) and 'uid' (unique identifier).
        embedding_model (object): Embedding model with an `encode()` method to generate text embeddings.
        batch_size (int, optional): The size of batches for the encoding process. Defaults to 32.

    Returns:
        list of tuples: Each tuple contains (source text, embedding, uid) for rows with 'content' >= 2000 characters.
    """

    # Filter rows where 'content' has at least 2000 characters
    df_filtered = df[df['content'].apply(lambda x: len(str(x)) >= 2000)]

    # Extract the 'content' and 'uid' columns
    texts = df_filtered['content'].tolist()
    uids = df_filtered['uid'].tolist()

    # Initialize an empty list to hold the embeddings
    embeddings = []

    # Process the texts in batches
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        batch_embeddings = embedding_model.encode(batch_texts)
        embeddings.extend(batch_embeddings)

    # Create a collection of tuples (source text, embedding, uid)
    collection = list(zip(texts, embeddings, uids))

    return collection


In [23]:
!pip install gdown
import gdown
# Step 1: Set the file IDs
file_id_french = '1-xHR5Oxlo1iVBpAKnwsxYxPGBZ0r86-3'
file_id_german = '1LGdbdvJVl1tIouYoPJPcL0u45_GftwoB'

# Step 2: Download the files using gdown
gdown.download(f"https://drive.google.com/uc?export=download&id={file_id_french}", "mariecurie_french.csv", quiet=True)
gdown.download(f"https://drive.google.com/uc?export=download&id={file_id_german}", "mariecurie_german.csv", quiet=True)



'mariecurie_german.csv'

In [24]:
import pandas as pd

marie_curie_df_french = pd.read_csv("mariecurie_french.csv", sep=";")

marie_curie_df_german = pd.read_csv("mariecurie_german.csv", sep=";")

In [25]:
marie_curie_german_collection = interface_exported_csv_to_collection(marie_curie_df_german, embedding_model, minimum_characters_in_article=2000)
print("German articles prepared: " + str(len(marie_curie_german_collection)))
marie_curie_french_collection = interface_exported_csv_to_collection(marie_curie_df_french, embedding_model, minimum_characters_in_article=2000)
print("French articles prepared: " + str(len(marie_curie_french_collection)))

German articles prepared: 141
French articles prepared: 828


In [26]:
# Example of finding and printing the best matches
matches = find_best_match_in_collection(source_collection=marie_curie_french_collection[:15], target_collection=marie_curie_german_collection, link=True)  # Find best matches
print_matches_formatted(matches, link=True, threshold=0.70)  # Print the matches

Source Text:
  Le merveilleux radium Parmi tous les miracles dont pourra s'enorgueillir un siècle fécond en découvertes et en inventions de toutes sortes, le plus merveilleux peut-être est celui du radium. Or, écrire l'histoire de celuici, c'est écrire l'histoire de Mme Curie, à laquelle, assure-t-on, le gouvernement français doit décerner sous peu la plus haute récompense nationale qui ait jamais été accordée à une femme : la croix de commandeur de la Légion d'honneur. Nous ne les séparerons donc pas l'un de l'autre. Qu'est-ce que le radium, dont on parle beaucoup, mais que bien peu de personnes connaissent ? C'est le dernier état chimique de l'uranium, qui constitue avec le thorium et l'actinium, les trois corps radio-actifs. Quels sont ses effets ? Il excite la phosphorescence des substances alcalines, du verre, du diamant, de l'épiderme. Il est conducteur de l'électricité. Il est « auto-lumineux ». Chimiquement, il colore le verre, l'eau, le diamant, décompose l'eau, coagule l'albu

## 4.3 Searching in an Article collection sourced from DataLab (Current Substitute API)

This is not how it will look within Datalab. Details about that go to Daniele, I am just showcasing the programmatic access through our API

In [27]:
%pip install --upgrade --force-reinstall impresso
import impresso
impresso_session = impresso.connect()

Collecting impresso
  Downloading impresso-0.9.8-py3-none-any.whl.metadata (4.6 kB)
Collecting PyJWT<3.0.0,>=2.8.0 (from impresso)
  Downloading PyJWT-2.9.0-py3-none-any.whl.metadata (3.0 kB)
Collecting PyYAML<7.0.0,>=6.0.2 (from impresso)
  Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting attrs<24.0.0,>=23.2.0 (from impresso)
  Downloading attrs-23.2.0-py3-none-any.whl.metadata (9.5 kB)
Collecting httpx<0.28.0,>=0.27.0 (from impresso)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting matplotlib<4.0.0,>=3.7.0 (from impresso)
  Downloading matplotlib-3.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting pandas<3.0.0,>=2.1.0 (from impresso)
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?2


Click on the following link to access the login page: https://impresso-project.ch/datalab/token
 - 🔤 Enter your email/password on this page.
 - 🔑 Once logged in, a secret token will be generated for you.
 - 📋 Copy this token and paste it into the input field below. Then press "Enter". 👇🏼.

🔑 Enter your token: ··········
🎉 You are now connected to the Impresso API!  🎉


In [30]:
# some search and get data
fr_result = impresso_session.search.find(
    q="recette",
    order_by="date")

print(fr_result.df.index[:5])

# show content of first 10 articles
for uri in fr_result.df.index[:20]:
    article = impresso_session.content_items.get(uri)

Index(['EXP-1768-05-05-a-i0001', 'EXP-1768-05-11-a-i0001',
       'EXP-1770-12-13-a-i0008', 'EXP-1771-01-17-a-i0012',
       'EXP-1771-01-24-a-i0009'],
      dtype='object', name='uid')
EXP-1768-05-05-a-i0001
                      uid type       title  size  nbPages  \
0  EXP-1768-05-05-a-i0001   ar  [REDACTED]   199        4   

                                               pages  isCC     excerpt  \
0  [{'uid': 'EXP-1768-05-05-a-p0001', 'num': 1, '...  True  [REDACTED]   

      labels  accessRight  ...  newspaper.firstIssue.date  \
0  [article]  OpenPrivate  ...  1738-10-02T00:00:00+00:00   

  newspaper.firstIssue.year newspaper.lastIssue.uid newspaper.lastIssue.cover  \
0                      1738        EXP-2017-10-31-a                             

  newspaper.lastIssue.labels newspaper.lastIssue.fresh  \
0                    [issue]                     False   

  newspaper.lastIssue.accessRights   newspaper.lastIssue.date  \
0                       NotDefined  2017-10-31T00:0

## 5. Summary and Next Steps

Feel free to try other texts or even other models, the pipeline remains the same
