# Searching Relevant texts within an Embedding spaces

## What is this notebook about?
This notebook demonstrates how to use a pre-trained multilingual embedding model downloaded from Hugging Face to search for relevant texts across languages.
We'll load the model, embed some texts and measure cosine similarity between possible matches.

Model Information

In this example, we are using an off the shelf multilingual embedding model hosted on Huggingface: 'gte-multilingual-base'.

Note: Newer impresso version of the model is in the works.

This model predicts an embedding representation (list of numbers that stores the "meaning") of a given text (sentence, paragraph, article) that can be used to measure similarity between two texts.

---

## What will you learn?

In this notebook, you will learn how to:

- 



---
## Prerequisites

First, we need to install `sentence-transformers`


In [None]:
%pip install sentence-transformers

## Loading the embedding model

This class downloads the model from Hugging Face and loads it ready for prediction. We use the SentenceTransformers library to benefit from their functionality and documentation.

In [None]:
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

embedding_model = SentenceTransformer("Alibaba-NLP/gte-multilingual-base", trust_remote_code=True)

### Simple Test

In [3]:
sentence1_en = "This is an example test sentence"
sentence2_en = "This constitutes a sample sentence"

In [None]:
embedding1_en = embedding_model.encode(sentence1_en)
embedding2_en = embedding_model.encode(sentence2_en)

print("Embedding Representation of sentence1 starts with ... " + str(embedding1_en[:5]))
print("Embedding Representation of sentence2 starts with ... " + str(embedding2_en[:5]))

Those numbers look cool but am I meant to understanding anything?

Answer: Calculate the similarity between the two representations

In [None]:
similarity_value = round(1 - cosine(embedding1_en, embedding2_en),2)
print("Sentence1 and Sentence2 have a cosine similarity of " + str(similarity_value))

FAQ:

Cosine similarity is not the percentage of similarity between the two texts.

The higher the cosine similarity, the more similar the two texts are. Range of what is considered "high" varies per model and domain.

Based on our experiments on contemporary texts, cosine similarity of 0.85+ means the two texts are mostly equivalent

### Simple Test Across Languages

In [6]:
sentence1_de = "Das ist ein Beispieltestsatz"

In [7]:
embedding1_de = embedding_model.encode(sentence1_de)

In [8]:
similarity_value = round(1 - cosine(embedding1_en, embedding1_de),2)
print("Sentence1 in English and Sentence1 in German have a cosine similarity of " + str(similarity_value))

Sentence1 in English and Sentence1 in German have a cosine similarity of 0.73


### Try your own similarity calculation

In [9]:
input1 = input()

test1


In [10]:
input2 = input()

test2


In [11]:
embedding1 = embedding_model.encode(input1)
embedding2 = embedding_model.encode(input2)

In [12]:
similarity_value = round(1 - cosine(embedding1_en, embedding1_de),2)
print("Input1 and Input2 have a cosine similarity of " + str(similarity_value))

Input1 and Input2 have a cosine similarity of 0.73


Works! Not in every case I did not cherry pick this. However, similarity across languages is possible with these models!

Works! Not in every case I did not cherry pick this. However, similarity across languages is possible with these models!

## Finding similar texts within collections using the embedding model

Now that we have seen how the model creates a representation and how we can use it to get the similarity of two texts, let's apply it to a couple of collections.

### Setting up utilities

In [19]:
from scipy.spatial.distance import cosine

def find_best_match_in_collection(source_collection, target_collection, matches_sorted=False, link=False):
    """
    Finds the most similar sentence in the target collection for each sentence in the source collection
    based on cosine similarity of their embeddings.

    Args:
        source_collection (list of tuples): A list of tuples where each tuple contains a source sentence and its corresponding embedding.
        target_collection (list of tuples): A list of tuples where each tuple contains a target sentence and its corresponding embedding.

    Returns:
        list of dict: A list of dictionaries, each containing:
            - 'source_text' (str): The original sentence from the source collection.
            - 'best_match' (str): The most similar sentence from the target collection.
            - 'similarity' (float): The cosine similarity score between the source sentence and the best match, rounded to two decimal places.

    Example:
        source = [("sentence in German", embedding_vector)]
        target = [("sentence in French", embedding_vector)]
        matches = find_best_match_in_collection(source, target)

    Note:
        The function assumes the embeddings are already precomputed and provided in the source and target collections.
    """

    matches = []  # List to store the best matches
    for source in source_collection:
        source_embedding = source[1]  # Get the embedding of the source sentence
        best_match_text = ""  # Initialize the best match text
        best_match_similarity = 0  # Initialize the best match similarity score
        best_match_content_id = ""
        # Iterate through the target collection to find the best match
        for target in target_collection:
            if source[0] == target[0]:  # Skip if comparing the same sentence
                continue
            target_embedding = target[1]  # Get the embedding of the target sentence
            # Calculate cosine similarity between source and target embeddings
            similarity_value = 1 - cosine(source_embedding, target_embedding)
            # Update if the current similarity is higher than the previous best
            if similarity_value > best_match_similarity:
                best_match_similarity = similarity_value
                best_match_text = target[0]
                if link:
                  best_match_content_id = target[2]

        # Append the source sentence, best match, and similarity score to the results list
        if link:
            matches.append({
                "source_text": source[0],
                "best_match": best_match_text,
                "similarity": round(best_match_similarity, 2),  # Round to 2 decimal places
                "Content Item Source": f"https://impresso-project.ch/app/article/{source[2]}",  # Corrected f-string for URL
                "Content Item Matched": f"https://impresso-project.ch/app/article/{best_match_content_id}"  # Corrected f-string for URL
            })
        else:
            matches.append({
                "source_text": source[0],
                "best_match": best_match_text,
                "similarity": round(best_match_similarity, 2)  # Round to 2 decimal places
            })


    if matches_sorted:
      # Sort the matches by similarity in descending order if specified
      matches = sorted(matches, key=lambda x: x['similarity'], reverse=True)

    return matches

# Function to print the matches in a nicely formatted way
def print_matches_formatted(matches, link=False, threshold=0):
    for match_dict in matches:
        if threshold > match_dict['similarity']: # if not a good match, skip the match
            continue
        print("Source Text:")
        print(f"  {match_dict['source_text']}\n")

        print("Best Match:")
        print(f"  {match_dict['best_match']}\n")

        print(f"Similarity: {match_dict['similarity']}\n")

        # If 'link' is True, print the URLs for the source and matched content
        if link and 'Content Item Source' in match_dict and 'Content Item Matched' in match_dict:
            print("Content Item Source:")
            print(f"  {match_dict['Content Item Source']}\n")

            print("Content Item Matched:")
            print(f"  {match_dict['Content Item Matched']}\n")

        print("-" * 50)  # Print a separator line for readability

def create_embedding_collection(texts, embedding_model):
    # Encode the sentences using the model
    embeddings = embedding_model.encode(texts)

    # Zip the sentences with their corresponding embeddings
    texts_embedding_collection = list(zip(texts, embeddings))

    return texts_embedding_collection

### Searching in a Dummy Sentence Level Collection

In [20]:
german_sentences = [
    "Mit diesen drei Kernkraftwerken wird die Schweiz 1972 die höchste installierte nukleare Kapazität pro Kopf der Bevölkerung aller kontinentaleuropäischer Länder aufweisen .",
    "In anderen Gegenden wiederum wirbt man gegen den Bau von Wasserkraftwerken aus Gründen des Natur- und Heimatschutzes und wäre mit der Aufstellung thermischer Kraftwerke mittlerer Leistung einverstanden , vorausgesetzt , dass deren Abgase keine unzulässige Verschmutzung der Luft verursachen",
    "In Baden beabsichtigt , mit einem Kostenaufwand von 480 Millionen Franken iii Kaiseraugst ein Kernkraftwerk mit einer Leistung von 500 Megawatt zu errichten",
    "Der Bürger akzeptiert das Prinzip des doppelten politischen Programms nicht mehr, bei dem das erste dazu dient, gewählt zu werden, und das zweite zum Regieren verwendet wird.",
]

french_sentences = [
    "Indemnités pour Kaiseraugst Contestées par une minorité au National Le Conseil national a entamé hier de débat sur l'abandon du projet de la centrale nucléaire de Kaiseraugst mais n'a pas encore voté l'entrée enmatière.",
    "Certains milieux voués à la protection de l'environnement s'élèvent non seulement contre la construction de centrales mais même contre l'accroissement de la consommation d'énergie.",
    "En revanche, le projet retenu pour Kaiseraugst, soit un réacteur à eau légère de 500 MW, coûterait 480 millions de francs et produirait de l'électricité au prix de 2,5 centimes par kwh. ",
    "La Suisse disposera en 1972, par habitant, de la capacité nucléaire installée la plus élevée de tous les pays d'Europe continentale."
]

german_collection = create_embedding_collection(german_sentences, embedding_model)
french_collection = create_embedding_collection(french_sentences, embedding_model)

In [None]:
# Example of finding and printing the best matches
matches = find_best_match_in_collection(source_collection=german_collection, target_collection=french_collection)  # Find best matches
print_matches_formatted(matches)  # Print the matches

### Searching in an Article collection exported from the interface

In [22]:
def interface_exported_csv_to_collection(df, embedding_model, batch_size=16, minimum_characters_in_article=2000):
    """
    Converts a DataFrame into a collection of (text, embedding, uid) tuples,
    filtering rows where the 'content' column has at least 2000 characters.
    The encoding is done in batches to handle large datasets efficiently.

    Args:
        df (pd.DataFrame): Input DataFrame containing 'content' (text) and 'uid' (unique identifier).
        embedding_model (object): Embedding model with an `encode()` method to generate text embeddings.
        batch_size (int, optional): The size of batches for the encoding process. Defaults to 32.

    Returns:
        list of tuples: Each tuple contains (source text, embedding, uid) for rows with 'content' >= 2000 characters.
    """

    # Filter rows where 'content' has at least 2000 characters
    df_filtered = df[df['content'].apply(lambda x: len(str(x)) >= 2000)]

    # Extract the 'content' and 'uid' columns
    texts = df_filtered['content'].tolist()
    uids = df_filtered['uid'].tolist()

    # Initialize an empty list to hold the embeddings
    embeddings = []

    # Process the texts in batches
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        batch_embeddings = embedding_model.encode(batch_texts)
        embeddings.extend(batch_embeddings)

    # Create a collection of tuples (source text, embedding, uid)
    collection = list(zip(texts, embeddings, uids))

    return collection


In [None]:
!pip install gdown
import gdown
# Step 1: Set the file IDs
file_id_french = '1-xHR5Oxlo1iVBpAKnwsxYxPGBZ0r86-3'
file_id_german = '1LGdbdvJVl1tIouYoPJPcL0u45_GftwoB'

# Step 2: Download the files using gdown
gdown.download(f"https://drive.google.com/uc?export=download&id={file_id_french}", "mariecurie_french.csv", quiet=True)
gdown.download(f"https://drive.google.com/uc?export=download&id={file_id_german}", "mariecurie_german.csv", quiet=True)

In [24]:
import pandas as pd

marie_curie_df_french = pd.read_csv("mariecurie_french.csv", sep=";")

marie_curie_df_german = pd.read_csv("mariecurie_german.csv", sep=";")

In [None]:
marie_curie_german_collection = interface_exported_csv_to_collection(marie_curie_df_german, embedding_model, minimum_characters_in_article=2000)
print("German articles prepared: " + str(len(marie_curie_german_collection)))
marie_curie_french_collection = interface_exported_csv_to_collection(marie_curie_df_french, embedding_model, minimum_characters_in_article=2000)
print("French articles prepared: " + str(len(marie_curie_french_collection)))

In [None]:
# Example of finding and printing the best matches
matches = find_best_match_in_collection(source_collection=marie_curie_french_collection[:15], target_collection=marie_curie_german_collection, link=True)  # Find best matches
print_matches_formatted(matches, link=True, threshold=0.70)  # Print the matches

### Searching in an Article collection sourced from DataLab (Current Substitute API)

This is not how it will look within Datalab. Details about that go to Daniele, I am just showcasing the programmatic access through our API

In [None]:
%pip install --upgrade --force-reinstall impresso
import impresso
impresso_session = impresso.connect()

In [None]:
# some search and get data
fr_result = impresso_session.search.find(
    term="recette",
    order_by="date")

print(fr_result.df.index[:5])

# show content of first 10 articles
for uri in fr_result.df.index[:20]:
    article = impresso_session.content_items.get(uri)

## Conclusion

Feel free to try other texts or even other models, the pipeline remains the same


---
## Project and License info

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2024 The Impresso team.

### License

This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>