<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/workshop_resources/ws4-embeddings/Linking2impresso.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


# ‚öìÔ∏è Linking2Impresso: Connecting Your Data with Impresso

## What is this notebook about?

This notebook shows how to link an external collection of texts (and optionally images) to the Impresso historical media archive using semantic embeddings.  
It walks through preparing a small example dataset, generating embeddings, querying the
Impresso API for similar items, and visualising the results in interactive semantic
graphs.

## Why is this useful?

Semantic linking helps researchers contextualise their own materials within a large historical corpus. By comparing your texts with Impresso documents, you can identify related themes, trace conceptual connections, and explore the broader historical discourse.  
This notebook shows how embedding-based search offers an efficient way to connect local
datasets to digital archives.

## How does it work?

The workflow is straightforward:

1. Prepare or upload your text collection.
2. Generate embeddings with the Impresso embedding model.
3. Query the Impresso API using vector-based similarity search.
4. Visualize input texts and retrieved items using discrete and similarity graphs.

Impresso stores text and image embeddings in a compact Base64-encoded format.  
For local similarity computations, these embeddings must be decoded into numeric vectors.  
The notebook provides helper functions to convert between Impresso‚Äôs internal format and standard floating-point vectors.

Both text and image embeddings follow the same principle but rely on different
underlying models.

## What will you learn?

In this notebook, you will learn how to:

- Load or provide your own text collection.
- Generate text embeddings using the Impresso API.
- Perform semantic similarity search against the Impresso archive.
- Retrieve and inspect matched historical documents.
- Build and interpret two types of interactive semantic graphs.
- Optionally embed and link your own images.


## Prerequisites

Install the necessary Python packages used throughout this notebook.

Note: The `impresso-py` installation uses a specific branch that provides the embedding-based search functionality.


In [None]:
%pip install git+https://github.com/impresso/impresso-py.git@embeddings-search
%pip install pyvis scikit-learn umap-learn

## Input Data: Example Collections

This notebook includes two small example datasets.  
One consists of French Assembly notes, and the other contains short English texts related to the Geneva Conventions.

The French set is kept for experimentation but is not used by default.  
The English set serves as the main example because the Impresso embedding model performs reliably on English input.

You may replace these samples with your own files, paste text directly, or upload documents through Colab/Drive.

### Example Collection I: French Assembly Notes

_This dataset is truncated for demonstration purposes. The goal is simply to illustrate how to prepare a collection._

### Example Collection II: Geneva Convention Pieces

_This English dataset is used in the remainder of this notebook as our running example._


### Example Collection I: French Assembly Notes

This is an example collection provided by Martin Grandjean.

It is **French texts**, so we do not use this here by default. But leave it for further experiments.


In [None]:
%%bash
Collectionlink="https://gist.githubusercontent.com/flipz357/e22f1f9df1b263d29927ec21440daeab/raw/199869d9ffce9b4121f7a97f00dd57eec4c8e763/gistfile1.txt"
wget $Collectionlink -O collection.md

In [None]:
import re

with open("collection.md", "r") as f:
    collection = f.read()
parts, titles = re.split("^#.*$", collection, flags=re.MULTILINE), [
    "notitle"
] + re.findall("^#.*$", collection, flags=re.MULTILINE)
parts = [part.replace("\n", " ")[:200] for part in parts]

### Example Collection II: Geneva Convention Pieces

This is an example collection of **English Text**, and we thus use it as the basis for our data. It is fairly short and simple.


In [None]:
parts = [
    (
        "Persons in the hands of the enemy are entitled at all times to respect for"
        " their life and for their physical and mental integrity."
    ),
    (
        "Under the first and second Geneva Conventions of 1949, the belligerents must"
        " protect the sick, wounded and shipwrecked as well as medical personnel,"
        " ambulances and hospitals. All persons protected under these conventions must"
        " be given shelter and cared for by the party to the conflict that holds power"
        " over them."
    ),
    (
        "The third Geneva Convention contains detailed rules on the treatment of"
        " prisoners of war."
    ),
    (
        "The fourth Geneva Convention protects civilians in the hands of the enemy,"
        " whether in their own or in occupied territory."
    ),
    (
        "The first Additional Protocol of 1977 supplements the rules applying to"
        " international armed conflicts contained in the four Geneva Conventions. It"
        " imposes restrictions on the conduct of hostilities; for example, it prohibits"
        " attacks against civilians and civilian objects and restricts the means and"
        " methods of warfare"
    ),
]

## Initialising an Impresso Session

We now establish a connection to the Impresso API.  
The endpoint used here is the development API suitable for demonstrations.  
If you work with privileged or production data, you may need to provide an API token and adjust the base URL accordingly.

The session object gives access to several components:

- `tools` for embedding text or images
- `search` for semantic retrieval
- `content_items` for fetching metadata and full texts


In [None]:
from impresso import connect

impresso_session = connect("https://dev.impresso-project.ch/public-api/v1")

## Embedding the Input Texts and Semantic Search

To link our documents with the Impresso archive, we first generate embeddings for each input text using the Impresso text model.  
Semantic search requires that both the query embeddings and the archive embeddings originate from the same model.

For each input text, we:

1. Compute an embedding.
2. Query the Impresso archive for the most similar documents.
3. Collect the results for further inspection and visualisation.

If your texts are very short or highly domain-specific, the retrieved matches may vary in quality or be sparse.  
Increasing the `limit` parameter may be useful in such cases.


In [None]:
# ---embed and search---

# this is to collect the embeddings of our input texts
embeddings = []
# this is to collect the search results
matches = []
for i, part in enumerate(parts):
    # embedding one of your documents
    embedding = impresso_session.tools.embed_text(text=part, target="text")
    # using the embedding to search in impresso
    matches.append(impresso_session.search.find(embedding=embedding, limit=5))
    # storing embedding for later
    embeddings.append(embedding)
    print(f"retrieved search results for text  {i+1}/{len(parts)}")

# ---some post-processing for convenience---

# get ids of search results
matches_uids = [[datum["uid"] for datum in m.raw["data"]] for m in matches]
# get texts of search results
articles = [
    [impresso_session.content_items.get(uid) for uid in uids] for uids in matches_uids
]
# concatenate title and text of search results
articles = [
    [article.raw.get("title") + " " + article.raw.get("transcript") for article in a]
    for a in articles
]

## Post-processing Retrieved Matches

The semantic search returns a list of content item identifiers.  
To analyse the results, we fetch each item‚Äôs title and transcript and combine them into short previews.  
Note that OCR quality varies across newspapers in Impresso; retrieved text may occasionally include OCR noise.

This post-processing step prepares the data for visualisation and supports quick manual inspection of semantic matches.


## Exploring the Links: Bipartite Overview Graph

The first visualisation is a bipartite graph connecting each input text with the documents returned by semantic search.  
Input texts and retrieved items are displayed as two node types, using colours to distinguish them.

This representation helps you see, at a glance:

- which Impresso items are linked to each input text
- how many results each query retrieves
- how input data clusters conceptually

Hovering over a node reveals a tooltip with the text preview and a link to the corresponding item in the Impresso interface.


### üï∏Ô∏è: **Discrete Graph**: Overview of input texts and search results


In [None]:
import os
from pyvis.network import Network
from IPython.display import display, HTML

# Create the network
net = Network(height="600px", width="100%", notebook=True, cdn_resources="in_line")

# Add input nodes
for i, text in enumerate(parts):
    net.add_node(f"I{i}", label=f"Input {i+1}", title=text, color="lightblue", size=25)

# Add result nodes and edges
for i, res_list in enumerate(articles):
    for j, rtext in enumerate(res_list):
        node_id = f"R{i}_{j}"
        url = "https://dev.impresso-project.ch/app/article/" + matches_uids[i][j]
        title = rtext + f"<br><a href={url} target='_blank'>Link</a>"
        net.add_node(node_id, label=f"R{i+1}.{j+1}", title=title, color="lightgreen")
        net.add_edge(f"I{i}", node_id)

# Save
net.save_graph("graph.html")
abs_path = os.path.abspath("graph.html")
print(f"Graph saved to {abs_path}")
# Display inline in Colab
display(HTML("graph.html"))

## Similarity Graph: Structure Beyond Direct Matches

Whereas the bipartite graph shows only input‚Üíresult connections,  
the similarity graph reveals relationships among all documents: inputs and retrieved results alike.

This graph is constructed as follows:

- All embeddings are mapped into a 2D space using UMAP.
- Cosine similarity is computed between every pair of documents.
- Edges are drawn only when the similarity exceeds a chosen threshold.

UMAP provides an approximate visual clustering of documents.  
Its layout is non-deterministic by nature, so small variations between runs are expected.  
For very small datasets, the projection may appear unstable; this is normal behaviour.


#### **Before we start**: Handy functions for mapping Impresso embeddings to vector

- We need vectors to compute the similarity function
- Impresso embeddings are encoded with a specialized string format
- So we build and apply functions that map from the string format to an actual vector, and back


In [None]:
import base64
import struct


# --- our handy mapping functions ---
def string2vector(embedding_string):
    # convert base64 string to a float array
    _, arr = embedding_string.split(":")
    arr = base64.b64decode(arr)
    embedding_vector = [
        struct.unpack("f", arr[i : i + 4])[0] for i in range(0, len(arr), 4)
    ]
    return embedding_vector


# the inverse function. Not used, just for completeness sake
def vector2string(vec, prefix="gte-768"):
    # pack floats into bytes
    arr = b"".join(struct.pack("f", x) for x in vec)
    # encode bytes to base64 string
    encoded = base64.b64encode(arr).decode("utf-8")
    # return in same format as original ("prefix:encoded_string")
    return f"{prefix}:{encoded}"


# --- applying the mapping functions ---
# first we get the embeddings for the matches (in Impresso format, note that we already have the embedding for the input)
matches_embeddings = [
    [impresso_session.content_items.get_embeddings(uid)[0] for uid in match_uid]
    for match_uid in matches_uids
]
# map input embedding string format to vector
part_embeddings = [string2vector(emb) for emb in embeddings]
# map match embedding string format to vector
article_embeddings = [
    [string2vector(emb) for emb in embs] for embs in matches_embeddings
]

#### Build the similarity graph

Idea:

- The graph above is a _"discrete graph"_:
  - It shows the matches for every input document
- Now we would like to build a graph that is more connected:
  - Strength of the connections
  - Show connectivity _among the returned documents_
- So this time we build a softer graph
  - Captures similarity relations between any input texts,
  - And any texts among the search results

For proper display (closeness of nodes), we rely on the help of UMAP.

- UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that projects high-dimensional data (like embeddings) into 2D or 3D space while preserving the data‚Äôs underlying structure and similarity relationships.


In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from pyvis.network import Network
import umap


def prepare_flat_results(article_embeddings, articles, matches_uids):
    """
    Flatten nested embeddings, texts, and UIDs and ensure alignment.
    """
    flat_embs = np.vstack(article_embeddings)
    flat_texts = [txt for group in articles for txt in group]
    flat_uris = [uid for group in matches_uids for uid in group]

    assert len(flat_embs) == len(flat_texts) == len(flat_uris), "Alignment error"
    return flat_embs, flat_texts, flat_uris


def reduce_to_2d(embeddings, metric="cosine", random_state=42):
    """
    Apply UMAP to reduce embeddings to 2D coordinates.
    """
    reducer = umap.UMAP(n_components=2, metric=metric, random_state=random_state)
    return reducer.fit_transform(embeddings)


def build_graph(
    embeddings,
    texts,
    node_types,
    uris,
    coords,
    threshold=0.7,
    outfile="embedding_similarity_graph.html",
):
    """
    Build and save a PyVis graph from embeddings, metadata and 2D coordinates.
    """
    net = Network(height="700px", width="100%", notebook=True, cdn_resources="in_line")

    # Scale node coordinates to visible ranges
    scale = float(np.std(coords) * 150) or 150.0

    result_idx = 1
    for i, (text, ntype) in enumerate(zip(texts, node_types)):
        is_input = ntype == "input"
        color = "lightblue" if is_input else "lightgreen"
        size = 25 if is_input else 15
        label = f"I{i+1}" if is_input else f"R{result_idx}"
        if not is_input:
            result_idx += 1

        x, y = coords[i] * scale
        url = (
            None
            if is_input
            else f"https://dev.impresso-project.ch/app/article/{uris[i]}"
        )
        title = (
            text if is_input else text + f"<br><a href={url} target='_blank'>Link</a>"
        )

        net.add_node(
            i,
            label=label,
            title=title,
            color=color,
            size=size,
            x=float(x),
            y=float(y),
            fixed={"x": True, "y": True},
        )

    # Add edges from cosine similarity
    sim_matrix = cosine_similarity(embeddings)
    for i in range(len(embeddings)):
        for j in range(i + 1, len(embeddings)):
            sim = sim_matrix[i, j]
            if sim > threshold:
                net.add_edge(i, j, value=float(sim), title=f"Similarity: {sim:.2f}")

    net.set_options(
        """
    {"physics": {"enabled": false}}
    """
    )

    net.save_graph(outfile)
    abs_path = os.path.abspath(outfile)
    print(f"Graph saved to {abs_path}")

In [None]:
# Example usage: building the similarity graph

# 1. Flatten Impresso results
a_flat, texts_flat, uris_flat = flatten_results(
    article_embeddings, articles, matches_uids
)

# 2. Prepare combined data
all_embeddings = np.vstack([part_embeddings, a_flat])
all_texts = parts + texts_flat
node_types = ["input"] * len(part_embeddings) + ["result"] * len(a_flat)
full_uris = [None] * len(part_embeddings) + uris_flat

# 3. Dimensionality reduction (UMAP)
coords = reduce_embeddings(all_embeddings)

# 4. Build and save graph (one call, no extra parameters needed)
build_similarity_graph(
    all_embeddings,
    all_texts,
    node_types,
    full_uris,
    coords,
    threshold=0.7,
    outfile="embedding_similarity_graph.html",
)

## Bonus: Linking Images through Similarity

The Impresso API also supports embeddings for images.  
The process mirrors the text workflow:

1. Provide an image (URL or local file).
2. Generate an embedding via `tools.embed_image`.
3. Retrieve visually related images from the Impresso collection.
4. Build a similarity graph illustrating the connections.

Image embeddings are based on a different underlying model than text embeddings.  
They capture visual properties and may respond strongly to layout, contrast, or style.  
You may replace the example image with any of your own files.


#### Defining any image


In [None]:
from IPython.display import Image

# picture of the notebook's author, feel free to exchange with any other image
image_url = "https://www.juriopitz.com/assets/img/me.png"
display(Image(url=image_url))

#### Embedding this image and searching similar ones in Impresso

Similar to how we embedded texts, we can also embed images.


In [None]:
img_embedding = impresso_session.tools.embed_image(image=image_url, target="image")
matches = impresso_session.images.find(embedding=img_embedding, limit=4)

#### Visualizing the result as a graph

Similarly to how we created the connected network of input texts nd texts in impresso, we can create the connected network of the input image and images in impresso ---- All just based on semantic similarity of embeddings


In [None]:
# --pre-processing-- for conveinence
matches_uids = [datum["uid"] for datum in matches.raw["data"]]
links_uids = [datum["contentItemUid"] for datum in matches.raw["data"]]
matches_embeddings = [
    impresso_session.images.get_embeddings(uid)[0] for uid in matches_uids
]
part_embeddings = [string2vector(emb) for emb in [embedding]]
article_embeddings = [
    [string2vector(emb) for emb in embs] for embs in [matches_embeddings]
]
matches_uids = [matches_uids]
links_uids = [links_uids]

flat_article_embeddings = np.vstack(article_embeddings)
flat_articles = [r for sublist in articles for r in sublist]
flat_uris = [None] * len(part_embeddings) + [
    r for sublist in links_uids for r in sublist
]

# Combine everything for similarity computation
all_embeddings = np.vstack([part_embeddings, flat_article_embeddings])
all_texts = parts + flat_articles
node_types = ["input"] * len(part_embeddings) + ["result"] * len(
    flat_article_embeddings
)

# Dimensionality reduction (UMAP)
reducer = umap.UMAP(n_components=2, random_state=42)
coords = reducer.fit_transform(all_embeddings)

# Compute cosine similarities
sim_matrix = cosine_similarity(all_embeddings)

#  Build the PyVis network
net = Network(height="700px", width="100%", notebook=True, cdn_resources="in_line")
scale = 150
# Add nodes
for i, (text, ntype) in enumerate(zip(all_texts, node_types)):
    color = "lightblue" if ntype == "input" else "lightgreen"
    size = 25 if ntype == "input" else 15
    label = f"I{i+1}" if ntype == "input" else f"R{i - len(flat_articles) + 1}"
    x, y = coords[i] * scale  # scale coordinates
    url = (
        None
        if ntype == "input"
        else "https://dev.impresso-project.ch/app/article/" + flat_uris[i]
    )
    title = (
        "Input Image"
        if ntype == "input"
        else "Impresso Image" + f"<br><a href={url} target='_blank'>Link</a>"
    )
    net.add_node(
        i,
        label=label,
        title=title,
        color=color,
        size=size,
        x=float(x),
        y=float(y),
        fixed={"x": True, "y": True},  # properly fix both axes
    )

# Add edges based on similarity
threshold = 0.0  # Adjust to make the graph denser/sparser
for i in range(len(all_embeddings)):
    for j in range(i + 1, len(all_embeddings)):
        sim = float(sim_matrix[i, j])
        if sim > threshold:
            net.add_edge(i, j, value=sim, title=f"Similarity: {sim:.2f}")


# --- Display inline in Colab ---
net.set_options(
    """
{
  "physics": {
    "enabled": false
  }
}
"""
)
net.save_graph("image_embedding_similarity_graph.html")
abs_path = os.path.abspath("image_embedding_similarity_graph.html")
display(HTML("image_ embedding_similarity_graph.html"))
print(f"Graph saved to {abs_path}")

## Using an External Embedding Model

You can also load the same embedding model used in Impresso directly from the SentenceTransformers library.
This allows you to compute embeddings offline, integrate them into your own workflows, or compare Impresso API outputs with embeddings generated locally.

In this example, we load the `gte-multilingual-base` model, which underlies the Impresso text embedding system.
We then compute an embedding for a simple test string and compare it with the corresponding embedding returned by the Impresso API.
This verification step confirms that both sources produce consistent representations.

Note that transformer models downloaded through SentenceTransformers are cached automatically within the current Colab runtime. Subsequent runs in the same runtime will therefore reuse the cached model without additional downloads, but a new Colab session starts with an empty cache and will trigger a fresh download.


In [None]:
%pip install 'sentence-transformers>=3.0.0'

In [None]:
# The first cell of revealing similarities and links can be replaced with this:
from sentence_transformers import SentenceTransformer

model_name_or_path = "Alibaba-NLP/gte-multilingual-base"
model = SentenceTransformer(model_name_or_path, trust_remote_code=True)
part_embeddings = model.encode(parts, normalize_embeddings=True)
article_embeddings = [
    model.encode(articleset, normalize_embeddings=True) for articleset in articles
]

# verify that the embedding model works as intended:
print(model.encode(["hello"], normalize_embeddings=True)[0][:4])
print(string2vector(impresso_session.tools.embed_text(text="hello", target="text"))[:4])

## Conclusion

This notebook presented a complete workflow for linking external text and image collections to the Impresso archive using semantic embeddings.  
You learned how to generate embeddings, run similarity searches, examine retrieved material, and explore connections through interactive visualisations.

Embedding-based methods are effective for exploration, but their results should be interpreted with care.  
Similarity scores reflect model behaviour and the quality of the underlying data, including OCR variation.  
The visualisations are best viewed as an aid for discovery rather than as evidence of direct historical or semantic relationships.


## Next Steps

You can continue exploring:

- Notebook: Introduction to semantic similarity search
- Notebook: Visualising newspaper collections
- Notebook: Working with Impresso OCR and NLP tools

Feel free to adapt this workflow to your own datasets.
