<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/workshop_resources/ws4-embeddings/Linking2impresso.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# ⚓️ Linking2Impresso: Connecting your Data with Impresso

This notebook demonstrates how to "dock" an external text collection (bonus: image collection) into the Impresso historical media archive using semantic embeddings. By transforming textual data into high-dimensional vector representations, we can measure conceptual similarity between new materials and the vast Impresso corpus. The workflow illustrates the full pipeline: from **preparing a small example collection**, through **embedding and semantic search via the Impresso API**, to **exploring connections through an interactive network visualization**. The resulting graph allows us to quickly see how our own texts resonate with historical documents, opening pathways for contextualization, cross-referencing, and discovery across linguistic and temporal boundaries.

In summary, we are going to:
1. Load a custom text collection  
2. Compute embeddings using Impresso’s model  
3. Search for similar content in the Impresso archive  
4. Visualize the matches as an interactive bipartite graph

*Author of this notebook: Juri Opitz. **©Impresso***

## Installing necessities

We install the Impresso python library as well as some tools for visualization.

In [None]:
!pip install git+https://github.com/impresso/impresso-py.git@embeddings-search
!pip install pyvis

## Input Data: Our Example Collections

We are using two example collections here, one of French texts and one of English text.

Feel free to replace the example collections with any of your texts of interest.

### Example Collection I: French Assembly Notes

This is an example collection provided by Martin Grandjean.

It is **French texts**, so we do not use this here by default. But leave it for further experiments.

In [None]:
%%bash
Collectionlink="https://gist.githubusercontent.com/flipz357/e22f1f9df1b263d29927ec21440daeab/raw/199869d9ffce9b4121f7a97f00dd57eec4c8e763/gistfile1.txt"
wget $Collectionlink -O collection.md

In [None]:
import re
with open("collection.md", "r") as f:
    collection = f.read()
parts, titles = re.split("^#.*$", collection, flags=re.MULTILINE), ["notitle"] + re.findall("^#.*$", collection, flags=re.MULTILINE)
parts = [part.replace("\n", " ")[:200] for part in parts]

### Example Collection II: Geneva Convention Pieces

This is an example collection of **English Text**, and we thus use it as the basis for our data. It is fairly short and simple.

In [None]:
parts = ["Persons in the hands of the enemy are entitled at all times to respect for their life and for their physical and mental integrity.",
         "Under the first and second Geneva Conventions of 1949, the belligerents must protect the sick, wounded and shipwrecked as well as medical personnel, ambulances and hospitals. All persons protected under these conventions must be given shelter and cared for by the party to the conflict that holds power over them.",
         "The third Geneva Convention contains detailed rules on the treatment of prisoners of war.",
         "The fourth Geneva Convention protects civilians in the hands of the enemy, whether in their own or in occupied territory.",
         "The first Additional Protocol of 1977 supplements the rules applying to international armed conflicts contained in the four Geneva Conventions. It imposes restrictions on the conduct of hostilities; for example, it prohibits attacks against civilians and civilian objects and restricts the means and methods of warfare"]

## Intitialize an impresso session

We connect with the Impresso API 🔥

In [None]:
from impresso import connect

impresso_session = connect('https://dev.impresso-project.ch/public-api/v1')

## Linking! 🔗

### Use Impresso embedding model for embedding your collection

We are leveraging the Impresso embedding model for embedding our input texts.

🖊 It's important that we apply the same embedding model to our texts and the texts in our database (which is Impresso here). Only then any similarites are meaningful.

In [None]:
# ---embed and search---

# this is to collect the embeddings of our input texts
embeddings = []
# this is to collect the search results
matches = []
for i, part in enumerate(parts):
    # embedding one of your documents
    embedding = impresso_session.tools.embed_text(text=part, target="text")
    # using the embedding to search in impresso
    matches.append(impresso_session.search.find(embedding=embedding, limit=5))
    # storing embedding for later
    embeddings.append(embedding)
    print(f"retrieved search results for text  {i+1}/{len(parts)}")

# ---some post-processing for convenience---

# get ids of search results
matches_uids = [[datum["uid"] for datum in m.raw["data"]] for m in matches]
# get texts of search results
articles = [[impresso_session.content_items.get(uid) for uid in uids] for uids in matches_uids]
# concatenate title and text of search results
articles = [[article.raw.get("title") + " " + article.raw.get("transcript") for article in a] for a in articles]

## 🕸️📊 Exploring the links with Bi-partitie graphs

### 🕸️: **Discrete Graph**: Overview of input texts and search results

In [None]:
from pyvis.network import Network
from IPython.display import display, HTML

# Create the network
net = Network(height="600px", width="100%", notebook=True, cdn_resources='in_line')

# Add input nodes
for i, text in enumerate(parts):
    net.add_node(f"I{i}", label=f"Input {i+1}", title=text, color="lightblue", size=25)

# Add result nodes and edges
for i, res_list in enumerate(articles):
    for j, rtext in enumerate(res_list):
        node_id = f"R{i}_{j}"
        url = "https://dev.impresso-project.ch/app/article/" + matches_uids[i][j]
        title = rtext + f"<br><a href={url} target='_blank'>Link</a>"
        net.add_node(node_id, label=f"R{i+1}.{j+1}", title=title, color="lightgreen")
        net.add_edge(f"I{i}", node_id)

# Save and embed in Colab
net.save_graph("graph.html")

# Display inline in Colab
display(HTML("graph.html"))


### 🕸️🧠 **Similarity Graph**: Revealing additional relations, and relation strengths

#### **Before we start**: Handy functions for mapping Impresso embeddings to vector

- We need vectors to compute the similarity function
- Impresso embeddings are encoded with a specialized string format
- So we build and apply functions that map from the string format to an actual vector, and back

In [None]:
import base64
import struct

# --- our handy mapping functions ---
def string2vector(embedding_string):
    # convert base64 string to a float array
    _, arr = embedding_string.split(':')
    arr = base64.b64decode(arr)
    embedding_vector = [struct.unpack('f', arr[i:i+4])[0] for i in range(0, len(arr), 4)]
    return embedding_vector

# the inverse function. Not used, just for completeness sake
def vector2string(vec, prefix="gte-768"):
    # pack floats into bytes
    arr = b''.join(struct.pack('f', x) for x in vec)
    # encode bytes to base64 string
    encoded = base64.b64encode(arr).decode('utf-8')
    # return in same format as original ("prefix:encoded_string")
    return f"{prefix}:{encoded}"

# --- applying the mapping functions ---
# first we get the embeddings for the matches (in Impresso format, note that we already have the embedding for the input)
matches_embeddings = [[impresso_session.content_items.get_embeddings(uid)[0] for uid in match_uid] for match_uid in matches_uids]
# map input embedding string format to vector
part_embeddings = [string2vector(emb) for emb in embeddings]
# map match embedding string format to vector
article_embeddings = [[string2vector(emb) for emb in embs] for embs in matches_embeddings]

#### Build the similarity graph

Idea:

- The graph above is a *"discrete graph"*:
    - It shows the matches for every input document
- Now we would like to build a graph that is more connected:
    - Strength of the connections
    - Show connectivity *among the returned documents*
- So this time we build a softer graph
    - Captures similarity relations between any input texts,
    - And any texts among the search results

For proper display (closeness of nodes), we rely on the help of UMAP.
- UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that projects high-dimensional data (like embeddings) into 2D or 3D space while preserving the data’s underlying structure and similarity relationships.

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from pyvis.network import Network
import umap

# Flatten results for convenience
flat_article_embeddings = np.vstack(article_embeddings)
flat_articles = [r for sublist in articles for r in sublist]
flat_uris = [None] * len(part_embeddings) + [r for sublist in matches_uids for r in sublist]

# Combine everything for similarity computation
all_embeddings = np.vstack([part_embeddings, flat_article_embeddings])
all_texts = parts + flat_articles
node_types = ['input'] * len(part_embeddings) + ['result'] * len(flat_article_embeddings)

# Dimensionality reduction (UMAP)
reducer = umap.UMAP(n_components=2, random_state=42)
coords = reducer.fit_transform(all_embeddings)

# Compute cosine similarities
sim_matrix = cosine_similarity(all_embeddings)

#  Build the PyVis network
net = Network(height="700px", width="100%", notebook=True, cdn_resources='in_line')
scale = 150
# Add nodes
for i, (text, ntype) in enumerate(zip(all_texts, node_types)):
    color = "lightblue" if ntype == "input" else "lightgreen"
    size = 25 if ntype == "input" else 15
    label = f"I{i+1}" if ntype == "input" else f"R{i - len(flat_articles) + 1}"
    x, y = coords[i] * scale  # scale coordinates
    url = None if ntype == "input" else "https://dev.impresso-project.ch/app/article/" + flat_uris[i]
    title = text if ntype == "input" else text + f"<br><a href={url} target='_blank'>Link</a>"
    net.add_node(
        i,
        label=label,
        title=title,
        color=color,
        size=size,
        x=float(x),
        y=float(y),
        fixed={'x': True, 'y': True}  # properly fix both axes
    )

# Add edges based on similarity
threshold = 0.7  # Adjust to make the graph denser/sparser
for i in range(len(all_embeddings)):
    for j in range(i + 1, len(all_embeddings)):
        sim = float(sim_matrix[i, j])
        if sim > threshold:
            net.add_edge(i, j, value=sim, title=f"Similarity: {sim:.2f}")


# --- Display inline in Colab ---
net.set_options("""
{
  "physics": {
    "enabled": false
  }
}
""")
net.save_graph("embedding_similarity_graph.html")
display(HTML("embedding_similarity_graph.html"))

## 🐣 Bonuses

This section is pure bonus.

- We learn how to dock any of our images similarly to texts.
- And we learn more about the text embedding model that is used in Impresso, verifying its results.

### Docking images

What we did above with texts, also works with images. Here we take an image, and link it to impresso with similarity graph.

#### Defining any image

In [None]:
from IPython.display import Image
# picture of the notebook's author, feel free to exchange with any other image
image_url = "https://www.juriopitz.com/assets/img/me.png"
display(Image(url=image_url))

#### Embedding this image and searching similar ones in Impresso

Similar to how we embedded texts, we can also embed images.

In [None]:
img_embedding = impresso_session.tools.embed_image(image=image_url, target="image")
matches = impresso_session.images.find(
  embedding=img_embedding,
  limit=4
)

#### Visualizing the result as a graph

Similarly to how we created the connected network of input texts nd texts in impresso, we can create the connected network of the input image and images in impresso ---- All just based on semantic similarity of embeddings

In [None]:
# --pre-processing-- for conveinence
matches_uids = [datum["uid"] for datum in matches.raw["data"]]
links_uids = [datum["contentItemUid"] for datum in matches.raw["data"]]
matches_embeddings = [impresso_session.images.get_embeddings(uid)[0] for uid in matches_uids]
part_embeddings = [string2vector(emb) for emb in [embedding]]
article_embeddings = [[string2vector(emb) for emb in embs] for embs in [matches_embeddings]]
matches_uids = [matches_uids]
links_uids = [links_uids]

flat_article_embeddings = np.vstack(article_embeddings)
flat_articles = [r for sublist in articles for r in sublist]
flat_uris = [None] * len(part_embeddings) + [r for sublist in links_uids for r in sublist]

# Combine everything for similarity computation
all_embeddings = np.vstack([part_embeddings, flat_article_embeddings])
all_texts = parts + flat_articles
node_types = ['input'] * len(part_embeddings) + ['result'] * len(flat_article_embeddings)

# Dimensionality reduction (UMAP)
reducer = umap.UMAP(n_components=2, random_state=42)
coords = reducer.fit_transform(all_embeddings)

# Compute cosine similarities
sim_matrix = cosine_similarity(all_embeddings)

#  Build the PyVis network
net = Network(height="700px", width="100%", notebook=True, cdn_resources='in_line')
scale = 150
# Add nodes
for i, (text, ntype) in enumerate(zip(all_texts, node_types)):
    color = "lightblue" if ntype == "input" else "lightgreen"
    size = 25 if ntype == "input" else 15
    label = f"I{i+1}" if ntype == "input" else f"R{i - len(flat_articles) + 1}"
    x, y = coords[i] * scale  # scale coordinates
    url = None if ntype == "input" else "https://dev.impresso-project.ch/app/article/" + flat_uris[i]
    title = "Input Image" if ntype == "input" else "Impresso Image" + f"<br><a href={url} target='_blank'>Link</a>"
    net.add_node(
        i,
        label=label,
        title=title,
        color=color,
        size=size,
        x=float(x),
        y=float(y),
        fixed={'x': True, 'y': True}  # properly fix both axes
    )

# Add edges based on similarity
threshold = 0.0  # Adjust to make the graph denser/sparser
for i in range(len(all_embeddings)):
    for j in range(i + 1, len(all_embeddings)):
        sim = float(sim_matrix[i, j])
        if sim > threshold:
            net.add_edge(i, j, value=sim, title=f"Similarity: {sim:.2f}")


# --- Display inline in Colab ---
net.set_options("""
{
  "physics": {
    "enabled": false
  }
}
""")
net.save_graph("embedding_similarity_graph.html")
display(HTML("embedding_similarity_graph.html"))

### Externally applying the Embedding Model

The last cells of this notebook show how we can load an external embedding model, to embed any texts. For example, we load the same model that is currently used in Impresso. We download it from its original source and verify that it delivers the same embedding as the one in impress.


In [None]:
!pip install sentence-transformers
!pip install sentence-transformers>=3.0.0

In [None]:
# The first cell of revealing similarities and links can be replaced with this:
from sentence_transformers import SentenceTransformer

model_name_or_path="Alibaba-NLP/gte-multilingual-base"
model = SentenceTransformer(model_name_or_path, trust_remote_code=True)
part_embeddings = model.encode(parts, normalize_embeddings=True)
article_embeddings = [model.encode(articleset, normalize_embeddings=True) for articleset in articles]

# verify that the embedding model works as intended:
print(model.encode(["hello"], normalize_embeddings=True)[0][:4])
print(string2vector(impresso_session.tools.embed_text(text="hello", target="text"))[:4])