# ⚓️ Linking2Impresso: Connecting your data with Impresso

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/workshop_resources/ws4-embeddings/Linking2impresso.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

If something doesn't work, you can [report a problem](https://github.com/impresso/impresso-datalab-notebooks/blob/main/reporting-problems.md).

## What is this notebook about?

This notebook demonstrates how to "dock" an external text collection (bonus: image collection) into the Impresso historical media archive using semantic embeddings. 
By transforming textual data into high-dimensional vector representations, we can measure conceptual similarity between new materials and the vast Impresso corpus. 

The workflow illustrates the full pipeline: from **preparing a small example collection**, through **embedding and semantic search via the Impresso API**, to **exploring connections through an interactive network visualization**. 
The resulting graph allows us to quickly see how our own texts resonate with historical documents, opening pathways for contextualization, cross-referencing, and discovery across linguistic and temporal boundaries.

## What you will learn?

In summary, we are going to:
- Load a custom text collection  
- Compute embeddings using Impresso’s model  
- Search for similar content in the Impresso archive  
- Visualize the matches as an interactive bipartite graph

## Useful resources

- [Impresso Python Library](https://impresso.github.io/impresso-py/)
- [Impresso Huggind Face](https://ipyleaflet.readthedocs.io/en/latest/index.html)


## Prerequisites

Run the following cells to install the required package and to connect to Imrpesso API:

> If you are working with Google Colab, you may need to restart the kernel. Go to *Runtime* and select *Restart session*. 

In [None]:
# Impresso Python package with embeddings search feature

!pip install --force-reinstall git+https://github.com/impresso/impresso-py.git@embeddings-search

In [None]:
# Connecting to Impresso API

from impresso import connect
impresso = connect('https://dev.impresso-project.ch/public-api/v1')

## Embed text and image with Impresso model

### Embed text

In this exemple, we are using an text from the **Geneva Convention Pieces**. We are **leveraging the Impresso embedding model for embedding our input texts**.
*Feel free to replace the example collections with any of your texts of interest*.

> It's important that we apply the same embedding model to our input texts and the texts in Impresso's database. Only then any similarites are meaningful.

In [None]:
parts = ["Persons in the hands of the enemy are entitled at all times to respect for their life and for their physical and mental integrity.",
         "Under the first and second Geneva Conventions of 1949, the belligerents must protect the sick, wounded and shipwrecked as well as medical personnel, ambulances and hospitals. All persons protected under these conventions must be given shelter and cared for by the party to the conflict that holds power over them.",
         "The third Geneva Convention contains detailed rules on the treatment of prisoners of war.",
         "The fourth Geneva Convention protects civilians in the hands of the enemy, whether in their own or in occupied territory.",
         "The first Additional Protocol of 1977 supplements the rules applying to international armed conflicts contained in the four Geneva Conventions. It imposes restrictions on the conduct of hostilities; for example, it prohibits attacks against civilians and civilian objects and restricts the means and methods of warfare"]

In [None]:
# ---embed and search---

embeddings = []
matches = []

for i, part in enumerate(parts):
    embedding = impresso.tools.embed_text(text=part, target="text")
    matches.append(impresso.search.find(embedding=embedding, limit=5))
    embeddings.append(embedding)
    print(f"retrieved search results for text  {i+1}/{len(parts)}")


In [None]:
# ---post-processing for convenience---

matches_uids = [[datum["uid"] for datum in m.raw["data"]] for m in matches]
articles = [[impresso.content_items.get(uid) for uid in uids] for uids in matches_uids]
articles = [[article.raw.get("title") + " " + article.raw.get("transcript") for article in a] for a in articles]

### Embed image

In [None]:
from IPython.display import Image

image_url = "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k6069079/f2/775,369,1303,887/max/0/default.jpg"
display(Image(url=image_url))

In [None]:
img_embedding = impresso.tools.embed_image(image=image_url, target="image")

matches = impresso.images.find(
  embedding=img_embedding,
  limit=4
)

matches

> For more information on text-image embeddings, see the notebook on [multimodal](https://github.com/impresso/impresso-datalab-notebooks/blob/main/workshop_resources/ws4-embeddings/multimodal_on_radio.ipynb).

## Exploring the links with graphs

### Bi-partite graph: Overview of input texts and search results

In [None]:
from pyvis.network import Network
from IPython.display import display, HTML

# Create the network
net = Network(height="600px", width="100%", notebook=True, cdn_resources='in_line')

# Add input nodes
for i, text in enumerate(parts):
    net.add_node(f"I{i}", label=f"Input {i+1}", title=text, color="lightblue", size=25)

# Add result nodes and edges
for i, res_list in enumerate(articles):
    for j, rtext in enumerate(res_list):
        node_id = f"R{i}_{j}"
        url = "https://dev.impresso-project.ch/app/article/" + matches_uids[i][j]
        title = rtext + f"<br><a href={url} target='_blank'>Link</a>"
        net.add_node(node_id, label=f"R{i+1}.{j+1}", title=title, color="lightgreen")
        net.add_edge(f"I{i}", node_id)

# Save and embed in Colab
net.save_graph("graph.html")

# Display inline in Colab
display(HTML("graph.html"))


### Similarity Graph: Revealing additional relations, and relation strengths

The first graph above is a *"discrete graph"*: it only showed direct matches for each input document. Now we want a more connected representation, with:
- Weighted links that reflect the strength of similarity
- Connections among the retrieved documents themselves

To achieve this, we construct a softer graph that:
- Captures similarity relations between any input texts
- And any texts among the search results

For proper display (closeness of nodes), we rely on the help of **UMAP** (Uniform Manifold Approximation and Projection). It's a **dimensionality reduction technique** that projects high-dimensional data - like embeddings, into 2D or 3D space while preserving the data’s underlying structure and similarity relationships.

> To compute similarities, we first convert Impresso’s string-encoded embeddings into usable numeric vectors, then apply our functions to map between the string format and vector space in both directions.

In [None]:
# Converting string-encoded embeddings to numeric vectors

import base64
import struct

# --- our handy mapping functions ---
def string2vector(embedding_string):
    # convert base64 string to a float array
    _, arr = embedding_string.split(':')
    arr = base64.b64decode(arr)
    embedding_vector = [struct.unpack('f', arr[i:i+4])[0] for i in range(0, len(arr), 4)]
    return embedding_vector

# the inverse function. Not used, just for completeness sake
def vector2string(vec, prefix="gte-768"):
    # pack floats into bytes
    arr = b''.join(struct.pack('f', x) for x in vec)
    # encode bytes to base64 string
    encoded = base64.b64encode(arr).decode('utf-8')
    # return in same format as original ("prefix:encoded_string")
    return f"{prefix}:{encoded}"

# --- applying the mapping functions ---
# first we get the embeddings for the matches (in Impresso format, note that we already have the embedding for the input)
matches_embeddings = [[impresso.content_items.get_embeddings(uid)[0] for uid in match_uid] for match_uid in matches_uids]
# map input embedding string format to vector
part_embeddings = [string2vector(emb) for emb in embeddings]
# map match embedding string format to vector
article_embeddings = [[string2vector(emb) for emb in embs] for embs in matches_embeddings]

In [None]:
# Buidling an embedding similarity graph

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from pyvis.network import Network
import umap

# Flatten results for convenience
flat_article_embeddings = np.vstack(article_embeddings)
flat_articles = [r for sublist in articles for r in sublist]
flat_uris = [None] * len(part_embeddings) + [r for sublist in matches_uids for r in sublist]

# Combine everything for similarity computation
all_embeddings = np.vstack([part_embeddings, flat_article_embeddings])
all_texts = parts + flat_articles
node_types = ['input'] * len(part_embeddings) + ['result'] * len(flat_article_embeddings)

# Dimensionality reduction (UMAP)
reducer = umap.UMAP(n_components=2, random_state=42)
coords = reducer.fit_transform(all_embeddings)

# Compute cosine similarities
sim_matrix = cosine_similarity(all_embeddings)

#  Build the PyVis network
net = Network(height="700px", width="100%", notebook=True, cdn_resources='in_line')
scale = 150
# Add nodes
for i, (text, ntype) in enumerate(zip(all_texts, node_types)):
    color = "lightblue" if ntype == "input" else "lightgreen"
    size = 25 if ntype == "input" else 15
    label = f"I{i+1}" if ntype == "input" else f"R{i - len(flat_articles) + 1}"
    x, y = coords[i] * scale  # scale coordinates
    url = None if ntype == "input" else "https://dev.impresso-project.ch/app/article/" + flat_uris[i]
    title = text if ntype == "input" else text + f"<br><a href={url} target='_blank'>Link</a>"
    net.add_node(
        i,
        label=label,
        title=title,
        color=color,
        size=size,
        x=float(x),
        y=float(y),
        fixed={'x': True, 'y': True}  # properly fix both axes
    )

# Add edges based on similarity
threshold = 0.7  # Adjust to make the graph denser/sparser
for i in range(len(all_embeddings)):
    for j in range(i + 1, len(all_embeddings)):
        sim = float(sim_matrix[i, j])
        if sim > threshold:
            net.add_edge(i, j, value=sim, title=f"Similarity: {sim:.2f}")


# --- Display inline in Colab ---
net.set_options("""
{
  "physics": {
    "enabled": false
  }
}
""")
net.save_graph("embedding_similarity_graph.html")
display(HTML("embedding_similarity_graph.html"))

## Bonus

#### Visualizing the result as a graph

Just as we built a connected network for input texts and their matches in Impresso, we can do the same for the input image and related images—entirely based on the semantic similarity of their embeddings.

In [None]:
# --pre-processing-- for convenience
matches_uids = [datum["uid"] for datum in matches.raw["data"]]
links_uids = [datum["contentItemUid"] for datum in matches.raw["data"]]
matches_embeddings = [impresso.images.get_embeddings(uid)[0] for uid in matches_uids]
part_embeddings = [string2vector(emb) for emb in [embedding]]
article_embeddings = [[string2vector(emb) for emb in embs] for embs in [matches_embeddings]]
matches_uids = [matches_uids]
links_uids = [links_uids]

flat_article_embeddings = np.vstack(article_embeddings)
flat_articles = [r for sublist in articles for r in sublist]
flat_uris = [None] * len(part_embeddings) + [r for sublist in links_uids for r in sublist]

# Combine everything for similarity computation
all_embeddings = np.vstack([part_embeddings, flat_article_embeddings])
all_texts = parts + flat_articles
node_types = ['input'] * len(part_embeddings) + ['result'] * len(flat_article_embeddings)

# Dimensionality reduction (UMAP)
reducer = umap.UMAP(n_components=2, random_state=42)
coords = reducer.fit_transform(all_embeddings)

# Compute cosine similarities
sim_matrix = cosine_similarity(all_embeddings)

#  Build the PyVis network
net = Network(height="700px", width="100%", notebook=True, cdn_resources='in_line')
scale = 150
# Add nodes
for i, (text, ntype) in enumerate(zip(all_texts, node_types)):
    color = "lightblue" if ntype == "input" else "lightgreen"
    size = 25 if ntype == "input" else 15
    label = f"I{i+1}" if ntype == "input" else f"R{i - len(flat_articles) + 1}"
    x, y = coords[i] * scale  # scale coordinates
    url = None if ntype == "input" else "https://dev.impresso-project.ch/app/article/" + flat_uris[i]
    title = "Input Image" if ntype == "input" else "Impresso Image" + f"<br><a href={url} target='_blank'>Link</a>"
    net.add_node(
        i,
        label=label,
        title=title,
        color=color,
        size=size,
        x=float(x),
        y=float(y),
        fixed={'x': True, 'y': True}  # properly fix both axes
    )

# Add edges based on similarity
threshold = 0.0  # Adjust to make the graph denser/sparser
for i in range(len(all_embeddings)):
    for j in range(i + 1, len(all_embeddings)):
        sim = float(sim_matrix[i, j])
        if sim > threshold:
            net.add_edge(i, j, value=sim, title=f"Similarity: {sim:.2f}")

# --- Display inline in Colab ---
net.set_options("""
{
  "physics": {
    "enabled": false
  }
}
""")
net.save_graph("embedding_similarity_graph.html")
display(HTML("embedding_similarity_graph.html"))

### Externally applying the embedding model

In the final cells of this notebook, we show how to load an external embedding model to embed any texts. As an example, we download the same model used in Impresso from its original source and verify that it produces the same embeddings as those returned by Impresso.


In [None]:
!pip install sentence-transformers
!pip install sentence-transformers>=3.0.0

In [None]:
# The first cell of revealing similarities and links can be replaced with this:
from sentence_transformers import SentenceTransformer

model_name_or_path="Alibaba-NLP/gte-multilingual-base"
model = SentenceTransformer(model_name_or_path, trust_remote_code=True)
part_embeddings = model.encode(parts, normalize_embeddings=True)
article_embeddings = [model.encode(articleset, normalize_embeddings=True) for articleset in articles]

# verify that the embedding model works as intended:
print(model.encode(["hello"], normalize_embeddings=True)[0][:4])
print(string2vector(impresso.tools.embed_text(text="hello", target="text"))[:4])

## Conclusion

In this notebook, we learned **how to connect external texts and images to the Impresso ecosystem through embeddings**. 
We showed how to embed user-provided inputs with the same model used internally by Impresso, making it possible to compare personal data with historical content in the collection. 
We explored two types of visualisation: a **bipartite graph** that highlights direct matches between inputs and search results, and a **similarity graph** that reveals deeper relational structures, strengths of connections, and latent proximity among retrieved items. 
In the bonus, we demonstrated how to load the embedding model locally to reproduce the same representations outside the platform. 

---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)

**Writing - Original draft:**  Roman Kalyakin. **Conceptualization:** Marten Düring. **Software:** Roman Kalyakin. **Writing - Review & Editing**: Juri Opitz, Simon Clematide, Cao Vy. **Validation:** Maud Ehrmann, Kirill Veprikov. **Datalab editorial board:** Caio Mello (Managing), Pauline Conti, Emanuela Boros, Marten Düring, Juri Opitz, Martin Grandjean, Estelle Bunout, Cao Vy. **Data curation & Formal analysis:** Maud Ehrmann, Emanuela Boros, Pauline Conti, Simon Clematide, Juri Opitz, Andrianos Michail. **Methodology:** Roman Kalyakin. **Supervision:** Marten Düring. **Funding aquisition:** Maud Ehrmann, Simon Clematide, Marten Düring, Raphaëlle Ruppen Coutaz.

<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a> 

This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)

For feedback on this notebook, please send an email to info@impresso-project.ch

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.


---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
