# üßó‚Äç‚ôÇÔ∏è Impresso's Embedding Basecamp: Essentials

<a target="_blank" href="https://colab.research.google.com/drive/1jiSYVMjUFsYdzCoAGd_pdMMV4sgBLnPO?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If something doesn't work, you can [report a problem](https://github.com/impresso/impresso-datalab-notebooks/blob/main/reporting-problems.md).

## What is this notebook about?

This notebook is designed as a **practical introduction to the Impresso embedding ecosystem**. 

The **first section** introduces the idea of text embeddings and demonstrates how textual content can be transformed into vector representations. This serves as a conceptual foundation for the rest of the material.

The **second section** guides users through two fundamental questions: how embeddings can be inspected to better understand their structure, and how similarity functions work in practice. These two exercises help make the abstract notion of vectors tangible, before moving toward more advanced retrieval workflows.

The **third section** focuses on Impresso‚Äôs embedding search capabilities. It explains how to perform text retrieval across multiple languages, how to limit searches by language, and how images can also be retrieved via their own embeddings. Users are then shown how to access both text and image embeddings directly from the Impresso platform, as well as how to retrieve the associated media objects. 

## What you will learn?

 - Embed texts and images;
 - Retrieve embeddings directly from Impresso database;
 - Search related texts and images within Impresso.

## Useful resources

- [Impresso Python Library](https://impresso.github.io/impresso-py/)
- [Impresso Huggind Face](https://ipyleaflet.readthedocs.io/en/latest/index.html)


## Prerequisites

Run the following cells to install the required package and to connect to Imrpesso API:

> If you are working with Google Colab, you may need to restart the kernel. Go to *Runtime* and select *Restart session*. 

In [None]:
# Impresso Python package with embeddings search feature

!pip install --force-reinstall git+https://github.com/impresso/impresso-py.git@embeddings-search

In [None]:
# Connecting to Impresso API

from impresso import connect
impresso = connect('https://dev.impresso-project.ch/public-api/v1')

## ‚ú® Embedding a text

Embedding a text means using a **neural network** to turn it into a **high-dimensional vector** that allows efficient comparison across languages and beyond surface wording. Let's start by generating such an embedding right away!

In [None]:
embedding = impresso.tools.embed_text(text="Schumann the politician", target="text")
embedding

### üßê Does it look like a vector?

Having inspected the generated embedding, one might wonder what these weird characters and numbers mean: ```gte-768:9wisvV3v1D0Pb1M7CWzzPEv727stV8u9ttGePBKJNTsd...```

The reason for why this embedding does not look like a vector of numbers is rather simple: **It's encoded in a data-efficient format**.
Let's introduce two helper functions to be able to map back-and-forth between vectors and data-friendly encodings.

In [None]:
# We will convert our embedding string to a vector with 768 dimensions:

import base64
import struct

def string2vector(embedding_string):
    # convert base64 string to a float array
    _, arr = embedding_string.split(':')
    arr = base64.b64decode(arr)
    embedding_vector = [struct.unpack('f', arr[i:i+4])[0] for i in range(0, len(arr), 4)]
    return embedding_vector

def vector2string(vec, prefix="gte-768"):
    # pack floats into bytes
    arr = b''.join(struct.pack('f', x) for x in vec)
    # encode bytes to base64 string
    encoded = base64.b64encode(arr).decode('utf-8')
    # return in same format as original ("prefix:encoded_string")
    return f"{prefix}:{encoded}"

string2vector(embedding)[:5]

In [None]:
# Now, is 'embedding' a vector? Let's try the to answer with a boolean expression:

embedding == vector2string(string2vector(embedding))

### üß© What is Similarity Function?

To understand **why embeddings are powerful for comparing and searching texts**, we‚Äôll define a simple similarity function and use it to compare three examples.

In [None]:
def vector_similarity(vector1, vector2):
    # this is a simple vector similarity function called "dot product"
    return sum([x*y for x, y in zip(vector1, vector2)])

text1 = "Schumann the politician"
text2 = "Schumann der politiker"
text3 = "Schumann the composer"

# text1 vs text2
s1 = vector_similarity(string2vector(impresso.tools.embed_text(text=text1, target="text")),
                       string2vector(impresso.tools.embed_text(text=text2, target="text")))
# text1 vs text3
s2 = vector_similarity(string2vector(impresso.tools.embed_text(text=text1, target="text")),
                       string2vector(impresso.tools.embed_text(text=text3, target="text")))

s1, s2

üìù See? Even though the first two texts are in the same language, and have more token overlap ("Schumann", "the") --- they are less similar in the semantic vector space than a text from another language with the same meaning.

üìù Such a similarity function is used in Impresso internally when we search texts with embeddings.

**Speaking of searching, that's what we'll do next, watch out!** ü§†

## üïµÔ∏è Using Impresso's Embedding Search

The idea of search in a nutshell: First, we embed our text, and then the similarity magic finds us the most similar documents in a dataset. Thanks to embeddings, **search can be performed *across languages and through all documents in Impresso*!**

### üìú Text Retrieval Across Languages üá´üá∑üá©üá™üá¨üáßüá±üá∫

Remember our text embeddings are cross-lingual: this means we can search across languages!
Remember that you can embed any text that you like.
So feel **free to overwrite the string in the `text` variable**.

In [None]:
text = "Schumann the politician"
embedding = impresso.tools.embed_text(text=text, target="text")
impresso.search.find(
  embedding=embedding,
  limit=5
)

### Setting Language Restrictions üá´üá∑

Yes, our embeddings can retrieve all kinds of data across languages!
But this is not always what we want.
Sometimes, we want to **find texts only from a target language of our choice**.
Here's how that can be done in Impresso: Set `country` to either `FR` `DE` `EN` or `LB` (Luxembourgish):

In [None]:
impresso.search.find(
  country='FR',
  embedding=embedding,
  limit=5
)

### üñºÔ∏è Image Retrieval

Say we have an image or text, and want to find out if there's any imagies *within* Impresso that are similar to this.
We can use the **Impresso image search**.

#### Use-case A: We want to find similar images to our text

This is also called a "multi-modal" search, since input and retrieved data are from two different modalities (image, text).
Hence, we set the `target="multimodal"`:

In [None]:
impresso.images.find(
  embedding=impresso.tools.embed_text(text="Berlin", target="multimodal"),
  limit=3
)

#### Use-case B: We want to know similar images to our embedded image

In [None]:
impresso.images.find(
  embedding=impresso.tools.embed_image(image="https://impresso-project.ch/assets/images/posts/rep-thomas.png", target="image"),
  limit=3
)

## üé£ Retrieve texts and images embeddings from Impresso


### Retrieve embeddings from a text

Say we found a matching article in Impresso, but **how would we get its embedding**?
Here's a one-liner to make that work.

> Remember, it will return an encoded format, see above on how to catch the vector.

In [None]:
article_id = "oeuvre-1938-02-25-a-i0003"
original_embedding = impresso.content_items.get_embeddings(article_id)[0]
original_embedding

### Retrieve embeddings from an image

We can also **get embeddings for any images in Impresso**.
Be ready! They also come in the quirky encoded format.

In [None]:
image_id = "luxwort-1930-09-26-a-i0036"
embeddings = impresso.images.get_embeddings(image_id)
embeddings[0]

### Retrieve an image from embeddings

Tired of looking at encodings and vectors?
Let's learn **how we can directly retrieve the image that's represented by the embedding**!

In [None]:
image_id = "luxwort-1930-09-26-a-i0036"
image = impresso.images.get(image_id, include_embeddings=True)
image

## Conclusion

In this notebook, we explored **how Impresso‚Äôs embedding ecosystem transforms text and images into meaningful vector representations**, enabling:

- cross-lingual retrieval
- similarity search
- multimodal linking

By experimenting with simple similarity functions and the Impresso API, you have seen how embeddings reveal latent relationships beyond surface wording. These techniques provide a foundation for more advanced workflows, such as **document clustering, cross-media analysis, and building richer historical research pipelines**.

---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)

**Writing - Original draft:**  Roman Kalyakin. **Conceptualization:** Marten D√ºring. **Software:** Roman Kalyakin. **Writing - Review & Editing**: Juri Opitz, Cao Vy. **Validation:** Martin Grandjean, Kirill Veprikov. **Datalab editorial board:** Caio Mello (Managing), Pauline Conti, Emanuela Boros, Marten D√ºring, Juri Opitz, Martin Grandjean, Estelle Bunout, Cao Vy. **Data curation & Formal analysis:** Maud Ehrmann, Emanuela Boros, Pauline Conti, Simon Clematide, Juri Opitz, Andrianos Michail. **Methodology:** Roman Kalyakin. **Supervision:** Marten D√ºring. **Funding aquisition:** Maud Ehrmann, Simon Clematide, Marten D√ºring, Rapha√´lle Ruppen Coutaz.

<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a> 

This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)

For feedback on this notebook, please send an email to info@impresso-project.ch

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.


---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
