# Exploring Image and Text-to-Image Embeddings

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/workshop_resources/ws4-embeddings/explore_multimodal-embeddings.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

If something doesn't work, you can [report a problem](https://github.com/impresso/impresso-datalab-notebooks/blob/main/reporting-problems.md).

## What is this notebook about?

This notebook demonstrates how to explore historical image collections using Impresso‚Äôs text-to-image and image-only embeddings, from keyword search to visual similarity retrieval.

In the **first section**, we begin with Open-CLIP to perform text-to-image search. We start by choosing a small set of keywords and using them to retrieve relevant images. Next, we expand the keywords into longer textual descriptions and repeat the process, allowing us to observe how richer language produces more nuanced and precise results.

In the **second section**, we work with DinoV2 image-only embeddings to identify visual similarities within the collection. Given a single reference image, we search for visually related items and interpret what features the model captures.

We will explore **how radio is represented both in images, and in the programs**. This will allow us to explore the image and textual elements using both types of embeddings.

## What you will learn?

- Perform keyword-based image retrieval using image captions and Open-CLIP, and convert a text query into an embedding for text-to-image similarity search;
- Use DINOv2 to search for visual similarities directly from a reference image;
- Compare how multimodal embeddings and visual-only embeddings support different research strategies.

## Useful resources

- [Impresso Python Library](https://impresso.github.io/impresso-py/)
- [Impresso Huggind Face](https://ipyleaflet.readthedocs.io/en/latest/index.html)

## Prerequisites

Run the following cells to install the required package and to connect to Imrpesso API:

> If you are working with Google Colab, you may need to restart the kernel. Go to *Runtime* and select *Restart session*.

In [None]:
# Impresso Python package with embeddings search feature
!pip install impresso


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:

# Connecting to Impresso API
from impresso import connect, OR, AND, DateRange
impresso_session = connect('https://dev.impresso-project.ch/public-api/v1')

üéâ You are now connected to the Impresso API!  üéâ
üîó Using API: https://dev.impresso-project.ch/public-api/v1


> In this notebook, we will often move back and forth between the notebook and the Impresso App, so a small function for constructing links is useful. Due to copyright restrictions, images might not be fully displayed here but you can access them via the Impresso Web App.

In [48]:
# Function to generate webapp URLs for images

def img_webapp_url(uid, issue_mode=True):
  mode = "issue" if issue_mode else "search/images"
  pre, suf = uid.split('-a-')
  suffix = f"{pre}-a/view?articleId={suf}" if issue_mode else uid
  return f'https://dev.impresso-project.ch/app/{mode}/{suffix}'

# Text-to-Image embeddings with Open-Clip

First, we examine how the system retrieves images from simple keywords or short phrases, providing an initial sense of the most similar results.

## 1. Keyword search on image captions

In [None]:
kw_radio = 'radio'

result = impresso_session.images.find(term=kw_radio)
result

Unnamed: 0_level_0,issueUid,previewImage,date,caption,pageNumbers,mediaSourceRef.uid,mediaSourceRef.name,mediaSourceRef.type,imageTypes.visualContent,imageTypes.visualContentType,previewUrl,contentItemUid
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
oeuvre-1944-07-15-a-i0076,oeuvre-1944-07-15-a,,1944-07-15,RADIO-PARIS,[3],oeuvre,L'≈íuvre (Paris),newspaper,Not an Image,,"https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4622965v/f3/4925,8756,747,153/max/0/default.jpg",
oeuvre-1944-08-12-a-i0054,oeuvre-1944-08-12-a,,1944-08-12,RADIO-PARIS,[2],oeuvre,L'≈íuvre (Paris),newspaper,Not an Image,,"https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4622988v/f2/5305,7413,758,168/max/0/default.jpg",
oeuvre-1929-01-03-a-i0134,oeuvre-1929-01-03-a,,1929-01-03,Radio-paris,[8],oeuvre,L'≈íuvre (Paris),newspaper,Image,Non-Figurative Visual Content,"https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4617832d/f8/4892,1942,885,550/max/0/default.jpg",


## 2. Text-to-image similarity search with Open-Clip

Next, we embed the same keyword with Open-CLIP and use this embedding to search through the Open-CLIP image embeddings, enabling text-to-image similarity retrieval.

In [None]:
kw_embedding = impresso_session.tools.embed_text(text=kw_radio, target='multimodal')
kw_embedding

'openclip-768:1zuRvAvlzLvJJxw9tRKdO5yyaryrbIK9wuLOvDO12bz3ezo9+5ogPaXcCj2bbCm9IogYPOXjBz7Sfvq8xCccvbjvbD1F2iW7ZQ0sO5CEf7wH7Qc9HWC9OzApX7y8ThO8DDckPH6dhbzIxBi82Z5RPCbzerwiIo48XpKUvDHtHTxUjcA8QgXJPDXBBzyDrhY7L78DPRhOhLviam487tW/PMJyX7sz6q68v+MnvMDEqTzdOHG8yJnkuz7OEr1EjZw8CFgMPS6dYryzMFm8xeU4PMxlPDwQw1S7eOjDOnfO3bwdgtc7LYsDPe+Nwrtt6Sq8fwOsuQwTG73vtTk7KE7Fu7nfU73P+UY805+dPJ83G7x5Xha9Tc+HPJdgijuPY5i7ks6fPAq1Db0ZFzO7ElLTPH09AD0S11Y6QIPxPDrNYbxr5W49h0ROPJPSo7w6D968ojfHO0MknbwNthM9wETYPHILhrlRnIY7oQgBvUhytTw+knU79xI0vUDxZ7x68IQ88KJUPC/i9bsW3Qq9og6GPNra7Lzj34Y93LM2u/jz1DsP1Ci9u4PRvMz+ar00TmK8PFYjvLRAuTyRy2g8a2dWvIldPjw+M5C7qzV7PJAJgrwlr648xUYBvTjmzDsGqCE8Z7kqPSygbTrLL6I8b2jnPNTtMjzkhAe8HjbRPKM6Ar3ZUlG8hgjyO/x+nbyPkhO894XCPFVBtTtRfrc8HXLXvFtcwrwvFig9Ih4kva8OXL16lnc7YBtzPOVVyTxvxwo79CsEPTe85jyLgLO84b1+O7IkNr1lroa89P5uPeW4kDy7Joe7Pkr5vNmWHjtEYJi8BqypPN4AMzxaats8mJBFvHK39rwDUzS9uDwUua7iIjzYUco8ym26vEPG5bw8VKq88mqRPM1cHjyB0VO9e9GkvIccWTwW5ai8je+pvCTcnzy2t0I8KbndPOLVq7xpAfs8D0aNvDValDyD+yC83DgoPXwU6L

> Having inspected the generated embedding, one might wonder what these weird characters and numbers mean: ```openclip-768:1zuRvAvlzLvJJxw9tRKdO5yyaryrbIK9wuLOvDO12bz...```
The reason for why this embedding does not look like a vector of numbers is rather simple: **It's encoded in a data-efficient format**.

In [None]:
# Searching images similar to the keywordembeddings
results = impresso_session.images.find(
  embedding=kw_embedding,
  limit=6
)
results

Unnamed: 0_level_0,issueUid,previewImage,date,contentItemUid,pageNumbers,mediaSourceRef.uid,mediaSourceRef.name,mediaSourceRef.type,imageTypes.visualContent,previewUrl,imageTypes.visualContentType
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
EXP-2009-01-06-a-i0096,EXP-2009-01-06-a,,2009-01-06,EXP-2009-01-06-a-i0088,[12],EXP,L'Express,newspaper,Image,"https://dev.impresso-project.ch/media/iiif/EXP-2009-01-06-a-p0012/97,236,195,200/max/0/default.jpg",Object
IMP-2009-01-06-a-i0088,IMP-2009-01-06-a,,2009-01-06,IMP-2009-01-06-a-i0080,[12],IMP,L'Impartial,newspaper,Image,"https://dev.impresso-project.ch/media/iiif/IMP-2009-01-06-a-p0012/106,242,183,196/max/0/default.jpg",Object
EXP-1960-03-31-a-i0096,EXP-1960-03-31-a,,1960-03-31,EXP-1960-03-31-a-i0087,[4],EXP,L'Express,newspaper,Image,"https://dev.impresso-project.ch/media/iiif/EXP-1960-03-31-a-p0004/2206,1724,392,150/max/0/default.jpg",Non-Figurative Visual Content


In [46]:
results.df[['contentItemUid', 'imageTypes.visualContentType']].index

Index(['EXP-2009-01-06-a-i0096', 'IMP-2009-01-06-a-i0088',
       'EXP-1960-03-31-a-i0096', 'JDG-1995-10-06-a-i0264',
       'EXP-1958-06-11-a-i0118', 'IMP-2010-02-02-a-i0126'],
      dtype='object', name='uid')

In [47]:
import numpy as np
import pandas as pd

# Print the URLs for the first 5 images 
for uid, r in results.df.head(5).iterrows():
  print(f"Result {uid} - link to image CI {r.previewUrl} - type {r['imageTypes.visualContentType']}")
  if str(r.contentItemUid)!='nan':
    print(f"       {r.contentItemUid} - link to corresponding CI {img_webapp_url(r.contentItemUid)}")

Result EXP-2009-01-06-a-i0096 - link to image CI https://dev.impresso-project.ch/media/iiif/EXP-2009-01-06-a-p0012/97,236,195,200/max/0/default.jpg - type Object
       EXP-2009-01-06-a-i0088 - link to corresponding CI https://impresso-project.ch/app/issue/EXP-2009-01-06-a/view?articleId=i0088
Result IMP-2009-01-06-a-i0088 - link to image CI https://dev.impresso-project.ch/media/iiif/IMP-2009-01-06-a-p0012/106,242,183,196/max/0/default.jpg - type Object
       IMP-2009-01-06-a-i0080 - link to corresponding CI https://impresso-project.ch/app/issue/IMP-2009-01-06-a/view?articleId=i0080
Result EXP-1960-03-31-a-i0096 - link to image CI https://dev.impresso-project.ch/media/iiif/EXP-1960-03-31-a-p0004/2206,1724,392,150/max/0/default.jpg - type Non-Figurative Visual Content
       EXP-1960-03-31-a-i0087 - link to corresponding CI https://impresso-project.ch/app/issue/EXP-1960-03-31-a/view?articleId=i0087
Result JDG-1995-10-06-a-i0264 - link to image CI https://dev.impresso-project.ch/media/i

> The extracted images either feature iillustrations of radios (physical radio sets) or illustrated headers of radio sections.
We can try to filter by image type, such as `Object`, `Non-Figurative Visual Content` and `Ornament or Illustrated Title`.


In [None]:
#¬†Filter results based on the image content type "Object"
object_results = impresso_session.images.find(
  content_type="Object",
  embedding=kw_embedding,
  limit=5
)

# Print the URLs for the first 5 images
print(f"Results for images of type Object")
for uid, r in object_results.df.head(5).iterrows():
  print(f"Result {uid} - link to image CI {img_webapp_url(uid, issue_mode=False)}")
  if str(r.contentItemUid)!='nan':
    print(f"       {r.contentItemUid} - link to corresponding CI {img_webapp_url(r.contentItemUid)}")

Results for images of type Object
Result EXP-2009-01-06-a-i0096 - link to image CI https://dev.impresso-project.ch/app/search/images/EXP-2009-01-06-a-i0096
       EXP-2009-01-06-a-i0088 - link to corresponding CI https://dev.impresso-project.ch/app/issue/EXP-2009-01-06-a/view?articleId=i0088
Result IMP-2009-01-06-a-i0088 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-2009-01-06-a-i0088
       IMP-2009-01-06-a-i0080 - link to corresponding CI https://dev.impresso-project.ch/app/issue/IMP-2009-01-06-a/view?articleId=i0080
Result JDG-1995-10-06-a-i0264 - link to image CI https://dev.impresso-project.ch/app/search/images/JDG-1995-10-06-a-i0264
       JDG-1995-10-06-a-i0257 - link to corresponding CI https://dev.impresso-project.ch/app/issue/JDG-1995-10-06-a/view?articleId=i0257
Result IMP-2010-02-02-a-i0126 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-2010-02-02-a-i0126
       IMP-2010-02-02-a-i0120 - link to corresponding CI https://de

> If you have opened all the links, you will notice that most of the elements from the previous search appear again, with the exception of the image `EXP-1960-03-31-a-i0096`. In addition, the images `EXP-2009-01-06-a-i0096` and `IMP-2009-01-06-a-i0080` were reused a few weeks later by the editors in `IMP-2010-02-02-a-i0120`.

In [None]:
#¬†Filter results based on the image content type "Non-Figurative Visual Content" OR "Ornament or Illustrated Title"
non_fig_results = impresso_session.images.find(
  content_type=OR("Non-Figurative Visual Content", "Ornament or Illustrated Title"),
  embedding=kw_embedding,
  limit=5
)

# Print the URLs for the first 5 images
print(f"Results for images of type Non-Figurative Visual Content")
for uid, r in non_fig_results.df.head(5).iterrows():
  print(f"Result {uid} - link to image CI {img_webapp_url(uid, issue_mode=False)}")
  if str(r.contentItemUid)!='nan':
    print(f"       {r.contentItemUid} - link to corresponding CI {img_webapp_url(r.contentItemUid)}")

Results for images of type Non-Figurative Visual Content
Result EXP-1960-03-31-a-i0096 - link to image CI https://dev.impresso-project.ch/app/search/images/EXP-1960-03-31-a-i0096
       EXP-1960-03-31-a-i0087 - link to corresponding CI https://dev.impresso-project.ch/app/issue/EXP-1960-03-31-a/view?articleId=i0087
Result EXP-1958-06-11-a-i0118 - link to image CI https://dev.impresso-project.ch/app/search/images/EXP-1958-06-11-a-i0118
       EXP-1958-06-11-a-i0114 - link to corresponding CI https://dev.impresso-project.ch/app/issue/EXP-1958-06-11-a/view?articleId=i0114
Result EXP-1960-06-21-a-i0096 - link to image CI https://dev.impresso-project.ch/app/search/images/EXP-1960-06-21-a-i0096
       EXP-1960-06-21-a-i0104 - link to corresponding CI https://dev.impresso-project.ch/app/issue/EXP-1960-06-21-a/view?articleId=i0104


> There are far fewer results of this type, as they are generally rarer in the data. However, in both cases the model identifies the Radio section logo, likely because it also contains text.

## 3. Complex search queries with embeddings

Next, we refine our search by embedding a more **descriptive query** that targets the radio program section of a newspaper.

In [None]:
program_query = "Weekly radio program"

# Embed program_query prompt using Open-Clip model
pgm_embedding = impresso_session.tools.embed_text(text=program_query, target='multimodal')
pgm_embedding


# Print the URLs for images similar to the query embeddings
pgm_results = impresso_session.images.find(
  embedding=pgm_embedding,
  limit=6
)

for uid, r in pgm_results.df.head(5).iterrows():
  print(f"Result {uid} - link to image CI {img_webapp_url(uid, issue_mode=False)} - type {r['imageTypes.visualContentType']}")
  if str(r.contentItemUid)!='nan':
    print(f"       {r.contentItemUid} - link to corresponding CI {img_webapp_url(r.contentItemUid)}")

Result IMP-1938-09-08-a-i0065 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-1938-09-08-a-i0065 - type Ornament or Illustrated Title
       IMP-1938-09-08-a-i0061 - link to corresponding CI https://dev.impresso-project.ch/app/issue/IMP-1938-09-08-a/view?articleId=i0061
Result IMP-1938-01-24-a-i0046 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-1938-01-24-a-i0046 - type Ornament or Illustrated Title
       IMP-1938-01-24-a-i0043 - link to corresponding CI https://dev.impresso-project.ch/app/issue/IMP-1938-01-24-a/view?articleId=i0043
Result IMP-1941-10-11-a-i0120 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-1941-10-11-a-i0120 - type Ornament or Illustrated Title
       IMP-1941-10-11-a-i0117 - link to corresponding CI https://dev.impresso-project.ch/app/issue/IMP-1941-10-11-a/view?articleId=i0117
Result IMP-1938-09-10-a-i0064 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-1938-09-1

> We have successfully retrieved the illustrated section titles!
This query captures many of the radio program pages from L‚ÄôImpartial in the late 1930s and early 1940s.

> Since **CLIP is multilingual**, we can try the same search using a query in French.

In [None]:
program_query_fr = "Programme Radio de la semaine"

# Embed program_query_fr prompt using Open-Clip model
pgm_fr_embedding = impresso_session.tools.embed_text(text=program_query_fr, target='multimodal')
pgm_fr_embedding


# Print the URLs for images similar to the query embeddings
pgm_fr_results = impresso_session.images.find(
  embedding=pgm_fr_embedding,
  limit=6
)

for uid, r in pgm_fr_results.df.head(5).iterrows():
  print(f"Result {uid} - link to image CI {img_webapp_url(uid, issue_mode=False)} - type {r['imageTypes.visualContentType']}")
  if str(r.contentItemUid)!='nan':
    print(f"       {r.contentItemUid} - link to corresponding CI {img_webapp_url(r.contentItemUid)}")

Result IMP-1980-05-24-a-i0269 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-1980-05-24-a-i0269 - type Graph
       IMP-1980-05-24-a-i0266 - link to corresponding CI https://dev.impresso-project.ch/app/issue/IMP-1980-05-24-a/view?articleId=i0266
Result IMP-1980-03-25-a-i0230 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-1980-03-25-a-i0230 - type Graph
       IMP-1980-03-25-a-i0229 - link to corresponding CI https://dev.impresso-project.ch/app/issue/IMP-1980-03-25-a/view?articleId=i0229
Result IMP-1981-03-04-a-i0226 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-1981-03-04-a-i0226 - type nan
       IMP-1981-03-04-a-i0225 - link to corresponding CI https://dev.impresso-project.ch/app/issue/IMP-1981-03-04-a/view?articleId=i0225
Result IMP-1980-04-05-a-i0232 - link to image CI https://dev.impresso-project.ch/app/search/images/IMP-1980-04-05-a-i0232 - type Graph
       IMP-1980-04-05-a-i0231 - link to correspondi

> We obtain very similar results: most of them are program pages, but this time they are more recent and often list TV programs (note that the Swiss national radio and TV share the same name).
Now let‚Äôs see if we can go further and **retrieve actual images of radio stations**, ideally with people listening to the radio.


In [None]:
radio_query_fr = "Personnes √©coutant la radio √† c√¥t√© du poste de radio."

# Embed radio_query_fr prompt using Open-Clip model
radio_fr_embedding = impresso_session.tools.embed_text(text=radio_query_fr, target='multimodal')
radio_fr_embedding


# Print the URLs for images similar to the query embeddings
radio_fr_results = impresso_session.images.find(
  embedding=radio_fr_embedding,
  limit=6
)

for uid, r in radio_fr_results.df.head(5).iterrows():

  print(f"Result {uid} - link to image CI {r.previewUrl} - type {r['imageTypes.visualContentType']}")
  if 'contentItemUid' in r and str(r.contentItemUid)!='nan':
    print(f"       {r.contentItemUid} - link to corresponding CI {img_webapp_url(r.contentItemUid)}")

Result LLE-1952-10-25-a-i0236 - link to image CI https://dev.impresso-project.ch/media/iiif/LLE-1952-10-25-a-p0017/237,3718,1114,668/max/0/default.jpg - type Non-Figurative Visual Content
Result LLE-1952-11-15-a-i0300 - link to image CI https://dev.impresso-project.ch/media/iiif/LLE-1952-11-15-a-p0023/2619,186,1109,665/max/0/default.jpg - type Non-Figurative Visual Content
Result oeuvre-1935-04-05-a-i0187 - link to image CI https://gallica.bnf.fr/iiif/ark:/12148/bpt6k46197689/f8/2299,595,838,608/max/0/default.jpg - type Object
Result lepetitparisien-1941-03-15-a-i0082 - link to image CI https://gallica.bnf.fr/iiif/ark:/12148/bpt6k684309k/f1/69,1788,815,1211/max/0/default.jpg - type Human Representation - Scene
Result oeuvre-1938-11-04-a-i0187 - link to image CI https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4621627t/f7/4642,3797,911,1083/max/0/default.jpg - type Non-Figurative Visual Content


> We retrieve more images of actual radio stations, often with people present. It can be useful to compare this with a similar sentence in English, or to refine the query to explicitly require a human figure in the scene.

In [None]:
radio_query_en = "People listening to a radio monitor."

# Embed radio_query_en prompt using Open-Clip model
radio_en_embedding = impresso_session.tools.embed_text(text=radio_query_en, target='multimodal')
radio_en_embedding


# Print the URLs for images similar to the query embeddings
radio_en_results = impresso_session.images.find(
  embedding=radio_en_embedding,
  limit=6
)

for uid, r in radio_en_results.df.head(5).iterrows():

  print(f"Result {uid} - link to image CI {r.previewUrl} - type {r['imageTypes.visualContentType']}")
  if 'contentItemUid' in r and str(r.contentItemUid)!='nan':
    print(f"       {r.contentItemUid} - link to corresponding CI {img_webapp_url(r.contentItemUid)}")

Result IMP-1997-07-21-a-i0042 - link to image CI https://dev.impresso-project.ch/media/iiif/IMP-1997-07-21-a-p0004/1267,690,1710,1149/max/0/default.jpg - type Human Representation - Scene
       IMP-1997-07-21-a-i0038 - link to corresponding CI https://impresso-project.ch/app/issue/IMP-1997-07-21-a/view?articleId=i0038
Result tageblatt-1941-02-15-a-i0113 - link to image CI https://iiif.eluxemburgensia.lu/image/iiif/2/ark:70795%2fgwq95j%2fpages%2f12/2913,229,794,1027/max/0/default.jpg - type Human Representation - Scene
Result EXP-1967-03-29-a-i0055 - link to image CI https://dev.impresso-project.ch/media/iiif/EXP-1967-03-29-a-p0003/497,637,824,564/max/0/default.jpg - type Human Representation - Scene
       EXP-1967-03-29-a-i0050 - link to corresponding CI https://impresso-project.ch/app/issue/EXP-1967-03-29-a/view?articleId=i0050
Result IMP-1966-10-19-a-i0091 - link to image CI https://dev.impresso-project.ch/media/iiif/IMP-1966-10-19-a-p0009/50,824,641,458/max/0/default.jpg - type Hu

> Having a query in english seems to have done the trick here; as you can see most images are of the type human representation!
**Don't hesitate to explore further with different queries, more complex and simple ones, varying languages and using the help of the image type filter** to specify more precisely what's of interest!

# Image-only embeddings with DinoV2
Let's dive more into **image-to-image embeddings**, and search for images that match ones that are of particular interest to us.

## 1. Searching for similar images with an external image
Suppose we are interested in studying the **spread of new technologies in the 1980s**: in that case, the image `EXP-1983-08-31-a-i0208` from content item `EXP-1983-08-31-a-i0195` could serve as a useful reference to discover similar articles.


In [None]:
example_image_id = 'EXP-1983-08-31-a-i0208'

example_embedding = impresso_session.images.get_embeddings(example_image_id)
example_embedding[1]

'dinov2-1024:g2p8vPJhgzu3OYQ8xNlnOl9Y/7yLUXy8ilwMO41QibztvAG7P4VLvD6UpjzlDA09m3b3vKnwWD0n3RY8wgWvOt2mQT0bFLC99ts5vEMeET26t5w9Kan/PDX/MT2HbZ48GZievMIdUbze90g82RkRvGKzBL3vCxg82M7SO2mSbDzja189NDnsvJpeDbxVBCG7ghYQPSGZJTwI1Ne7d6xNPSDFBLzafuO7bYOpPNxI2bwM4tE8F8+zPKHTcb3rOC49yzgOOrOkvTyiRSO88tqHPOq887xNct27zMjDu7ADGr3qQBU8B5mFPcdh2bxoYCQ9rBXfPL2JHr1PZ8y8jEKGPNRRFTwXZlG8INDdPLNMtTyVA9e8/XajPWMwOb1gW+Q8G4gVPetcA7yOdbA8NC3aPJbqQrz1G4a80SOoPB1rcj31FnG9w1YNvS6M3jwFSFW7xc1LPcdcAjwFrq86TYDUuy2ikb3JBKw8meAxO5SMFLwje1U8OvgGPVCC9zwdAPA83A4YO1FIAj2mLGq9NbUDPZgOVTx6Yq68EDi+O+hA2j2v0hC93qTdPEBr5jyRc5M7CsOgvCulXT2DYuY8hW+wOznEiDsGH6q8L2qoPMp+GLywICM8V7lmvQ/KsrzOmWO9hGwLvWrTOb0v68+8GSGXu0dpE70TD5U754uivGcqKT1tSQs9C4rsvIaM2DwiPci8Ge77vHEdl7nKpeo89COFvfKYfb3Sjxg9B/FGPQJG7Dxth2q8TzZovadgEb23k5W8m6HYuxWDbz1SZCs9ACQRPOikN7xMl968mscMvaZpLLzqhSi9t1dMPWo1j7w0xbw8D2OZPQh1CLxYGwK9DsE+PFY2wDyWqMm8Dzl1vCR6eroy7sI8CqTzO8EQ5DxhA3U6wA6fPIBBH7yiz4I627ZKvPZxEjwewDO9BICivPc+pjx+Xyq7LXTpO2iX7zyU5na7VE4CvX9UATw7RgG9RLS3OndcLb1

In [None]:
dino_results = impresso_session.images.find(
  embedding=example_embedding[1],
  limit=7
)

# Print the URLs for images similar to the query embeddings
for uid, r in dino_results.df.head(6).iterrows():
  if uid != example_image_id:
    print(f"Result {uid} - link to image CI {r.previewUrl} - type {r['imageTypes.visualContentType']}")
    if 'contentItemUid' in r and str(r.contentItemUid)!='nan':
      print(f"       {r.contentItemUid} - link to corresponding CI {img_webapp_url(r.contentItemUid)}")

Result EXP-1969-09-04-a-i0031 - link to image CI https://dev.impresso-project.ch/media/iiif/EXP-1969-09-04-a-p0002/317,665,557,361/max/0/default.jpg - type Human Representation - Scene
       EXP-1969-09-04-a-i0021 - link to corresponding CI https://dev.impresso-project.ch/app/issue/EXP-1969-09-04-a/view?articleId=i0021
Result VHT-1976-04-07-a-i0023 - link to image CI https://dev.impresso-project.ch/media/iiif/VHT-1976-04-07-a-p0005/2135,3109,1335,935/max/0/default.jpg - type Human Representation - Scene
Result luxland-2007-02-09-a-i0136 - link to image CI https://iiif.eluxemburgensia.lu/image/iiif/2/ark:70795%2fx5v85z%2fpages%2f24/2554,2782,975,643/max/0/default.jpg - type Human Representation - Scene
Result EXP-1995-08-24-a-i0149 - link to image CI https://dev.impresso-project.ch/media/iiif/EXP-1995-08-24-a-p0017/76,4308,1335,900/max/0/default.jpg - type Human Representation - Scene
       EXP-1995-08-24-a-i0146 - link to corresponding CI https://dev.impresso-project.ch/app/issue/EXP

> We can see that the retrieved images are different, yet they share several key characteristics with our base image: they depict what appear to be mid to late-twentieth-century technologies and show people interacting with them.
However, the publication dates of the corresponding articles vary widely, ranging from the late 1950s to the mid-1990s. To narrow the results, we can apply a date filter to restrict the search to images published in issues from the mid-1970s to the mid-1990s. We can further refine the query by using the type filter introduced earlier to ensure that people are present in the images.

## 2. Searching for similar images with complex filters

In [None]:
# Filter results based on the image content type "Human Representation - Scene" OR "Human Representation - Portrait" AND date range 1975-1995
filter_results = impresso_session.images.find(
  content_type=OR("Human Representation - Scene", "Human Representation - Portrait"),
  embedding=example_embedding[1],
  date_range=DateRange("1975-01-01", "1995-01-01"),
  limit=7
)

# Print the URLs for images similar to the query embeddings
for uid, r in filter_results.df.head(6).iterrows():
  if uid != example_image_id:
    print(f"Result {uid} - link to image CI {r.previewUrl} - type {r['imageTypes.visualContentType']}")
    if 'contentItemUid' in r and str(r.contentItemUid)!='nan':
      print(f"       {r.contentItemUid} - link to corresponding CI {img_webapp_url(r.contentItemUid)}")

Result VHT-1976-04-07-a-i0023 - link to image CI https://dev.impresso-project.ch/media/iiif/VHT-1976-04-07-a-p0005/2135,3109,1335,935/max/0/default.jpg - type Human Representation - Scene
Result IMP-1993-02-06-a-i0181 - link to image CI https://dev.impresso-project.ch/media/iiif/IMP-1993-02-06-a-p0017/699,641,902,706/max/0/default.jpg - type Human Representation - Scene
       IMP-1993-02-06-a-i0177 - link to corresponding CI https://dev.impresso-project.ch/app/issue/IMP-1993-02-06-a/view?articleId=i0177
Result GDL-1986-05-17-a-i0004 - link to image CI https://dev.impresso-project.ch/media/iiif/GDL-1986-05-17-a-p0001/1565,2158,1281,694/max/0/default.jpg - type Human Representation - Scene
       GDL-1986-05-17-a-i0001 - link to corresponding CI https://dev.impresso-project.ch/app/issue/GDL-1986-05-17-a/view?articleId=i0001
Result GDL-1978-06-20-a-i0086 - link to image CI https://dev.impresso-project.ch/media/iiif/GDL-1978-06-20-a-p0012/1452,1412,2017,975/max/0/default.jpg - type Human 

## 2. Searching for similar images with an external URL

Finally, if we find an image online that fits our research interests, **we can also use it directly as input for our search**. We simply need to embed the image using the same model ‚Äî in this case, DINOv2, and then run the same similarity search as before.

For example, from the Wikimedia Commons category ‚ÄúPeople listening to radios‚Äù, we selected an [image](https://commons.wikimedia.org/wiki/Category:People_listening_to_radios) of a girl listening to the radio. To use it, we only need to choose a version of the [image](https://commons.wikimedia.org/wiki/File:REA,_%22Little_girl_by_radio%22_-_NARA_-_195876.tif) at an appropriate resolution - for instance, the 527 √ó 628 pixel version available on the image‚Äôs Wikimedia Commons page, and use its link as input for the embedding step.

In [None]:
# Embedding an image from a URL

image_url = 'https://gallica.bnf.fr/iiif/ark:/12148/bpt6k6069079/f2/775,369,1303,887/max/0/default.jpg'
external_embedding = impresso_session.tools.embed_image(image=image_url, target="image")
external_embedding


In [None]:
# Searching similar images from the embedded image URL

results = impresso_session.images.find(
  embedding=embedding,
  limit=5
)

results

# Conclusion

In this notebook, we explored how Impresso's models - **Open-CLIP for text-to-image search** and **DINOv2 for image-to-image similarity** - can be used to navigate historical visual collections.
Starting from simple and more descriptive queries, we saw how Open-CLIP retrieves radio programs and illustrated section titles across languages, before turning to DINOv2 to find visually similar images from a single reference example.
Together, these approaches show **how multimodal and visual embeddings can help us move beyond keyword search**.

---
## Project and License info

### Notebook credits [CreditLogo.png](https://credit.niso.org/)

**Writing - Original draft:**  Roman Kalyakin. **Conceptualization:** Marten D√ºring. **Software:** Roman Kalyakin. **Writing - Review & Editing**: Pauline Conti, Cao Vy. **Validation:** Marten D√ºring. **Datalab editorial board:** Caio Mello (Managing), Cao Vy, Pauline Conti, Emanuela Boros, Marten D√ºring, Juri Opitz, Martin Grandjean, Estelle Bunout. **Data curation & Formal analysis:** Maud Ehrmann, Emanuela Boros, Pauline Conti, Simon Clematide, Juri Opitz, Andrianos Michail. **Methodology:** Roman Kalyakin. **Supervision:** Marten D√ºring. **Funding aquisition:** Maud Ehrmann, Simon Clematide, Marten D√ºring, Rapha√´lle Ruppen Coutaz.

<br><a target="_blank" href="https://creativecommons.org/licenses/by/4.0/">
  <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png"  width="100" alt="Open In Colab"/>
</a>

This notebook is published under [CC BY 4.0 License](https://creativecommons.org/licenses/by/4.0/)

For feedback on this notebook, please send an email to info@impresso-project.ch

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
<br></br>
### License

All Impresso code is published open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.


---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>
