# Exploring Sentence, Chunk, and Entity Embeddings for Retrieval  


This notebook provides a **proof of concept (PoC)** for working with embeddings in the *Impresso* corpus across three levels of granularity:  

1. **Sentence embeddings** â€“ fine-grained retrieval at the level of individual sentences (e.g., `lepetitparisien-1912-11-13-a-i0001-s-11`).  
2. **Chunk embeddings** â€“ broader retrieval at the level of aggregated text chunks (e.g., `lepetitparisien-1912-11-13-a-i0001-c-1`).  
3. **Entity embeddings** â€“ retrieval of linked entities (e.g., `Q380083` or `Jonas Furrer`).  

We refer to sentences and chunks as _subdocuments_ and we test these *subdoc* embeddings with two complementary query scenarios:  

- **In-corpus queries** â€“ selecting a query directly from the *Impresso* corpus.  
- **Out-of-corpus queries** â€“ embedding an external query (e.g., manually formulated or from another source).  

For the purpose of this PoC:

ðŸ‘‰ The **subdocs** = sentence and chunk embeddings for **front pages** of all newspapers in **1912 (Titanic)** and **1986 (Tchernobyl)**.  
ðŸ‘‰ The **entities** = entity embeddings for **person entities** in the years [+/- 5y] around the same years.

From now on, we refer to this PoC as a set of **experiments**. The experiments use direct queries to our internal retrieval system ([Solr](https://solr.apache.org/)).

Let's first connect to Impresso:

In [1]:
from impresso import connect

impresso = connect('https://dev.impresso-project.ch/public-api/v1')

ðŸŽ‰ You are now connected to the Impresso API!  ðŸŽ‰
ðŸ”— Using API: https://dev.impresso-project.ch/public-api/v1


## Get all experiments

In [2]:
impresso.experiments.find()

Unnamed: 0_level_0,name,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
subdoc-embeddings,Experiment with sentence and character level e...,\n Generates embeddings for subdocuments usin...
entity-profiles,Experiment with entity profiles and their embe...,\n Generates embeddings for subdocuments usin...


### Subdoc Embeddings Experiments

#### Sentence Embeddings - In-corpus queries

Let's search for some documents, take their embeddings and then search by embedding in Impresso.

In [3]:
sentence = "Le congrÃ¨s international s'est tenu Ã  Paris pour discuter des avancÃ©es scientifiques de la dÃ©cennie."

result = impresso.experiments.execute(
    experiment_id="subdoc-embeddings",
    body={
        "solrPayload": {
            "query": f"content_txt_fr:({sentence}) AND type_s:s", # type_s:s restricts search to sentences
            "limit": 5,
            "params": {"hl": False}
        }
    }
)

docs = result["solrResponse"]["response"]["docs"]
print(f"âœ… Got {len(docs)} Solr document(s)\n")


âœ… Got 5 Solr document(s)



In [4]:
for i, d in enumerate(docs, 1):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))

--- Result 1 ---
Mais câ€™ est une question qui ne peut se rÃ©gler en congrÃ¨s internationaux et câ€™ est pourquoi le pays cjui ne prÃ©sente pas une natalitÃ© suffisante sera Ã©tranglÃ©, ce qui ne sera dâ€™ ailleurs quâ€™ une avance sur son sui- cide.
--- Result 2 ---
Ã‰DITION DE PARIS Les Eves nouvelles Les suffragettes franÃ§aises on * tenu, dimanche, une rÃ©union oÃ¹ fut discutÃ©e l' intÃ©ressante question du vote municipal des femmes.
--- Result 3 ---
Le 20 juillet 1889, au CongrÃ¨s socialiste intermfci mal de Paris, il proposa la rÃ©solution suivante : Â« Il sera organisÃ© une grande manifestation internationale Ã  dr.
--- Result 4 ---
Le congrÃ¨s socialiste de 19 10 a rÃ©servÃ© la question de principe, mais il avait stipulÃ© que conformÃ©ment aux rÃ©solutions des congrÃ¨s internationaux de Paris et dâ€™ Amsterdam, il nâ€™ admettait pas comme possible la participation individuelle de certains socialistes, sans lâ€™ assentiment du parti ouvrier, Ã  un ministÃ¨re quelconque ; le con

In [26]:
list(docs[0].keys())

['id',
 'imp_ids_ss',
 'surfaces_ss',
 'ci_ids_ss',
 'mention_keys_ss',
 'ci_lg_s',
 'wiki_masterlabel_s',
 'wiki_url_s',
 'date_of_birth_dt',
 'date_of_death_dt',
 'wkd_occupations_ss',
 'wkd_occupation_qids_ss',
 'wkd_entity_types_ss',
 'wiki_summaries_t',
 'contexts_ss',
 'entity_mixed_emb_v768',
 'entity_encyc_emb_v768',
 'entity_media_emb_v768',
 '_version_',
 '_root_']

In [6]:
docs[0]['content_txt_fr'][:200]  # first 200 characters of the document content

'Mais câ€™ est une question qui ne peut se rÃ©gler en congrÃ¨s internationaux et câ€™ est pourquoi le pays cjui ne prÃ©sente pas une natalitÃ© suffisante sera Ã©tranglÃ©, ce qui ne sera dâ€™ ailleurs quâ€™ une avanc'

Let's take the first returned document's embedding.

In [7]:
embedding = docs[0]['gte_multi_v768']

embedding[:3]

[-0.081427164, 0.064372316, -0.045108054]

In [8]:
embedding = result['solrResponse']['response']['docs'][0]['gte_multi_v768']

In [9]:
result = impresso.experiments.execute(
    experiment_id="subdoc-embeddings",
    body={
        "solrPayload": {
            "query": "{!knn f=gte_multi_v768 topK=3}" + str(embedding),
            "limit": 3,
            "params": {
                "fq": "type_s:s",   # type_s:s restricts search to sentences (s=sentence)
                # "fl": "id,score,content_txt_fr,ci_id_s", -- add these later if you want to return only specific fields
                # for now let's return everything
                "hl": False
            }
        }
    }
)

docs = result["solrResponse"]["response"]["docs"]
print(f"âœ… Got {len(docs)} Solr document(s)\n")

âœ… Got 3 Solr document(s)



In [10]:
# docs

In [11]:
for i, d in enumerate(docs):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))
    print(d.get("content_txt_de", "[No text]"))

--- Result 0 ---
Mais câ€™ est une question qui ne peut se rÃ©gler en congrÃ¨s internationaux et câ€™ est pourquoi le pays cjui ne prÃ©sente pas une natalitÃ© suffisante sera Ã©tranglÃ©, ce qui ne sera dâ€™ ailleurs quâ€™ une avance sur son sui- cide.
[No text]
--- Result 1 ---
[No text]
weil sie nicht die ganze Nation in dem Parlament vertreten sehen wolle.
--- Result 2 ---
[No text]
Es steht zu hoffen, daÃŸ damit die vom Bundesrate getanen vÃ¶llig gewesen Schritte nicht nutzlos sind, und daÃŸ in nicht allzu ferner Zeit doch noch eine solche internationale Konferenz sich mit dein Problem beschÃ¤ftigen wird.


#### Sentence Embeddings - Out-of-corpus queries


In [12]:
sentence = "Le congrÃ¨s international s'est tenu Ã  Paris pour discuter des avancÃ©es scientifiques de la dÃ©cennie."

embedding = impresso.tools.embed_text(text=sentence, target="text")
embedding

'gte-768:SldQPEKgFj3hqBq9KbZDO44Ntr2m3B28HGQiPBZYHDzoyi89VZGjvdn/jD376Ia83lSSO/TikzyG1fu9RWKxPOTYkj0jF8k99JncPaxsarx18VI92R20PTlvrbxuzWS9A4uUO6lYFTsWTQ09+QKVvVQiXL3mpiE9FbOxvRhr/DyplVE8vIZ6PQ/ivj0660e90aF9PE0Yrzx12ZC9Awv/O0xEortNwYO8vr0rvTctNrtTsAu87z18PJPXGL2lOO08N3NSPORAgz1FtBY9YC4cPY3tFT0JM8g8aQ5zvHa067v9YS+9wol3vAzUtj2SRWG9jPGMO4I2M7uCBLy8qxM3PaXMYb3Rs1+9ukY5POf2lbycXIc9j/XyvM9A4L29ToA8lBeBPAsP6DwO+3i9y0m4PQqOAL3f1eC8NpOEPDLQpzzHMxK9MQc1PLpK9rwbOxI9EH4ZvZiKrrzGDFi8qVTbuztLPDr/IYQ9HwPZPcMXfT1SmI06B6WOvNNM9jwBFWo4X+24vJDvoL1biHC8vca5O6NWYbySuqW8KkXkvG8SirzGp+w8J0WSvCIjuLtRdjG9irEYvYnl8juWY/Y7iURPvEciEL1BHzm9gkebvSwx3Dz4smQ8f1lKPRGzYD0NzEa9ujSDOyduHLyrzYI78vxhvT/5xb24sBi9IgSzPBhvsr1Yufc8kyFIvYcaGT2KvtK69kKCPI9oD73T+bq8vFyDvW2jp7wwuTM9+FC6PFAzlbuWlns7sEdiPbj6ibxZYEI9Cr4dPMEeXT2Qsom8FObQvPubDT3o8xs9N69IOwMPsLtImW09uayXvFIUMDz8ssK7eXa7vIYX6zw2G3O9R5eKvCH+kz0plv48P516PSE+lb1heNW8rjr2O8RRrTvoQ8O84cOaPX0YbjwHJ008qCejPNlKFDy5BPO85f6RvDaqWzsYDvU8DCqAPck/ib15Z549f/jiO5MYiL2aEFi9g2uNvdMusDuMeWW

In [13]:
import base64
import struct

# convert base64 string to a float array
_, arr = embedding.split(':')
arr = base64.b64decode(arr)
outof_corpus_emb = [struct.unpack('f', arr[i:i+4])[0] for i in range(0, len(arr), 4)]

len(outof_corpus_emb)

768

In [14]:
outof_corpus_emb[:5]

[0.012716123834252357,
 0.0367739275097847,
 -0.037758711725473404,
 0.002986321458593011,
 -0.08889304101467133]

In [15]:
result = impresso.experiments.execute(
  experiment_id="subdoc-embeddings",
  body={
    "solrPayload": {
      "query": "{!knn f=gte_multi_v768 topK=3}" + str(outof_corpus_emb),
      "limit": 3,
      "params": {
        "hl": False
      }
    }
  }
)
docs = result["solrResponse"]["response"]["docs"]
print(f"âœ… Got {len(docs)} Solr document(s)\n")

âœ… Got 3 Solr document(s)



In [16]:
for i, d in enumerate(docs):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))
    print(d.get("content_txt_de", "[No text]"))

--- Result 0 ---
[No text]
Bei der internationalen Delegierten konferenz der Notare han.
--- Result 1 ---
[No text]
l " " " erwÃ¤hnten KongreÃŸ Â« Â« Â« gesetzte internatio. " "
--- Result 2 ---
Paris, 10 janvier.
[No text]


#### Chunk Embeddings - In-corpus queries


In [17]:
chunk = (
    "Le congrÃ¨s international s'est tenu Ã  Paris pour discuter des avancÃ©es scientifiques de la dÃ©cennie. "
    "Des chercheurs venus de nombreux pays ont prÃ©sentÃ© leurs travaux les plus rÃ©cents dans les domaines de la physique, "
    "de la biologie et des sciences sociales. "
    "Les dÃ©bats ont mis en lumiÃ¨re les progrÃ¨s rÃ©alisÃ©s grÃ¢ce Ã  la collaboration entre institutions europÃ©ennes et amÃ©ricaines, "
    "ainsi que les dÃ©fis Ã  venir pour une recherche plus ouverte et interdisciplinaire. "
    "La rencontre sâ€™est conclue par lâ€™adoption dâ€™une rÃ©solution encourageant la diffusion libre des connaissances scientifiques.")

result = impresso.experiments.execute(
    experiment_id="subdoc-embeddings",
    body={
        "solrPayload": {
            "query": f"content_txt_fr:({chunk}) AND type_s:c",
            "limit": 5,
            "params": {"hl": False}
        }
    }
)

docs = result["solrResponse"]["response"]["docs"]
print(f"âœ… Got {len(docs)} Solr document(s)\n")

âœ… Got 5 Solr document(s)



In [18]:
for i, d in enumerate(docs):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))
    print(d.get("content_txt_de", "[No text]"))

--- Result 0 ---
et leurs gouvernements respectifs. '.. ".'â€¢â€¢â€¢En outre, la nouvelle convention accorde la franchise de port pour la correspondance drdniaire des institutions nationales ayant un caractÃ¨re scientifique et d' intÃ©rÃªt gÃ©nÃ©ral ; ainsi qu' aux congrÃ¨s scientifiques sud-amÃ©ricains composÃ©s de la majoritÃ© des pays de ce continent.Cette sage disposition constitue pour les sciences un bel encouragement et elle contribuera Ã  rÃ©pandre avec plus de facilitÃ© sur les immenses territoires de l' AmÃ©rique du Sud les progrÃ¨s dus aux- efforts du gÃ©nie humain.D' autre pari, le congrÃ¨s postal dont nous analysons les travaux n' a pas bliÃ© les services rendus jpar la presse dans le domaine de l' Ã©ducation - des peuples.Il a cru devoir Ã©galement favoriser la diffusion des idÃ©es, dans le louable dÃ©sir d' accÃ©lÃ©rer - l' avÃ¨nement de la turitÃ© - politique des citoyens.C' est pourquoi la franchise de.port est accordÃ©e aux Ã©diteurs de journaux quotidiens et de publi

In [19]:
embedding = docs[0]['gte_multi_v768']

embedding[:3]

[-0.039117645, 0.062711135, -0.060027212]

In [20]:
result = impresso.experiments.execute(
    experiment_id="subdoc-embeddings",
    body={
        "solrPayload": {
            "query": "{!knn f=gte_multi_v768 topK=3}" + str(embedding),
            "limit": 3,
            "params": {
                "fq": "type_s:c",   # type_s:c restricts search to chunks! (c=chunk)
                # "fl": "id,score,content_txt_fr,ci_id_s", -- add these later if you want to return only specific fields
                # for now let's return everything
                "hl": False
            }
        }
    }
)

docs = result["solrResponse"]["response"]["docs"]
print(f"âœ… Got {len(docs)} Solr document(s)\n")

âœ… Got 3 Solr document(s)



In [21]:
for i, d in enumerate(docs):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))
    print(d.get("content_txt_de", "[No text]"))

--- Result 0 ---
[No text]
gskosten BriefbefÃ¶id Â«.auf die eigentliche BefÃ¶rderung zwei Drittel aber auf den den Schalterdienst Â« nd Bestelldienst fallen.Der groÃŸe Gedanke macht Â« Schule : er iibec Â» schritt die Grenzen des Landes ud n versuchte sich auch auf internationalem Boden.Nebst dem Weltpostverein, der uns im Jahre 1875 das internationale Einheitsporto fÃ¼r Briefe brachte, entstanden eine Anzahl Sonderpost der.eine, die HauptsÃ¤chlich den Zweck hatten, d Â« Inlandsporto der Vertragsstaaten auch auf ihren ieg Â« nseitiaen Auhenveikehr auszudehnen.Selbst uor dem Meere macht Â« das Einheitsporlo Â« ich Â» Halt.Im Jahre 1898 Antrag fÃ¼hrte England auf den seines Abgeordneten Henniker-Heaton im Verkehr mit seinen Kolonien das Penny.Porto ein ; Deutschland folgte diesem Beispiele 1N99. Seit 19N8 19l ) 9 verkehrt GroÃŸbritannien unÃ¶ seit auch Deutschland mit den Vereinigten Staaten von Amerika zum Inlandssahe.Begehren Auch in der Schweiz wurden schon laut, Ã¶fters die Ã¤hnliche

## Entities profiles experiment

In [22]:
result = impresso.experiments.execute(
  experiment_id="entity-profiles",
  body={
    "solrPayload": {
      "query": "wiki_url_s:*Simone*de*Beauvoir*",
      "limit": 1,
      "params": {
        "hl": False
      }
    }
  }
)
entity_doc = result['solrResponse']['response']['docs'][0]

list(entity_doc.keys())

['id',
 'imp_ids_ss',
 'surfaces_ss',
 'ci_ids_ss',
 'mention_keys_ss',
 'ci_lg_s',
 'wiki_masterlabel_s',
 'wiki_url_s',
 'date_of_birth_dt',
 'date_of_death_dt',
 'wkd_occupations_ss',
 'wkd_occupation_qids_ss',
 'wkd_entity_types_ss',
 'wiki_summaries_t',
 'contexts_ss',
 'entity_mixed_emb_v768',
 'entity_encyc_emb_v768',
 'entity_media_emb_v768',
 '_version_',
 '_root_']

In [23]:
entity_doc['wkd_occupations_ss']

['political philosopher',
 'journalist',
 'novelist',
 'autobiographer',
 'essayist',
 'political activist',
 'diarist',
 'women letter writer',
 'philosopher',
 'literary critic',
 'writer',
 'author',
 'feminist',
 'philosophy teacher']

In [24]:
result = impresso.experiments.execute(
  experiment_id="entity-profiles",
  body={
    "solrPayload": {
      "query": "{!knn f=entity_mixed_emb_v768 topK=5}" + str(entity_doc['entity_mixed_emb_v768']),
      "filter": [
          f"-id:{entity_doc['id']}" # exclude target entity itself
      ],
      "limit": 5,
      "params": {
        "hl": False
      }
    }
  }
)
docs = result['solrResponse']['response']['docs']
for doc in docs:
    print(doc['wiki_url_s'])

https://fr.wikipedia.org/wiki/HÃ©lÃ¨ne_de_Beauvoir
https://fr.wikipedia.org/wiki/Jean_Beauvoir
https://fr.wikipedia.org/wiki/Simone_Chalon
https://fr.wikipedia.org/wiki/Sylvia_Earle
https://fr.wikipedia.org/wiki/Gustave_Simon
