# Exploring Sentence, Chunk, and Entity Embeddings for Retrieval  


This notebook provides a **proof of concept (PoC)** for working with embeddings in the *Impresso* corpus across three levels of granularity:  

1. **Sentence embeddings** ‚Äì fine-grained retrieval at the level of individual sentences (e.g., `lepetitparisien-1912-11-13-a-i0001-s-11`).  
2. **Chunk embeddings** ‚Äì broader retrieval at the level of aggregated text chunks (e.g., `lepetitparisien-1912-11-13-a-i0001-c-1`).  
3. **Entity embeddings** ‚Äì retrieval of linked entities (e.g., `Q380083` or `Jonas Furrer`).  

We refer to sentences and chunks as _subdocuments_ and we test these *subdoc* embeddings with two complementary query scenarios:  

- **In-corpus queries** ‚Äì selecting a query directly from the *Impresso* corpus.  
- **Out-of-corpus queries** ‚Äì embedding an external query (e.g., manually formulated or from another source).  

For the purpose of this PoC:

üëâ The **subdocs** = sentence and chunk embeddings for **front pages** of all newspapers in **1912 (Titanic)** and **1986 (Tchernobyl)**.  
üëâ The **entities** = entity embeddings for **person entities** in the years [+/- 5y] around the same years.

From now on, we refer to this PoC as a set of **experiments**. The experiments use direct queries to our internal retrieval system ([Solr](https://solr.apache.org/)).

Let's first connect to Impresso:

In [32]:
from impresso import connect

impresso = connect('https://dev.impresso-project.ch/public-api/v1')

üéâ You are now connected to the Impresso API!  üéâ
üîó Using API: https://dev.impresso-project.ch/public-api/v1


## Get all experiments

In [33]:
impresso.experiments.find()

Unnamed: 0_level_0,name,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
subdoc-embeddings,Experiment with sentence and character level e...,\n Generates embeddings for subdocuments usin...
entity-profiles,Experiment with entity profiles and their embe...,\n Generates embeddings for subdocuments usin...


### Subdoc Embeddings Experiments

#### Sentence Embeddings - In-corpus queries

Let's search for some documents, take their embeddings and then search by embedding in Impresso.

In [34]:
sentence = "Le congr√®s international s'est tenu √† Paris pour discuter des avanc√©es scientifiques de la d√©cennie."

result = impresso.experiments.execute(
    experiment_id="subdoc-embeddings",
    body={
        "solrPayload": {
            "query": f"content_txt_fr:({sentence}) AND type_s:s", # type_s:s restricts search to sentences
            "limit": 5,
            "params": {"hl": False}
        }
    }
)

docs = result["solrResponse"]["response"]["docs"]
print(f"‚úÖ Got {len(docs)} Solr document(s)\n")


‚úÖ Got 5 Solr document(s)



In [35]:
for i, d in enumerate(docs, 1):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))

--- Result 1 ---
Mais c‚Äô est une question qui ne peut se r√©gler en congr√®s internationaux et c‚Äô est pourquoi le pays cjui ne pr√©sente pas une natalit√© suffisante sera √©trangl√©, ce qui ne sera d‚Äô ailleurs qu‚Äô une avance sur son sui- cide.
--- Result 2 ---
√âDITION DE PARIS Les Eves nouvelles Les suffragettes fran√ßaises on * tenu, dimanche, une r√©union o√π fut discut√©e l' int√©ressante question du vote municipal des femmes.
--- Result 3 ---
Le 20 juillet 1889, au Congr√®s socialiste intermfci mal de Paris, il proposa la r√©solution suivante : ¬´ Il sera organis√© une grande manifestation internationale √† dr.
--- Result 4 ---
Le congr√®s socialiste de 19 10 a r√©serv√© la question de principe, mais il avait stipul√© que conform√©ment aux r√©solutions des congr√®s internationaux de Paris et d‚Äô Amsterdam, il n‚Äô admettait pas comme possible la participation individuelle de certains socialistes, sans l‚Äô assentiment du parti ouvrier, √† un minist√®re quelconque ; le con

In [36]:
docs[0]

{'id': 'indeplux-1912-06-14-a-i0001-s-27',
 'type_s': 's',
 'content_txt_fr': 'Mais c‚Äô est une question qui ne peut se r√©gler en congr√®s internationaux et c‚Äô est pourquoi le pays cjui ne pr√©sente pas une natalit√© suffisante sera √©trangl√©, ce qui ne sera d‚Äô ailleurs qu‚Äô une avance sur son sui- cide.',
 'ci_id_s': 'indeplux-1912-06-14-a-i0001',
 'gte_multi_v768': [-0.081427164,
  0.064372316,
  -0.045108054,
  0.08742539,
  -0.016204905,
  -0.0032661548,
  -0.08083391,
  -0.030488549,
  0.01922982,
  -0.066497795,
  0.04242271,
  -0.036173802,
  -0.0028986486,
  0.034270063,
  -0.03690813,
  0.09999781,
  0.025211757,
  -0.004169398,
  0.04152064,
  0.032867514,
  0.11520032,
  0.09198125,
  0.005486816,
  0.015264476,
  0.017409217,
  -0.005969191,
  0.10060896,
  -0.066738024,
  -0.051272426,
  0.038004458,
  -0.10215557,
  -0.016743958,
  0.017888524,
  -0.00096381706,
  0.052417602,
  -0.073400766,
  0.047756754,
  -0.018304978,
  -0.03343914,
  0.015744846,
  -0.035610

In [37]:
docs[0]['content_txt_fr'][:200]  # first 200 characters of the document content

'Mais c‚Äô est une question qui ne peut se r√©gler en congr√®s internationaux et c‚Äô est pourquoi le pays cjui ne pr√©sente pas une natalit√© suffisante sera √©trangl√©, ce qui ne sera d‚Äô ailleurs qu‚Äô une avanc'

Let's take the first returned document's embedding.

In [38]:
embedding = docs[0]['gte_multi_v768']

embedding[:3]

[-0.081427164, 0.064372316, -0.045108054]

In [39]:
embedding = result['solrResponse']['response']['docs'][0]['gte_multi_v768']

In [75]:
result = impresso.experiments.execute(
    experiment_id="subdoc-embeddings",
    body={
        "solrPayload": {
            "query": "{!knn f=gte_multi_v768 topK=3}" + str(embedding),
            "limit": 3,
            "params": {
                "fq": "type_s:s",   # type_s:s restricts search to sentences (s=sentence)
                # "fl": "id,score,content_txt_fr,ci_id_s", -- add these later if you want to return only specific fields
                # for now let's return everything
                "hl": False
            }
        }
    }
)

docs = result["solrResponse"]["response"]["docs"]
print(f"‚úÖ Got {len(docs)} Solr document(s)\n")

‚úÖ Got 3 Solr document(s)



In [76]:
docs

[{'id': 'GDL-1912-01-25-a-i0001-s-65',
  'type_s': 's',
  'content_txt_fr': "En outre, la nouvelle convention accorde la franchise de port pour la correspondance drdniaire des institutions nationales ayant un caract√®re scientifique et d' int√©r√™t g√©n√©ral ; ainsi qu' aux congr√®s scientifiques sud-am√©ricains compos√©s de la majorit√© des pays de ce continent.",
  'ci_id_s': 'GDL-1912-01-25-a-i0001',
  'gte_multi_v768': [-0.051615752,
   0.05201664,
   -0.049590472,
   0.03735869,
   -0.05054572,
   0.0066682235,
   0.017149948,
   0.029176751,
   0.1008355,
   -0.06555545,
   0.02282818,
   -0.004284408,
   0.06918294,
   0.05266427,
   -0.1323649,
   0.07952411,
   0.10921394,
   0.061787345,
   0.026983595,
   -0.03820048,
   0.06531311,
   0.04358564,
   -0.022574957,
   -0.055607993,
   -0.017880114,
   0.08863045,
   0.075027466,
   0.00598791,
   -0.029623808,
   0.04483322,
   -0.07739359,
   0.013161964,
   0.06096676,
   0.07653891,
   0.020618537,
   0.066804744,
   0.003

In [77]:
for i, d in enumerate(docs):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))
    print(d.get("content_txt_de", "[No text]"))

--- Result 0 ---
En outre, la nouvelle convention accorde la franchise de port pour la correspondance drdniaire des institutions nationales ayant un caract√®re scientifique et d' int√©r√™t g√©n√©ral ; ainsi qu' aux congr√®s scientifiques sud-am√©ricains compos√©s de la majorit√© des pays de ce continent.
[No text]
--- Result 1 ---
Certaines dispositions ins√©r√©es dans la nouvelle convention sud-am√©ricaine ont sans contredit une port√©e autre que simplement postale.
[No text]
--- Result 2 ---
port est accord√©e aux √©diteurs de journaux quotidiens et de publications p√©riodiques sud-am√©ricains pour les exemplaires jusqu' au nombre de deux √©chang√©s par.
[No text]


#### Sentence Embeddings - Out-of-corpus queries


In [78]:
sentence = "Le congr√®s international s'est tenu √† Paris pour discuter des avanc√©es scientifiques de la d√©cennie."

embedding = impresso.tools.embed_text(text=sentence, target="text")
embedding

'gte-768:SldQPEKgFj3hqBq9KbZDO44Ntr2m3B28HGQiPBZYHDzoyi89VZGjvdn/jD376Ia83lSSO/TikzyG1fu9RWKxPOTYkj0jF8k99JncPaxsarx18VI92R20PTlvrbxuzWS9A4uUO6lYFTsWTQ09+QKVvVQiXL3mpiE9FbOxvRhr/DyplVE8vIZ6PQ/ivj0660e90aF9PE0Yrzx12ZC9Awv/O0xEortNwYO8vr0rvTctNrtTsAu87z18PJPXGL2lOO08N3NSPORAgz1FtBY9YC4cPY3tFT0JM8g8aQ5zvHa067v9YS+9wol3vAzUtj2SRWG9jPGMO4I2M7uCBLy8qxM3PaXMYb3Rs1+9ukY5POf2lbycXIc9j/XyvM9A4L29ToA8lBeBPAsP6DwO+3i9y0m4PQqOAL3f1eC8NpOEPDLQpzzHMxK9MQc1PLpK9rwbOxI9EH4ZvZiKrrzGDFi8qVTbuztLPDr/IYQ9HwPZPcMXfT1SmI06B6WOvNNM9jwBFWo4X+24vJDvoL1biHC8vca5O6NWYbySuqW8KkXkvG8SirzGp+w8J0WSvCIjuLtRdjG9irEYvYnl8juWY/Y7iURPvEciEL1BHzm9gkebvSwx3Dz4smQ8f1lKPRGzYD0NzEa9ujSDOyduHLyrzYI78vxhvT/5xb24sBi9IgSzPBhvsr1Yufc8kyFIvYcaGT2KvtK69kKCPI9oD73T+bq8vFyDvW2jp7wwuTM9+FC6PFAzlbuWlns7sEdiPbj6ibxZYEI9Cr4dPMEeXT2Qsom8FObQvPubDT3o8xs9N69IOwMPsLtImW09uayXvFIUMDz8ssK7eXa7vIYX6zw2G3O9R5eKvCH+kz0plv48P516PSE+lb1heNW8rjr2O8RRrTvoQ8O84cOaPX0YbjwHJ008qCejPNlKFDy5BPO85f6RvDaqWzsYDvU8DCqAPck/ib15Z549f/jiO5MYiL2aEFi9g2uNvdMusDuMeWW

In [79]:
import base64
import struct

# convert base64 string to a float array
_, arr = embedding.split(':')
arr = base64.b64decode(arr)
outof_corpus_emb = [struct.unpack('f', arr[i:i+4])[0] for i in range(0, len(arr), 4)]

len(outof_corpus_emb)

768

In [80]:
outof_corpus_emb[:5]

[0.012716123834252357,
 0.0367739275097847,
 -0.037758711725473404,
 0.002986321458593011,
 -0.08889304101467133]

In [81]:
result = impresso.experiments.execute(
  experiment_id="subdoc-embeddings",
  body={
    "solrPayload": {
      "query": "{!knn f=gte_multi_v768 topK=3}" + str(outof_corpus_emb),
      "limit": 3,
      "params": {
        "hl": False
      }
    }
  }
)
docs = result["solrResponse"]["response"]["docs"]
print(f"‚úÖ Got {len(docs)} Solr document(s)\n")

‚úÖ Got 3 Solr document(s)



In [82]:
for i, d in enumerate(docs):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))
    print(d.get("content_txt_de", "[No text]"))

--- Result 0 ---
[No text]
Bei der internationalen Delegierten konferenz der Notare han.
--- Result 1 ---
[No text]
l " " " erw√§hnten Kongre√ü ¬´ ¬´ ¬´ gesetzte internatio. " "
--- Result 2 ---
Paris, 10 janvier.
[No text]


#### Chunk Embeddings - In-corpus queries


In [83]:
chunk = (
    "Le congr√®s international s'est tenu √† Paris pour discuter des avanc√©es scientifiques de la d√©cennie. "
    "Des chercheurs venus de nombreux pays ont pr√©sent√© leurs travaux les plus r√©cents dans les domaines de la physique, "
    "de la biologie et des sciences sociales. "
    "Les d√©bats ont mis en lumi√®re les progr√®s r√©alis√©s gr√¢ce √† la collaboration entre institutions europ√©ennes et am√©ricaines, "
    "ainsi que les d√©fis √† venir pour une recherche plus ouverte et interdisciplinaire. "
    "La rencontre s‚Äôest conclue par l‚Äôadoption d‚Äôune r√©solution encourageant la diffusion libre des connaissances scientifiques.")

result = impresso.experiments.execute(
    experiment_id="subdoc-embeddings",
    body={
        "solrPayload": {
            "query": f"content_txt_fr:({chunk}) AND type_s:c",
            "limit": 5,
            "params": {"hl": False}
        }
    }
)

docs = result["solrResponse"]["response"]["docs"]
print(f"‚úÖ Got {len(docs)} Solr document(s)\n")

‚úÖ Got 5 Solr document(s)



In [84]:
for i, d in enumerate(docs):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))
    print(d.get("content_txt_de", "[No text]"))

--- Result 0 ---
et leurs gouvernements respectifs. '.. ".'‚Ä¢‚Ä¢‚Ä¢En outre, la nouvelle convention accorde la franchise de port pour la correspondance drdniaire des institutions nationales ayant un caract√®re scientifique et d' int√©r√™t g√©n√©ral ; ainsi qu' aux congr√®s scientifiques sud-am√©ricains compos√©s de la majorit√© des pays de ce continent.Cette sage disposition constitue pour les sciences un bel encouragement et elle contribuera √† r√©pandre avec plus de facilit√© sur les immenses territoires de l' Am√©rique du Sud les progr√®s dus aux- efforts du g√©nie humain.D' autre pari, le congr√®s postal dont nous analysons les travaux n' a pas bli√© les services rendus jpar la presse dans le domaine de l' √©ducation - des peuples.Il a cru devoir √©galement favoriser la diffusion des id√©es, dans le louable d√©sir d' acc√©l√©rer - l' av√®nement de la turit√© - politique des citoyens.C' est pourquoi la franchise de.port est accord√©e aux √©diteurs de journaux quotidiens et de publi

In [86]:
embedding = docs[0]['gte_multi_v768']

embedding[:3]

[-0.039117645, 0.062711135, -0.060027212]

In [87]:
result = impresso.experiments.execute(
    experiment_id="subdoc-embeddings",
    body={
        "solrPayload": {
            "query": "{!knn f=gte_multi_v768 topK=3}" + str(embedding),
            "limit": 3,
            "params": {
                "fq": "type_s:c",   # type_s:c restricts search to chunks! (c=chunk)
                # "fl": "id,score,content_txt_fr,ci_id_s", -- add these later if you want to return only specific fields
                # for now let's return everything
                "hl": False
            }
        }
    }
)

docs = result["solrResponse"]["response"]["docs"]
print(f"‚úÖ Got {len(docs)} Solr document(s)\n")

‚úÖ Got 3 Solr document(s)



In [88]:
for i, d in enumerate(docs):
    print(f"--- Result {i} ---")
    print(d.get("content_txt_fr", "[No text]"))
    print(d.get("content_txt_de", "[No text]"))

--- Result 0 ---
[No text]
gskosten Briefbef√∂id ¬´.auf die eigentliche Bef√∂rderung zwei Drittel aber auf den den Schalterdienst ¬´ nd Bestelldienst fallen.Der gro√üe Gedanke macht ¬´ Schule : er iibec ¬ª schritt die Grenzen des Landes ud n versuchte sich auch auf internationalem Boden.Nebst dem Weltpostverein, der uns im Jahre 1875 das internationale Einheitsporto f√ºr Briefe brachte, entstanden eine Anzahl Sonderpost der.eine, die Haupts√§chlich den Zweck hatten, d ¬´ Inlandsporto der Vertragsstaaten auch auf ihren ieg ¬´ nseitiaen Auhenveikehr auszudehnen.Selbst uor dem Meere macht ¬´ das Einheitsporlo ¬´ ich ¬ª Halt.Im Jahre 1898 Antrag f√ºhrte England auf den seines Abgeordneten Henniker-Heaton im Verkehr mit seinen Kolonien das Penny.Porto ein ; Deutschland folgte diesem Beispiele 1N99. Seit 19N8 19l ) 9 verkehrt Gro√übritannien un√∂ seit auch Deutschland mit den Vereinigten Staaten von Amerika zum Inlandssahe.Begehren Auch in der Schweiz wurden schon laut, √∂fters die √§hnliche

## Entities profiles experiment

In [92]:
result = impresso.experiments.execute(
  experiment_id="entity-profiles",
  body={
    "solrPayload": {
      "query": "wiki_url_s:*Simone*de*Beauvoir*",
      "limit": 1,
      "params": {
        "hl": False
      }
    }
  }
)
entity_doc = result['solrResponse']['response']['docs'][0]

list(entity_doc.keys())

['id',
 'imp_ids_ss',
 'surfaces_ss',
 'ci_ids_ss',
 'mention_keys_ss',
 'ci_lg_s',
 'wiki_masterlabel_s',
 'wiki_url_s',
 'date_of_birth_dt',
 'date_of_death_dt',
 'wkd_occupations_ss',
 'wkd_occupation_qids_ss',
 'wkd_entity_types_ss',
 'wiki_summaries_t',
 'contexts_ss',
 'entity_mixed_emb_v768',
 'entity_encyc_emb_v768',
 'entity_media_emb_v768',
 '_version_',
 '_root_']

In [93]:
entity_doc['wkd_occupations_ss']

['political philosopher',
 'journalist',
 'novelist',
 'autobiographer',
 'essayist',
 'political activist',
 'diarist',
 'women letter writer',
 'philosopher',
 'literary critic',
 'writer',
 'author',
 'feminist',
 'philosophy teacher']

In [98]:
result = impresso.experiments.execute(
  experiment_id="entity-profiles",
  body={
    "solrPayload": {
      "query": "{!knn f=entity_mixed_emb_v768 topK=5}" + str(entity_doc['entity_mixed_emb_v768']),
      "filter": [
          f"-id:{einstein_doc['id']}" # exclude target entity itself
      ],
      "limit": 5,
      "params": {
        "hl": False
      }
    }
  }
)
docs = result['solrResponse']['response']['docs']
for doc in docs:
    print(doc['wiki_url_s'])

https://de.wikipedia.org/wiki/Simone_de_Beauvoir
https://fr.wikipedia.org/wiki/H√©l√®ne_de_Beauvoir
https://fr.wikipedia.org/wiki/Jean_Beauvoir
https://fr.wikipedia.org/wiki/Simone_Chalon
https://fr.wikipedia.org/wiki/Sylvia_Earle
