# Exploring Impresso Tools (NER, NEL, article embeddings)  


This notebook provides an overview and demonstration of the core tools integrated into the Impresso application, focusing on components that power text understanding, entity recognition, and retrieval capabilities across the corpus. 

This notebook documents production-level tools that are permanent within the Impresso infrastructure.

Specifically, we cover three major components:

* **Named Entity Recognition (NER)** – identifying and classifying named entities (e.g., people, places, organizations) in historical newspaper text using the [impresso-project/ner-stacked-bert-multilingual](https://huggingface.co/impresso-project/ner-stacked-bert-multilingual) model.
* **Named Entity Linking (NEL)** – resolving recognized entities to canonical entries in knowledge bases such as Wikidata, using the [impresso-project/nel-mgenre-multilingual](https://huggingface.co/impresso-project/nel-mgenre-multilingual) model.
* **Article Embeddings** – generating embeddings of full articles using [gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) to enable semantic search with:
    - **In-corpus queries** – selecting a query directly from the *Impresso* corpus.  
    - **Out-of-corpus queries** – embedding an external query (e.g., manually formulated or from another source).  


In [None]:
from impresso import connect

impresso = connect()

### Named entity recognition


In [2]:
text = """
    Jean-Baptiste Nicolas Robert Schuman ( 
    29 June 1886 – 4 September 1963) was a Luxembourg-born French 
    statesman. Schuman was a Christian democratic (Popular 
    Republican Movement) political thinker and activist. 
    """
result = impresso.tools.ner(
    text=text
)
result

Unnamed: 0_level_0,type,surfaceForm,function,name,confidence.ner,offset.start,offset.end,wikidata.id,wikidata.wikipediaPageName
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2:41:pers:ner-stacked-2-bert-medium-historic-multilingual|ner-mgenre-multilingual,pers,Jean-Baptiste Nicolas Robert Schuman,,Baptiste Nicolas Robert Schuman,91.25,2,41,,
46:80:time:ner-stacked-2-bert-medium-historic-multilingual|ner-mgenre-multilingual,time,29 June 1886 – 4 September 1963,,,77.9,46,80,,
88:98:org:ner-stacked-2-bert-medium-historic-multilingual|ner-mgenre-multilingual,org,Luxembourg,,,25.12,88,98,,


### Named entity linking

For the system to know what entity to link, we need to surround it with the markers [START] and [END]. Leave spaces between the entity and the markers.

In [3]:
text = """
    [START] Jean-Baptiste Nicolas Robert Schuman [END] ( 
    29 June 1886 – 4 September 1963) was a Luxembourg-born French 
    statesman. Schuman was a Christian democratic (Popular 
    Republican Movement) political thinker and activist. 
    """

result = impresso.tools.nel(
    text=text,
)
result

Unnamed: 0_level_0,type,confidence.nel,wikidata.id,wikidata.wikipediaPageName,wikidata.wikipediaPageUrl
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,unk,99.93,Q15981,Robert Schuman,https://en.wikipedia.org/wiki/Robert_Schuman


### Named entity recognition and linking

This method will do entity recognition and linking at the same time.

In [4]:
text = """
    Jean-Baptiste Nicolas Robert Schuman ( 
    29 June 1886 – 4 September 1963) was a Luxembourg-born French 
    statesman. Schuman was a Christian democratic (Popular 
    Republican Movement) political thinker and activist. 
    """
result = impresso.tools.ner_nel(
    text=text,
)
result

Unnamed: 0_level_0,type,surfaceForm,function,name,confidence.ner,confidence.nel,offset.start,offset.end,wikidata.id,wikidata.wikipediaPageName,wikidata.wikipediaPageUrl
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2:41:pers:nel-mgenre-multilingual,pers,Jean-Baptiste Nicolas Robert Schuman,,Baptiste Nicolas Robert Schuman,91.25,96.76,2,41,Q15981,Robert Schuman,https://en.wikipedia.org/wiki/Robert_Schuman
46:80:time:nel-mgenre-multilingual,time,29 June 1886 – 4 September 1963,,,77.9,86.46,46,80,Q15981,Robert Schuman,https://en.wikipedia.org/wiki/Robert_Schuman
88:98:org:nel-mgenre-multilingual,org,Luxembourg,,,25.12,100.0,88,98,Q32,Luxembourg,https://en.wikipedia.org/wiki/Luxembourg


### Article embeddings

All content items in the Impresso data of type _article_ longer than 800 characters were embedded using [gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) which is the latest in the GTE (General Text Embedding) family that has a strong multilingual capability.

In [5]:
text = """
    Jean-Baptiste Nicolas Robert Schuman ( 
    29 June 1886 – 4 September 1963) was a Luxembourg-born French 
    statesman. Schuman was a Christian democratic (Popular 
    Republican Movement) political thinker and activist. 
    """

embedding = impresso.tools.embed_text(text=text, target="text")
embedding

'gte-768:Fao1vWPioj3Msv+7LwqfPLicTb2kPgO+pkpAPVhbkL1cdi891oY9vYv1dDzSHTC9dJ0MvYGTZT3d87C90kt6PZ81gT0pg4c7x6XRPdfOC70xllU9WJrxO/pBwDwDI1C8YveXvZ3gyrzDjek9FkkhvXFAoL2LjYo9W0XwvKLdFLsg10C7QGhjPcAUHD1Z4YC8yFZhPbxhnToLYza9cFl8vMkdYTw0DZG8Mr10vX8zIjo/ZTs8AYC1PGyAn70CFaS7PAyNvNqxPT0AUyC9eiAyPXVJo7y1OGM9r/8TPdFwez1HIaS9lXUHvGb4Aj4N9oK9wTlNvQBmOrtQG9E7WqmPvUjIYLx6jZa8hOEFPV6vkLzIEoM9latiPej5i7s72xI8NmLBurlZuDzdEyw8gD7mPcUoQLytEUk6CfLAPfnsOD1UKy29vVunvBrRozzb8jU9op9OPJnZqTyJeE+8bchbvT3uhLxL2WU9toKuPfffBrwuxZk8H2PLvG6TWby+ei09OKuLvAUJTL0mvh08oOuDvYTpiTsBNQ47CNTAPB3frbzvPi88CUQ9vcza/LwKvRs91lUCPPlTCjx/bH+9dEkovK43K70N5bC8+BMxvcgJBz1vxoo9JQygvHvAkz3VDry8nWWzuwuSJ72RL+a8lJP5vJPTur1O3uG8Ph27vGo/LL0q7ew8mvK9vBMQmD3kaIU93Lg1PejcIb1ffAg9y4wuvDBo2LxlEzc9SWfWunvsKb228fy8F7iKPBLTSL2OFiO9T4+JPOr5jzuas5K8grgMuy6EQzzGCF69W2DZPKX7JT3C/908EIxYvaKCMrwC3e28KIJJvLBgGj0P+Im8LMJuvGClrT2C4W67DOx6u/g0rbummEI8UquLPKSzvrwj1nk8OyIoPXR3ILxpzxy98WElPW2ejDxP2jK8vsGBvKCdBD1QD/K8sieVPc6Cjr0mm6k7U8SCvEEc9LyS4Iq9STFGvGc7uLsB0bg

#### Search by text embedding with `tools.embed_text` with an out-of-corpus embedding

In [6]:
result = impresso.search.find(
  embedding=impresso.tools.embed_text(text=text, target="text"),
  limit=10
)

result

Unnamed: 0_level_0,copyrightStatus,type,sourceMedium,title,topics,transcriptLength,totalPages,languageCode,isOnFrontPage,publicationDate,...,pageNumbers,collectionUids,entities.locations,entities.persons,entities.organisations,entities.newsAgencies,mentions.locations,mentions.persons,mentions.organisations,mentions.newsAgencies
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
EXP-1951-11-23-a-i0001,in_cpy,ar,print,M. ADENAUER A PARIS,"[{'uid': 'tm-fr-all-v2.0_tp87_fr', 'relevance'...",621,1,fr,True,1951-11-23T00:00:00+00:00,...,[1],[],"[{'uid': '2-54-Paris', 'count': 2}, {'uid': '2...","[{'uid': '2-50-Aristide_Briand', 'count': 1}, ...",[],[],"[{'surfaceForm': 'PARIS', 'mentionConfidence':...","[{'surfaceForm': 'Schuman', 'mentionConfidence...",[],[]
GDL-1998-02-11-a-i0031,in_cpy,ar,print,Une voix s'est tue: celle de Maurice Schumann,"[{'uid': 'tm-fr-all-v2.0_tp55_fr', 'relevance'...",283,1,fr,False,1998-02-11T00:00:00+00:00,...,[5],[],"[{'uid': '2-54-Londres', 'count': 1}, {'uid': ...","[{'uid': '2-50-René_Payot', 'count': 1}]",[],[],"[{'surfaceForm': 'Londres', 'mentionConfidence...","[{'surfaceForm': 'René Payot', 'mentionConfide...",[],[]
IMP-1949-01-20-a-i0001,in_cpy,ar,print,Les voyages de M. Robert Schuman,"[{'uid': 'tm-fr-all-v2.0_tp29_fr', 'relevance'...",643,1,fr,True,1949-01-20T00:00:00+00:00,...,[1],[],,,,,,,,


In [7]:
result.df.loc['EXP-1951-11-23-a-i0001']

copyrightStatus                                                      in_cpy
type                                                                     ar
sourceMedium                                                          print
title                                                   M. ADENAUER A PARIS
topics                    [{'uid': 'tm-fr-all-v2.0_tp87_fr', 'relevance'...
transcriptLength                                                        621
totalPages                                                                1
languageCode                                                             fr
isOnFrontPage                                                          True
publicationDate                                   1951-11-23T00:00:00+00:00
issueUid                                                   EXP-1951-11-23-a
countryCode                                                              CH
providerCode                                                            SNL
mediaUid    

In [8]:
impresso.content_items.get_embeddings("EXP-1951-11-23-a-i0001")[0]

'gte-768:pRhKvXXfWz0f+gq9dFx+PZVbvb04PCC9mMWnOj6kQrwie6A8JTVevWjFYbxStA48ECQLvZTBtD2Qece8ramIPSPfTz2Pm8Q6GAS5PTq0q7umh0s9hTmdPRPpmjyLlAK8mz0zvJFJKb2Ra6Y9drPwvdbBnb1stXg9cP1lvVgzXLuFOLk8YzDwPN7XpD3gqXG8XJJ0PeEK0jwSC5g6TGOXPIP51TuAIGq8HGPRu3p0TTxk1co8fjaOu6fJfbyBgUq8LW1iPARMQ7uD+dW7ExdxPbT5m7rYnHQ8jwpGPSW6Azzpnfu9tqpPPRh9KD6/fFy9hIeFvHqhHLtONsi8igZTO1X/SrxB+lC9EF0zPebY67y75Js9468sPd4au7yx8+A8DxuBO67497zAeQ29XzQAPeOCXbyrxOY8LpnNPaY6RzyNe4+924uEPG7/Lb2UR+E8BzxaPK0a9byFcWG7aJeLOycyDzzN+Jo9/9hOPc5xCj197Ng7Ky5/vR7Pgzz+zfw8CH0FvSpwjr337U48AEUBvbiOjbpHS0g9q5YQPa0a9TyA3O+8mnbbvK6/z7yX3ho93JfdvPlZgbxTRKm9nEY9vSUeM73ucxi9JqTfvMuk9zwteDQ93tbAvG+kiLsPfzA97Admu5pqgrxKDQm9zdeBvK1Rsr2nslK8ySCTu6jHkr0JnTo9fxSRPMdFXz04avY8+XqaPUv35Lzcl908KZR2vfJBMr10tFQ9O1kGO+nTVLz0Sjy9ZXqlu/3PRL3i0am8OsvWvN4DED1dyM28VegfvGBAWT1Mvrw7KcFFvDW7Cj2/hy49j5tEvHz3qjxJc4C8xcoEvT9V9jzmzBI6H9gNvTFznT0kFSk7a/eHPVVDRTvGvzI98kEyPMCbCjxOHx08T+d7PWASg7zHtOC8DwVdPNpBzzyYxSe5ml+wvGWoezmwN1s8Twn5PIQ6Ab1ASR06PL01PbHFCr0MsLK9x3u4vDMhgj0SIsO

#### Search by text embedding with `tools.embed_text` with an in-corpus embedding

In [9]:
result = impresso.search.find(
  country="FR",
  embedding=impresso.content_items.get_embeddings("EXP-1951-11-23-a-i0001")[0],
  limit=2
)

result

Unnamed: 0_level_0,copyrightStatus,type,sourceMedium,title,topics,transcriptLength,totalPages,languageCode,isOnFrontPage,publicationDate,...,pageNumbers,collectionUids,entities.locations,entities.persons,entities.organisations,entities.newsAgencies,mentions.locations,mentions.persons,mentions.organisations,mentions.newsAgencies
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
oeuvre-1939-06-06-a-i0060,in_cpy,ar,print,Un important article de la « Gazette de Cologn...,"[{'uid': 'tm-fr-all-v2.0_tp29_fr', 'relevance'...",593,1,fr,False,1939-06-06T00:00:00+00:00,...,[4],[],"[{'uid': '2-54-Berlin', 'count': 1}, {'uid': '...","[{'uid': '2-50-Georges_Clemenceau', 'count': 1...",[],[],"[{'surfaceForm': 'Berlin', 'mentionConfidence'...","[{'surfaceForm': 'Clemenceau', 'mentionConfide...","[{'surfaceForm': 'Cittorio', 'mentionConfidenc...",[]
lematin-1911-09-20-a-i0041,pbl,ar,print,JOURNAUX DE FRANCE ET DE L'ÉTRANGER,"[{'uid': 'tm-fr-all-v2.0_tp85_fr', 'relevance'...",141,1,fr,False,1911-09-20T00:00:00+00:00,...,[3],[],,,,,,,,


### Other tools

#### Convert embedding to an array of floats and back

In [14]:
original_embedding = impresso.content_items.get_embeddings("lepetitparisien-1928-05-07-a-i0047")[0]
original_embedding

'gte-768:8gU7vdWAjz3oWoK9o8o+PUHElL3dkT+9zky+vFKAfT34Ejg9S8Y/vSqUE71SgSE98ArSvMs/AT4x0a68X4zWPXoJnD2/+Rs7OJ/IPS4awLz4VjI9nRS0PedVKz3oMy69kP8Qve/ujzyU9L49KpSTvVwUS73ilTI9rkowvTgcaz11fhg9Q++bPUEzlj3ZLRC8UXdzPQKYwDw1x4C7CEn0PBl2pju5K4K886TavJmmFr2Zg/U8IMFiOy66g7zq/dS8QUG3PFI0Hb3fHSe8PIE+PYrQljy7ZKq7piDNPNNeUjwkbuO9Byf3PFWAPT7+w2u9pQNnvHniR7tFl0W8Qh86PDd9S7wGjW482+mVPCM6krw0AY09J6xiPSQXcb1gatm8fbX4O6JNnLtH0ZE7IT8pPccBAr0wFSm8x5KAPUfy6jyCKRE8ep5NPIPCdTwWsHI9EkzDOu4x5rySWja8q1RevCF9aDzkX1k9uSpePNbW3btIQBM9ez1tu+t7GzypaDo993ivvLvEZr37rEC8jUf+u0lq9rrQheY8MXBOu07ejjq+G5k8fracvACdV7wuWP88zKQUvQ230TgY6r69iz8Yvc7dPL0YyEG9AtejPJBRrD22msM9f3F+vN3CgbwTRyw9HO6xvPSfQ7wQSJC91Ll3vJaOx73uMWa8VKyoPPD8sL3Mc9I8ndonvanXOz12+zo9es8PPTSqmrs7NLo83IL6vIi4x7w7lZo9tXSTPCwf1zywO6u9S0mdPAsFur2/GvU6GJd/vALXIz3BVMG8C95lvDzAIT0o60U8oAqGvRgbATzG43c9hMOZvDc/jDrCjg27o8q+vOm/FTz0YQQ7CYPAvDFwTj3K2ck6g8L1PFTrC73zgt06L7nfPF7fFT0zjbS7OfuRPBJMwzu6R8S8PY/fPH702zw2Q/+7cDvCvDyBPj0W4TS7TKTCPGjSe70+6yg85+apvPWHtL37bV297hANPVw2yD0g8iS

In [15]:
from impresso.util.embeddings import embedding_to_vector, vector_to_embedding

array = embedding_to_vector(original_embedding)
array[:10]  # show first 10 values

[-0.04565996676683426,
 0.07006994634866714,
 -0.06364995241165161,
 0.04657996818423271,
 -0.07263994961977005,
 -0.046769965440034866,
 -0.02322998270392418,
 0.061889953911304474,
 0.04493996500968933,
 -0.04681996628642082]

In [16]:
recreated_embedding = vector_to_embedding(array, 'gte-768')
recreated_embedding

'gte-768:8gU7vdWAjz3oWoK9o8o+PUHElL3dkT+9zky+vFKAfT34Ejg9S8Y/vSqUE71SgSE98ArSvMs/AT4x0a68X4zWPXoJnD2/+Rs7OJ/IPS4awLz4VjI9nRS0PedVKz3oMy69kP8Qve/ujzyU9L49KpSTvVwUS73ilTI9rkowvTgcaz11fhg9Q++bPUEzlj3ZLRC8UXdzPQKYwDw1x4C7CEn0PBl2pju5K4K886TavJmmFr2Zg/U8IMFiOy66g7zq/dS8QUG3PFI0Hb3fHSe8PIE+PYrQljy7ZKq7piDNPNNeUjwkbuO9Byf3PFWAPT7+w2u9pQNnvHniR7tFl0W8Qh86PDd9S7wGjW482+mVPCM6krw0AY09J6xiPSQXcb1gatm8fbX4O6JNnLtH0ZE7IT8pPccBAr0wFSm8x5KAPUfy6jyCKRE8ep5NPIPCdTwWsHI9EkzDOu4x5rySWja8q1RevCF9aDzkX1k9uSpePNbW3btIQBM9ez1tu+t7GzypaDo993ivvLvEZr37rEC8jUf+u0lq9rrQheY8MXBOu07ejjq+G5k8fracvACdV7wuWP88zKQUvQ230TgY6r69iz8Yvc7dPL0YyEG9AtejPJBRrD22msM9f3F+vN3CgbwTRyw9HO6xvPSfQ7wQSJC91Ll3vJaOx73uMWa8VKyoPPD8sL3Mc9I8ndonvanXOz12+zo9es8PPTSqmrs7NLo83IL6vIi4x7w7lZo9tXSTPCwf1zywO6u9S0mdPAsFur2/GvU6GJd/vALXIz3BVMG8C95lvDzAIT0o60U8oAqGvRgbATzG43c9hMOZvDc/jDrCjg27o8q+vOm/FTz0YQQ7CYPAvDFwTj3K2ck6g8L1PFTrC73zgt06L7nfPF7fFT0zjbS7OfuRPBJMwzu6R8S8PY/fPH702zw2Q/+7cDvCvDyBPj0W4TS7TKTCPGjSe70+6yg85+apvPWHtL37bV297hANPVw2yD0g8iS

In [17]:
f"Original and recreated embeddings are the same: {original_embedding == recreated_embedding}"

'Original and recreated embeddings are the same: True'

In [19]:
impresso.tools.embed_image("https://impresso-project.ch/assets/images/posts/rep-thomas.png", target="multimodal")

'openclip-768:/aKPPipmmT7ziZ0/+jaWPqb4lb5up0w/pYmevSLlvz5p4TW9XbhBv3uSO7+PATE+6IP3Ppwugb4e6cU+GDAwvxzPLD/ddM4+vOPyPkfXhD4O6kW+sNy4PRitcD9NsP8+4Marvfd2zL4ao5W+qgGaPomXBT/XNxG/RN8LP++jhr8U/kK/WCjjPuqsPj+5nfm+byGPvpGSL7/JmJ69ReE1P1Ymojypa9W+iN6/v2qMAD7EYRu/IBDDPVoOFL/YSHW/jljfPrBmMj9kQOu+duvivl4Scb8Q120/bMMxPtwLvL63tu49xDiJPHekFT/hLCW/4a5iv9TBhr482H++Jj6NPkLkhr6+XiQ/w2caveqZmD4wGQG/JKOxPpYZCz//JVa/mGMoP7CVrb6SMbe9cssZPwXSuD/3NGK/RRvYvWeeib4r7Po9FlfsPUAHJT/MZ6w/HI+fPs7yL780EJK+Kf3tvgUtxb60Gdo+HmIfPvAg7Tx6sL4+lku2P00+xT5fF8k+1ABAvuLUlj6H+Vg/L8TvvrxA1b4dnkw/T+oAPwV4Bz+ndBG/iPGrPzAgsL6tjve+bxvhPcivr78RMXe/fqrzvgRYtz5J6CW/aij8vQPKgb57ewu//I75vfD2UT44VLK+0iEdPlo4yz8HQXs/uqeUP2zWDT65m2G/MNTjvUzhCD/bZuk+gFh0PuFsabzAHSC+ewifPwikgz5iPzc+Dq+dvTRjlT/yI8M+PpK2v2/8Ib3bteG+eYIzvSzmjL/+GjG/tCofvi7gdD7c5am+RpGsPj6il77sJtM+LaZOQDlDGb5bUxk/2UIuvyxn4r40VVK/Upc6vrTzzb5WPo09nRiGPfuqLr8BhSu9dh4nP2rVgD8biU+/jMJVPsO5o741cUS+C+UNPVC+Yr9D4uC+WpTUvTRwE79uMDa/WWbDvUaAP7+uC2W+J834PeCKCr8shTA+xgJjv+qEGr/fRJW++CoYv+ybjz