# From Text to Triples

In this notebook we draft a knoweldge extraction process, taking as example a museum artifact description.


## 0. Preliminaries

In [1]:
import sys
sys.path.append("..")

In [None]:
from dotenv import load_dotenv
import os
from openai import OpenAI
from pydantic import BaseModel
import json

In [None]:
# TODO: COPY YOUR OPENAI API KEY HERE
load_dotenv()
OPENAI_API_KEY = "XXXXXXXX"

In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

In [None]:
MODEL = "gpt-4o-mini"
#TEXT = "They marched from [Alexandria](LOCATION) through [Memphis](LOCATION) via the [Nile](LOCATION) to [Thebes](LOCATION)."


### Scenario Specification

In [None]:
LABELS = "YEAR_RANGE, YEAR, PERSON, PLACE, AGE, CREATIVE_WORK"
NAMESPACE = "http://www.example.org/collection/"

### Example 1
ARTIFACT_ID = "Stela_of_Meh"
TEXT = """
Stela of Meh, scribe of the treasury in the temple of Ramesses II in the sanctuary of Re (in Heliopolis). Limestone. New Kingdom, 19th Dynasty, second half of the reign of Ramesses II-early reign of Merenptah, ca. 1240-1210 BC. Abydos (?).. Acquired befor"""

### Example 2
ARTIFACT_ID = "The_Bride"
TEXT = """
The Bride; the bride's mother on the right hands the young woman a robe and chain, while Death fastens a collar of bones around her neck; in architectural border, with two bearded old men as caryatides, each holding a large orb surmounted by a cross and string of jester's bells; at top a skull and hourglass and two small boys, and at bottom a skull with cross-bones and two small boys each with an hourglass; after Hans Holbein the Younger; the border from a  separate plate after Abraham van Diepenbeeck; first state.  1651 Etching, from two plates
"""

### Example 3
ARTIFACT_ID = "P_SL-5217-334"
TEXT = """
Venus and Cupid, one of 425 drawings from the 1637 album; Venus seated astride a log at l, her arms raised as if to threaten Cupid, standing before her at r Pen and brown ink, and grey-brown wash, over black chalk
"""

## 1. Named Entity Recognition (NER)
In this step, we look for entities in the collection record text.

In [None]:
prompt = """
You are an expert Information Extraction (IE) system with access to grounding tools.
Your task is to identify entities in the given text and use search engines and grounding to find accurate Wikidata IDs corresponding to them.

Instructions:
1. Identify entities in the text that match the provided labels
2. Only use the labels provided by the user: {labels}
3. Be precise and only annotate clear, unambiguous entities
4. BEWARE: an entity can span multiple words
5. BEWARE: you can associate an entity with one or many labels


ENTITIES FOUND:
For each entity, provide:
- entity: [entity text]
- Label: [entity type]
- Description: [description from grounding search results]
- As XSD date: [representation of the entity as one or many XSD dates, comma separated, if applicable]
- As XSD int: [representation of the entity as one or more XSD integers, comma separated, if applicable]

Only return the JSON output, nothing else. Do so with the following schema:

class Entity(BaseModel):
    entity_text: str
    label: str
    description: str
    xsd_date: str
    xsd_int: str
    sources: list[str]

entities: list[Entity]

Text to analyze:
{text}

"""

In [None]:
formatted_prompt = prompt.format(labels=LABELS, text=TEXT)

In [None]:
## Uncomment the below to see the prompt
#print(formatted_prompt)

In [None]:
# Invoke open ai service
response = client.responses.create(
    model="gpt-4o",
    tools=[{"type": "web_search",
}],
    input=formatted_prompt,
)

output_text = response.output_text

In [None]:
## Uncomment the below to see the output
#print(output_text)

In [None]:
# This is useful to clean the output text and prepare a structured JSON.
def parse_json_with_sources(text):
    if "```json" not in text:
        json_data = text
        sources = []
    else:
      json_data = text.split("```json")[1]
      json_data, sources = json_data.split("```")
    json_data = json.loads(json_data)
    return json_data, sources

json_output, sources = parse_json_with_sources(output_text)


In [None]:
## Uncomment to see the json object
#print(json_output)

In [None]:
# Now, for each entity found, if that is an entity of type XXXX, look it up on Wikidata

## 2. Entity Linking
To WikiData

In [None]:
wikidataLinkPrompt = """
Query the web to identify this entity in Wikidata:

{entity}

The entity is of type {label}.

It is within the context of the following text:

{text}

Only return the JSON output, nothing else. Do so with the following schema:

class Entity(BaseModel):
    entity_text: str
    type: str
    label: str
    wikidata_id: str
    sources: list[str]
"""

In [None]:
answers = []
for entity in json_output['entities']:
    wikiprompt = wikidataLinkPrompt.format(entity=entity['entity_text'], text=TEXT, label=entity['label'])
    print("Linking", entity['entity_text'] + " " + entity['label'] )
    response = client.responses.create(
        model="gpt-4o",
        tools=[{"type": "web_search",}],
        input=wikiprompt,
    )
    txt = response.output_text
    answers.append(txt)


Linking 1637 YEAR
Linking Venus CREATIVE_WORK
Linking Cupid CREATIVE_WORK
Linking Mercury CREATIVE_WORK


In [None]:
dicts = []
for answer in answers:
    #print("Answer:",answer)
    jsono, sources = parse_json_with_sources(answer)
    dicts.append(jsono)

In [None]:
## Uncomment to see the links
#json.dumps(dicts, indent=2)

## 3. Relations
We generate possible relationship from the desdcribed artifact to each one of the retrieved entities

In [None]:
## Find relations between the museum item and each one of the entities
relationshipsPrompt = """
Considering the following description of a museum artifact:

{text}

retrieve the relationship between such artifact and the following entity, mentioned in the description:

{entity}

Express the relationship with a maximum of three words.
Make sure that the museum artifact is the subject of the relationship.
Ignore relationships between other entities in the text.
Only return the JSON output, nothing else. Do so with the following schema:

class Entity(BaseModel):
    entity_text: str
    relationship: str
    subject_of_relationship: str
    object_of_relationship: str
    description: str
    sources: list[str]
"""

In [None]:
relations = []
for entity in json_output['entities']:
    wikiprompt = relationshipsPrompt.format(entity=entity['entity_text'], text=TEXT)
    print("Getting relationship to", entity['entity_text'] + " " + entity['label'] )
    response = client.responses.create(
        model="gpt-4o",
        tools=[{"type": "web_search",}],
        input=wikiprompt,
    )
    txt = response.output_text
    relations.append(txt)


Getting relationship to 1637 YEAR
Getting relationship to Venus CREATIVE_WORK
Getting relationship to Cupid CREATIVE_WORK
Getting relationship to Mercury CREATIVE_WORK


In [None]:
#print(relations)
reldicts = []
for rel in relations:
    #print("Answer:",answer)
    jsono, sources = parse_json_with_sources(rel)
    reldicts.append(jsono)

In [None]:
## Uncomment to inspect the content
#reldicts

## 4. Relation Linking
We link the identified relations to WikiData properties

In [None]:
## Find relations between the museum item and each one of the entities
linkWikidataPropertyPrompt = """
Considering the following description of a relationship:

{relationship}

retrieve the WikiData property ID for this relationship.

Only return the JSON output, nothing else. Do so with the following schema:

class Entity(BaseModel):
    relationship_text: str
    wikidata_property_id: str
    sources: list[str]
"""

In [None]:
property_ids = []
for relation in reldicts:
    wikiprompt = linkWikidataPropertyPrompt.format(relationship=relation['relationship'])
    print("Getting PID of ", relation['relationship'] )
    response = client.responses.create(
        model="gpt-4o",
        tools=[{"type": "web_search",}],
        input=wikiprompt,
    )
    txt = response.output_text
    property_ids.append(txt)


Getting PID of  from album
Getting PID of  depicts
Getting PID of  depicts
Getting PID of  not mentioned


In [None]:
#print(relations)
pids = []
for rel in property_ids:
    print("Answer:",rel)
    try:
      jsono, sources = parse_json_with_sources(rel)
      pids.append(jsono)
    except:
      print("Error parsing JSON", rel)
      pass
pids

Answer: ```json
{
  "relationship_text": "from album",
  "wikidata_property_id": "P658",
  "sources": [
    "EntitySchema:E248 (tracklist includes property P658) ([wikidata.org](https://www.wikidata.org/wiki/EntitySchema%3AE248?utm_source=openai))"
  ]
}
```
Answer: {"relationship_text":"depicts","wikidata_property_id":"P180","sources":["Wikidata property page for 'depicts' indicates property ID P180","Wikidata:WikiProject Commons documentation includes 'Depicts  depicts (P180)']"}
Error parsing JSON {"relationship_text":"depicts","wikidata_property_id":"P180","sources":["Wikidata property page for 'depicts' indicates property ID P180","Wikidata:WikiProject Commons documentation includes 'Depicts  depicts (P180)']"}
Answer: ```json
{
  "relationship_text": "depicts",
  "wikidata_property_id": "P180",
  "sources": [
    "Wikidata property page for 'depicts' P180 (entity visually depicted in an image…) ([wikidata.org](https://www.wikidata.org/wiki/Property%3AP180?utm_source=openai))"
  ]

[{'relationship_text': 'from album',
  'wikidata_property_id': 'P658',
  'sources': ['EntitySchema:E248 (tracklist includes property P658) ([wikidata.org](https://www.wikidata.org/wiki/EntitySchema%3AE248?utm_source=openai))']},
 {'relationship_text': 'depicts',
  'wikidata_property_id': 'P180',
  'sources': ["Wikidata property page for 'depicts' P180 (entity visually depicted in an image…) ([wikidata.org](https://www.wikidata.org/wiki/Property%3AP180?utm_source=openai))"]},
 {'relationship_text': 'not mentioned',
  'wikidata_property_id': 'No corresponding property',
  'sources': ["Wikidata properties list and help documentation (shows no property labeled 'not mentioned')",
   "Search results indicating no Wikidata property exists matching 'not mentioned'"]}]

In [None]:
# Entities: dicts
# Relationships: reldicts
# Wikidate property ids: pids
for dic in reldicts:
  #print(dic)
  relationship = dic['relationship']
  subject = dic['subject_of_relationship']
  obj = dic['object_of_relationship']
  for pid in pids:
    if pid['relationship_text'] == relationship:
      dic['wikidata_property_id'] = pid['wikidata_property_id']
      break
  for entity in dicts:
    if entity['entity_text'] == obj:
      dic['object_label'] = entity['type']
      dic['object_wikidata_id'] = entity['wikidata_id']
      break


[{'entity_text': '1637',
  'relationship': 'from album',
  'subject_of_relationship': 'museum artifact',
  'object_of_relationship': '1637',
  'description': "The museum artifact, 'Venus and Cupid,' is one of 425 drawings from the 1637 album.",
  'sources': [],
  'wikidata_property_id': 'P658',
  'object_label': 'YEAR',
  'object_wikidata_id': 'Q577'},
 {'entity_text': 'Venus',
  'relationship': 'depicts',
  'subject_of_relationship': 'museum artifact',
  'object_of_relationship': 'Venus',
  'description': 'Venus and Cupid, one of 425 drawings from the 1637 album; Venus seated astride a log at l, her arms raised as if to threaten Cupid, standing before her at r Pen and brown ink, and grey-brown wash, over black chalk',
  'sources': [],
  'wikidata_property_id': 'P180',
  'object_label': 'CREATIVE_WORK',
  'object_wikidata_id': 'Q29421989'},
 {'entity_text': 'Cupid',
  'relationship': 'depicts',
  'subject_of_relationship': 'Venus and Cupid drawing',
  'object_of_relationship': 'Cupid',

In [None]:
## Uncomment to inspect
# #print(json_output)

## 5. Knowledge Graph Construction

We build triples by merging all the information extracted.

In [None]:
## Build linked data from the annotations
!java -version
!apt-get install openjdk-17-jre-headless -qq > /dev/null
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-17-openjdk-amd64/bin/java
!java -version
!pip install pysparql-anything



openjdk version "17.0.16" 2025-07-15
OpenJDK Runtime Environment (build 17.0.16+8-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 17.0.16+8-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)
openjdk version "17.0.16" 2025-07-15
OpenJDK Runtime Environment (build 17.0.16+8-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 17.0.16+8-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)


In [None]:
query = """
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:  <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl:   <http://www.w3.org/2002/07/owl#>
PREFIX fx:    <http://sparql.xyz/facade-x/ns/>
PREFIX xyz:   <http://sparql.xyz/facade-x/data/>
PREFIX xhtml: <http://www.w3.org/1999/xhtml#>
PREFIX lorentz: <http://www.example.org/lorentz/>
PREFIX schema: <https://schema.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX wde: <https://www.wikidata.org/entity/>
PREFIX wdt: <https://www.wikidata.org/prop/direct/>
PREFIX wps: <https://www.wikidata.org/prop/statement/>
PREFIX wdno: <https://www.wikidata.org/prop/novalue/>
PREFIX wdp: <https://www.wikidata.org/prop/>
PREFIX wikibase: <http://wikiba.se/ontology#>

 CONSTRUCT {
  ?artifactEntity ?wikidataProperty ?objectEntity .
  ?artifactEntity ?trutyProperty [
      a wikibase:Statement ;
      ?statementProperty ?objectEntity ;
      skos:note ?description
    ]
    .
     ?objectEntity a ?schemaOrg; wdt:P31 ?wikidataType ;
      rdfs:label ?objectLabel ;
      schema:name ?objectLabel
     .

     [] a rdf:Statement ;
       rdf:subject ?artifactEntity ;
       rdf:predicate ?wikidataProperty ;
       rdf:object ?objectEntity ;
      skos:note ?description
     .

 }

WHERE {
  SERVICE <x-sparql-anything:> {
    fx:properties fx:location ?_location ; fx:media-type "application/json" .
    VALUES (?type ?schemaOrg ?wikidataType) {
        ("YEAR_RANGE" schema:DateTime wde:Q386724)
        ("YEAR"  schema:Date wde:Q116880167)
        ("PERSON"  schema:Person wde:Q5)
        ("PLACE"  schema:Place wde:Q2221906)
        ("CREATIVE_WORK"  schema:CreativeWork wde:386724)
    }

    [] xyz:relationship ?relationship
    ;
        xyz:subject_of_relationship ?artifactLabel ;
        xyz:object_of_relationship ?objectLabel ;
        xyz:description ?description ;
        xyz:wikidata_property_id ?propertyId ;
        xyz:object_wikidata_id ?objectWikiDataId ;
        xyz:object_label ?type
        .
      FILTER(!CONTAINS(?propertyId, " "))
      BIND(fx:entity(wde:, ?objectWikiDataId) AS ?objectEntity)
      BIND(fx:entity(wdp:, ?propertyId) AS ?trutyProperty)
      BIND(fx:entity(wps:, ?propertyId) AS ?statementProperty)
      BIND(fx:entity(wdt:, ?propertyId) AS ?wikidataProperty)
      BIND(fx:entity(?_namespace) AS ?artifactEntity)
  }
}
"""

In [None]:
#print(type(json.dumps(reldicts)))
import json
with open(ARTIFACT_ID + '.json', 'w') as f:
    json.dump(reldicts, f)

In [None]:

import pysparql_anything as sa
engine = sa.SparqlAnything()
g = engine.construct(
    	query=query,
     values = {
         'location': "./" + ARTIFACT_ID + '.json',
         'namespace': NAMESPACE + ARTIFACT_ID
     }
    )
print(g.serialize(format="ttl", destination=ARTIFACT_ID + ".ttl"))

[a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory']].


## 8. Output

The result KG is the output turtle file named from the ARTIFACT_ID