# Ontotext GraphDB

>[Ontotext GraphDB](https://graphdb.ontotext.com/) is a graph database and knowledge discovery tool compliant with [RDF](https://www.w3.org/RDF/) and [SPARQL](https://www.w3.org/TR/sparql11-query/).

>This notebook shows how to use LLMs to provide natural language querying (NLQ to SPARQL, also called `text2sparql`) for `Ontotext GraphDB`. 

## GraphDB LLM Functionalities

`GraphDB` supports some LLM integration functionalities as described [here](https://github.com/w3c/sparql-dev/issues/193):

[gpt-queries](https://graphdb.ontotext.com/documentation/10.7/gpt-queries.html)

* magic predicates to ask an LLM for text, list or table using data from your knowledge graph (KG)
* query explanation
* result explanation, summarization, rephrasing, translation

[retrieval-graphdb-connector](https://graphdb.ontotext.com/documentation/10.7/retrieval-graphdb-connector.html)

* Indexing of KG entities in a vector database
* Supports any text embedding algorithm and vector database
* Uses the same powerful connector (indexing) language that GraphDB uses for Elastic, Solr, Lucene
* Automatic synchronization of changes in RDF data to the KG entity index
* Supports nested objects (no UI support in GraphDB version 10.5)
* Serializes KG entities to text like this (e.g. for a Wines dataset):

```
Franvino:
- is a RedWine.
- made from grape Merlo.
- made from grape Cabernet Franc.
- has sugar dry.
- has year 2012.
```

[talk-to-graph](https://graphdb.ontotext.com/documentation/10.7/talk-to-graph.html)

* A simple chatbot using a defined KG entity index


For this tutorial, we won't use the GraphDB LLM integration, but `SPARQL` generation from NLQ. We'll use the `Star Wars API` (`SWAPI`) ontology and dataset that you can examine [here](https://github.com/Ontotext-AD/langchain-graphdb-qa-chain-demo/blob/main/starwars-data.trig).


## Setting up

You need a running GraphDB instance. This tutorial shows how to run the database locally using the [GraphDB Docker image](https://hub.docker.com/r/ontotext/graphdb). It provides a docker compose set-up, which populates GraphDB with the Star Wars dataset. All necessary files including this notebook can be downloaded from the GitHub repository [langchain-graphdb-qa-chain-demo](https://github.com/Ontotext-AD/langchain-graphdb-qa-chain-demo).

* Install [Docker](https://docs.docker.com/get-docker/).
* Clone the GitHub repository [langchain-graphdb-qa-chain-demo](https://github.com/Ontotext-AD/langchain-graphdb-qa-chain-demo) in a local folder on your machine.
* Start GraphDB with the following script executed from the same folder
  
```
docker build --tag graphdb .
docker compose up -d graphdb
```

  You need to wait a couple of seconds for the database to start on `http://localhost:7200/`. The Star Wars dataset `starwars-data.trig` is automatically loaded into the `langchain` repository. The local SPARQL endpoint `http://localhost:7200/repositories/langchain` can be used to run queries against. You can also open the GraphDB Workbench from your favourite web browser `http://localhost:7200/sparql` where you can make queries interactively.
* Set up working environment

If you use `conda`, create and activate a new conda env (e.g. `conda create -n graph_ontotext_graphdb_qa python=3.12`).

Install the following libraries:

```
pip install jupyter==1.0.0
pip install sparqlwrapper==2.0.0
pip install rdflib==7.0.0
pip install langchain-openai
pip install langchain
```

Run Jupyter with
```
jupyter notebook
```

## Connect to GraphDB

In [1]:
from langchain_community.graphs import OntotextGraphDBGraph

graph = OntotextGraphDBGraph(
    gdb_repository="http://localhost:7200/repositories/langchain",
)

If you're running a secured GraphDB, you can pass the authentication header in custom_http_headers
```
graph = OntotextGraphDBGraph(
    gdb_repository=...,
    custom_http_headers="Authorization: <auth-scheme> <authorization-parameters>",
)
```

Alternativly, if you're using a basic authentication, you can set the environment variables `GRAPHDB_USERNAME` and `GRAPHDB_PASSWORD` before the initialization of `OntotextGraphDBGraph`.
```
os.environ["GRAPHDB_USERNAME"] = "graphdb-user"
os.environ["GRAPHDB_PASSWORD"] = "graphdb-password"

graph = OntotextGraphDBGraph(
    gdb_repository=...,
)
```

## Setup Ontotext GraphDB QA chain

In [2]:
import os

from langchain.chains import OntotextGraphDBQAChain
from langchain_openai import ChatOpenAI

# We'll be using an OpenAI model which requires an OpenAI API Key.
# However, other models are available as well:
# https://python.langchain.com/docs/integrations/chat/

# Set the environment variable `OPENAI_API_KEY` to your OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-***"

# Any available OpenAI model can be used here.
# We use 'gpt-4o' because of the bigger context window.
# Check the OpenAI models here https://platform.openai.com/docs/models

chain = OntotextGraphDBQAChain.from_llm(
    ChatOpenAI(
        temperature=0,
        model_name="gpt-4o-2024-05-13",
        seed=123,
    ),
    graph=graph,
    verbose=True,
)

## Specifying the ontology

In order for the LLM to be able to generate SPARQL, it needs to know the knowledge graph schema (the ontology). The ontology schema dump should:

* Include enough information about classes, properties, property attachment to classes (using rdfs:domain, schema:domainIncludes or OWL restrictions), and taxonomies (important individuals).
* Not include overly verbose and irrelevant definitions and examples that do not help SPARQL construction.

In [3]:
from pathlib import Path

ontology_schema = Path(
    "/path/to/langchain-graphdb-qa-chain-demo/starwars-ontology.ttl"
).read_text(encoding="utf-8")

## Question Answering against the StarWars dataset

We can now use the `OntotextGraphDBQAChain` to ask some questions. Let's ask a simple one.

In [4]:
chain.invoke(
    {
        "question": "What is the climate on Tatooine?",
        "ontology_schema": ontology_schema,
    }
)[chain.output_key_answer]



[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?climate
WHERE {
  ?planet rdfs:label "Tatooine" .
  ?planet <https://swapi.co/vocabulary/climate> ?climate .
}
LIMIT 5
[32;1m[1;3mQuery results:[0m
[32;1m[1;3m[{"climate": "arid"}][0m
Finished chain for 3.26 seconds

[1m> Finished chain.[0m


'The climate on Tatooine is arid.'

We can also ask more complicated questions like

In [5]:
chain.invoke(
    {
        "question": "What is the average box office revenue for all the Star Wars movies?",
        "ontology_schema": ontology_schema,
    }
)[chain.output_key_answer]



[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT (AVG(?boxOffice) AS ?averageBoxOffice)
WHERE {
  ?film rdf:type <https://swapi.co/vocabulary/Film> ;
        <https://swapi.co/vocabulary/boxOffice> ?boxOffice .
}
[32;1m[1;3mQuery results:[0m
[32;1m[1;3m[{"averageBoxOffice": "754147643.8"}][0m
Finished chain for 8.08 seconds

[1m> Finished chain.[0m


'The average box office revenue for all the Star Wars movies is $754,147,643.80.'

## Chain prompts

The Ontotext GraphDB QA chain allows prompt refinement for further improvement of your QA chain and enhancing the overall user experience of your app.


### "SPARQL Generation" prompt

The prompt is used for the SPARQL query generation based on the user question and the KG schema.

- `sparql_generation_prompt`

    Default value:
  ````python
    GRAPHDB_SPARQL_GENERATION_TEMPLATE = """
    Write a SPARQL SELECT query to answer the user question delimited by triple backticks:\n```{question}```\n
    The ontology schema delimited by triple backticks in Turtle format is:\n```{ontology_schema}```\n
    Use only the classes and properties provided in the schema to construct the SPARQL query. 
    Do not use any classes or properties that are not explicitly provided in the SPARQL query. 
    Include all necessary prefixes. 
    Do not include any explanations or apologies in your responses. 
    Do not wrap the query in backticks. 
    Do not include any text except the SPARQL query generated. 
    For queries without aggregation, apply LIMIT 5 unless otherwise specified. 
    For queries with aggregation, don't apply limit unless otherwise specified. \n
    """
    GRAPHDB_SPARQL_GENERATION_PROMPT = PromptTemplate(
        input_variables=["ontology_schema", "question"],
        template=GRAPHDB_SPARQL_GENERATION_TEMPLATE,
    )
  ````

Note that if you change the default value of the prompt, you must call the chain by providing values for each of the input variables of the prompt.

### "SPARQL Fix" prompt

Sometimes, the LLM may generate a SPARQL query with syntactic errors or missing prefixes, etc. The chain will try to amend this by prompting the LLM to correct it a certain number of times.

- `sparql_fix_prompt`

    Default value:
  ````python
    GRAPHDB_SPARQL_FIX_TEMPLATE = """
    The following SPARQL query delimited by triple backticks
    ```
    {generated_sparql}
    ```
    is not valid.
    The error delimited by triple backticks is
    ```
    {error_message}
    ```
    Give me a correct version of the SPARQL query.
    Do not change the logic of the query.
    Do not include any explanations or apologies in your responses.
    Do not wrap the query in backticks.
    Do not include any text except the SPARQL query generated.
    The ontology schema delimited by triple backticks in Turtle format is:
    ```
    {ontology_schema}
    ```
    """
    GRAPHDB_SPARQL_FIX_PROMPT = PromptTemplate(
        input_variables=["error_message", "generated_sparql", "ontology_schema"],
        template=GRAPHDB_SPARQL_FIX_TEMPLATE,
    )
  ````

Note that the input values passed to the prompt include the error message `error_message`, the generated SPARQL query `generated_sparql`, and the input values passed to the chain. If you want to include extra variables, you must pass them as input to the chain.

- `max_fix_retries`
  
    Default value: `5`

### "Answering" prompt

The prompt is used for answering the question based on the results returned from the database and the initial user question. By default, the LLM is instructed to only use the information from the returned result(s). If the result set is empty, the LLM should inform that it can't answer the question.

- `qa_prompt`
  
  Default value:
  ````python
    GRAPHDB_QA_TEMPLATE = """Task: Generate a natural language response from the results of a SPARQL query.
    You are an assistant that creates well-written and human understandable answers.
    The information part contains the information provided, which you can use to construct an answer.
    The information provided is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
    Make your response sound like the information is coming from an AI assistant, but don't add any information.
    Don't use internal knowledge to answer the question, just say you don't know if no information is available.
    Information:
    {context}
    
    Question: {question}
    Helpful Answer:"""
    GRAPHDB_QA_PROMPT = PromptTemplate(
        input_variables=["context", "question"], template=GRAPHDB_QA_TEMPLATE
    )
  ````

Note that the input values passed to the prompt include the SPARQL query results `context`, and the input values passed to the chain. If you want to include extra variables, you must pass them as input to the chain.

## Modifying "SPARQL Generation" prompt example

In [6]:
from langchain_core.prompts.prompt import PromptTemplate

template = """
Write a SPARQL SELECT query to answer the user question 
delimited by triple backticks:\n```{question}```\n
The question mentions the following concepts in JSON format 
delimited by triple backticks\n```{named_entities}```\n
The ontology schema delimited by triple backticks in 
Turtle format is:\n```{ontology_schema}```\n
Use only the classes and properties provided in the schema 
to construct the SPARQL query.
Do not use any classes or properties 
that are not explicitly provided in the SPARQL query.
Include all necessary prefixes.
Do not include any explanations or apologies in your responses.
Do not wrap the query in backticks.
Do not include any text except the SPARQL query generated.
"""
chain = OntotextGraphDBQAChain.from_llm(
    ChatOpenAI(
        temperature=0,
        model_name="gpt-4o-2024-05-13",
        seed=123,
    ),
    graph=graph,
    sparql_generation_prompt=PromptTemplate(
        input_variables=["question", "named_entities", "ontology_schema"],
        template=template,
    ),
)

chain.invoke(
    {
        "question": "What is the name of Luke Skywalker's home planet?",
        "ontology_schema": ontology_schema,
        "named_entities": [
            {
                "class": "https://swapi.co/vocabulary/Human",
                "inst": "https://swapi.co/resource/human/1",
            },
        ],
    }
)[chain.output_key_answer]

"Luke Skywalker's home planet is Tatooine."

Once you're finished playing with QA with GraphDB, you can shut down the Docker environment by running
``
docker compose down -v --remove-orphans
``
from the directory with the Docker compose file.