# GraphDB QA Chain

This notebook shows how to use LLMs to provide natural language querying (NLQ to SPARQL, also called text2sparql) for [GraphDB](https://graphdb.ontotext.com/). Ontotext GraphDB is a graph database and knowledge discovery tool compliant with [RDF](https://www.w3.org/RDF/) and [SPARQL](https://www.w3.org/TR/sparql11-query/).

## GraphDB LLM Functionalities

GraphDB supports some LLM integration functionalities as described in https://github.com/w3c/sparql-dev/issues/193:

[gpt-queries](https://graphdb.ontotext.com/documentation/10.5/gpt-queries.html)
- magic predicates to ask an LLM for text, list or table using data from your knowledge graph (KG)
- query explanation
- result explanation, summarization, rephrasing, translation

[retrieval-graphdb-connector](https://graphdb.ontotext.com/documentation/10.5/retrieval-graphdb-connector.html)
- Indexing of KG entities in a vector database
- Supports any text embedding algorithm and vector database
- Uses the same powerful connector (indexing) language that GraphDB uses for Elastic, Solr, Lucene
- Automatic synchronization of changes in RDF data to the KG entity index
- Supports nested objects (no UI support in GraphDB versions <= 10.5 as of yet)
- Serializes KG entities to text like this (e.g. for a Wines dataset):

        Franvino:
        - is a RedWine.
        - made from grape Merlo.
        - made from grape Cabernet Franc.
        - has sugar dry.
        - has year 2012.

[talk-to-graph](https://graphdb.ontotext.com/documentation/10.5/talk-to-graph.html)
- A simple chatbot using a defined KG entity index

## Querying the GraphDB Database

For this tutorial, we won't use the GraphDB LLM integration, but SPARQL generation from NLQ. We'll use the Star Wars API (SWAPI) ontology and dataset that you can examine [here](https://drive.google.com/file/d/1wQ2K4uZp4eq3wlJ6_F_TxkOolaiczdYp/view?usp=drive_link).

You will need to have a running GraphDB instance. This tutorial shows how to run the database locally using the [GraphDB Docker image](https://hub.docker.com/r/ontotext/graphdb). It provides a docker compose set-up, which populates GraphDB with the Star Wars dataset. All nessessary files including this notebook can be downloaded from GDrive. 

### Set-up

- Install [Docker](https://docs.docker.com/get-docker/). This tutorial is created using Docker version `24.0.7` which bundles [Docker Compose](https://docs.docker.com/compose/). For earlier Docker versions you may need to install Docker Compose separately.
- Download all files from [GDrive](https://drive.google.com/drive/folders/18dN7WQxfGu26Z9C9HUU5jBwDuPnVTLbl) in a local folder on your machine.
- Start GraphDB with the following script executed from this folder
```
docker build --tag graphdb .
docker compose up -d graphdb
```
You need to wait a couple of seconds for the database to start on `http://localhost:7200/`. The Star Wars dataset `starwars-data.trig` is automatically loaded into the `langchain` repository. The local SPARQL endpoint `http://localhost:7200/repositories/langchain` can be used to run queries against. You can also open the GraphDB Workbench from your favourite web browser `http://localhost:7200/sparql` where you can make queries interactively.
- Working environment

If you use `conda`, create and activate a new conda env (e.g. `conda create -n graph_graphdb_qa python=3.9.18`).
Install the following libraries:

```
pip install jupyter==1.0.0
pip install openai==0.28.0
pip install rdflib==6.3.2
pip install langchain-openai==0.0.2.post1
pip install langchain
```

Run Jupyter with
```
jupyter notebook
```

### Specifying the Ontology

In order for the LLM to be able to generate SPARQL, it needs to know the knowledge graph schema (the ontology). It can be provided using one of two parameters on the `GraphDBGraph` class:
- `query_ontology`: a `CONSTRUCT` query that is executed on the SPARQL endpoint and returns the KG schema statements. We recommend that you store the ontology in its own named graph, which will make it easier to get only the relevant statements (as the example below). `DESCRIBE` queries are not supported, because `DESCRIBE` returns the Symmetric Concise Bounded Description (SCBD), i.e. also the incoming class links. In case of large graphs with a million of instances, this is not efficient. Check https://github.com/eclipse-rdf4j/rdf4j/issues/4857
- `local_file`: a local RDF ontology file. Supported file formats are `.ttl`, `.trig`, `.xml`, `.n3`, `.nt`, `.nq` and `.jsonld`.

In either case, the ontology dump should:
- Include enough information about classes, properties, property attachment to classes (using rdfs:domain, schema:domainIncludes or OWL restrictions), and taxonomies (important individuals).
- Not include overly verbose and irrelevant definitions and examples that do not help SPARQL construction.

In [1]:
from langchain_community.graphs import GraphDBGraph

# feeding the schema using a user construct query

graph = GraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    query_ontology="CONSTRUCT {?s ?p ?o} FROM <https://swapi.co/ontology/> WHERE {?s ?p ?o}",
)

In [2]:
# feeding the schema using a local RDF file

graph = GraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    local_file="/path/to/langchain_graphdb_tutorial/starwars-ontology.nt",  # change the path here
)

Either way, the ontology (schema) is fed to the LLM in `.ttl` since `.ttl` with appropriate prefixes is most compact and easiest for the LLM to remember.

The Star Wars ontology is a bit unusual in that it includes a lot of specific triples about classes, e.g. that the species :Aleena live on <planet/38>, <aleena/47>, they are a subclass of :Reptile, have certain typical characteristics (height, lifespan, skinColor), and specific individuals (characters) are representatives of that class:


```
@prefix : <https://swapi.co/vocabulary/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:Aleena a owl:Class,
        :Species ;
    rdfs:label "Aleena" ;
    rdfs:isDefinedBy <https://swapi.co/ontology/> ;
    rdfs:subClassOf :Reptile,
        :Sentient ;
    :averageHeight 80.0 ;
    :averageLifespan "79" ;
    :character <https://swapi.co/resource/aleena/47> ;
    :film <https://swapi.co/resource/film/4> ;
    :language "Aleena" ;
    :planet <https://swapi.co/resource/planet/38> ;
    :skinColor "blue",
        "gray" .

:Award a owl:Class ;
    rdfs:isDefinedBy <https://swapi.co/ontology/> .

:AwardRecognition a owl:Class ;
    rdfs:isDefinedBy <https://swapi.co/ontology/> .

:Besalisk a owl:Class,
        :Species ;
    rdfs:label "Besalisk" ;
    rdfs:isDefinedBy <https://swapi.co/ontology/> ;
    rdfs:subClassOf :Amphibian,
        :Sentient ;
    :averageHeight 178.0 ;
    :averageLifespan "75" ;
    :character <https://swapi.co/resource/besalisk/71> ;
    :eyeColor "yellow" ;
    :film <https://swapi.co/resource/film/5> ;
    :language "besalisk" ;
    :planet <https://swapi.co/resource/planet/55> ;
    :skinColor "brown" .

    ...

 ```


In order to keep this tutorial simple, we use un-secured GraphDB. If GraphDB is secured, you should set the environment variables 'GRAPHDB_USERNAME' and 'GRAPHDB_PASSWORD' before the initialization of `GraphDBGraph`.

```python
os.environ["GRAPHDB_USERNAME"] = "graphdb-user"
os.environ["GRAPHDB_PASSWORD"] = "graphdb-password"

graph = GraphDBGraph(
    query_endpoint=...,
    query_ontology=...
)
```


### Question Answering against the StarWars Dataset

We can now use the `GraphDBQAChain` to ask some questions.

In [3]:
import os

from langchain.chains import GraphDBQAChain
from langchain_openai import ChatOpenAI

# We'll be using an OpenAI model which requires an OpenAI API Key.
# However, other models are available as well:
# https://python.langchain.com/docs/integrations/chat/

# Set the environment variable `OPENAI_API_KEY` to your OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-***"

# Any available OpenAI model can be used here.
# We use 'gpt-4-1106-preview' because of the bigger context window.
# The 'gpt-4-1106-preview' model_name will deprecate in the future and will change to 'gpt-4-turbo' or similar,
# so be sure to consult with the OpenAI API https://platform.openai.com/docs/models for the correct naming.

chain = GraphDBQAChain.from_llm(
    ChatOpenAI(temperature=0, model_name="gpt-4-1106-preview"),
    graph=graph,
    verbose=True,
)

In [5]:
chain.invoke("What is the average height of the Ewok?")



[1m> Entering new GraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX ns1: <https://swapi.co/vocabulary/>

SELECT ?averageHeight
WHERE {
    ns1:Ewok ns1:averageHeight ?averageHeight .
}[0m
Query results:
[32;1m[1;3m[(rdflib.term.Literal('100.0', datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#decimal')),)][0m

[1m> Finished chain.[0m


{'query': 'What is the average height of the Ewok?',
 'result': 'The average height of an Ewok is 100.0 centimeters.'}

In [6]:
chain.invoke("Can you describe the Wookiees?")



[1m> Entering new GraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX ns1: <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?description
WHERE {
    ns1:Wookiee rdfs:label "Wookiee" .
    ns1:Wookiee ns1:desc ?description .
}[0m
Query results:
[32;1m[1;3m[(rdflib.term.Literal('Wookiees (/ˈwʊkiːz/) are a fictional species of intelligent bipeds from the planet Kashyyyk in the Star Wars universe. They are taller, stronger, and hairier than humans and most (if not all) other humanoid species. The most notable Wookiee is Chewbacca, the copilot of Han Solo, who first appeared in the 1977 film Star Wars Episode IV: A New Hope.'),)][0m

[1m> Finished chain.[0m


{'query': 'Can you describe the Wookiees?',
 'result': 'Wookiees are a fictional species from the Star Wars universe known for being intelligent bipeds originating from the planet Kashyyyk. They are characterized by their towering height, superior strength, and hairier appearance compared to humans and most other humanoid species. The most famous Wookiee is undoubtedly Chewbacca, who is recognized as the co-pilot of Han Solo. Chewbacca made his first appearance in the 1977 film "Star Wars Episode IV: A New Hope."'}

In [11]:
chain.invoke("What is the climate on Tatooine?")



[1m> Entering new GraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX ns1: <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?climate
WHERE {
    ?planet rdfs:label "Tatooine" .
    ?planet ns1:climate ?climate .
}[0m
Query results:
[32;1m[1;3m[(rdflib.term.Literal('arid'),)][0m

[1m> Finished chain.[0m


{'query': 'What is the climate on Tatooine?',
 'result': 'The climate on Tatooine is arid.'}

### Chain Modifiers

The GraphDB QA chain allows prompt refinement for further improvement of your QA chain and enhancing the overall user experience of your app.


#### "SPARQL Generation" Prompt

The prompt is used for the SPARQL query generation based on the user question and the KG schema.

- `sparql_select_prompt`

    Default value - ``sparql_select_prompt=GRAPHDB_GENERATION_SELECT_PROMPT`` :  

  ```
    GRAPHDB_GENERATION_SELECT_TEMPLATE = """Task: Generate a SPARQL SELECT statement for querying a graph database.
    For instance, to find all email addresses of John Doe, the following query would be suitable:
    
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT ?email
    WHERE {{
        ?person foaf:name "John Doe" .
        ?person foaf:mbox ?email .
    }}
    
    Instructions:
    Use only the node types and properties provided in the schema.
    Do not use any node types and properties that are not explicitly provided.
    Include all necessary prefixes.
    Schema in turtle format:
    {schema}
    Note: Be as concise as possible.
    Do not include any explanations or apologies in your responses.
    Do not include '```sparql'.
    Do not respond to any questions that ask for anything else than for you to construct a SPARQL query.
    Do not include any text except the SPARQL query generated.
    
    The question is:
    {prompt}"""
    
    GRAPHDB_GENERATION_SELECT_PROMPT = PromptTemplate(
        input_variables=["schema", "prompt"],
        template=GRAPHDB_GENERATION_SELECT_TEMPLATE,
    )
  ```

#### "SPARQL Fix" Prompt

Sometimes, the LLM may generate a SPARQL query with syntactic errors. The chain will try to amend this by prompting the LLM to correct it a certain number of times.

- `sparql_fix_select_prompt`

    Default value - ``sparql_fix_select_prompt=GRAPHDB_FIX_SELECT_PROMPT`` :
  ```
    GRAPHDB_FIX_SELECT_TEMPLATE = """Task: This query returns a syntactic error: 
    {parse_exception}
    Give me an improved query that works without any explanations or apologies. Do not change the logic of the query.
    The query is: 
    {generated_sparql}"""
    
    GRAPHDB_FIX_SELECT_PROMPT = PromptTemplate(
        input_variables=["parse_exception", "generated_sparql"],
        template=GRAPHDB_FIX_SELECT_TEMPLATE,
    )
  ```

- `max_regeneration_attempts`
  
    Default value - ``max_regeneration_attempts=5``

#### "Answering" Prompt

The prompt is used for answering the question based on the results returned from the database and the initial user question. By default, the LLM is instructed to only use the information from the returned result(s). If the result set is empty, the LLM should inform that it can't answer the question.

- `qa_prompt`
  
  Default value - ``qa_prompt=GRAPHDB_QA_PROMPT`` :
  
  ```
    GRAPHDB_QA_TEMPLATE = """Task: Generate a natural language response from the results of a SPARQL query.
    You are an assistant that creates well-written and human understandable answers.
    The information part contains the information provided, which you can use to construct an answer.
    The information provided is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
    Make your response sound like the information is coming from an AI assistant, but don't add any information.
    Don't use internal knowledge to answer the question, just say you don't know if no information is available.
    Information:
    {context}
    
    Question: {prompt}
    Helpful Answer:"""
    GRAPHDB_QA_PROMPT = PromptTemplate(
        input_variables=["context", "prompt"], template=GRAPHDB_QA_TEMPLATE
    )
  ```

In [12]:
# Example of setting the `qa_prompt` to make the LLM answer correctly using common/general knowledge, if the SPARQL query returns empty set.

from langchain_core.prompts.prompt import PromptTemplate

qa_prompt = PromptTemplate(
    template="Task: Generate a natural language response from the results of a SPARQL query. "
    "If the query returns an empty list, you can use common knowledge. "
    "Indicate if you've used common knowledge or information from the database. "
    "Information: {context} "
    "Question: {prompt}",
    input_variables=["context", "prompt"],
)

chain_common_knowledge = GraphDBQAChain.from_llm(
    ChatOpenAI(temperature=0, model_name="gpt-4-1106-preview"),
    graph=graph,
    verbose=True,
    qa_prompt=qa_prompt,
    max_regeneration_attempts=10,
)

In [13]:
chain_common_knowledge.invoke("What is the climate on Luke Skywalker's home planet?")



[1m> Entering new GraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX ns1: <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?climate
WHERE {
    ?person ns1:name "Luke Skywalker" .
    ?person ns1:homeworld ?planet .
    ?planet ns1:climate ?climate .
}[0m
Query results:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m


{'query': "What is the climate on Luke Skywalker's home planet?",
 'result': "Luke Skywalker's home planet is Tatooine. Tatooine is known for its harsh, desert climate, characterized by extremely high temperatures, dry conditions, and a lack of surface water. The planet has two suns, which contribute to its arid environment. This information is based on common knowledge from the Star Wars universe, as the SPARQL query results were empty."}

Once you're finished playing with QA with GraphDB, you can shut down the Docker environment by running
``
docker compose down -v --remove-orphans
``
from the directory with the Docker compose file.