## Preparation for NebulaGraph

Install Dependencies, prepare for contexts of Llama Index

In [2]:
# %pip install openai ipython-ngql llama_index==0.8.9 pyvis

In [1]:
from dotenv import load_dotenv, find_dotenv
import openai
import os

_ = load_dotenv(find_dotenv())
openai.api_key = os.getenv('OPENAI_API_KEY')

In [2]:
os.environ["GRAPHD_HOST"] = "127.0.0.1"
os.environ["NEBULA_USER"] = "root"
os.environ["NEBULA_PASSWORD"] = "nebula" 
os.environ["NEBULA_ADDRESS"] = "127.0.0.1:9669"  
%reload_ext ngql
connection_string = f"--address {os.environ['GRAPHD_HOST']} --port 9669 --user root --password {os.environ['NEBULA_PASSWORD']}"
%ngql {connection_string}

Connection Pool Created


Unnamed: 0,Name
0,llamaindex
1,phillies_rag


In [3]:
%ngql CREATE SPACE IF NOT EXISTS llamaindex(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);

In [4]:
%ngql SHOW SPACES;

Unnamed: 0,Name
0,llamaindex
1,phillies_rag


In [5]:
%ngql USE llamaindex;

In [8]:
%ngql CREATE TAG IF NOT EXISTS  entity(name string);

In [9]:
%ngql SHOW TAGS;

Unnamed: 0,Name
0,entity


In [10]:
%ngql CREATE EDGE IF NOT EXISTS  relationship(relationship string);

In [11]:
%ngql SHOW EDGES;

Unnamed: 0,Name
0,relationship


In [12]:
%ngql CREATE TAG INDEX IF NOT EXISTS  entity_index ON entity(name(256));

In [23]:
from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore

# Storage_context with Graph_Store
space_name = "llamaindex"
edge_types, rel_prop_names = ["relationship"], ["relationship"]
tags = ["entity"]

graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

## 🏗️ KG Building with Llama Index

### Preprocess Data with data connectors

with `WikipediaReader`

We will download and preprecess data from:
    https://en.wikipedia.org/wiki/Guardians_of_the_Galaxy_Vol._3

In [16]:
from datasets import load_dataset
import pandas as pd

In [17]:
xsum_dataset = load_dataset(
    "xsum", version="1.2.0"
)

# Taking a sample of 1000 rows
xsum_sample = xsum_dataset["train"].select(range(1000)).to_pandas()
xsum_sample.head(2)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035


In [18]:
# Combining 'document' and 'summary' columns
xsum_sample["combined"] = (
    "Document: " + xsum_sample.document.str.strip() + "; Summary: " + xsum_sample.summary.str.strip()
)

In [19]:
!mkdir -p 'document/'
documents = xsum_dataset["train"].select(range(1000)).to_pandas()
joined_documents = '\n'.join(xsum_sample["combined"])
with open('document/documents.txt', 'w', encoding='utf-8') as file:
    file.write(joined_documents)

In [21]:
from llama_index import SimpleDirectoryReader

loader = SimpleDirectoryReader(input_dir="./document/")
documents = loader.load_data()

In [22]:
# if you want to see what the text looks like
documents[0].text[:1000]

'Document: The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate 

### Indexing Extract Triplets and Save to NebulaGraph

with `KnowledgeGraphIndex`

This call will take some time, it'll extract entities and relationships and store them into NebulaGraph

In [25]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI

# define LLM
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=512)

In [26]:
from llama_index import KnowledgeGraphIndex

kg_index = KnowledgeGraphIndex.from_documents(
    documents=documents,
    storage_context=storage_context,
    max_triplets_per_chunk=15,
    service_context=service_context,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
)

kg_index.storage_context.persist(persist_dir='./storage_graph')

(Damage, is being assessed in, Newton Stewart)
(Repair work, is ongoing in, Hawick)
(Roads, remain affected by, standing water in Peeblesshire)
(Trains, face disruption due to, damage at the Lamington Viaduct)
(Businesses and householders, were affected by flooding in, Newton Stewart)
(Nicola Sturgeon, visited, the area)
(Waters, breached a retaining wall, flooding commercial properties on Victoria Street)
(Jeanette Tate, owns, the Cinnamon Cafe)
(Preventative work, could have been carried out to ensure, the retaining wall did not fail)
(Flood alert, remains in place across, the Borders)
(Peebles, was badly hit by, problems)
(Scottish Borders Council, has put a list on, its website)
(Alex Rowley, was in, Hawick on Monday)
(Damage, has been done, amount of)
(People, have been forced out of, their homes)
(Storm Frank, caused, flooding)
(Holiday Inn, located in, Hope Street)
(guests, asked to leave, hotel)
(two buses, parked in, car park)
(tour groups, from, Germany)
(tour groups, from, C

### Persist storage context

In [29]:
#kg_index.storage_context.persist(persist_dir='./storage_graph')

!ls ./storage_graph

docstore.json     index_store.json  vector_store.json


### Restore storage_context from disk

In [None]:
# from llama_index import load_index_from_storage

# storage_context = StorageContext.from_defaults(persist_dir='./storage_graph', graph_store=graph_store)
# kg_index = load_index_from_storage(
#     storage_context=storage_context,
#     service_context=service_context,
#     max_triplets_per_chunk=10,
#     space_name=space_name,
#     edge_types=edge_types,
#     rel_prop_names=rel_prop_names,
#     tags=tags,
#     verbose=True,
# )

In [62]:
# KG vector-based entity retrieval
kg_query_engine = kg_index.as_query_engine()

In [63]:
response = kg_query_engine.query("I'm looking for the information of Harry Potter. What could you suggest to me?")
print(response)

You may want to explore details about the play "Harry Potter and the Cursed Child," which has been described as a thrilling theatrical production with impressive performances and magical elements.


In [64]:
# KG keyword-based entity retrieval
kg_keyword_query_engine = kg_index.as_query_engine(
    # setting to false uses the raw triplets instead of adding the text from the corresponding nodes
    include_text=False,
    retriever_mode="keyword",
    response_mode="tree_summarize",
)

In [65]:
response = kg_keyword_query_engine.query("I'm looking for the information of Harry Potter. What could you suggest to me?")
print(response)

I would suggest looking into the relationships and interactions involving Harry in the provided information. This includes the various individuals Harry interacted with, the actions he took, and the information exchanged between him and others. By examining these relationships and interactions, you may gain a better understanding of Harry's involvement and the context surrounding him.


In [32]:
# KG hybrid entity retrieval
kg_hybrid_query_engine = kg_index.as_query_engine(
    include_text=True,
    response_mode="tree_summarize",
    embedding_mode="hybrid",
    similarity_top_k=3,
    explore_global_knowledge=True,
)

In [33]:
response = kg_hybrid_query_engine.query("I'm looking for the information of Harry Potter. What could you suggest to me?")
print(response)

I would recommend exploring the play "Harry Potter and the Cursed Child," the original books by J.K. Rowling, and the film adaptations to fully immerse yourself in the magical world of Harry Potter. You can find information about Harry Potter in popular literature databases, bookstores, or online platforms specializing in fiction books. Additionally, official Harry Potter websites, fan forums, and social media pages dedicated to the series could provide you with detailed information about the character and the enchanting universe created by J.K. Rowling.


In [34]:
# using KnowledgeGraphQueryEngine
from llama_index.query_engine import KnowledgeGraphQueryEngine

kgqe_query_engine = KnowledgeGraphQueryEngine(
    storage_context=storage_context,
    service_context=service_context,
    llm=llm,
    verbose=True,
)

In [35]:
response = kgqe_query_engine.query("I'm looking for the information of Harry Potter. What could you suggest to me?")
print(response)

[33;1m[1;3mGraph Store Query:
MATCH (h:`entity`)-[:relationship]->(p:`entity`)
WHERE h.`entity`.`name` == 'Harry Potter'
RETURN p.`entity`.`name`;
[0m[33;1m[1;3mGraph Store Response:
{'p.entity.name': []}
[0m[32;1m[1;3mFinal Response: Harry Potter's information is not available in the database based on the query and response provided.
[0mHarry Potter's information is not available in the database based on the query and response provided.


In [36]:
# using KnowledgeGraphRAGRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.retrievers import KnowledgeGraphRAGRetriever

graph_rag_retriever = KnowledgeGraphRAGRetriever(
    storage_context=storage_context,
    service_context=service_context,
    llm=llm,
    verbose=True,
)

kg_rag_query_engine = RetrieverQueryEngine.from_args(
    graph_rag_retriever, service_context=service_context
)

In [37]:
response = kg_rag_query_engine.query("I'm looking for the information of Harry Potter. What could you suggest to me?")
print(response)

[32;1m[1;3mEntities processed: ['Harry Potter', 'information', 'Harry', 'Potter', 'suggest']
[0m[32;1m[1;3mEntities processed: ['Information', 'Harry Potter', 'Harry', 'Potter', 'Suggest']
[0m[36;1m[1;3mGraph RAG context:
The following are knowledge sequence in max depth 2 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...` extracted based on key entities as subject:
information{name: information} <-[relationship:{relationship: provided}]- Mr Platini{name: Mr Platini} <-[relationship:{relationship: made to}]- payment{name: payment}
information{name: information} <-[relationship:{relationship: provided}]- Mr Platini{name: Mr Platini} <-[relationship:{relationship: contradicted}]- Swiss attorney general{name: Swiss attorney general}
information{name: information} <-[relationship:{relationship: help gather}]- system{name: system} -[relationship:{relationship: could deliver}]-> what{name: what}
information{name: informa

In [38]:
response = kg_rag_query_engine.query("Tell me some news about Harry Potter.")
print(response)

[32;1m[1;3mEntities processed: ['Harry Potter', 'news', 'Potter', 'Harry']
[0m[32;1m[1;3mEntities processed: ['Harry Potter', 'Potter', 'Harry', 'News']
[0m[36;1m[1;3mGraph RAG context:
The following are knowledge sequence in max depth 2 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...` extracted based on key entities as subject:
news{name: news} <-[relationship:{relationship: seemed unperturbed by}]- voters{name: voters} -[relationship:{relationship: turn out}]-> for the general election in May{name: for the general election in May}
news{name: news} <-[relationship:{relationship: seemed unperturbed by}]- voters{name: voters} <-[relationship:{relationship: appealing to}]- Zanu-PF{name: Zanu-PF}
news{name: news} <-[relationship:{relationship: seemed unperturbed by}]- voters{name: voters} <-[relationship:{relationship: meeting}]- David Cameron{name: David Cameron}
news{name: news} <-[relationship:{relationship: rece

Valina LLM + Prompts doesn't work well on all questions, fine-tuning, or few-shot ways could push further.

But Graph RAG is easier as:
- The query-composing doesn't rely on the higher intelligence
- Easier to enable approximate starting entities
- Easier to push CoT-like task-break-down in the orchestration layer


## 🧠 Graph RAG



### KG_Index as **Query Engine**

In [43]:
kg_index_query_engine = kg_index.as_query_engine(
    retriever_mode="keyword",
    verbose=True,
    response_mode="tree_summarize",
)

In [46]:
response_graph_rag = kg_index_query_engine.query("I'm looking for the information of Harry Potter. What could you suggest to me?")

print(response_graph_rag)

[32;1m[1;3mExtraced keywords: ['Harry Potter', 'information', 'Harry', 'Potter', 'suggest']
[0m[36;1m[1;3mKG context:
The following are knowledge sequence in max depth 2 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...`
information{name: information} <-[relationship:{relationship: provided}]- Mr Platini{name: Mr Platini} <-[relationship:{relationship: made to}]- payment{name: payment}
information{name: information} <-[relationship:{relationship: provided}]- Mr Platini{name: Mr Platini} <-[relationship:{relationship: contradicted}]- Swiss attorney general{name: Swiss attorney general}
information{name: information} <-[relationship:{relationship: help gather}]- system{name: system} -[relationship:{relationship: could deliver}]-> what{name: what}
information{name: information} <-[relationship:{relationship: help gather}]- system{name: system} -[relationship:{relationship: worked on}]-> basis{name: basis}
information{nam

In [48]:
%ngql USE llamaindex; MATCH p=(n)-[e:relationship*1..2]-() WHERE id(n) in ['Harry Potter', 'Harry', 'Potter'] RETURN p

Unnamed: 0,p
0,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."
1,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."
2,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."
3,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."
4,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."
5,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."
6,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."
7,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."
8,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."
9,"(""Harry"" :entity{name: ""Harry""})-[:relationshi..."



See also here for comparison of text2cypher & GraphRAG
- https://user-images.githubusercontent.com/1651790/260617657-102d00bc-6146-4856-a81f-f953c7254b29.mp4
- https://siwei.io/en/demos/text2cypher/

> While another idea is to retrieve in both ways and combine the context to fit more use cases.


### Graph RAG on any existing KGs

with `KnowledgeGraphRAGRetriever`.

REF: https://gpt-index.readthedocs.io/en/stable/examples/query_engine/knowledge_graph_rag_query_engine.html#perform-graph-rag-query

In [52]:
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.retrievers import KnowledgeGraphRAGRetriever

graph_rag_retriever = KnowledgeGraphRAGRetriever(
    storage_context=storage_context,
    service_context=service_context,
    llm=llm,
    verbose=True,
)

query_engine = RetrieverQueryEngine.from_args(
    graph_rag_retriever, service_context=service_context
)

In [55]:
response = query_engine.query(
    "I'm looking for the information of Harry Potter. What could you suggest to me?",
)
print(response)

[32;1m[1;3mEntities processed: ['Harry Potter', 'information', 'Harry', 'Potter', 'suggest']
[0m[32;1m[1;3mEntities processed: ['Information', 'Harry Potter', 'Harry', 'Potter', 'Suggest']
[0m[36;1m[1;3mGraph RAG context:
The following are knowledge sequence in max depth 2 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...` extracted based on key entities as subject:
information{name: information} <-[relationship:{relationship: provided}]- Mr Platini{name: Mr Platini} <-[relationship:{relationship: made to}]- payment{name: payment}
information{name: information} <-[relationship:{relationship: provided}]- Mr Platini{name: Mr Platini} <-[relationship:{relationship: contradicted}]- Swiss attorney general{name: Swiss attorney general}
information{name: information} <-[relationship:{relationship: help gather}]- system{name: system} -[relationship:{relationship: could deliver}]-> what{name: what}
information{name: informa

### Example of Graph RAG Chat Engine

#### The context mode

In [57]:
from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

chat_engine = kg_index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    verbose=True
)

In [60]:
response = chat_engine.chat("I'm looking for the information of Harry Potter. What could you suggest to me?")
print(response)

[32;1m[1;3mExtraced keywords: ['Harry Potter', 'information', 'Harry', 'Potter', 'suggest']
[0m[36;1m[1;3mKG context:
The following are knowledge sequence in max depth 2 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...`
information{name: information} <-[relationship:{relationship: may have}]- Det Insp Larry Johnson{name: Det Insp Larry Johnson} -[relationship:{relationship: from}]-> Thames Valley Police{name: Thames Valley Police}
information{name: information} <-[relationship:{relationship: may have}]- Det Insp Larry Johnson{name: Det Insp Larry Johnson} -[relationship:{relationship: could assist}]-> investigation{name: investigation}
information{name: information} <-[relationship:{relationship: provided}]- Mr Platini{name: Mr Platini} <-[relationship:{relationship: contradicted}]- Swiss attorney general{name: Swiss attorney general}
information{name: information} <-[relationship:{relationship: may have}]- Det Insp 