# Custom Knowledge Graph Creation

In this example, a custom knowledge graph is created from wikipedia documents. We use a custom knowledge graph extractor:

```python
kg_extractor = DynamicLLMPathExtractor(
    llm=Settings.llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=[
        "IDEAS",
        "TECHNOLOGIES",
        "FRAMEWORKS",
        "TECHNIQUES",
        "USE_CASES",
    ],
    allowed_relation_types=["CREATED_BY", "IMPLEMENTED_IN", "USED_BY", "HELPS_WITH"],
)
```

## NebulaGraph Property Graph Index
NebulaGraph is an open-source distributed graph database built for super large-scale graphs with milliseconds of latency.

If you already have an existing graph, please skip to the end of this notebook.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

```python
%pip install llama-index-graph-stores-nebula
%pip install llama-index-llms-vertex
%pip install llama-index-embeddings-vertex
%pip install llama-index
```


Before we start the `Knowledge Graph RAG QueryEngine` demo, let's first get ready for basic preparation of Llama Index.

In [23]:
# For Vertex

import os


import logging
import google.auth
import google.auth.transport.requests
from google.cloud import aiplatform
import vertexai
from vertexai.generative_models import HarmBlockThreshold, HarmCategory, SafetySetting
import sys
from dotenv import load_dotenv
import logging

logging.basicConfig(level=logging.WARN)

# Import the Secret Manager client library.
from google.cloud import secretmanager

load_dotenv()  # this loads the .env script for use below
PROJECT_ID = os.getenv("PROJECT_ID")
LOCATION = os.getenv("LOCATION")

# try to get the secret manager stored public IP
# Create the Secret Manager client.
client = secretmanager.SecretManagerServiceClient()

# Access the secret version.
try:
    response = client.access_secret_version(
        request={"name": "projects/679926387543/secrets/nebula-ip/versions/latest"}
    )

    # Print the secret payload.
    #
    # WARNING: Do not print the secret in a production environment - this
    # snippet is showing how to access the secret material.
    payload = response.payload.data.decode("UTF-8")

    NEBULA_SERVER_ADDRESS = payload
except:
    print("no secret found, using 127.0.0.1")
    NEBULA_SERVER_ADDRESS = "127.0.0.1"

credentials = google.auth.default(quota_project_id=PROJECT_ID)[0]
request = google.auth.transport.requests.Request()
credentials.refresh(request)

safety_config = [
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=HarmBlockThreshold.BLOCK_NONE,
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=HarmBlockThreshold.BLOCK_NONE,
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=HarmBlockThreshold.BLOCK_NONE,
    ),
]

logging.basicConfig(
    stream=sys.stdout, level=logging.INFO
)  # logging.DEBUG for more verbose output

vertexai.init(project=PROJECT_ID, location=LOCATION)

I0000 00:00:1730827193.918877  257055 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1730827194.968899  257055 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers


## Define the LLM


In [24]:
from llama_index.llms.vertex import Vertex
from llama_index.embeddings.vertex import VertexTextEmbedding
from llama_index.core import Settings

Settings.llm = Vertex(
    temperature=0,
    model="gemini-1.5-flash",
    credentials=credentials,
    safety_settings=safety_config,
)
Settings.embed_model = VertexTextEmbedding(
    model_name="text-embedding-004", credentials=credentials
)
Settings.chunk_size = 256

# Nested Asyncio needed for async ops

In [25]:
import nest_asyncio

nest_asyncio.apply()

## Prepare for NebulaGraph

In [26]:
space_name = "rag"

Then we could instiatate a `NebulaGraphStore`, in order to create a `StorageContext`'s `graph_store` as it.

In [44]:
%load_ext ngql
%ngql --address $NEBULA_SERVER_ADDRESS --port 9669 --user root --password <password>
%ngql CREATE SPACE IF NOT EXISTS $space_name(vid_type=FIXED_STRING(256))

The ngql extension is already loaded. To reload it, use:
  %reload_ext ngql
[1;3;38;2;0;135;107m[OK] Connection Pool Created[0m
INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)
INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)


# **One time Operation to Load the DB**

In [46]:
from llama_index.core import StorageContext
from llama_index.graph_stores.nebula import NebulaPropertyGraphStore

graph_store = NebulaPropertyGraphStore(
    space=space_name, overwrite=True, url=f"nebula://{NEBULA_SERVER_ADDRESS}:9669"
)

Here, we assumed to have the same Knowledge Graph from [this turtorial](https://gpt-index.readthedocs.io/en/latest/examples/query_engine/knowledge_graph_query_engine.html#optional-build-the-knowledge-graph-with-llamaindex)

Let's follow on this tutorial:

# With the help of Llama Index and LLM defined, we could build Knowledge Graph from given documents.

If we have a Knowledge Graph on NebulaGraphStore already, this step could be skipped

Load data from Wikipedia for "Guardians of the Galaxy Vol. 3"

In [47]:
from llama_index.core import download_loader

from llama_index.readers.wikipedia import WikipediaReader

loader = WikipediaReader()

documents = loader.load_data(
    pages=["Retrieval-augmented generation"], auto_suggest=False
)

# Next, Generate a KnowledgeGraphIndex with NebulaGraph as graph_store
Then, we will create a KnowledgeGraphIndex to enable Graph based RAG, apart from that, we have a Knowledge Graph up and running for other purposes, too!

## Create a vector store

In [48]:
from llama_index.core.vector_stores.simple import SimpleVectorStore

vec_store = SimpleVectorStore()

# Use custom graph entity and relationships

You can define custom entities and relationships with `PropertyGraphIndex`, using either the `DynamicLLMPathExtractor` [link](https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/#dynamicllmpathextractor), or the more rigid `SchemaLLMPathExtractor` [link](https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/#schemallmpathextractor)

Here we are going to create a custom knowledge graph about retreival augment generation (sourced from wikipedia)

In [None]:
from llama_index.core.indices.property_graph import DynamicLLMPathExtractor

kg_extractor = DynamicLLMPathExtractor(
    llm=Settings.llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=[
        "IDEAS",
        "TECHNOLOGIES",
        "FRAMEWORKS",
        "TECHNIQUES",
        "USE_CASES",
    ],
    allowed_relation_types=["CREATED_BY", "IMPLEMENTED_IN", "USED_BY", "HELPS_WITH"],
)

In [50]:
from llama_index.core.indices.property_graph import PropertyGraphIndex

index = PropertyGraphIndex.from_documents(
    documents,
    property_graph_store=graph_store,
    vector_store=vec_store,
    show_progress=True,
    kg_extractor=kg_extractor,  # this can take a list!
)

index.storage_context.vector_store.persist("./data/nebula_vec_store.json")

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting paths from text: 100%|██████████| 9/9 [00:03<00:00,  2.88it/s]
Extracting implicit paths: 100%|██████████| 9/9 [00:00<00:00, 18586.28it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  4.49it/s]
Generating embeddings: 100%|██████████| 18/18 [00:00<00:00, 26.19it/s]


# Loading from an existing graph
**Important** Go here if skipping data creation step above

In [51]:
from llama_index.graph_stores.nebula import NebulaPropertyGraphStore
from llama_index.core.indices.property_graph import PropertyGraphIndex


graph_store = NebulaPropertyGraphStore(
    space=space_name, url=f"nebula://{NEBULA_SERVER_ADDRESS}:9669"
)

from llama_index.core.vector_stores.simple import SimpleVectorStore

vec_store = SimpleVectorStore.from_persist_path("./data/nebula_vec_store.json")

index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    vector_store=vec_store,
)

### Now that the graph is created, we can explore it with [jupyter-nebulagraph](https://github.com/wey-gu/jupyter_nebulagraph)

In [52]:
# Query some random Relationships with Cypher

%ngql USE $space_name;
%ngql MATCH (p)-[e]->(q) RETURN p, e, q LIMIT 3

INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)
INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)


Unnamed: 0,p,e,q
0,"(""Retrieval"" :Props__{_node_content: __NULL__, _node_type: __NULL__, creation_date: __NULL__, doc_id: __NULL__, document_id: __NULL__, file_name: __NULL__, file_path: __NULL__, file_size: __NULL__, file_type: __NULL__, last_modified_date: __NULL__, ref_doc_id: __NULL__, triplet_source_id: ""77a5817c-27b8-4eb6-af01-3e06553d9332""} :Node__{label: ""entity""} :Entity__{name: ""Retrieval""})","(""Retrieval"")-[:Relation__@0{_node_content: __NULL__, _node_type: __NULL__, creation_date: __NULL__, doc_id: __NULL__, document_id: __NULL__, file_name: __NULL__, file_path: __NULL__, file_size: __NULL__, file_type: __NULL__, label: ""Uses"", last_modified_date: __NULL__, ref_doc_id: __NULL__, triplet_source_id: ""d2968930-9650-4b9a-8e17-65aab1d4fbdc""}]->(""Document retriever"")","(""Document retriever"" :Props__{_node_content: __NULL__, _node_type: __NULL__, creation_date: __NULL__, doc_id: __NULL__, document_id: __NULL__, file_name: __NULL__, file_path: __NULL__, file_size: __NULL__, file_type: __NULL__, last_modified_date: __NULL__, ref_doc_id: __NULL__, triplet_source_id: ""d2968930-9650-4b9a-8e17-65aab1d4fbdc""} :Node__{label: ""entity""} :Entity__{name: ""Document retriever""})"
1,"(""Retrieval"" :Props__{_node_content: __NULL__, _node_type: __NULL__, creation_date: __NULL__, doc_id: __NULL__, document_id: __NULL__, file_name: __NULL__, file_path: __NULL__, file_size: __NULL__, file_type: __NULL__, last_modified_date: __NULL__, ref_doc_id: __NULL__, triplet_source_id: ""77a5817c-27b8-4eb6-af01-3e06553d9332""} :Node__{label: ""entity""} :Entity__{name: ""Retrieval""})","(""Retrieval"")-[:Relation__@0{_node_content: __NULL__, _node_type: __NULL__, creation_date: __NULL__, doc_id: __NULL__, document_id: __NULL__, file_name: __NULL__, file_path: __NULL__, file_size: __NULL__, file_type: __NULL__, label: ""Is"", last_modified_date: __NULL__, ref_doc_id: __NULL__, triplet_source_id: ""d2968930-9650-4b9a-8e17-65aab1d4fbdc""}]->(""Process"")","(""Process"" :Props__{_node_content: __NULL__, _node_type: __NULL__, creation_date: __NULL__, doc_id: __NULL__, document_id: __NULL__, file_name: __NULL__, file_path: __NULL__, file_size: __NULL__, file_type: __NULL__, last_modified_date: __NULL__, ref_doc_id: __NULL__, triplet_source_id: ""d2968930-9650-4b9a-8e17-65aab1d4fbdc""} :Node__{label: ""entity""} :Entity__{name: ""Process""})"
2,"(""Retrieval"" :Props__{_node_content: __NULL__, _node_type: __NULL__, creation_date: __NULL__, doc_id: __NULL__, document_id: __NULL__, file_name: __NULL__, file_path: __NULL__, file_size: __NULL__, file_type: __NULL__, last_modified_date: __NULL__, ref_doc_id: __NULL__, triplet_source_id: ""77a5817c-27b8-4eb6-af01-3e06553d9332""} :Node__{label: ""entity""} :Entity__{name: ""Retrieval""})","(""Retrieval"")-[:Relation__@0{_node_content: __NULL__, _node_type: __NULL__, creation_date: __NULL__, doc_id: __NULL__, document_id: __NULL__, file_name: __NULL__, file_path: __NULL__, file_size: __NULL__, file_type: __NULL__, label: ""Be"", last_modified_date: __NULL__, ref_doc_id: __NULL__, triplet_source_id: ""77a5817c-27b8-4eb6-af01-3e06553d9332""}]->(""Slow"")","(""Slow"" :Props__{_node_content: __NULL__, _node_type: __NULL__, creation_date: __NULL__, doc_id: __NULL__, document_id: __NULL__, file_name: __NULL__, file_path: __NULL__, file_size: __NULL__, file_type: __NULL__, last_modified_date: __NULL__, ref_doc_id: __NULL__, triplet_source_id: ""77a5817c-27b8-4eb6-af01-3e06553d9332""} :Node__{label: ""entity""} :Entity__{name: ""Slow""})"


In [53]:
TAGS = %ngql SHOW TAGS
%ngql SHOW TAGS

INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)
INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)


Unnamed: 0,Name
0,Chunk__
1,Entity__
2,Node__
3,Props__


In [54]:
EDGES = %ngql SHOW EDGES
%ngql SHOW EDGES

INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)
INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)


Unnamed: 0,Name
0,Relation__
1,__meta__node_label__
2,__meta__rel_label__


In [55]:
# Index the schema for querying
%ngql CREATE TAG INDEX IF NOT EXISTS entity_index on Entity__();
%ngql CREATE TAG INDEX IF NOT EXISTS props_index on Props__();
%ngql CREATE EDGE INDEX IF NOT EXISTS relation_index on Relation__();

INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)
INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)
INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)


In [56]:
%ngql MATCH q=(p:Node__:Entity__) RETURN p.Entity__.name LIMIT 2

INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)


Unnamed: 0,p.Entity__.name
0,Augment query
1,Heavy cost training runs


In [57]:
%ngql MATCH p=(v:Entity__)-[r]->(t:Entity__) RETURN v.Entity__.name AS src, r.label AS relation, t.Entity__.name AS dest LIMIT 15;

INFO:nebula3.logger:Get connection to ('34.55.198.242', 9669)


Unnamed: 0,src,relation,dest
0,Information,Fed into,Llm
1,Chunking,Break up,Data
2,Chunking,Find,Details
3,Chunking,Strategy,File format based
4,Chunking,Strategy,Fixed length
5,Chunking,Strategy,Overlap
6,Chunking,Involves,Strategies
7,Chunking,Strategy,Syntax based
8,Llms,Include,Hallucination
9,Performance,Can be improved,Centroid searches


In [128]:
%ng_draw

<class 'pyvis.network.Network'> |N|=16 |E|=15

### The rendered output should look like this:

![](./graph_output.png)

# Querying and Retrieval
1. Getting a simple graph from a query

In [58]:
subgraph_retriever = index.as_retriever(
    include_text=False,  # include source text in returned nodes, default True
)

In [None]:
from IPython.display import display, Markdown

nodes = subgraph_retriever.retrieve(
    "Tell me about chunking in retrieval augmented generation",
)
node_text = ""
for node in nodes:
    display(Markdown(f"{node.text}"))

Chunking -> Break up -> Data

Chunking -> Find -> Details

Chunking -> Strategy -> File format based

Chunking -> Strategy -> Fixed length

Chunking -> Strategy -> Overlap

Chunking -> Involves -> Strategies

Chunking -> Strategy -> Syntax based

Rag -> Eliminate -> Challenges

Rag -> Uses -> Information

Rag -> Grants -> Information retrieval capabilities

Rag -> Modifies -> Interactions

Rag -> Can be used on -> Semi-structured data

Rag -> Can be used on -> Structured data

Rag -> Is -> Technique

Rag -> Can be used on -> Unstructured data

Progressive data augmentation -> Use -> Methods

Augmentation modules -> Incorporates -> Augmentation

Augmentation modules -> Have abilities -> Expanding queries

Augmentation modules -> Use -> Memory and self-improvement

Retriever using inverse cloze task -> Pre-train -> Methods

2. Getting a NL query over the graph

In [52]:
query_engine = index.as_query_engine(
    streaming=True,
    include_text=True,  # include source text in returned nodes, default True
)

In [None]:
response = query_engine.query(
    "Tell me about chunking in retrieval augmented generation",
)
display(Markdown(response.print_response_stream()))

Chunking is a strategy used in retrieval augmented generation to break up data into smaller units called vectors. This allows the retriever to efficiently find relevant details within the data. There are several chunking strategies, including fixed length with overlap, syntax-based chunking, and file format-based chunking. 


<IPython.core.display.Markdown object>

#### Check on source nodes

In [55]:
response.source_nodes

[NodeWithScore(node=TextNode(id_='b5e73cba-d219-4f0a-9b04-32bf32af7d40', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='75229858', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='5878f62255c46891cdc11c5953435e70e6522322cec36216e0147761467a710e'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='e7d4e4b3-1434-4731-9709-ed170a9def49', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='7fe778aa2a357082c3de2557e5b78c8468a9bc0cffa0af79cae1d11da97ad31a')}, text="Here are some facts extracted from the provided text:\n\nChunking -> Break up -> Data\nChunking -> Find -> Details\nChunking -> Strategy -> File format based\nChunking -> Strategy -> Fixed length\nChunking -> Strategy -> Overlap\nChunking -> Involves -> Strategies\nChunking -> Strategy -> Syntax based\nRag -> Eliminate -> Challenges\nRag -> Includes -> Giving factual information\nRag -> Uses -> In

In [56]:
response.get_formatted_sources()

'> Source (Node id: b5e73cba-d219-4f0a-9b04-32bf32af7d40): Here are some facts extracted from the provided text:\n\nChunking -> Break up -> Data\nChunking -> F...\n\n> Source (Node id: 69b78dbd-e3ec-4600-a20b-541bc65f4b20): Here are some facts extracted from the provided text:\n\nProgressive data augmentation -> Use -> Me...\n\n> Source (Node id: b5e73cba-d219-4f0a-9b04-32bf32af7d40): Here are some facts extracted from the provided text:\n\nChunking -> Break up -> Data\nChunking -> F...\n\n> Source (Node id: 59d3a92e-b240-4dcb-9ea3-56133ef4ec61): Here are some facts extracted from the provided text:\n\nAugmentation modules -> Incorporates -> Au...'

3. Using a hybrid query engine

In [57]:
query_engine = index.as_query_engine(
    include_text=True,
    response_mode="tree_summarize",
    embedding_mode="hybrid",
    similarity_top_k=5,
)

In [None]:
response = query_engine.query(
    "Tell me about chunking in retrieval augmented generation",
)
display(Markdown(f"<b>{response}</b>"))

<b>Chunking is a strategy used in retrieval augmented generation to break up data into smaller units, called vectors. This allows the retriever to efficiently find specific details within the data. There are several chunking strategies, including fixed length with overlap, syntax-based chunking, and file format-based chunking. 
</b>

4. Use `TextToCypherRetriever` to generate queries

In [44]:
from llama_index.core.indices.property_graph import TextToCypherRetriever


DEFAULT_RESPONSE_TEMPLATE = (
    "Generated Cypher query:\n{query}\n\n" "Cypher Response:\n{response}"
)
DEFAULT_ALLOWED_FIELDS = ["text", "label", "type"]

DEFAULT_TEXT_TO_CYPHER_TEMPLATE = (index.property_graph_store.text_to_cypher_template,)


cypher_retriever = TextToCypherRetriever(
    index.property_graph_store,
    # customize the LLM, defaults to Settings.llm
    llm=Settings.llm,
    # customize the text-to-cypher template.
    # Requires `schema` and `question` template args
    text_to_cypher_template=index.property_graph_store.text_to_cypher_template,
    # customize how the cypher result is inserted into
    # a text node. Requires `query` and `response` template args
    response_template=DEFAULT_RESPONSE_TEMPLATE,
    # an optional callable that can clean/verify generated cypher
    cypher_validator=None,
    # allowed fields in the resulting
    allowed_output_field=DEFAULT_ALLOWED_FIELDS,
)

In [61]:
SCHEMA = index.property_graph_store.get_schema_str()

In [64]:
SCHEMA

'Node properties:\n\nRelationship properties:\n\nThe relationships:\n'

In [63]:
nodes = cypher_retriever.retrieve(
    index.property_graph_store.text_to_cypher_template.format(
        question="Tell me about chunking and how it applies to RAG",
        schema=str(SCHEMA),
    )
)

Exception: ('NebulaGraph query failed:', "SyntaxError: syntax error near ```cypher'", 'Statement:', '```cypher\nMATCH (c:Concept {name: "chunking"})<-[:HAS_CONCEPT]-(d:Document)<-[:DESCRIBES]-(r:Resource {name: "RAG"})\nRETURN c, d, r\n```', 'Params:', None)

In [52]:
for node in nodes:
    print(node)

Node ID: 753c3a00-4b9d-49cc-842f-3644dceeb50b
Text: Generated Cypher query: MATCH (m:Entity__ {name: "Guardians of
the Galaxy 3"})<-[:Relation__]-(p:Props__) RETURN p   Cypher Response:
[]
Score:  1.000



# Cleanup 
%ngql
CLEAR/DROP SPACE $space_name; 