# GraphRAG Python package End-to-End Example

This notebook contains an end-to-end worked example using the [GraphRAG Python package](https://neo4j.com/docs/neo4j-graphrag-python/current/index.html) for Neo4j. It starts with unstructured documents (in this case pdfs), and progresses through knowledge graph construction, knowledge graph retriever design, and complete GraphRAG pipelines.

Research papers on Lupus are used as the data source. We design a couple of different retrievers based on different knowledge graph retrieval patterns.

For more details and explanations around each of the below steps, see the [corresponding blog post](https://neo4j.com/blog/graphrag-python-package/) which contains a full write-up, in-depth comparison of the retrieval patterns, and additional learning resources.

## Pre-Requisites

1. __Create a Neo4j Database__: To work through this RAG example, you need a database for storing and retrieving data. There are many options for this. You can quickly start a free Neo4j Graph Database using [Neo4j AuraDB](https://neo4j.com/product/auradb/?ref=neo4j-home-hero). You can use __AuraDB Free__ or start an __AuraDB Professional (Pro) free trial__ for higher ingestion and retrieval performance. The Pro instances have a bit more RAM; we recommend them for the best user experience.
2. __Obtain an OpenAI Key__: This example requires an OpenAI API key to use language models and embedders. The cost should be very minimal. If you do not yet have an OpenAI API key you can [create an OpenAI account](https://platform.openai.com/signup) or [sign in](https://platform.openai.com/login). Next, navigate to the [API key page](https://platform.openai.com/account/api-keys) and click "Create new secret key". Optionally naming the key.
3. __Fill in Credentials__: Either by copying the [`.env.template`](.env.template) file, naming it `.env`, and filling in the appropriate credentials, or by manually putting the credentials into the second code cell below. You will need:
    1. The Neo4j URI, username, and password variables from when you created the database. If you created your database on AuraDB, they are in the file you downloaded.
    2. Your OpenAI API key.



## Setup

In [2]:
%%capture
%pip install fsspec langchain-text-splitters tiktoken openai python-dotenv numpy torch neo4j-graphrag

In [4]:
%%python --version

Python 3.10.12


In [None]:
%pip list

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [10]:
import neo4j_graphrag
print(neo4j_graphrag.__version__)

AttributeError: module 'neo4j_graphrag' has no attribute '__version__'

In [4]:
from dotenv import load_dotenv
import os

# load neo4j credentials (and openai api key in background).
load_dotenv('/content/drive/MyDrive/secrets/dot.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

#uncomment this line if you aren't using a .env file
# os.environ['OPENAI_API_KEY'] = 'copy_paste_the_openai_key_here'

## Knowledge Graph Building

The `SimpleKGPipeline` class allows you to automatically build a knowledge graph with a few key inputs, including
- a driver to connect to Neo4j,
- an LLM for entity extraction, and
- an embedding model to create vectors on text chunks for similarity search.

There are also some optional inputs, such as node labels, relationship types, and a custom prompt template, which we will use to improve the quality of the knowledge graph. For full details on this, see [the blog](https://neo4j.com/blog/graphrag-python-package/).


In [7]:
import neo4j
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

driver = neo4j.GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

ex_llm=OpenAILLM(
    model_name="gpt-4o-mini",
    model_params={
        "response_format": {"type": "json_object"}, # use json_object formatting for best results
        "temperature": 0 # turning temperature down for more deterministic results
    }
)

# #create text embedder
# embedder = OpenAIEmbeddings()

TypeError: Client.__init__() got an unexpected keyword argument 'proxies'

In [None]:
#define node labels
basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]

academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]

medical_node_labels = ["Anatomy", "BiologicalProcess", "Cell", "CellularComponent",
                       "CellType", "Condition", "Disease", "Drug",
                       "EffectOrPhenotype", "Exposure", "GeneOrProtein", "Molecule",
                       "MolecularFunction", "Pathway"]

node_labels = basic_node_labels + academic_node_labels + medical_node_labels

# define relationship types
rel_types = ["ACTIVATES", "AFFECTS", "ASSESSES", "ASSOCIATED_WITH", "AUTHORED",
    "BIOMARKER_FOR", "CAUSES", "CITES", "CONTRIBUTES_TO", "DESCRIBES", "EXPRESSES",
    "HAS_REACTION", "HAS_SYMPTOM", "INCLUDES", "INTERACTS_WITH", "PRESCRIBED",
    "PRODUCES", "RECEIVED", "RESULTS_IN", "TREATS", "USED_FOR"]


In [None]:
prompt_template = '''
You are a medical researcher tasks with extracting information from papers
and structuring it in a property graph to inform further medical and research Q&A.

Extract the entities (nodes) and specify their type from the following Input text.
Also extract the relationships between these nodes. the relationship direction goes from the start node to the end node.


Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity" }} }}],
  "relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Description of the relationship"}} }}] }}

- Use only the information from the Input text.  Do not add any additional information.
- If the input text is empty, return empty Json.
- Make sure to create as many nodes and relationships as needed to offer rich medical context for further research.
- An AI knowledge assistant must be able to read this graph and immediately understand the context to inform detailed research questions.
- Multiple documents will be ingested from different sources and we are using this property graph to connect information, so make sure entity types are fairly general.

Use only fhe following nodes and relationships (if provided):
{schema}

Assign a unique ID (string) to each node, and reuse it to define relationships.
Do respect the source and target node types for relationship and
the relationship direction.

Do not return any additional information other than the JSON in it.

Examples:
{examples}

Input text:

{text}
'''

In [None]:
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

kg_builder_pdf = SimpleKGPipeline(
    llm=ex_llm,
    driver=driver,
    text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
    embedder=embedder,
    entities=node_labels,
    relations=rel_types,
    prompt_template=prompt_template,
    from_pdf=True
)

Below, we run the `SimpleKGPipeline` to construct our knowledge graph from 3 pdf documents and store in Neo4j.

In [None]:
pdf_file_paths = ['truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf',
             'truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf',
             'truncated-pdfs/pgpm-13-39-trunc.pdf']

for path in pdf_file_paths:
    print(f"Processing : {path}")
    pdf_result = await kg_builder_pdf.run_async(file_path=path)
    print(f"Result: {pdf_result}")

## Knowledge Graph Retrieval

We will leverage Neo4j's vector search capabilities here. To do this, we need to begin by creating a vector index on the text chunks from the PDFs, which are stored on `Chunk` nodes in our knowledge graph.

In [None]:
from neo4j_graphrag.indexes import create_vector_index

create_vector_index(driver, name="text_embeddings", label="Chunk",
                    embedding_property="embedding", dimensions=1536, similarity_fn="cosine")

Now that the index is set up, we will start simple with a __VectorRetriever__.  The __VectorRetriever__ just queries `Chunk` nodes via vector search, bringing back the text and some metadata.

In [None]:
from neo4j_graphrag.retrievers import VectorRetriever

vector_retriever = VectorRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    return_properties=["text"],
)

Below we visualize the context we get back when submitting a search prompt.

In [None]:
import json

vector_res = vector_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?",
                                                 top_k=3)
for i in vector_res.records: print("====\n" + json.dumps(i.data(), indent=4))

The GraphRAG Python Package offers [a wide range of useful retrievers](https://neo4j.com/docs/neo4j-graphrag-python/current/user_guide_rag.html#retriever-configuration), each covering different knowledge graph retrieval patterns.

Below we will use the __`VectorCypherRetriever`__, which allows you to run a graph traversal after finding nodes with vector search.  This uses Cypher, Neo4j's graph query language, to define the logic for traversing the graph.

As a simple starting point, we'll traverse up to 3 hops out from each Chunk, capture the relationships encountered, and include them in the response alongside our text chunks.


In [None]:
from neo4j_graphrag.retrievers import VectorCypherRetriever

vc_retriever = VectorCypherRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    retrieval_query="""
//1) Go out 2-3 hops in the entity graph and get relationships
WITH node AS chunk
MATCH (chunk)<-[:FROM_CHUNK]-()-[relList:!FROM_CHUNK]-{1,2}()
UNWIND relList AS rel

//2) collect relationships and text chunks
WITH collect(DISTINCT chunk) AS chunks,
  collect(DISTINCT rel) AS rels

//3) format and return context
RETURN '=== text ===\n' + apoc.text.join([c in chunks | c.text], '\n---\n') + '\n\n=== kg_rels ===\n' +
  apoc.text.join([r in rels | startNode(r).name + ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' +  ' -> ' + endNode(r).name ], '\n---\n') AS info
"""
)

Below we visualize the context we get back when submitting a search prompt.

In [None]:
vc_res = vc_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?", top_k=3)

# print output
kg_rel_pos = vc_res.records[0]['info'].find('\n\n=== kg_rels ===\n')
print("# Text Chunk Context:")
print(vc_res.records[0]['info'][:kg_rel_pos])
print("# KG Context From Relationships:")
print(vc_res.records[0]['info'][kg_rel_pos:])

## GraphRAG

 You can construct GraphRAG pipelines with the `GraphRAG` class.  At a minimum, you will need to pass the constructor an LLM and a retriever. You can optionally pass a custom prompt template. We will do so here just to provide a bit more guidance for the LLM to stick to information from our data source.

Below we create `GraphRAG` objects for both the vector and vector-cypher retrievers.

In [None]:
from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.generation import RagTemplate
from neo4j_graphrag.generation.graphrag import GraphRAG

llm = LLM(model_name="gpt-4o",  model_params={"temperature": 0.0})

rag_template = RagTemplate(template='''Answer the Question using the following Context. Only respond with information mentioned in the Context. Do not inject any speculative information not mentioned.

# Question:
{query_text}

# Context:
{context}

# Answer:
''', expected_inputs=['query_text', 'context'])

v_rag  = GraphRAG(llm=llm, retriever=vector_retriever, prompt_template=rag_template)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever, prompt_template=rag_template)

Now we can run GraphRAG and examine the outputs.

In [None]:
q = "How is precision medicine applied to Lupus? provide in list format."
print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k':5}).answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':5}).answer}")

In [None]:
q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, and treatments? Provide in detailed list format."

v_rag_result = v_rag.search(q, retriever_config={'top_k': 5}, return_context=True)
vc_rag_result = vc_rag.search(q, retriever_config={'top_k': 5}, return_context=True)

print(f"Vector Response: \n{v_rag_result.answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag_result.answer}")

In [None]:
for i in v_rag_result.retriever_result.items: print(json.dumps(eval(i.content), indent=1))

In [None]:
vc_ls = vc_rag_result.retriever_result.items[0].content.split('\\n---\\n')
for i in vc_ls:
    if "biomarker" in i: print(i)

In [None]:
vc_ls = vc_rag_result.retriever_result.items[0].content.split('\\n---\\n')
for i in vc_ls:
    if "treat" in i: print(i)

In [None]:
q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, treatments, and current challenges faced by Physicians and patients? provide in list format with details for each item."
print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k': 5}).answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k': 5}).answer}")