# GraphRAG Python package
End-to-end-example on research papers. 

In [None]:
%%capture
%pip install fsspec langchain-text-splitters openai python-dotenv numpy torch neo4j-graphrag-python

In [1]:
from dotenv import load_dotenv
import os

# load neo4j credentials (and openai api key in background)
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

## Knowledge Graph Building


In [2]:
import neo4j
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

driver = neo4j.GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

llm=OpenAILLM(
    model_name="gpt-4o-mini",
    model_params={
        "response_format": {"type": "json_object"}, # use json_object formatting for best results
        "temperature": 0 # turning temperature down for more deterministic results
    }
)

#create text embedder
embedder = OpenAIEmbeddings()

In [3]:
#define node labels
basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]

academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]

medical_node_labels = ["Anatomy", "BiologicalProcess", "Cell", "CellularComponent", 
                       "CellType", "Condition", "Disease", "Drug",
                       "EffectOrPhenotype", "Exposure", "GeneOrProtein", "Molecule",
                       "MolecularFunction", "Pathway"]

node_labels = basic_node_labels + academic_node_labels + medical_node_labels

# define relationship types
rel_types = ["ACTIVATES", "AFFECTS", "ASSESSES", "ASSOCIATED_WITH", "AUTHORED",
    "BIOMARKER_FOR", "CAUSES", "CITES", "CONTRIBUTES_TO", "DESCRIBES", "EXPRESSES",
    "HAS_REACTION", "HAS_SYMPTOM", "INCLUDES", "INTERACTS_WITH", "PRESCRIBED",
    "PRODUCES", "RECEIVED", "RESULTS_IN", "TREATS", "USED_FOR"]


In [4]:
prompt_template = '''
You are a medical researcher tasks with extracting information from papers 
and structuring it in a property graph to inform further medical and research Q&A.

Extract the entities (nodes) and specify their type from the following Input text.
Also extract the relationships between these nodes. the relationship direction goes from the start node to the end node. 


Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity", "details": "brief definition of the entity if available" }} }}],
  "relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "brief description of the relationship"}} }}] }}

- Use only the information from the Input text.  Do not add any additional information.  
- If the input text is empty, return empty Json. 
- Make sure to create as many nodes and relationships as needed to offer rich medical context for further research.
- An AI knowledge assistant must be able to read this graph and immediately understand the context to inform detailed research questions. 
- Multiple documents will be ingested from different sources and we are using this property graph to connect information, so make sure entity types are fairly general. 

Use only fhe following nodes and relationships (if provided):
{schema}

Assign a unique ID (string) to each node, and reuse it to define relationships.
Do respect the source and target node types for relationship and
the relationship direction.

Do not return any additional information other than the JSON in it.

Examples:
{examples}

Input text:

{text}
'''

In [5]:
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

kg_builder_pdf = SimpleKGPipeline(
    llm=llm,
    driver=driver,
    text_splitter=FixedSizeSplitter(chunk_size=800, chunk_overlap=100),
    embedder=embedder,
    entities=node_labels,
    relations=rel_types,
    prompt_template=prompt_template,
    from_pdf=True
)

In [6]:
pdf_file_paths = ['truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf', 
             'truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf', 
             'truncated-pdfs/pgpm-13-39-trunc.pdf']

for path in pdf_file_paths:
    print(f"Processing : {path}")
    pdf_result = await kg_builder_pdf.run_async(file_path=path)
    print(f"PDF Processing Result: {pdf_result}")

Processing : truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf
PDF Processing Result: run_id='d3f5df80-fa1f-4b53-9f5d-6941fb9fe6e9' result={'resolver': {'number_of_nodes_to_resolve': 627, 'number_of_created_nodes': 517}}
Processing : truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf
PDF Processing Result: run_id='6ab54670-cb80-4d78-b2cf-8f0c7a4d0168' result={'resolver': {'number_of_nodes_to_resolve': 872, 'number_of_created_nodes': 780}}
Processing : truncated-pdfs/pgpm-13-39-trunc.pdf
PDF Processing Result: run_id='40b90b40-9eb9-4272-b7ef-c384326117c8' result={'resolver': {'number_of_nodes_to_resolve': 1328, 'number_of_created_nodes': 1211}}


## Knowledge Graph Retrieval

We will leverage Neo4j's vector search capabilities here. To do this we need to begin by creating a vector index on the text in our Chunk nodes

In [7]:
from neo4j_graphrag.indexes import create_vector_index

create_vector_index(driver, name="text_embeddings", label="Chunk",
                    embedding_property="embedding", dimensions=1536, similarity_fn="cosine")

Neo4jIndexError: Neo4j vector index creation failed: An equivalent index already exists, 'Index( id=3, name='text_embeddings', type='VECTOR', schema=(:Chunk {embedding}), indexProvider='vector-2.0' )'.

Now that the index is set up we will start simple with a VectorRetriever.  The VectorRetriever just queries Chunk nodes, brining back the text and some metadata

In [8]:
from neo4j_graphrag.retrievers import VectorRetriever

vector_retriever = VectorRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    return_properties=["text"],
)

In [9]:
vector_res = vector_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?", 
                                                 top_k=3)
for i in vector_res.records: print("====\n" + '...' + i.data()['node']['text'][500:700] + '...')

====
...y amongst its clinical features and immunological abnormalities. In this review, we
attempt to capture the major immunological changes linked to the pathophysiology of lupus
and discuss the challenge ...
====
...efore predicting risk
of speci ﬁc organ damage, most adequate treatment, and
would allow better follow-up and ﬂare prediction.
In this review, we will outline the pathological processes
in lupus in ge...
====
...hritis, lupus
genetics, immunosuppression
Introduction
Systemic lupus erythematosus (SLE) is a chronic, complex, auto-immune disease of
unknown origin with multiorgan involvement. Knowledge of its epi...


The GraphRAG Python Package offers a whole host of other useful retrieval covering different patterns.

Below we will use the VectorCypherRetriever which allows you to run a graph traversal after finding text chunks.  We will use the Cypher Query language to define the logic to traverse the graph.  

As a simple starting point, lets traverse up to 2 hops out from each chunk and textualize the different relationships we pick up.  We will use something called a quantified path pattern to accomplish in this. 

In [10]:
from neo4j_graphrag.retrievers import VectorCypherRetriever

vc_retriever = VectorCypherRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    retrieval_query="""
//node = Chunk.  Go out 2-3 hops in the entity graph 
MATCH (node)<-[:FROM_CHUNK]-()-[rl:!FROM_CHUNK]-{1,2}()
UNWIND rl AS r
WITH DISTINCT r

//Get the source document(s) from which each entity set was extracted (could be multiple)
MATCH (sourceDoc:Document)<-[FROM_DOCUMENT]-()<-[:FROM_CHUNK]-(n)-[r]->(m)
WITH n,r,m, apoc.text.join(collect(DISTINCT sourceDoc.path), ', ') AS sources

// return textualize relations/triples
RETURN n.name + '(' + coalesce(n.details, '') + ')'+
   ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' +  ' -> ' +
   m.name + '(' + coalesce(m.details, '') + ')' + ' [sourced from: ' + sources + ']' AS fact
"""
)



In [12]:
vc_res = vc_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?")

print(f"Retrieved {len(vc_res.records)} records. Previewing the first 50:\n")
for i in vc_res.records[:50]: print("====\n" + i.data()['fact']) 

Retrieved 473 records. Previewing the first 50:

====
Towards Precision Medicine in Systemic Lupus Erythematosus(A review article discussing systemic lupus erythematosus and precision medicine.) - AUTHORED(Elliott Lever is an author of the article.) -> Elliott Lever(One of the authors of the article.) [sourced from: truncated-pdfs/pgpm-13-39-trunc.pdf]
====
Elliott Lever(One of the authors of the article.) - ASSOCIATED_WITH(Elliott Lever is associated with University College Hospital London.) -> University College Hospital London(An institution associated with one of the authors.) [sourced from: truncated-pdfs/pgpm-13-39-trunc.pdf]
====
Towards Precision Medicine in Systemic Lupus Erythematosus(A review article discussing systemic lupus erythematosus and precision medicine.) - AUTHORED(Marta R Alves is an author of the article.) -> Marta R Alves(One of the authors of the article.) [sourced from: truncated-pdfs/pgpm-13-39-trunc.pdf]
====
Marta R Alves(One of the authors of the article.)

## GraphRAG 
 You can construct GraphRAG pipelines with the `GraphRAG` class.  At minimum, you will need to pass the constructor an LLM and a retriever. You can also pass a custom prompt template, but for now we will just use the default.
 

In [13]:

from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.generation.graphrag import GraphRAG

llm = LLM(model_name="gpt-4o",  model_params={"temperature": 0.0})

v_rag  = GraphRAG(llm=llm, retriever=vector_retriever)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever)

In [14]:
q = "How is precision medicine applied to Lupus?"
print(f"\n\nVector Response: \n{v_rag.search(q, retriever_config={'top_k':3}).answer}")
print(f"\n\nVector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':3}).answer}")



Vector Response: 
Precision medicine in systemic lupus erythematosus (SLE) involves a tailored approach to each patient based on their genetic and epigenetic characteristics, which influence disease pathophysiology and drug response. The goal is to optimally assess SLE patients, predict disease course, and determine treatment response at diagnosis. Ideally, each patient would undergo an initial evaluation to profile their disease, assess the main pathophysiologic pathways through biomarkers, predict the risk of specific organ damage, identify the most suitable treatment, and allow for better follow-up and flare prediction. However, the application of precision medicine in SLE is still in development, as the precise immunopathological abnormalities differ between various organs and systems in lupus patients.


Vector + Cypher Response: 
Precision medicine is applied to Systemic Lupus Erythematosus (SLE) by tailoring treatment approaches based on individual variability in genes, enviro

In [15]:
q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, treatments, and current challenges faced by Physicians and patients? Provide in list format."
print(f"\n\nVector Response: \n{v_rag.search(q, retriever_config={'top_k':3}).answer}")
print(f"\n\nVector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':3}).answer}")



Vector Response: 
Certainly! Here's a summary of systemic lupus erythematosus (SLE) including common effects, biomarkers, treatments, and current challenges faced by physicians and patients:

- **Common Effects:**
  - SLE is a systemic autoimmune disease characterized by immune system dysfunction.
  - It presents with a wide range of clinical manifestations, including renal, dermatological, neuropsychiatric, and cardiovascular symptoms.

- **Biomarkers:**
  - Clinical and immunological biomarkers are critical for diagnosing and monitoring disease activity in SLE.
  - Novel biomarkers discovered through "omics" research are being reviewed for their potential in improving diagnosis and assessment.

- **Treatments:**
  - A "treat-to-target" strategy is applied to SLE, aiming for remission and low disease activity.
  - Treatment focuses on preventing organ damage and managing symptoms.

- **Current Challenges:**
  - Reaching the targets of remission and low disease activity is not strong