# GraphRAG Python package
End-to-end-example on research papers. 

In [None]:
%%capture
%pip install fsspec langchain-text-splitters openai python-dotenv numpy torch neo4j-graphrag-python

In [1]:
from dotenv import load_dotenv
import os

# load neo4j credentials (and openai api key in background)
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

## Knowledge Graph Building


In [2]:
import neo4j
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

driver = neo4j.GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

ex_llm=OpenAILLM(
    model_name="gpt-4o-mini",
    model_params={
        "response_format": {"type": "json_object"}, # use json_object formatting for best results
        "temperature": 0 # turning temperature down for more deterministic results
    }
)

#create text embedder
embedder = OpenAIEmbeddings()

In [3]:
#define node labels
basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]

academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]

medical_node_labels = ["Anatomy", "BiologicalProcess", "Cell", "CellularComponent", 
                       "CellType", "Condition", "Disease", "Drug",
                       "EffectOrPhenotype", "Exposure", "GeneOrProtein", "Molecule",
                       "MolecularFunction", "Pathway"]

node_labels = basic_node_labels + academic_node_labels + medical_node_labels

# define relationship types
rel_types = ["ACTIVATES", "AFFECTS", "ASSESSES", "ASSOCIATED_WITH", "AUTHORED",
    "BIOMARKER_FOR", "CAUSES", "CITES", "CONTRIBUTES_TO", "DESCRIBES", "EXPRESSES",
    "HAS_REACTION", "HAS_SYMPTOM", "INCLUDES", "INTERACTS_WITH", "PRESCRIBED",
    "PRODUCES", "RECEIVED", "RESULTS_IN", "TREATS", "USED_FOR"]


In [4]:
prompt_template = '''
You are a medical researcher tasks with extracting information from papers 
and structuring it in a property graph to inform further medical and research Q&A.

Extract the entities (nodes) and specify their type from the following Input text.
Also extract the relationships between these nodes. the relationship direction goes from the start node to the end node. 


Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity" }} }}],
  "relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Description of the relationship"}} }}] }}

- Use only the information from the Input text.  Do not add any additional information.  
- If the input text is empty, return empty Json. 
- Make sure to create as many nodes and relationships as needed to offer rich medical context for further research.
- An AI knowledge assistant must be able to read this graph and immediately understand the context to inform detailed research questions. 
- Multiple documents will be ingested from different sources and we are using this property graph to connect information, so make sure entity types are fairly general. 

Use only fhe following nodes and relationships (if provided):
{schema}

Assign a unique ID (string) to each node, and reuse it to define relationships.
Do respect the source and target node types for relationship and
the relationship direction.

Do not return any additional information other than the JSON in it.

Examples:
{examples}

Input text:

{text}
'''

In [45]:
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
#from langchain_text_splitters import TokenTextSplitter(chunk_size=200, chunk_overlap=20)
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

kg_builder_pdf = SimpleKGPipeline(
    llm=ex_llm,
    driver=driver,
    text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
    embedder=embedder,
    entities=node_labels,
    relations=rel_types,
    prompt_template=prompt_template,
    from_pdf=True
)

In [6]:
pdf_file_paths = ['truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf', 
             'truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf', 
             'truncated-pdfs/pgpm-13-39-trunc.pdf']

for path in pdf_file_paths:
    print(f"Processing : {path}")
    pdf_result = await kg_builder_pdf.run_async(file_path=path)
    print(f"PDF Processing Result: {pdf_result}")

Processing : truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf
PDF Processing Result: run_id='6f09563e-c0ba-44ce-bd7c-a8ce553f6d59' result={'resolver': {'number_of_nodes_to_resolve': 954, 'number_of_created_nodes': 732}}
Processing : truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf
PDF Processing Result: run_id='1b000f65-2a5d-4c07-aeb6-1f3cd3df8ab8' result={'resolver': {'number_of_nodes_to_resolve': 1042, 'number_of_created_nodes': 968}}
Processing : truncated-pdfs/pgpm-13-39-trunc.pdf
PDF Processing Result: run_id='260c1760-a78d-45ab-9d37-1b78ac8778e1' result={'resolver': {'number_of_nodes_to_resolve': 1814, 'number_of_created_nodes': 1621}}


## Knowledge Graph Retrieval

We will leverage Neo4j's vector search capabilities here. To do this we need to begin by creating a vector index on the text in our Chunk nodes

In [7]:
from neo4j_graphrag.indexes import create_vector_index

create_vector_index(driver, name="text_embeddings", label="Chunk",
                    embedding_property="embedding", dimensions=1536, similarity_fn="cosine")

Neo4jIndexError: Neo4j vector index creation failed: An equivalent index already exists, 'Index( id=3, name='text_embeddings', type='VECTOR', schema=(:Chunk {embedding}), indexProvider='vector-2.0' )'.

Now that the index is set up we will start simple with a VectorRetriever.  The VectorRetriever just queries Chunk nodes, brining back the text and some metadata

In [8]:
from neo4j_graphrag.retrievers import VectorRetriever

vector_retriever = VectorRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    return_properties=["text"],
)

In [11]:
vector_res = vector_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?", 
                                                 top_k=3)
for i in vector_res.records: print("====\n" + i.data()['node']['text'])

====
precise and systematic fashion as suggested here.
Future care will involve molecular diagnostics throughout
the patient timecourse to drive the least toxic combination
of therapies. Recent evidence suggests a paradigm shift is
on the way but it is hard to predict how fast it will come.
Disclosure
The authors report no con ﬂicts of interest in this work.
References
1. Lisnevskaia L, Murphy G, Isenberg DA. Systemic lupus
erythematosus. Lancet .2014 ;384:1878 –1888. doi:10.1016/S0140-
6736(14)60128
====
d IS agents.
Precision medicine consists of a tailored approach to
each patient, based on genetic and epigenetic singularities,
which in ﬂuence disease pathophysiology and drug
response. Precision medicine in SLE is trying to address
the need to assess SLE patients optimally, predict disease
course and treatment response at diagnosis. Ideally every
patient would undergo an initial evaluation that would
proﬁle his/her disease, assessing the main pathophysiolo-
gic pathway through bioma

The GraphRAG Python Package offers a whole host of other useful retrieval covering different patterns.

Below we will use the VectorCypherRetriever which allows you to run a graph traversal after finding text chunks.  We will use the Cypher Query language to define the logic to traverse the graph.  

As a simple starting point, lets traverse up to 2 hops out from each chunk and textualize the different relationships we pick up.  We will use something called a quantified path pattern to accomplish in this. 

In [12]:
from neo4j_graphrag.retrievers import VectorCypherRetriever

vc_retriever = VectorCypherRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    retrieval_query="""
//node = Chunk.  Go out 2-3 hops in the entity graph 
MATCH (node)<-[:FROM_CHUNK]-()-[rl:!FROM_CHUNK]-{1,2}()
UNWIND rl AS r
WITH DISTINCT r

//Get the source document(s) from which each entity set was extracted (could be multiple)
MATCH (sourceDoc:Document)<-[FROM_DOCUMENT]-()<-[:FROM_CHUNK]-(n)-[r]->(m)
WITH n,r,m, apoc.text.join(collect(DISTINCT sourceDoc.path), ', ') AS sources

// return textualize relations/triples
RETURN n.name +
   ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' +  ' -> ' +
   m.name AS fact
"""
)



In [None]:
from neo4j_graphrag.retrievers import VectorCypherRetriever

vc_retriever = VectorCypherRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    retrieval_query="""
//node = Chunk.  Go out 2-3 hops in the entity graph 
MATCH  (sourceDoc:Document)<-[FROM_DOCUMENT]-(node)<-[:FROM_CHUNK]-()-[rl:!FROM_CHUNK]-{1,2}()
UNWIND rl AS r
WITH DISTINCT r, apoc.text.join(collect(DISTINCT sourceDoc.path), ', ') AS sources

MATCH (n)-[r]->(m)

// return textualize relations/triples
RETURN n.name +
   ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' +  ' -> ' +
   m.name AS fact
"""
)

In [13]:
vc_res = vc_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?")

print(f"Retrieved {len(vc_res.records)} records. Previewing the first 50:\n")
for i in vc_res.records[:50]: print("====\n" + i.data()['fact']) 

Retrieved 764 records. Previewing the first 50:

====
Systemic lupus erythematosus - AUTHORED(Published in) -> N. Engl. J. Med.
====
Lisnevskaia L - AUTHORED() -> Systemic lupus erythematosus
====
Murphy G - AUTHORED() -> Systemic lupus erythematosus
====
Isenberg DA - AUTHORED() -> Systemic lupus erythematosus
====
Systemic lupus erythematosus - CITES(Published in) -> Lancet
====
Systemic lupus erythematosus - CITES(Systemic lupus erythematosus is discussed in the Lancet publication.) -> Lancet
====
Systemic lupus erythematosus - ASSOCIATED_WITH(SLE is characterized by aberrant activity of the immune system) -> Aberrant activity of the immune system
====
Immunological biomarkers - USED_FOR(Immunological biomarkers could diagnose and monitor disease activity in SLE) -> Systemic lupus erythematosus
====
Novel SLE biomarkers - DESCRIBES(Novel SLE biomarkers have been discovered through omics research) -> Systemic lupus erythematosus
====
Lindblom et al. post-hoc analysis of BLISS-52 and 

## GraphRAG 
 You can construct GraphRAG pipelines with the `GraphRAG` class.  At minimum, you will need to pass the constructor an LLM and a retriever. You can also pass a custom prompt template, but for now we will just use the default.
 

In [27]:
from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.generation.graphrag import GraphRAG

llm = LLM(model_name="gpt-4o",  model_params={"temperature": 0.0})

v_rag  = GraphRAG(llm=llm, retriever=vector_retriever)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever)

In [23]:
q = "How is precision medicine applied to Lupus?"
print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k':5}).answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':5}).answer}")

Vector Response: 
Precision medicine in systemic lupus erythematosus (SLE) involves a tailored approach to each patient, based on their genetic and epigenetic characteristics, which influence disease pathophysiology and drug response. This approach aims to optimally assess SLE patients, predict disease course, and determine treatment response at diagnosis. Ideally, every patient would undergo an initial evaluation to profile their disease, assessing the main pathophysiologic pathway through biomarkers. The goal is to minimize the use of potent medications and their side effects while proactively treating autoimmunity and preventing damage. Future care is expected to involve molecular diagnostics throughout the patient's time course to drive the least toxic combination of therapies.


Vector + Cypher Response: 
Precision medicine is applied to lupus, specifically systemic lupus erythematosus (SLE), by utilizing biomarkers to assess organ-specific involvement, which is crucial for precis

In [24]:
q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, treatments, and current challenges faced by Physicians and patients? provide in list format with details for each item."
print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k': 5}).answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k': 5}).answer}")

Vector Response: 
Certainly! Here's a summary of Systemic Lupus Erythematosus (SLE) including its common effects, biomarkers, treatments, and current challenges faced by physicians and patients:

1. **Common Effects:**
   - SLE imposes a significant burden on patients' lives, affecting their Health-Related Quality of Life (HRQoL).
   - It is characterized by a wide range of symptoms due to immune reactivity and inflammation in various organs.

2. **Biomarkers:**
   - SLE is diagnosed and classified based on clinical symptoms, signs, and laboratory biomarkers.
   - Common biomarkers reflect immune reactivity and inflammation, and are used to assess disease activity and therapeutic effects.
   - Ideal biomarkers should reflect the underlying pathophysiology, have high reliability, validity, predictive values, sensitivity, and specificity.

3. **Treatments:**
   - A "treat-to-target" strategy is applied, aiming for remission and low disease activity.
   - Despite this strategy, achieving 

In [42]:
q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, and treatments? Provide in detailed list format"
print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k': 5}).answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k': 5}).answer}")

Vector Response: 
Certainly! Here's a detailed summary of systemic lupus erythematosus (SLE), including its common effects, biomarkers, and treatments:

### Common Effects of SLE:
1. **Immune System Dysfunction**: SLE is characterized by aberrant activity of the immune system.
2. **Clinical Manifestations**:
   - **Renal Symptoms**: Kidney involvement is common, leading to lupus nephritis.
   - **Dermatological Symptoms**: Skin rashes, including the classic butterfly rash on the face.
   - **Neuropsychiatric Symptoms**: Cognitive dysfunction, headaches, and mood disorders.
   - **Cardiovascular Symptoms**: Increased risk of heart disease and inflammation of the heart or blood vessels.

### Biomarkers for SLE:
1. **Clinical and Immunological Biomarkers**:
   - Used for diagnosing and monitoring disease activity.
   - Reflect immune reactivity and inflammation in various organs.
2. **Common Biomarkers**:
   - **Anti-dsDNA**: Antibodies against double-stranded DNA.
   - **Anti-SSA**: Anti

In [43]:
q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, treatments, and current challenges faced by Physicians and patients? provide in list format with details for each item."
print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k': 5}).answer}")
print("\n===========================\n")
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k': 5}).answer}")

Vector Response: 
Certainly! Here's a summary of systemic lupus erythematosus (SLE) including its common effects, biomarkers, treatments, and current challenges faced by physicians and patients:

1. **Common Effects:**
   - SLE is a systemic autoimmune disease characterized by immune system dysfunction.
   - It presents with a wide range of clinical manifestations, including renal, dermatological, neuropsychiatric, and cardiovascular symptoms.
   - The disease is chronic, inflammatory, and involves multiple organs with varying severity.
   - SLE has an unpredictable relapsing and remitting course.

2. **Biomarkers:**
   - Clinical and immunological biomarkers are critical for diagnosing and monitoring disease activity in SLE.
   - These biomarkers help assess pathophysiological processes and control the disease, with or without organ-specific injury.
   - Novel biomarkers discovered through "omics" research are being reviewed for their potential use in SLE management.

3. **Treatments: