# GraphRAG Python
End-to-End Example

In [None]:
%%capture
%pip install fsspec langchain-text-splitters openai python-dotenv numpy torch

In [1]:
%%capture
%pip install -U git+https://github.com/neo4j/neo4j-graphrag-python

In [1]:
#%capture
#%pip install -U ../demo/neo4j-graphrag-python

In [2]:
from dotenv import load_dotenv
import os

# load neo4j credentials (and openai api key in background)
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

## KG Building


In [3]:
import neo4j
from neo4j_graphrag.experimental.components.embedder import TextChunkEmbedder
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

driver = neo4j.GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

extractor_llm=OpenAILLM(
    model_name="gpt-4o-mini",
    model_params={
        "response_format": {"type": "json_object"}, # use json_object formatting for best results
        "temperature": 0 # turning temperature down for more deterministic results
    }
)

#create text embedder
embedder = OpenAIEmbeddings()

In [4]:
#define node labels
basic_node_labels = ["Object",
                     "Entity",
                     "Group",
                     "Person",
                     "Organization",
                     "Place"]

academic_node_labels = ["ArticleOrPaper",
                        "PublicationOrJournal"]

medical_node_labels = ["Anatomy",
                       "BiologicalProcess",
                       "Cell",
                       "CellularComponent",
                       "CellType",
                       "Condition",
                       "Disease",
                       "Drug",
                       "EffectOrPhenotype",
                       "Exposure",
                       "GeneOrProtein",
                       "Molecule",
                       "MolecularFunction",
                       "Pathway"]

node_labels = basic_node_labels + academic_node_labels + medical_node_labels

# define relationship types
rel_types = ["ACTIVATES",
             "AFFECTS",
             "ASSESSES",
             "ASSOCIATED_WITH",
             "AUTHORED",
             "CAUSES",
             "CITES",
             "CLASSIFIES",
             "COLLABORATES_WITH"
             "CONTRIBUTES_TO",
             "CORRELATES_WITH",
             "DESCRIBES",
             "DEVELOPED",
             "DISCUSSES",
             "EXHIBITS",
             "EXPRESSES",
             "HAS_EFFECT",
             "HAS_SYMPTOM",
             "INCLUDES",
             "INDUCES",
             "INTERACTS_WITH",
             "INVOLVES",
             "LEADS_TO",
             "LINKED_TO",
             "LOCATED_IN",
             "MANIFESTS_AS",
             "OBSERVED_IN",
             "PARTICIPATES_IN",
             "PART_OF",
             "PRODUCES",
             "PUBLISHED_IN",
             "REACTS_WITH",
             "REDUCES",
             "RELATED_TO",
             "RESULTS_IN",
             "TARGETS",
             "TREATMENT_FOR",
             "TRIGGERS",
             "USED_FOR",
             "USED_WITH",
             "USES"]


In [5]:
prompt_template = '''
You are a medical researcher tasks with extracting information from papers 
and structuring it in a property graph to inform further medical and research Q&A

Extract the entities (nodes) and specify their type from the following Input text.
Also extract the relationships between these nodes. the relationship direction goes from the start node to the end node. 


Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity", "details":" brief description of entity (dont include info about relationships)" }} }}],
  "relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "brief description of relationship if needed"}} }}] }}

- Use only the information from the Input text.  Do not add any additional information.  
- If the input text is empty, return empty Json. 
- Make sure to create as many nodes and relationships as needed to offer rich medical context for further research.
- An AI knowledge assistant must be able to read this graph and immediately understand the context to inform detailed research questions. 
- Multiple documents will be ingested from different sources and we are using this property graph to connect information, so make sure entity types are fairly general. 

Use only fhe following nodes and relationships (if provided):
{schema}

Assign a unique ID (string) to each node, and reuse it to define relationships.
Do respect the source and target node types for relationship and
the relationship direction.

Do not return any additional information other than the JSON in it.

Examples:
{examples}

Input text:

{text}
'''

In [6]:
from langchain_text_splitters import CharacterTextSplitter
from neo4j_graphrag.experimental.components.text_splitters.langchain import LangChainTextSplitterAdapter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

kg_builder_pdf = SimpleKGPipeline(
    llm=extractor_llm,
    driver=driver,
    embedder=embedder,
    entities=node_labels,
    relations=rel_types,
    prompt_template=prompt_template,
    #text_splitter=LangChainTextSplitterAdapter(CharacterTextSplitter(chunk_size=500, chunk_overlap=100, separator=".")),
    from_pdf=True
)

In [8]:
pdf_file_paths = ['truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf', 
             'truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf', 
             'truncated-pdfs/pgpm-13-39-trunc.pdf']

for path in pdf_file_paths:
    print(f"Processing : {path}")
    pdf_result = await kg_builder_pdf.run_async(file_path=path)
    print(f"PDF Processing Result: {pdf_result}")

Processing : truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf




PDF Processing Result: run_id='59d8838f-2032-40e9-9362-6fe145384700' result={'resolver': {'number_of_nodes_to_resolve': 117, 'number_of_created_nodes': 110}}
Processing : truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf




PDF Processing Result: run_id='3263a97b-861b-4fb1-be8b-13bd22f77244' result={'resolver': {'number_of_nodes_to_resolve': 194, 'number_of_created_nodes': 179}}
Processing : truncated-pdfs/pgpm-13-39-trunc.pdf




PDF Processing Result: run_id='df3a182b-a2f0-4429-837c-f7a4449976df' result={'resolver': {'number_of_nodes_to_resolve': 303, 'number_of_created_nodes': 273}}


## KG Retrieval
Now lets make some knowledge graph retrievers which we will later use for a GraphRAG pipeline

We will leverage Neo4j's vector search capabilities here. To do this we need to begin by creating a vector index on the text in our Chunk nodes

In [9]:
from neo4j_graphrag.indexes import create_vector_index

create_vector_index(driver, name="text_embeddings", label="Chunk", embedding_property="embedding", dimensions=1536, similarity_fn="cosine")

Neo4jIndexError: Neo4j vector index creation failed: An equivalent index already exists, 'Index( id=3, name='text_embeddings', type='VECTOR', schema=(:Chunk {embedding}), indexProvider='vector-2.0' )'.

Now that the index is set up we will start simple with a VectorRetriever.  The VectorRetriever just queries Chunk nodes, brining back the text and some metadata

In [10]:
from neo4j_graphrag.retrievers import VectorRetriever

from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

embedder = OpenAIEmbeddings()

vector_retriever = VectorRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    return_properties=["text"],
)

In [11]:
vector_res = vector_retriever.get_search_results(query_text = "Which biomarkers are associated with predicting organ damage in lupus patients?")

for i in vector_res.records: print("====\n" + i.data()['node']['text'])

====
lar
adhesion molecules 1; VGLL-3: vestigial-like family member 3.
5.1. Biomarkers in Lupus Nephritis (LN)
Renal biopsy is the gold standard for diagnosing, classifying, and prognosing LN.
However, it cannot be widely employed due to certain disadvantages, including it being
an intrinsically invasive procedure, the risk of bleeding, and the possibility of sampling
error [ 87]. Furthermore, a 10% to 20% misclassiﬁcation risk may occur when conducting
a ﬁne needle percutaneous renal biopsy because of the possibility of not being able to
penetrate the pathological location of renal or pathological error analysis [ 88]. In addition,
serial biopsies cannot be conducted due to the invasive nature and potential complications
associated with the procedure [ 89,90]. For these reasons, routine renal biopsy has been
considered controversial and a question has been raised about whether it is absolutely
required to diagnose LN [91].
5.1.1. Serum Anti-dsDNA Antibodies
Anti-dsDNA antibodies are b

The GraphRAG Python Package offers a whole host of other useful retrieval covering different GraphRAG patterns (text2cypher, vector and/or full text + Cypher template, etc.).  and if none of those fit perfectly you can implement your own custom retrievers. 

Below we will use the VectorCypherRetriever which allows you to run a graph traversal after finding text chunks.  We will use the Cypher Query language to define the logic to traverse the graph.  

As a simple starting point, lets traverse up to 2 hops out from each chunk and textualize the different relationships we pick up.  We will use something called a quantified path pattern to accomplish in this. 

In [12]:
from neo4j_graphrag.retrievers import VectorCypherRetriever

vc_retriever=VectorCypherRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    retrieval_query="""
MATCH (node)<-[:FROM_CHUNK]-()-[rl:!FROM_CHUNK]-{1,2}()
UNWIND rl AS r
WITH DISTINCT r
MATCH (sourceDoc:Document)<-[FROM_DOCUMENT]-()<-[:FROM_CHUNK]-(n)-[r]->(m)
WITH n,r,m, apoc.text.join(collect(DISTINCT sourceDoc.path), ', ') AS sources
// return textualize relations
RETURN n.name + '(' + coalesce(n.details, '') + ')'+ 
    ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' +  ' -> ' + 
    m.name + '(' + coalesce(m.details, '') + ')' + ' [sourced from: ' + sources + ']' AS fact
        """,
)


In [13]:
vc_res = vc_retriever.get_search_results(query_text = "How do environmental factors influence systemic lupus erythematosus?")
for i in vc_res.records: print("====\n" + i.data()['fact']) 

====
Environmental factors, toxicants and systemic lupus erythematosus(A paper discussing the impact of environmental factors and toxicants on systemic lupus erythematosus.) - DESCRIBES(The paper discusses environmental factors affecting systemic lupus erythematosus.) -> Systemic lupus erythematosus(An autoimmune disease characterized by the body's immune system attacking its own tissues.) [sourced from: truncated-pdfs/pgpm-13-39-trunc.pdf]
====
SLE: another autoimmune disorder influenced by microbes and diet?(A paper investigating the influence of microbes and diet on systemic lupus erythematosus.) - DESCRIBES(The paper investigates the influence of diet and microbes on systemic lupus erythematosus.) -> Systemic lupus erythematosus(An autoimmune disease characterized by the body's immune system attacking its own tissues.) [sourced from: truncated-pdfs/pgpm-13-39-trunc.pdf]
====
Derivation and validation of the systemic lupus erythematosus international collaborating clinics classifica

In [14]:
len(vc_res.records)

174

## GraphRAG Pipelines
 You can construct GraphRAG pipelines with the `GraphRAG` class.  At minimum, you will need to pass the constructor an LLM and a retriever. You can also pass a custom prompt template, but for now we will just use the default.
 

In [15]:
from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.generation.graphrag import GraphRAG

llm = LLM(model_name="gpt-4o",  model_params={"temperature": 0})

v_rag  = GraphRAG(llm=llm, retriever=vector_retriever)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever)

In [19]:
q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers for identifying, treatments, and current challenges facing faced by Physicians and patients? Provide in list format"
print(f"\n\nVector Response: \n{v_rag.search(q, retriever_config={'top_k':3}).answer}")
print(f"\n\nVector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':3}).answer}")



Vector Response: 
Certainly! Here's a summary of systemic lupus erythematosus (SLE) including its common effects, biomarkers for identifying, treatments, and current challenges faced by physicians and patients:

1. **Common Effects of SLE:**
   - SLE is a systemic autoimmune disease characterized by immune system dysfunction.
   - It presents with a wide range of clinical manifestations, including renal, dermatological, neuropsychiatric, and cardiovascular symptoms.
   - Patients often experience fatigue, pain, and limitations in daily life activities.

2. **Biomarkers for Identifying SLE:**
   - Clinical and immunological biomarkers are critical for diagnosing and monitoring SLE.
   - Biomarkers help in assessing pathophysiological processes and disease activity.
   - Novel biomarkers have been discovered through "omics" research, although many lack sufficient sensitivity, specificity, or predictive power for clinical use.

3. **Treatments for SLE:**
   - The treatment approach is m

Lets start with  simple question and compare the answers for each

In [22]:
q = "What organ systems are most commonly affected by lupus? and how are they treated? provide in list format"
print(f"\n\nVector Response: \n{v_rag.search(q, retriever_config={'top_k':3}).answer}")
print(f"\n\nVector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':3}).answer}")



Vector Response: 
1. **Skin**: 
   - Treatment: Topical and oral steroids, hydroxychloroquine. Severe cases may require rituximab.

2. **Renal (Kidneys)**:
   - Treatment: High-dose steroids, cyclophosphamide, mycophenolate mofetil (MMF), or azathioprine.

3. **Neuropsychiatric (Cerebral)**:
   - Treatment: High-dose steroids and immunosuppressive agents like cyclophosphamide.

4. **Musculoskeletal (Polyarthritis)**:
   - Treatment: Hydroxychloroquine, corticosteroids, methotrexate.

5. **Hematological (Thrombocytopenia)**:
   - Treatment: Corticosteroids, hydroxychloroquine, azathioprine, or mycophenolate mofetil.

6. **Cardiovascular (Lupus Antiphospholipid Syndrome - APS)**:
   - Treatment: Anticoagulation therapy such as warfarin or low molecular weight heparin (LMWH).

7. **Pulmonary (Pleurisy/Pericarditis)**:
   - Treatment: Corticosteroids and immunosuppressive agents.

8. **Gastrointestinal**:
   - Treatment: Not specifically captured in SLEDAI scoring, but managed based on s

Now lets try a bit more of a complex question

In [33]:
q = "How does precision medicine help in treating systemic lupus erythematosus (SLE)?"
print(f"\n\nVector Response: \n{v_rag.search(q, retriever_config={'top_k':3}).answer}")
print(f"\n\nVector + Cypher Response: \n{vc_rag.search(q).answer}")



Vector Response: 
Precision medicine in treating systemic lupus erythematosus (SLE) involves tailoring treatment to individual patients based on their genetic and epigenetic characteristics, which influence disease pathophysiology and drug response. This approach aims to optimally assess SLE patients, predict disease course, and determine treatment response at diagnosis. Ideally, each patient would undergo an initial evaluation to profile their disease, assess the main pathophysiologic pathway through biomarkers, predict the risk of specific organ damage, and identify the most suitable treatment. This would allow for better follow-up and flare prediction. Although there is exciting potential for improving precision in lupus care, achieving this goal universally in the short term is unlikely. Current treatments are individualized to some extent based on organ involvement, with the primary objective being remission of disease signs and symptoms to prevent organ damage. However, the dev

In [24]:
q = "Which biomarkers are associated with predicting organ damage in lupus patients? - provide a detailed list"
print(f"\n\nVector Response: \n{v_rag.search(q, retriever_config={'top_k':3}).answer}")
print(f"\n\nVector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':3}).answer}")



Vector Response: 
Biomarkers associated with predicting organ damage in lupus patients include:

1. **Lupus Nephritis (LN):**
   - **Anti-dsDNA antibodies (Serum):** Associated with SLE disease activity and can predict the development of LN. High specificity (96%) but low diagnostic sensitivity (52–70%).
   - **Anti-Sm antibodies (Serum):** Correlates with SLE disease activity and LN. Highly specific diagnostic biomarker for SLE with a specificity of 99% but low sensitivity (5–30%). High titers predict silent LN and early poor outcomes in LN.
   - **Anti-C1q antibodies (Serum):** Increased titers predict renal flares in LN with 81–97% sensitivity and 71–95% specificity. Absence of anti-C1q is associated with a nearly 100% negative predictive value for LN development.
   - **Proteinuria; Protein/creatinine ratio; 24-h urine protein (Urine):** Conventional urinary biomarkers for LN.
   - **Chemokines and Cytokines (Urine):** Evaluated as potential SLE biomarkers, but few have been inde

In [25]:
q = "How are the kidneys affected by Lupus? - provide a detailed list"
print(f"\n\nVector Response: \n{v_rag.search(q, retriever_config={'top_k':1}).answer}")
print(f"\n\nVector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':1}).answer}")



Vector Response: 
Lupus can significantly affect the kidneys, leading to a condition known as lupus nephritis (LN). Here is a detailed list of how the kidneys are affected:

1. **Glomerulonephritis (GN):** Up to 50% of SLE patients develop clinically significant GN, which is a major cause of morbidity and mortality. GN involves inflammation of the glomeruli, the filtering units of the kidney.

2. **Classification of GN:** The International Society of Nephrology/Renal Pathology Society (ISN/RPS) classifies GN into different classes based on glomerular lesions:
   - **Class I and II:** Mesangial nephritis with a better prognosis due to the high regenerative capacity of mesangial cells.
   - **Class III (Focal) and IV (Diffuse):** Proliferative GN with subendothelial deposition of immune complexes (ICs) in glomerular capillaries, leading to endothelial activation and germinal center formation.
   - **Class V:** Membranous IC deposition in the subepithelial space, causing local inflammat

In [41]:
q = "Which clinical biomarkers are relevant to both adults and children with systemic lupus erythematosus, and how do treatment approaches differ based on these biomarkers?"
print(f"\n\nVector Response: \n{v_rag.search(q, retriever_config={'top_k':3}).answer}")
print(f"\n\nVector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':3}).answer}")



Vector Response: 
The context provided does not specifically differentiate between clinical biomarkers relevant to adults and children with systemic lupus erythematosus (SLE). However, it does list several biomarkers associated with SLE and its manifestations, such as lupus nephritis (LN), skin lesions, and neuropsychiatric SLE (NPSLE). These biomarkers include:

1. **Anti-dsDNA antibodies**: Associated with SLE disease activity and can predict the development of LN.
2. **Anti-Sm antibodies**: Correlate with SLE disease activity and LN.
3. **Anti-C1q antibodies**: Predict renal flares in LN.
4. **Proteinuria and urine protein/creatinine ratio**: Conventional urinary biomarkers for LN.
5. **Chemokines and cytokines in urine**: Evaluated as potential SLE biomarkers.
6. **AhR ratio**: Associated with SLE activity and skin lesions.
7. **Anti-SSA antibodies**: Associated with subacute cutaneous lupus.
8. **Antiphospholipid antibodies**: Associated with NPSLE manifestations.
9. **Anti-NMDA