# Unstructured Text Data: GraphRAG Python package End-to-End Example

This notebook contains an end-to-end worked example using the [GraphRAG Python package](https://neo4j.com/docs/neo4j-graphrag-python/current/index.html) for Neo4j. It starts with high;y unstructured text data (in this case pdfs), and progresses through knowledge graph construction, knowledge graph retriever design, and complete GraphRAG pipelines. 

Research papers on Lupus are used as the data source. We design a couple of different retrievers based on different knowledge graph retrieval patterns. 

For more details and explanations around each of the below steps, see the [corresponding blog post](https://neo4j.com/blog/graphrag-python-package/) which contains a full write-up, in-depth comparison of the retrieval patterns, and additional learning resources. 

## Pre-Requisites

1. __Configure your Environment to Access the Bedrock API__ following this [guide]().
2. __Create a Neo4j Database__: To work through this RAG example, you need a database for storing and retrieving data. There are many options for this. You can quickly start a free Neo4j Graph Database using [Neo4j AuraDB](https://neo4j.com/product/auradb/?ref=neo4j-home-hero). You can use __AuraDB Free__ or start an __AuraDB Professional (Pro) free trial__ for higher ingestion and retrieval performance. The Pro instances have a bit more RAM; we recommend them for the best user experience.
3. __Fill in Credentials__: Either by copying the [`.env.template`](cred.env.template) file, naming it `.env`, and filling in the appropriate credentials, or by manually putting the credentials into the second code cell below. You will need the Neo4j URI, username, and password variables from when you created the database. If you created your database on AuraDB, they are in the file you downloaded.
   



## Setup

In [1]:
%%capture
%pip install fsspec langchain-text-splitters tiktoken boto3 python-dotenv numpy torch neo4j-graphrag pypdf

In [2]:
from dotenv import load_dotenv
import os

# load neo4j credentials
load_dotenv('cred.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

## Knowledge Graph Building

The `SimpleKGPipeline` class allows you to automatically build a knowledge graph with a few key inputs, including
- a driver to connect to Neo4j,
- an LLM for entity extraction, and
- an embedding model to create vectors on text chunks for similarity search.

There are also some optional inputs, such as node labels, relationship types, and a custom prompt template, which we will use to improve the quality of the knowledge graph. For full details on this, see [the blog](https://neo4j.com/blog/graphrag-python-package/).


In [3]:
from typing import Optional, Any
from neo4j_graphrag.embeddings import Embedder
from neo4j_graphrag.llm import LLMInterface, LLMResponse
import boto3

client = boto3.client("bedrock-runtime", region_name="us-east-1")


class BedrockLLM(LLMInterface):
 
    def invoke(self, input: str) -> LLMResponse:
 
        response = client.converse(
            modelId=self.model_name,
            messages=[{
                'role': 'user',
                'content': [{"text": input}],
            }],
            inferenceConfig=self.model_params
        )      
        return LLMResponse(
            content=response["output"]["message"]["content"][0]["text"]
        )

    async def ainvoke(self, input: str) -> LLMResponse:
        return self.invoke(input)


class BedrockEmbedder(Embedder):
    def __init__(self, model_id):
        self.model_id = model_id

    def embed_query(self, text: str) -> list[float]:
        # Create the request for the model.
        native_request = {"inputText": text}
        # Convert the native request to JSON.
        request = json.dumps(native_request)
        # Invoke the model with the request.
        response = client.invoke_model(modelId=self.model_id, body=request)
        # Decode the model's native response body.
        model_response = json.loads(response["body"].read())
        # Extract and print the generated embedding and the input text token count.
        return model_response["embedding"]

In [4]:
import neo4j

driver = neo4j.GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

ex_llm = BedrockLLM(
    model_name="us.anthropic.claude-3-5-haiku-20241022-v1:0",
    model_params={
        "temperature": 0 # turning temperature down for more deterministic results
    }
)

#create text embedder
embedder = BedrockEmbedder(model_id="amazon.titan-embed-text-v1")

In [5]:
#define node labels
basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]

academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]

medical_node_labels = ["Anatomy", "BiologicalProcess", "Cell", "CellularComponent", 
                       "CellType", "Condition", "Disease", "Drug",
                       "EffectOrPhenotype", "Exposure", "GeneOrProtein", "Molecule",
                       "MolecularFunction", "Pathway"]

node_labels = basic_node_labels + academic_node_labels + medical_node_labels

# define relationship types
rel_types = ["ACTIVATES", "AFFECTS", "ASSESSES", "ASSOCIATED_WITH", "AUTHORED",
    "BIOMARKER_FOR", "CAUSES", "CITES", "CONTRIBUTES_TO", "DESCRIBES", "EXPRESSES",
    "HAS_REACTION", "HAS_SYMPTOM", "INCLUDES", "INTERACTS_WITH", "PRESCRIBED",
    "PRODUCES", "RECEIVED", "RESULTS_IN", "TREATS", "USED_FOR"]


In [6]:
prompt_template = '''
You are a medical researcher tasks with extracting information from papers 
and structuring it in a property graph to inform further medical and research Q&A.

Extract the entities (nodes) and specify their type from the following Input text.
Also extract the relationships between these nodes. the relationship direction goes from the start node to the end node. 


Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity" }} }}],
  "relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Description of the relationship"}} }}] }}

- Use only the information from the Input text.  Do not add any additional information.  
- If the input text is empty, return empty Json. 
- Make sure to create as many nodes and relationships as needed to offer rich medical context for further research.
- An AI knowledge assistant must be able to read this graph and immediately understand the context to inform detailed research questions. 
- Multiple documents will be ingested from different sources and we are using this property graph to connect information, so make sure entity types are fairly general. 

Use only fhe following nodes and relationships (if provided):
{schema}

Assign a unique ID (string) to each node, and reuse it to define relationships.
Do respect the source and target node types for relationship and
the relationship direction.

Do not return any additional information other than the JSON in it.

Examples:
{examples}

Input text:

{text}
'''

In [7]:
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

kg_builder_pdf = SimpleKGPipeline(
    llm=ex_llm,
    driver=driver,
    text_splitter=FixedSizeSplitter(chunk_size=2000, chunk_overlap=200),
    embedder=embedder,
    entities=node_labels,
    relations=rel_types,
    prompt_template=prompt_template,
    from_pdf=True
)

Below, we run the `SimpleKGPipeline` to construct our knowledge graph from 3 pdf documents and store in Neo4j.

In [8]:
import json

pdf_file_paths = ['truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf', 
             'truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf', 
             'truncated-pdfs/pgpm-13-39-trunc.pdf']

for path in pdf_file_paths:
    print(f"Processing : {path}")
    pdf_result = await kg_builder_pdf.run_async(file_path=path)
    print(f"Result: {pdf_result}")

Processing : truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf
Result: run_id='80f616a6-eb45-49bb-b0d5-905b56a5e988' result={'resolver': {'number_of_nodes_to_resolve': 213, 'number_of_created_nodes': 181}}
Processing : truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf
Result: run_id='d45dd36d-6da3-437f-8e08-d5cefe9da53a' result={'resolver': {'number_of_nodes_to_resolve': 323, 'number_of_created_nodes': 269}}
Processing : truncated-pdfs/pgpm-13-39-trunc.pdf
Result: run_id='395b532b-1569-43e6-8015-c1a3cf24f10a' result={'resolver': {'number_of_nodes_to_resolve': 492, 'number_of_created_nodes': 429}}


## Knowledge Graph Retrieval

We will leverage Neo4j's vector search capabilities here. To do this, we need to begin by creating a vector index on the text chunks from the PDFs, which are stored on `Chunk` nodes in our knowledge graph.

In [9]:
from neo4j_graphrag.indexes import create_vector_index

create_vector_index(driver, name="text_embeddings", label="Chunk",
                    embedding_property="embedding", dimensions=1536, similarity_fn="cosine")

Now that the index is set up, we will start simple with a __VectorRetriever__.  The __VectorRetriever__ just queries `Chunk` nodes via vector search, bringing back the text and some metadata.

In [10]:
from neo4j_graphrag.retrievers import VectorRetriever

vector_retriever = VectorRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    return_properties=["text"],
)

Below we visualize the context we get back when submitting a search prompt. 

In [11]:
import json

vector_res = vector_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?", 
                                                 top_k=3)
for i in vector_res.records: print("====\n" + json.dumps(i.data(), indent=4))

====
{
    "node": {
        "text": "REVIEW\nT owards Precision Medicine in Systemic Lupus\nErythematosus\nThis article was published in the following Dove Press journal:\nPharmacogenomics and Personalized Medicine\nElliott Lever 1\nMarta R Alves2\nDavid A Isenberg 1\n1Centre for Rheumatology, Division of\nMedicine, University College Hospital\nLondon, London, UK;2Internal Medicine,\nDepartment of Medicine, Centro\nHospitalar do Porto, Porto, Portugal\nAbstract: Systemic lupus erythematosus (SLE) is a remarkable condition characterised by\ndiversity amongst its clinical features and immunological abnormalities. In this review, we\nattempt to capture the major immunological changes linked to the pathophysiology of lupus\nand discuss the challenge it presents in moving towards the concept of precision medicine.\nCurrently broadly similar types of drugs, e.g., steroids, immunosuppressives, hydroxychlor-\noquine are used to treat many of the diverse clinical features of SLE. We suspect th

The GraphRAG Python Package offers [a wide range of useful retrievers](https://neo4j.com/docs/neo4j-graphrag-python/current/user_guide_rag.html#retriever-configuration), each covering different knowledge graph retrieval patterns.

Below we will use the __`VectorCypherRetriever`__, which allows you to run a graph traversal after finding nodes with vector search.  This uses Cypher, Neo4j's graph query language, to define the logic for traversing the graph. 

As a simple starting point, we'll traverse up to 3 hops out from each Chunk, capture the relationships encountered, and include them in the response alongside our text chunks.


In [12]:
from neo4j_graphrag.retrievers import VectorCypherRetriever

vc_retriever = VectorCypherRetriever(
    driver,
    index_name="text_embeddings",
    embedder=embedder,
    retrieval_query="""
//1) Go out 2-3 hops in the entity graph and get relationships
WITH node AS chunk
MATCH (chunk)<-[:FROM_CHUNK]-()-[relList:!FROM_CHUNK]-{1,2}()
UNWIND relList AS rel

//2) collect relationships and text chunks
WITH collect(DISTINCT chunk) AS chunks, 
  collect(DISTINCT rel) AS rels

//3) format and return context
RETURN '=== text ===\n' + apoc.text.join([c in chunks | c.text], '\n---\n') + '\n\n=== kg_rels ===\n' +
  apoc.text.join([r in rels | startNode(r).name + ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' +  ' -> ' + endNode(r).name ], '\n---\n') AS info
"""
)

Below we visualize the context we get back when submitting a search prompt. 

In [13]:
vc_res = vc_retriever.get_search_results(query_text = "How is precision medicine applied to Lupus?", top_k=3)

# print output
kg_rel_pos = vc_res.records[0]['info'].find('\n\n=== kg_rels ===\n')
print("# Text Chunk Context:")
print(vc_res.records[0]['info'][:kg_rel_pos])
print("# KG Context From Relationships:")
print(vc_res.records[0]['info'][kg_rel_pos:])

# Text Chunk Context:
=== text ===
REVIEW
T owards Precision Medicine in Systemic Lupus
Erythematosus
This article was published in the following Dove Press journal:
Pharmacogenomics and Personalized Medicine
Elliott Lever 1
Marta R Alves2
David A Isenberg 1
1Centre for Rheumatology, Division of
Medicine, University College Hospital
London, London, UK;2Internal Medicine,
Department of Medicine, Centro
Hospitalar do Porto, Porto, Portugal
Abstract: Systemic lupus erythematosus (SLE) is a remarkable condition characterised by
diversity amongst its clinical features and immunological abnormalities. In this review, we
attempt to capture the major immunological changes linked to the pathophysiology of lupus
and discuss the challenge it presents in moving towards the concept of precision medicine.
Currently broadly similar types of drugs, e.g., steroids, immunosuppressives, hydroxychlor-
oquine are used to treat many of the diverse clinical features of SLE. We suspect that, as the
precise im

## GraphRAG
 
 You can construct GraphRAG pipelines with the `GraphRAG` class.  At a minimum, you will need to pass the constructor an LLM and a retriever. You can optionally pass a custom prompt template. We will do so here just to provide a bit more guidance for the LLM to stick to information from our data source.
 
Below we create `GraphRAG` objects for both the vector and vector-cypher retrievers. 

In [14]:
from neo4j_graphrag.generation import RagTemplate
from neo4j_graphrag.generation.graphrag import GraphRAG

llm = BedrockLLM(model_name="anthropic.claude-3-5-sonnet-20240620-v1:0",  model_params={"temperature": 0.0})

rag_template = RagTemplate(template='''Answer the Question using the following Context. Only respond with information mentioned in the Context. Do not inject any speculative information not mentioned. 

# Question:
{query_text}
 
# Context:
{context}

# Answer:
''', expected_inputs=['query_text', 'context'])

v_rag  = GraphRAG(llm=llm, retriever=vector_retriever, prompt_template=rag_template)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever, prompt_template=rag_template)

Now we can run GraphRAG and examine the outputs. 

In [15]:
q = "How is precision medicine applied to Lupus? provide in list format."
print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k':5}).answer}")

Vector Response: 
Based on the provided context, precision medicine is applied to Lupus in the following ways:

1. Using molecular tools to predict response to treatments like methotrexate or azathioprine.

2. Utilizing gene signatures to assist in response prediction, as has been shown in rheumatoid arthritis.

3. Targeting specific molecules with biologic drugs to provide a more precise approach.

4. Considering racial differences in treatment response, such as black lupus patients not responding as well to Cyclophosphamide compared to Caucasians.

5. Potentially treating patients prophylactically based on rising dsDNA antibody levels and falling C3, which can anticipate a flare.

6. Using biomarkers for SLE diagnosis and monitoring, as defined in various criteria (ACR-1997, SLICC-2012, and EULAR/ACR-2019).

7. Considering the role of the gut microbiome in SLE and potential interventions based on microbiota findings.

8. Tailoring treatment based on the specific organ or system invol

In [16]:
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k':5}).answer}")

Vector + Cypher Response: 
Based on the context provided, precision medicine is applied to Lupus in the following ways:

1. Use of biomarkers for diagnosis and monitoring disease activity, including:
   - Anti-dsDNA antibodies
   - Antiphospholipid antibodies
   - Complement proteins (C3, C4)
   - SLE-specific antibodies
   - Anti-C1q antibodies

2. Genetic studies:
   - Genome-wide association studies used for studying SLE genetics
   - Polygenic risk scores to predict SLE risk

3. Targeted treatments based on specific pathways:
   - Belimumab targeting B cell pathway
   - Rituximab for B cell depletion therapy
   - Anifrolumab as a potential treatment

4. Use of scoring systems to assess disease activity:
   - BILAG scoring system
   - SLE Disease Activity Index (SLEDAI)

5. Patient-reported outcomes (PROs) used for monitoring and treatment evaluation

6. Consideration of individual patient factors:
   - Ethnicity (e.g., black lupus patients responding differently to cyclophosphamide

In [17]:
q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, and treatments? Provide in detailed list format."

v_rag_result = v_rag.search(q, retriever_config={'top_k': 5}, return_context=True)
print(f"Vector Response: \n{v_rag_result.answer}")

Vector Response: 
Based on the provided context, here is a summary of systemic lupus erythematosus (SLE) in a detailed list format:

1. General characteristics:
   - Chronic inflammatory systemic disease
   - Multi-organ involvement
   - Complex clinical picture
   - Wide range of manifestations with varying severity
   - Unpredictable relapsing and remitting course

2. Common effects:
   - Constitutional symptoms (e.g., fatigue, fever)
   - Skin involvement (e.g., malar rash, discoid lupus erythematosus)
   - Polyarthritis
   - Thrombocytopenia
   - Nephritis
   - Pleurisy
   - Pericarditis
   - Neurological involvement (e.g., psychosis, mononeuritis multiplex)

3. Biomarkers:
   - Anti-dsDNA antibodies
   - Ro60 antigen

4. Treatments:
   - First-line treatments:
     - Hydroxychloroquine (HCQ)
     - Corticosteroids (CS)
     - Mycophenolate (MMF)
     - Cyclophosphamide (IV Cyclo)
     - Warfarin/Low molecular weight heparin (for antiphospholipid syndrome)

   - Second-line treatme

In [18]:
for i in v_rag_result.retriever_result.items: print(json.dumps(eval(i.content), indent=1))

{
 "text": "scular risk.\nOur approach as detailed in Table 1 does not help in\ndistinguishing subgroups of patients for methotrexate or\nazathioprine. The molecular tools for this are on their way\nto the clinic. In rheumatoid arthritis gene signature has\nbeen shown to assist response prediction.77\nIn the current treatment of lupus longterm clinical and\nimmunological remission is only achieved in a proportion\nof patients. The inherent pathophysiological complexity of\nlupus described at the cellular and molecular level cannot\nbe overstated. Using targeted drug therapy and improving\nclinical assessment remains vital to improving disease\ncontrol and patient outcomes.\nThe Challenge of Moving T owards\nPrecision Medicine in SLE\nAs the foregoing text has con\ufb01rmed the pathophysiological\naspects of SLE are highly diverse and the combination of\nfactors leading to say skin, kidney and renal disease are very\ndifferent. In spite of these differences, a rather restricted set of\n

In [22]:
vc_rag_result = vc_rag.search(q, retriever_config={'top_k': 5}, return_context=True)
print(f"Vector + Cypher Response: \n{vc_rag_result.answer}")

Vector + Cypher Response: 
Here is a detailed summary of systemic lupus erythematosus (SLE) based on the provided context:

Common Effects:
1. Multi-organ involvement
2. Skin disease
3. Kidney disease/renal involvement
4. Neuropsychiatric involvement
5. Hemolytic anemia
6. Joint pain
7. Chronic pain
8. Fatigue
9. Cardiovascular events/increased cardiovascular risk
10. Damage accrual
11. Increased mortality

Biomarkers:
1. Anti-dsDNA antibodies
2. Anti-Sm antibodies  
3. Antinuclear antibody (ANA)
4. Anti-SSA antibodies
5. Anticardiolipin antibodies
6. Anti-β2-glycoprotein I antibodies
7. Anti-NMDAR antibodies
8. Anti-RibP antibodies
9. IFI44L promoter methylation
10. Proteinuria
11. Urinary casts
12. White blood cell count
13. Platelet count
14. Cardiac troponin T
15. IgG-anticardiolipin antibodies
16. L-valine
17. Pyrimidine  
18. Erucamide
19. L-leucine

Treatments:
1. First-line:
   - Hydroxychloroquine
   - Corticosteroids (e.g. prednisolone)

2. Second-line:
   - Methotrexate
   -

In [23]:
vc_ls = vc_rag_result.retriever_result.items[0].content.split('\\n---\\n')
for i in vc_ls:
    if "biomarker" in i: print(i)

Anticardiolipin Antibodies - BIOMARKER_FOR(Anticardiolipin antibodies as biomarker for SLE) -> Systemic Lupus Erythematosus
Anti-β2-glycoprotein I Antibodies - BIOMARKER_FOR(Anti-β2-glycoprotein I antibodies as biomarker for SLE) -> Systemic Lupus Erythematosus
SLE - PUBLISHED_IN(Research about SLE biomarkers) -> Biomolecules
anti-RibP antibodies - BIOMARKER_FOR(Specific biomarker for diagnosing SLE) -> SLE


In [24]:
vc_ls = vc_rag_result.retriever_result.items[0].content.split('\\n---\\n')
for i in vc_ls:
    if "treat" in i: print(i)

<Record info="=== text ===\nscular risk.\nOur approach as detailed in Table 1 does not help in\ndistinguishing subgroups of patients for methotrexate or\nazathioprine. The molecular tools for this are on their way\nto the clinic. In rheumatoid arthritis gene signature has\nbeen shown to assist response prediction.77\nIn the current treatment of lupus longterm clinical and\nimmunological remission is only achieved in a proportion\nof patients. The inherent pathophysiological complexity of\nlupus described at the cellular and molecular level cannot\nbe overstated. Using targeted drug therapy and improving\nclinical assessment remains vital to improving disease\ncontrol and patient outcomes.\nThe Challenge of Moving T owards\nPrecision Medicine in SLE\nAs the foregoing text has conﬁrmed the pathophysiological\naspects of SLE are highly diverse and the combination of\nfactors leading to say skin, kidney and renal disease are very\ndifferent. In spite of these differences, a rather restrict

In [25]:
q = "Can you summarize systemic lupus erythematosus (SLE)? including common effects, biomarkers, treatments, and current challenges faced by Physicians and patients? provide in list format with details for each item."
print(f"Vector Response: \n{v_rag.search(q, retriever_config={'top_k': 5}).answer}")

Vector Response: 
Based on the provided context, here is a summary of systemic lupus erythematosus (SLE) in list format:

1. Common effects:
   - Constitutional symptoms
   - Discoid lupus erythematosus (skin involvement)
   - Polyarthritis
   - Thrombocytopenia
   - Nephritis
   - Mononeuritis multiplex

2. Biomarkers:
   - Proteinuria
   - Urinary casts
   - Hemolytic anemia
   - Low white blood cell count
   - Low platelet count
   - Direct Coombs' test

3. Treatments:
   - Hydroxychloroquine
   - Corticosteroids
   - Immunosuppressives (e.g., methotrexate, azathioprine, mycophenolate)
   - Cyclophosphamide
   - Rituximab
   - Belimumab
   - Anticoagulation or antiplatelet therapy for antiphospholipid syndrome

4. Current challenges faced by physicians and patients:
   - Difficulty in diagnosis due to heterogeneous presentation
   - Suboptimal sensitivity and specificity of current biomarkers
   - High level of physician skill and experience required for diagnosis
   - Limited sympt

In [26]:
print(f"Vector + Cypher Response: \n{vc_rag.search(q, retriever_config={'top_k': 5}).answer}")

Vector + Cypher Response: 
Here is a summary of systemic lupus erythematosus (SLE) based on the provided context:

1. Common effects:
   - Multi-organ involvement 
   - Skin disease
   - Kidney disease
   - Renal disease
   - Neuropsychiatric involvement
   - Fatigue
   - Joint pain
   - Chronic pain

2. Biomarkers:
   - Anti-dsDNA antibodies
   - Antinuclear antibody (ANA)
   - Anti-SSA antibodies  
   - Anticardiolipin antibodies
   - Smith antibody
   - Complement proteins (C3, C4)
   - Proteinuria
   - Urinary casts
   - White blood cell count
   - Platelet count

3. Treatments:
   - First-line: Hydroxychloroquine, Corticosteroids
   - Second-line: Methotrexate, Azathioprine, Mycophenolate
   - Third-line: Belimumab, Rituximab, Cyclophosphamide
   - Other: Antimalarials, Immunosuppressants

4. Current challenges:
   - Diagnosis can be challenging due to varied symptoms
   - Sensitivity and specificity of current biomarkers are not ideal
   - High levels of physician skill and exper