# From Unstructured Data Ingestion to Question Answering: an end-to-end example

This notebook demonstrates how to leverage the neo4j-graphrag package to ingest a PDF document into Neo4j, including entity and relation extraction to build a Knowledge Graph. In a second part, we show how to answer questions with an LLM, grounded to the content of the KG.

Note: this notebook uses OpenAI embeddings and GPT-4o model and hence requires `OPENAI_API_KEY` to be in a `.env` file. Other providers are supported, check them out in the [documentation](https://neo4j.com/docs/neo4j-graphrag-python/current/user_guide_rag.html#using-another-llm-model).


In [52]:
import neo4j
from dotenv import load_dotenv

from neo4j_graphrag.experimental.components.pdf_loader import PdfLoader, DocumentInfo
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.embeddings import OpenAIEmbeddings as Embedder
from neo4j_graphrag.experimental.components.embedder import TextChunkEmbedder
from neo4j_graphrag.experimental.components.schema import SchemaBuilder, SchemaEntity, \
    SchemaRelation, SchemaProperty
from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.experimental.components.entity_relation_extractor import \
    LLMEntityRelationExtractor, OnError
from neo4j_graphrag.experimental.components.kg_writer import Neo4jWriter
from neo4j_graphrag.experimental.pipeline.pipeline import Pipeline

In [2]:
load_dotenv()

driver = neo4j.GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

In [3]:
filepath = "~/Downloads/pgpm-13-39.pdf"

# Building Knowledge Graph

## Understanding the components

### Extract text from PDF file

In [4]:
loader = PdfLoader()
loader_res = await loader.run(filepath=filepath)
type(loader_res), loader_res.text[:100]

(neo4j_graphrag.experimental.components.pdf_loader.PdfDocument,
 'REVIEW\nT owards Precision Medicine in Systemic Lupus\nErythematosus\nThis article was published in the')

### Split text into small chunks manageable by the LLM context window


In [5]:
splitter = FixedSizeSplitter(chunk_size=2000)
splitter_res = await splitter.run(text=loader_res.text)
type(splitter_res), len(splitter_res.chunks), splitter_res.chunks[0]

(neo4j_graphrag.experimental.components.types.TextChunks,
 33,
 TextChunk(text='REVIEW\nT owards Precision Medicine in Systemic Lupus\nErythematosus\nThis article was published in the following Dove Press journal:\nPharmacogenomics and Personalized Medicine\nElliott Lever1\nMarta R Alves2\nDavid A Isenberg1\n1Centre for Rheumatology, Division of\nMedicine, University College Hospital\nLondon, London, UK;2Internal Medicine,\nDepartment of Medicine, Centro\nHospitalar do Porto, Porto, PortugalAbstract: Systemic lupus erythematosus (SLE) is a remarkable condition characterised by\ndiversity amongst its clinical features and immunological abnormalities. In this review, we\nattempt to capture the major immunological changes linked to the pathophysiology of lupus\nand discuss the challenge it presents in moving towards the concept of precision medicine.\nCurrently broadly similar types of drugs, e.g., steroids, immunosuppressives, hydroxychlor-\noquine are used to treat many of the diverse c

### Embed each chunk' text

In [6]:
embedder = Embedder()

In [7]:
embedder_component = TextChunkEmbedder(embedder=embedder)
embedder_res = await embedder_component.run(text_chunks=splitter_res)
type(embedder_res), embedder_res.chunks[0]

(neo4j_graphrag.experimental.components.types.TextChunks,
 TextChunk(text='REVIEW\nT owards Precision Medicine in Systemic Lupus\nErythematosus\nThis article was published in the following Dove Press journal:\nPharmacogenomics and Personalized Medicine\nElliott Lever1\nMarta R Alves2\nDavid A Isenberg1\n1Centre for Rheumatology, Division of\nMedicine, University College Hospital\nLondon, London, UK;2Internal Medicine,\nDepartment of Medicine, Centro\nHospitalar do Porto, Porto, PortugalAbstract: Systemic lupus erythematosus (SLE) is a remarkable condition characterised by\ndiversity amongst its clinical features and immunological abnormalities. In this review, we\nattempt to capture the major immunological changes linked to the pathophysiology of lupus\nand discuss the challenge it presents in moving towards the concept of precision medicine.\nCurrently broadly similar types of drugs, e.g., steroids, immunosuppressives, hydroxychlor-\noquine are used to treat many of the diverse clinic

### Create a schema

In order to ground the LLM to specific entities and relations to extract.

In [59]:
schema_builder = SchemaBuilder()
entities = [
    SchemaEntity(label="Disease", properties=[SchemaProperty(name="name", type="STRING")]),
    SchemaEntity(label="Gene", properties=[SchemaProperty(name="name", type="STRING")]),
    SchemaEntity(label="Cell", properties=[SchemaProperty(name="name", type="STRING")]),
    SchemaEntity(label="Food", properties=[SchemaProperty(name="name", type="STRING")]),
    SchemaEntity(label="Protein", properties=[SchemaProperty(name="name", type="STRING")]),
    SchemaEntity(label="Compound", properties=[SchemaProperty(name="name", type="STRING")]),
    SchemaEntity(label="Book", properties=[SchemaProperty(name="name", type="STRING"), SchemaProperty(name="year", type="STRING")]),
]
relations = [
    # TODO: update relations and potential_schema to something more realistic
    SchemaRelation(label="CORRELATES_WITH"),
    SchemaRelation(label="SOURCE"),
    SchemaRelation(label="ASSOCIATED_WITH"),
]
potential_schema = [
    ("Disease", "CORRELATES_WITH", "Disease"),
    ("Food", "SOURCE", "Compound"),
    ("Compound", "ASSOCIATED_WITH", "Disease"),
]

schema = await schema_builder.run(
    entities=entities, 
    relations=relations,
    potential_schema=potential_schema,
)

### Extract entities and relations

In [54]:
llm = LLM(model_name="gpt-4o", model_params={"response_format": {"type": "json_object"}})
extractor = LLMEntityRelationExtractor(llm=llm, on_error=OnError.IGNORE)
extractor_res = await extractor.run(chunks=embedder_res, document_info=DocumentInfo(path=filepath), schema=schema)
type(extractor_res), len(extractor_res.nodes), len(extractor_res.relationships)

LLM response has improper format {'nodes': [{'id': '0', 'label': 'Disease', 'properties': {'name': 'systemic lupus erythematosus'}}, {'id': '1', 'label': 'Disease', 'properties': {'name': 'SLE'}}, {'id': '2', 'label': 'Cell', 'properties': {'name': 'dendritic cells'}}, {'id': '3', 'label': 'Cell', 'properties': {'name': 'T-cell'}}, {'id': '4', 'label': 'Cell', 'properties': {'name': 'B-cell'}}, {'id': '5', 'label': 'Cell', 'properties': {'name': 'T helper 17 cell'}}, {'id': '6', 'label': 'Cell', 'properties': {'name': 'Regulatory T cells'}}, {'id': '7', 'label': 'Cell', 'properties': {'name': 'T follicular helper cells'}}, {'id': '8', 'label': 'Cell', 'properties': {'name': 'neutrophils'}}, {'id': '9', 'label': 'Disease', 'properties': {'name': 'rheumatic autoimmune diseases'}}], 'relationships': [{'type': 'CORRELATES_WITH', 'start_node_id': '0', 'end_node_id': '1', 'properties': []}, {'type': 'CORRELATES_WITH', 'start_node_id': '1', 'end_node_id': '0', 'properties': []}]} for chunk_in

(neo4j_graphrag.experimental.components.types.Neo4jGraph, 271, 464)

### Finally, writing everything to Neo4j

In [55]:
writer = Neo4jWriter(driver=driver)
writer_res = await writer.run(graph=extractor_res)
writer_res

KGWriterModel(status='SUCCESS', metadata={'node_count': 271, 'relationship_count': 464})

## Using a pipeline

In [61]:
pipeline = Pipeline()
pipeline.add_component(loader, "loader")
pipeline.add_component(splitter, "splitter")
pipeline.add_component(embedder_component, "embedder")
pipeline.add_component(schema_builder, "schema")
pipeline.add_component(extractor, "extractor")
pipeline.add_component(writer, "writer")
pipeline.connect("loader", "splitter", {"text": "loader.text"})
pipeline.connect("splitter", "embedder", {"text_chunks": "splitter"})
pipeline.connect("embedder", "extractor", {"chunks": "embedder"})
pipeline.connect("schema", "extractor", {"schema": "schema"})
pipeline.connect("extractor", "writer", {"graph": "extractor"})
pipeline_res = await pipeline.run({
    "loader": {"filepath": filepath}, 
    "schema":  {"entities": entities, "relations": relations, "potential_schema": potential_schema},
})
pipeline_res

Starting pipeline
PIPELINE START data={'loader': {'filepath': '~/Downloads/pgpm-13-39.pdf'}, 'schema': {'entities': [SchemaEntity(label='Disease', description='', properties=[SchemaProperty(name='name', type='STRING', description='')]), SchemaEntity(label='Gene', description='', properties=[SchemaProperty(name='name', type='STRING', description='')]), SchemaEntity(label='Cell', description='', properties=[SchemaProperty(name='name', type='STRING', description='')]), SchemaEntity(label='Food', description='', properties=[SchemaProperty(name='name', type='STRING', description='')]), SchemaEntity(label='Protein', description='', properties=[SchemaProperty(name='name', type='STRING', description='')]), SchemaEntity(label='Compound', description='', properties=[SchemaProperty(name='name', type='STRING', description='')]), SchemaEntity(label='Book', description='', properties=[SchemaProperty(name='name', type='STRING', description=''), SchemaProperty(name='year', type='STRING', description='

In [62]:
pipeline_res

PipelineResult(run_id='6d873584-5ee9-4ffc-a7f1-366bab349d58', result={'writer': {'status': 'SUCCESS', 'metadata': {'node_count': 279, 'relationship_count': 466}}})

# DB operation

In [33]:
from neo4j_graphrag.indexes import create_vector_index
create_vector_index(driver, name="chunk-embeddings", label="Chunk", embedding_property="embedding", dimensions=1536, similarity_fn="cosine")

Creating vector index named 'chunk-embeddings'


# RAG

## Vector search

In [44]:
from neo4j_graphrag.retrievers import VectorRetriever, VectorCypherRetriever
from neo4j_graphrag.generation.graphrag import GraphRAG

In [36]:
llm = LLM(model_name="gpt-4o")

vector_rag = GraphRAG(
   llm=llm,
   retriever=VectorRetriever(
       driver,
       index_name="chunk-embeddings",
       embedder=embedder,
       return_properties=["text"],
   )
)
vector_rag.search("what is Systemic lupus?")

VectorRetriever Cypher parameters: {'top_k': 5, 'vector_index_name': 'chunk-embeddings', 'query_vector': [-0.009282542392611504, 0.017388710752129555, -0.005547214765101671, -0.019660329446196556, -0.03329005092382431, 0.014211146160960197, -0.018510999158024788, -0.035886187106370926, -0.011804311536252499, -0.019024817273020744, 0.007085290737450123, 0.003229959635064006, -0.0014239880256354809, 0.016293464228510857, 0.021282916888594627, 0.019876675680279732, 0.051517095416784286, -0.011743464507162571, 0.0010014396393671632, -0.021580390632152557, -0.01422466803342104, -0.030802085995674133, 0.003897586138918996, 0.0012160942424088717, -0.014576228335499763, -0.00344968494027853, 0.003998997621238232, -0.03393908590078354, -0.014657357707619667, -0.006179347168654203, 0.01496835332363844, -0.015428085811436176, -0.017388710752129555, -0.0016656856751069427, 0.006574852392077446, -0.005036776419728994, 0.025461073964834213, -0.002087388886138797, -0.024960776790976524, -0.0004445208

RagResultModel(answer='Systemic lupus erythematosus (SLE) is a chronic, complex autoimmune disease characterized by its multiorgan involvement and highly variable course. It is marked by the loss of self-immune tolerance, leading to the production of self-reacting antibodies and the formation of immune complexes that precipitate in tissues. This causes chronic systemic inflammation and organ damage. SLE predominantly affects women (90%) during their childbearing years and manifests with diverse clinical features and immunological abnormalities. The disease has an unpredictable prognosis, primarily influenced by disease activity severity, organ damage, and response to treatment.', retriever_result=None)

## Graph RAG

In [37]:
graph_rag = GraphRAG(
   llm=llm,
   retriever=VectorCypherRetriever(
       driver,
       index_name="chunk-embeddings",
       embedder=embedder,
       retrieval_query="WITH node MATCH (node)<-[:FROM_CHUNK]-(n:`__Entity__`) RETURN n",
   )
)
graph_rag.search("how is HLA-DR3 related to lupus?", retriever_config={"top_k": 2})

VectorCypherRetriever Cypher parameters: {'top_k': 2, 'vector_index_name': 'chunk-embeddings', 'query_vector': [-0.009320869110524654, 0.0019455356523394585, -0.0010451084235683084, -0.03807326406240463, -0.017538756132125854, 0.04204944148659706, -0.0161906685680151, -0.006794906686991453, -0.010144700296223164, -0.007884270511567593, -0.005821287631988525, 0.010600871406495571, 0.009429804980754852, -0.00938214547932148, -0.014297899790108204, 0.029685163870453835, 0.030120909214019775, -0.01850556768476963, 0.02272685244679451, -0.01023321133106947, -0.00960001815110445, 0.003038303693756461, 0.001120002125389874, -0.01098895724862814, -0.006461288779973984, 0.011458745226264, -0.008211079984903336, -0.034342192113399506, -0.016476627439260483, -0.011622149497270584, 0.030828995630145073, 0.0016297904076054692, -0.008551505394279957, 0.01975833624601364, 0.016040882095694542, -0.018165141344070435, 0.031646016985177994, 0.029358353465795517, 0.012200874276459217, 0.00355745363049209

RagResultModel(answer='HLA-DR3 is a gene that has been associated with systemic lupus erythematosus (SLE), a type of lupus. This connection suggests that specific gene variants, such as HLA-DR3, may play a role in the susceptibility to developing lupus.', retriever_result=None)

In [38]:
graph_rag.search("how is HLA-DR3 related to lupus?", retriever_config={"top_k": 2}, return_context=True)

VectorCypherRetriever Cypher parameters: {'top_k': 2, 'vector_index_name': 'chunk-embeddings', 'query_vector': [-0.009233240969479084, 0.0019525309326127172, -0.0010358457220718265, -0.037995196878910065, -0.017608527094125748, 0.0420534648001194, -0.016178598627448082, -0.006744487676769495, -0.010166098363697529, -0.007864597253501415, -0.005784394219517708, 0.010574648156762123, 0.009423897601664066, -0.009423897601664066, -0.01424479391425848, 0.0296062920242548, 0.03009655326604843, -0.018384771421551704, 0.022715408354997635, -0.010200143791735172, -0.009607745334506035, 0.003021571319550276, 0.0010503152152523398, -0.011051290668547153, -0.006509571336209774, 0.01141898613423109, -0.008184628561139107, -0.03434547781944275, -0.016464585438370705, -0.011636880226433277, 0.030831944197416306, 0.0016324996249750257, -0.008552324026823044, 0.01971936970949173, 0.016042416915297508, -0.018112406134605408, 0.03170351684093475, 0.029279451817274094, 0.012256515212357044, 0.003574816742

RagResultModel(answer='HLA-DR3 is a gene that has been associated with Systemic Lupus Erythematosus (SLE), which is also known as lupus.', retriever_result=RetrieverResult(items=[RetrieverResultItem(content="<Record n=<Node element_id='4:a7411f47-9112-4baf-995a-de5abe0a4a7e:988' labels=frozenset({'GeneticVariant', '__Entity__'}) properties={'chunk_index': 12, 'name': 'de novo methylation', 'id': '1727344509.57838:12:11'}>>", metadata=None), RetrieverResultItem(content="<Record n=<Node element_id='4:a7411f47-9112-4baf-995a-de5abe0a4a7e:987' labels=frozenset({'GeneticVariant', '__Entity__'}) properties={'chunk_index': 12, 'name': 'CpG island methylation', 'id': '1727344509.57838:12:10'}>>", metadata=None), RetrieverResultItem(content="<Record n=<Node element_id='4:a7411f47-9112-4baf-995a-de5abe0a4a7e:986' labels=frozenset({'Gene', '__Entity__'}) properties={'chunk_index': 12, 'name': 'FOXP3', 'id': '1727344509.57838:12:9'}>>", metadata=None), RetrieverResultItem(content="<Record n=<Node 