In [1]:
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License.

# Neo4j Import of GraphRAG Result Parquet files

This notebook imports the results of the GraphRAG indexing process into the Neo4j Graph database for further processing, analysis or visualization. 

You can also build your own GenAI applications using Neo4j and a number of RAG strategies with LangChain, LlamaIndex, Haystack, and many other frameworks.
See: https://neo4j.com/labs/genai-ecosystem

Here is what the end result looks like:

![](https://dev.assets.neo4j.com/wp-content/uploads/graphrag-neo4j-visualization.png)

## How does it work?

The notebook loads the parquet files from the `output` folder of your indexing process and loads them into Pandas dataframes.
It then uses a batching approach to send a slice of the data into Neo4j to create nodes and relationships and add relevant properties. The id-arrays on most entities are turned into relationships. 

All operations use MERGE, so they are idempotent, and you can run the script multiple times.

If you need to clean out the database, you can run the following statement

```cypher
MATCH (n)
CALL { WITH n DETACH DELETE n } IN TRANSACTIONS OF 25000 ROWS;
```

In [2]:
GRAPHRAG_FOLDER = "./output"

### Depedendencies

We only need Pandas and the neo4j Python driver with the rust extension for faster network transport.

In [3]:
%pip install --quiet pandas neo4j-rust-ext python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [9]:
%pip install pyarrow fastparquet

Collecting pyarrow
  Downloading pyarrow-20.0.0-cp311-cp311-win_amd64.whl.metadata (3.4 kB)
Collecting fastparquet
  Downloading fastparquet-2024.11.0-cp311-cp311-win_amd64.whl.metadata (4.3 kB)
Collecting cramjam>=2.3 (from fastparquet)
  Downloading cramjam-2.10.0-cp311-cp311-win_amd64.whl.metadata (5.1 kB)
Collecting fsspec (from fastparquet)
  Downloading fsspec-2025.3.2-py3-none-any.whl.metadata (11 kB)
Downloading pyarrow-20.0.0-cp311-cp311-win_amd64.whl (25.8 MB)
   ---------------------------------------- 0.0/25.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/25.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/25.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/25.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/25.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/25.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/25.8 MB ? eta -:--:--
   --------------------------------

In [4]:
import time

import pandas as pd
from neo4j import GraphDatabase
from dotenv import load_dotenv
import os

## Neo4j Installation

You can create a free instance of Neo4j [online](https://console.neo4j.io). You get a credentials file that you can use for the connection credentials. You can also get an instance in any of the cloud marketplaces.

If you want to install Neo4j locally either use [Neo4j Desktop](https://neo4j.com/download) or 
the official Docker image: `docker run -e NEO4J_AUTH=neo4j/password -p 7687:7687 -p 7474:7474 neo4j` 

In [5]:
load_dotenv()
NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME") 
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD") 
NEO4J_DATABASE = "neo4j"

# Create a Neo4j driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

## Batched Import

The batched import function takes a Cypher insert statement (needs to use the variable `value` for the row) and a dataframe to import.
It will send by default 1k rows at a time as query parameter to the database to be inserted.

In [6]:
def batched_import(statement, df, batch_size=1000):
    """
    Import a dataframe into Neo4j using a batched approach.

    Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.
    """
    total = len(df)
    start_s = time.time()
    for start in range(0, total, batch_size):
        batch = df.iloc[start : min(start + batch_size, total)]
        result = driver.execute_query(
            "UNWIND $rows AS value " + statement,
            rows=batch.to_dict("records"),
            database_=NEO4J_DATABASE,
        )
        print(result.summary.counters)
    print(f"{total} rows in {time.time() - start_s} s.")
    return total

## Indexes and Constraints

Indexes in Neo4j are only used to find the starting points for graph queries, e.g. quickly finding two nodes to connect.
Constraints exist to avoid duplicates, we create them mostly on id's of Entity types.

We use some Types as markers with two underscores before and after to distinguish them from the actual entity types.

The default relationship type here is `RELATED` but we could also infer a real relationship-type from the description or the types of the start and end-nodes.

* `__Entity__`
* `__Document__`
* `__Chunk__`
* `__Community__`
* `__Covariate__`

In [7]:
# create constraints, idempotent operation

statements = [
    "\ncreate constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique",
    "\ncreate constraint document_id if not exists for (d:__Document__) require d.id is unique",
    "\ncreate constraint entity_id if not exists for (c:__Community__) require c.community is unique",
    "\ncreate constraint entity_id if not exists for (e:__Entity__) require e.id is unique",
    "\ncreate constraint entity_title if not exists for (e:__Entity__) require e.name is unique",
    "\ncreate constraint entity_title if not exists for (e:__Covariate__) require e.title is unique",
    "\ncreate constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique",
    "\n",
]

for statement in statements:
    if len((statement or "").strip()) > 0:
        print(statement)
        driver.execute_query(statement)


create constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique

create constraint document_id if not exists for (d:__Document__) require d.id is unique

create constraint entity_id if not exists for (c:__Community__) require c.community is unique

create constraint entity_id if not exists for (e:__Entity__) require e.id is unique

create constraint entity_title if not exists for (e:__Entity__) require e.name is unique

create constraint entity_title if not exists for (e:__Covariate__) require e.title is unique

create constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique


## Import Process

### Importing the Documents

We're loading the parquet file for the documents and create nodes with their ids and add the title property.
We don't need to store text_unit_ids as we can create the relationships and the text content is also contained in the chunks.

In [10]:
doc_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/documents.parquet", columns=["id", "title"]
)
doc_df.head(2)

Unnamed: 0,id,title
0,594b3fec36c9d27c75bc5b0d5f4884bde37ba3108bd1fb...,study_rules_removed_first_page.txt


In [11]:
# Import documents
statement = """
MERGE (d:__Document__ {id:value.id})
SET d += value {.title}
"""

batched_import(statement, doc_df)

{'_contains_updates': True, 'labels_added': 1, 'nodes_created': 1, 'properties_set': 2}
1 rows in 0.3626978397369385 s.


1

### Loading Text Units

We load the text units, create a node per id and set the text and number of tokens.
Then we connect them to the documents that we created before.

In [12]:
text_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/text_units.parquet",
    columns=["id", "text", "n_tokens", "document_ids"],
)
text_df.head(2)

Unnamed: 0,id,text,n_tokens,document_ids
0,4f63e423d748ad1be8673335fdb848623d947c7f5c9b38...,Study Rules\nat Gdańsk University of Technolog...,1200,[594b3fec36c9d27c75bc5b0d5f4884bde37ba3108bd1f...
1,188e3ca3bac1c13b8ee58dfb1e2499653df915db2aba7b...,: odd and even; in the bachelor’s degree studi...,1200,[594b3fec36c9d27c75bc5b0d5f4884bde37ba3108bd1f...


In [13]:
statement = """
MERGE (c:__Chunk__ {id:value.id})
SET c += value {.text, .n_tokens}
WITH c, value
UNWIND value.document_ids AS document
MATCH (d:__Document__ {id:document})
MERGE (c)-[:PART_OF]->(d)
"""

batched_import(statement, text_df)

{'_contains_updates': True, 'labels_added': 12, 'relationships_created': 12, 'nodes_created': 12, 'properties_set': 36}
12 rows in 0.4093461036682129 s.


12

### Loading Nodes

For the nodes we store id, name, description, embedding (if available), human readable id.

In [15]:
df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/entities.parquet")
print(df.head())

                                     id  human_readable_id  \
0  999c9a93-d720-4bc0-ad38-c907021b0cfe                  0   
1  d0444414-309f-485b-b421-a61086329dae                  1   
2  5da9f778-19e6-401f-8377-73e24c4d6c04                  2   
3  dc6b12a3-1da8-460b-b235-10dfbf41d048                  3   
4  402ae422-491c-4b3d-a3e5-3f712874a6c0                  4   

                                               title          type  \
0                                             GDAŃSK           GEO   
1                                            MOJA PG  ORGANIZATION   
2            GDAŃSK UNIVERSITY OF TECHNOLOGY STATUTE         EVENT   
3  ACT OF 20 JULY 2018 - LAW ON HIGHER EDUCATION ...         EVENT   
4                                               ECTS  ORGANIZATION   

                                         description  \
0   Gdańsk is a city where the university is located   
1  Here is the comprehensive description:\n\nMoja...   
2   The statute is a legal basis for t

In [None]:
entity_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/entities.parquet",
    columns=[
        "title",
        "type",
        "description",
        "human_readable_id",
        "id",
        "text_unit_ids",
    ],
)
entity_df.head(2)

Unnamed: 0,title,type,description,human_readable_id,id,text_unit_ids
0,GDAŃSK,GEO,Gdańsk is a city where the university is located,0,999c9a93-d720-4bc0-ad38-c907021b0cfe,[4f63e423d748ad1be8673335fdb848623d947c7f5c9b3...
1,MOJA PG,ORGANIZATION,Here is the comprehensive description:\n\nMoja...,1,d0444414-309f-485b-b421-a61086329dae,[4f63e423d748ad1be8673335fdb848623d947c7f5c9b3...


In [21]:
entity_statement = """
MERGE (e:__Entity__ {id:value.id})
SET e += value {.human_readable_id, .description, title:replace(value.title,'"','')}
WITH e, value
CALL apoc.create.addLabels(e, case when coalesce(value.type,"") = "" then [] else [apoc.text.upperCamelCase(replace(value.type,'"',''))] end) yield node
UNWIND value.text_unit_ids AS text_unit
MATCH (c:__Chunk__ {id:text_unit})
MERGE (c)-[:HAS_ENTITY]->(e)
"""

batched_import(entity_statement, entity_df)

{'_contains_updates': True, 'labels_added': 159, 'relationships_created': 223, 'nodes_created': 159, 'properties_set': 636}
159 rows in 0.7827091217041016 s.


159

### Import Relationships

For the relationships we find the source and target node by name, using the base `__Entity__` type.
After creating the `RELATED` relationships, we set the description as attribute.

In [23]:
df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/relationships.parquet")
print(df.head())

                                     id  human_readable_id  \
0  0dc35283-a2e8-49ac-8dd5-337c34cd2eab                  0   
1  f2b901a3-81ee-4f0c-8430-1c0f8009c364                  1   
2  779482e1-afc2-4cd1-9578-f46d1784aeda                  2   
3  5d2cd25d-04ad-45b7-bb9a-120b5e4f1b96                  3   
4  0a660ff3-8f1f-417d-ad54-ce94cb76e89c                  4   

                                              source  \
0                                             GDAŃSK   
1                                            MOJA PG   
2            GDAŃSK UNIVERSITY OF TECHNOLOGY STATUTE   
3  ACT OF 20 JULY 2018 - LAW ON HIGHER EDUCATION ...   
4                                               ECTS   

                            target  \
0  GDAŃSK UNIVERSITY OF TECHNOLOGY   
1  GDAŃSK UNIVERSITY OF TECHNOLOGY   
2  GDAŃSK UNIVERSITY OF TECHNOLOGY   
3  GDAŃSK UNIVERSITY OF TECHNOLOGY   
4  GDAŃSK UNIVERSITY OF TECHNOLOGY   

                                         description  weight 

In [24]:
rel_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/relationships.parquet",
    columns=[
        "source",
        "target",
        "id",
        "combined_degree",
        "weight",
        "human_readable_id",
        "description",
        "text_unit_ids",
    ],
)
rel_df.head(2)

Unnamed: 0,source,target,id,combined_degree,weight,human_readable_id,description,text_unit_ids
0,GDAŃSK,GDAŃSK UNIVERSITY OF TECHNOLOGY,0dc35283-a2e8-49ac-8dd5-337c34cd2eab,90,32.0,0,Here is the comprehensive description:\n\nGdań...,[4f63e423d748ad1be8673335fdb848623d947c7f5c9b3...
1,MOJA PG,GDAŃSK UNIVERSITY OF TECHNOLOGY,f2b901a3-81ee-4f0c-8430-1c0f8009c364,92,16.0,1,Moja PG is an electronic system of Gdańsk Univ...,[4f63e423d748ad1be8673335fdb848623d947c7f5c9b3...


In [25]:
rel_statement = """
    MATCH (source:__Entity__ {name:replace(value.source,'"','')})
    MATCH (target:__Entity__ {name:replace(value.target,'"','')})
    // not necessary to merge on id as there is only one relationship per pair
    MERGE (source)-[rel:RELATED {id: value.id}]->(target)
    SET rel += value {.combined_degree, .weight, .human_readable_id, .description, .text_unit_ids}
    RETURN count(*) as createdRels
"""

batched_import(rel_statement, rel_df)

{}
218 rows in 0.6588599681854248 s.


218

### Importing Communities

For communities we import their id, title, level.
We connect the `__Community__` nodes to the start and end nodes of the relationships they refer to.

Connecting them to the chunks they orignate from is optional, as the entites are already connected to the chunks.

In [26]:
community_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/communities.parquet",
    columns=["id", "level", "title", "text_unit_ids", "relationship_ids"],
)

community_df.head(2)

Unnamed: 0,id,level,title,text_unit_ids,relationship_ids
0,8d3a360a-038f-4638-af40-e2d6fd21347b,0,Community 0,[5c1b2586f59902f7f368a4dfacbc66a328bc9a5510e9e...,"[0d079949-9156-44c7-86eb-aa99308ccd85, 0f91ca3..."
1,cf5c8c46-2a5b-4ea8-a86d-fe2df78f4c46,0,Community 1,[18e6b531e771de65151463aa769492340cf4579b7cc4e...,"[194c060c-5754-448b-82e8-a6f0be5a4b10, 35e790d..."


In [27]:
statement = """
MERGE (c:__Community__ {community:value.id})
SET c += value {.level, .title}
/*
UNWIND value.text_unit_ids as text_unit_id
MATCH (t:__Chunk__ {id:text_unit_id})
MERGE (c)-[:HAS_CHUNK]->(t)
WITH distinct c, value
*/
WITH *
UNWIND value.relationship_ids as rel_id
MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__)
MERGE (start)-[:IN_COMMUNITY]->(c)
MERGE (end)-[:IN_COMMUNITY]->(c)
RETURn count(distinct c) as createdCommunities
"""

batched_import(statement, community_df)

{'_contains_updates': True, 'labels_added': 17, 'nodes_created': 17, 'properties_set': 51}
17 rows in 0.3049497604370117 s.


17

### Importing Community Reports

Fo the community reports we create nodes for each communitiy set the id, community, level, title, summary, rank, and rank_explanation and connect them to the entities they are about.
For the findings we create the findings in context of the communities.

In [29]:
df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/community_reports.parquet")
print(df.head())

                                 id  human_readable_id  community  level  \
0  8e08bd79b7bd45858faadba5dee15a33                 14         14      2   
1  1c7e4ed7253d479c819bc62ac72ad9de                 15         15      2   
2  be7c1720df964201891e26b4143ccab9                 16         16      2   
3  d174963a45ef49c4b8833f942ca97162                  6          6      1   
4  5991d8e1543d46ec970745fb4fe6493c                  7          7      1   

   parent children                                              title  \
0       9       []            ECTS and Industrial Research Internship   
1      12       []           Gdańsk University of Technology Entities   
2      12       []           Gdańsk University of Technology Entities   
3       1       []  Gdańsk University of Technology and its Associ...   
4       1       []                  Moja PG and University Repository   

                                             summary  \
0  The community revolves around ECTS, a system 

In [30]:
community_report_df = pd.read_parquet(
    f"{GRAPHRAG_FOLDER}/community_reports.parquet",
    columns=[
        "id",
        "community",
        "level",
        "title",
        "summary",
        "findings",
        "rank",
        "rating_explanation",
        "full_content",
    ],
)
community_report_df.head(2)

Unnamed: 0,id,community,level,title,summary,findings,rank,rating_explanation,full_content
0,8e08bd79b7bd45858faadba5dee15a33,14,2,ECTS and Industrial Research Internship,"The community revolves around ECTS, a system u...",[{'explanation': 'ECTS is a system used by Gda...,2.0,The impact severity rating is low due to the e...,# ECTS and Industrial Research Internship\n\nT...
1,1c7e4ed7253d479c819bc62ac72ad9de,15,2,Gdańsk University of Technology Entities,The community revolves around Gdańsk Universit...,[{'explanation': 'The Rector of Gdańsk Univers...,6.5,The impact severity rating is moderate due to ...,# Gdańsk University of Technology Entities\n\n...


In [31]:
# Import communities
community_statement = """
MERGE (c:__Community__ {community:value.community})
SET c += value {.level, .title, .rank, .rating_explanation, .full_content, .summary}
WITH c, value
UNWIND range(0, size(value.findings)-1) AS finding_idx
WITH c, value, finding_idx, value.findings[finding_idx] as finding
MERGE (c)-[:HAS_FINDING]->(f:Finding {id:finding_idx})
SET f += finding
"""
batched_import(community_statement, community_report_df)

{'_contains_updates': True, 'labels_added': 54, 'relationships_created': 43, 'nodes_created': 54, 'properties_set': 206}
11 rows in 0.6558113098144531 s.


11

### Importing Covariates

Covariates are for instance claims on entities, we connect them to the chunks where they originate from.

In [34]:
df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/text_units.parquet")
print(df.head())

                                                  id  human_readable_id  \
0  4f63e423d748ad1be8673335fdb848623d947c7f5c9b38...                  1   
1  188e3ca3bac1c13b8ee58dfb1e2499653df915db2aba7b...                  2   
2  7108828b52154a4861c6d71e0b6c3c76ffe9281b38e051...                  3   
3  25af1d5852767cf31e7bec97e71349a5463abbb5757d55...                  4   
4  cc75819dce37316839dbab9b964c47184be2f96c04f33f...                  5   

                                                text  n_tokens  \
0  Study Rules\nat Gdańsk University of Technolog...      1200   
1  : odd and even; in the bachelor’s degree studi...      1200   
2   of Gdańsk Tech, and announced on the website ...      1200   
3   copy of the agreement referred to in paragrap...      1200   
4  5.\n\nFor each subject an online course in the...      1200   

                                        document_ids  \
0  [594b3fec36c9d27c75bc5b0d5f4884bde37ba3108bd1f...   
1  [594b3fec36c9d27c75bc5b0d5f4884bde37b

In [32]:
cov_df = (pd.read_parquet(f"{GRAPHRAG_FOLDER}/covariates.parquet"),)
#                         columns=["id","text_unit_id"])
cov_df.head(2)
# Subject id do not match entity ids

FileNotFoundError: [Errno 2] No such file or directory: './output/covariates.parquet'

In [None]:
# Import covariates
cov_statement = """
MERGE (c:__Covariate__ {id:value.id})
SET c += apoc.map.clean(value, ["text_unit_id", "document_ids", "n_tokens"], [NULL, ""])
WITH c, value
MATCH (ch:__Chunk__ {id: value.text_unit_id})
MERGE (ch)-[:HAS_COVARIATE]->(c)
"""
batched_import(cov_statement, cov_df)

{'_contains_updates': True, 'labels_added': 89, 'relationships_created': 89, 'nodes_created': 89, 'properties_set': 1061}
89 rows in 0.13370895385742188 s.


89

### Visualize your data

You can now [Open] Neo4j on Aura, you need to log in with either SSO or your credentials.

Or open https://workspace-preview.neo4j.io and connect to your local instance, remember the URI is `neo4j://localhost` and `neo4j` as username and `password` as password.

In "Explore" you can explore by using visual graph patterns and then explore and expand further.

In "Query", you can open the left sidebar and explore by clicking on the nodes and relationships.
You can also use the co-pilot to generate Cypher queries for your, here are some examples.

#### Show a few `__Entity__` nodes and their relationships (Entity Graph)

```cypher
MATCH path = (:__Entity__)-[:RELATED]->(:__Entity__)
RETURN path LIMIT 200
```

#### Show the Chunks and the Document (Lexical Graph)

```cypher
MATCH (d:__Document__) WITH d LIMIT 1
MATCH path = (d)<-[:PART_OF]-(c:__Chunk__)
RETURN path LIMIT 100
```

####  Show a Community and it's Entities

```cypher
MATCH (c:__Community__) WITH c LIMIT 1
MATCH path = (c)<-[:IN_COMMUNITY]-()-[:RELATED]-(:__Entity__)
RETURN path LIMIT 100
```

#### Show everything

```cypher
MATCH (d:__Document__) WITH d LIMIT 1
MATCH path = (d)<-[:PART_OF]-(:__Chunk__)-[:HAS_ENTIY]->()-[:RELATED]-()-[:IN_COMMUNITY]->()
RETURN path LIMIT 250
```

We showed the visualization of this last query at the beginning.

If you have questions, feel free to reach out in the GraphRAG discord server: 
https://discord.gg/graphrag