# Neo4j Hello World (Notebook)

This notebook connects to a local Neo4j **Community** instance (via Docker), creates a tiny graph, and queries it into a pandas DataFrame.

**Assumes** 
 
 
- Neo4j service is running at `bolt://localhost:${URI_PORT}` with the user and password set in the `.env` file. **Run `docker compose up -d`**.
- Ollama service is up on `http://localhost:11434` (ollama default). **Run `ollama serve` and pull the model `ollama pull nomic-embed-text`** (if not pulled yet).



In [1]:
import os
from dotenv import load_dotenv  
import yaml
from pathlib import Path
from pprint import pprint
from termcolor import cprint
import ollama
import requests
from typing import Literal

from langchain_neo4j import Neo4jGraph


In [2]:
load_dotenv()  # Load local environment variables

URI = "bolt://localhost:" + os.environ.get("URI_PORT")
NEO4J_USER = os.environ.get("NEO4J_USER")
NEO4J_PWD = os.environ.get("NEO4J_PASSWORD")
NEO4J_DB = os.getenv("NEO4J_DATABASE", "neo4j")    # 👈 choose DB here
EMBED_MODEL = "nomic-embed-text:latest"

cprint(f"Connecting to Neo4j at {URI} with user {NEO4J_USER} and password {NEO4J_PWD}", "green")

[32mConnecting to Neo4j at bolt://localhost:7687 with user neo4j and password test1234[0m


In [3]:
# load cypher queries from yaml file
queries = yaml.safe_load(Path("queries.yaml").read_text())
queries.keys()  # list available queries

dict_keys(['constraints', 'create_seed', 'show_people', 'show_companies', 'match_adjacency', 'show_info', 'add_info', 'create_vector_indexes', 'delete_all'])

## Using Langchain wrapper for Neo4j

In [4]:
kg = Neo4jGraph(url=URI, username=NEO4J_USER, password=NEO4J_PWD, database=NEO4J_DB)

### 1. Create sample data

<p align="center">
  <img src="media/KG_step1_populate_graph.svg" width="550">
</p>


- **Entities**: Person and Company nodes with unique constraints (unique name and uuid)
- **Relationships**: KNOWS (person-to-person) and WORKS_AT (person-to-company)
- **Properties**: Basic attributes (name, age, education, industry)


In [5]:
# KG

wipe_at_init = True # delete everything at the start 

# Interact directly with KG, no need for driver context.

cprint(f"\n== Connected to Neo4j database: {NEO4J_DB}", "green")

cprint("\n== Creating constraints (if not exist)", "green")
for q in queries["constraints"]:
    kg.query(q)
print(" ok")

cprint("\n== Init Cleanup.", "green")
if wipe_at_init:
    for q in queries["delete_all"]:
        kg.query(q)
    print(" ok")
else:
    print(" skipped")
    
cprint("\n== Creating sample data", "green")
kg.query(queries["create_seed"])
print(" ok")

cprint("\n== Query: list all people", "green")
records = kg.query(queries["show_people"]) # <class 'list'>
for r in records:
    print(r)
    
cprint("\n== Query: list all companies", "green")
records = kg.query(queries["show_companies"]) # <class 'list'>
for r in records:
    print(r)

cprint("\n== Query: adjacency (who knows whom)", "green")
records = kg.query(queries["match_adjacency"]) # <class 'list'>
for r in records:
    print(r)

[32m
== Connected to Neo4j database: neo4j[0m
[32m
== Creating constraints (if not exist)[0m
 ok
[32m
== Init Cleanup.[0m
 ok
[32m
== Creating sample data[0m
 ok
[32m
== Query: list all people[0m
{'name': 'Paula', 'age': 25, 'p.gender': 'female', 'education': 'Computer Engineering'}
{'name': 'Guillermo', 'age': 26, 'p.gender': 'male', 'education': 'Industrial Engineering'}
{'name': 'Gabriela', 'age': 26, 'p.gender': 'female', 'education': 'Physics'}
{'name': 'Iria', 'age': 27, 'p.gender': 'female', 'education': 'Physics'}
{'name': 'Cristina', 'age': 27, 'p.gender': 'female', 'education': 'Physics'}
[32m
== Query: list all companies[0m
{'name': 'Indra', 'industry': 'Engineering'}
{'name': 'CIEMAT', 'industry': 'Scientific Research'}
{'name': 'CBM', 'industry': 'Scientific Research'}
[32m
== Query: adjacency (who knows whom)[0m
{'person': 'Cristina', 'knows': ['Gabriela', 'Iria'], 'works_at': 'CBM'}
{'person': 'Gabriela', 'knows': ['Cristina', 'Iria'], 'works_at': 'CIEMAT'}

### 2. Add rich text info

<p align="center">
  <img src="media/KG_step2_generate_rich_descriptions.svg" width="750">
</p>


In [6]:
cprint("\n== Query: Adding descriptions, appearance and summaries", "green")
for q in queries["add_info"]:
    kg.query(q)
print(" ok")

records = kg.query(queries["show_info"])
for r in records:
    print(r)


[32m
== Query: Adding descriptions, appearance and summaries[0m
 ok
{'labels(n)': ['Person'], 'n.name': 'Iria', 'n.info': 'Iria is a female of 27 years old and studied Physics.Iria has blue eyes and long brunette and wavy hair. She likes to paint her nails in red or purple colours. She usually wears long earrings.'}
{'labels(n)': ['Person'], 'n.name': 'Guillermo', 'n.info': 'Guillermo is a male of 26 years old and studied Industrial Engineering.Guillermo has brown eyes and short hair. He has a very fancy shirt that he takes to all important events. He shaved his head this summer.'}
{'labels(n)': ['Person'], 'n.name': 'Gabriela', 'n.info': "Gabriela is a female of 26 years old and studied Physics.Gabriela has long curly hair with babylights. She's petite and likes to wear hippie-style clothes."}
{'labels(n)': ['Person'], 'n.name': 'Paula', 'n.info': 'Paula is a female of 25 years old and studied Computer Engineering.Paula short hair in a wolfcut style. She wears long and wide pants an

### 3. Create property embeddings (first step into RAG) 

<p align="center">
  <img src="media/KG_step3_generate_property_embeddings.svg" width="750">
</p>

**RAG** implementation requires selecting a **property to embed and use for similarity searches**. 

Description properties containing **rich text** work well for this purpose, as they provide richer semantic information. In our example, we'll use *info* and *info*.

In order to do so, we create two vector indexes in Neo4j:

- **Vector index *person_node_info_idx***: based on property ***info_emb*** for nodes of type "Person"
-  **Vector index *company_node_info_idx***: based on property ***info_emb*** for nodes of type "Company"

After that, we create the embeddings (this happens for both Person nodes and Company nodes):

- **Property *info*** ---`nomic-embed-text`---> **Property *info_emb***


In [7]:
def embed_property(element: Literal["node", "relationship"],type_name:str, property_name:str):
        
    if element == "node":
        cprint(f"\nGenerating embeddings for (n:{type_name}) on n.{property_name}", "green")
        records = list(kg.query(f"""
            MATCH (n:{type_name})
            WHERE n.{property_name} IS NOT NULL AND n.{property_name} <> ''
            AND n.{property_name}_emb IS NULL
            RETURN n.uuid AS uuid, n.{property_name} AS txt
            """))
        for r in records:
            vec = ollama.embed(model="nomic-embed-text", input=r["txt"])["embeddings"][0]
            kg.query(
                f"""
                MATCH (n:{type_name} {{uuid: $uuid}})
                SET n.{property_name}_emb = $vec
                """,
                params={"uuid": r["uuid"], "vec": vec},
            )
            print(f"  text: {r['txt']}\n  vec: {vec[:3]}")
            
    elif element == "relationship":
        
        cprint(f"\nGenerating embeddings for [r:{type_name}] on r.{property_name}", "green")
        records = list(kg.query(f"""
            MATCH ()-[r:{type_name}]-()
            WHERE r.{property_name} IS NOT NULL AND r.{property_name} <> ''
            AND r.{property_name}_emb IS NULL
            RETURN r.uuid AS uuid, r.{property_name} AS txt
            """))
        for r in records:
            vec = ollama.embed(model="nomic-embed-text", input=r["txt"])["embeddings"][0]
            kg.query(
                f"""
                MATCH ()-[r:{type_name} {{uuid: $uuid}}]-()
                SET r.{property_name}_emb = $vec
                """,
                params={"uuid": r["uuid"], "vec": vec},
            )
            print(f"  text: {r['txt']}\n  vec: {vec[:3]}")
            
# Create vector indexes (once)

for q in queries["create_vector_indexes"]:
    kg.query(q)

# Show created vector indexes
results = kg.query("SHOW VECTOR INDEXES")
idx = list(results)
cprint(f"\nFound {len(idx)} vector index entries.", "green")
for r in idx:
    cprint("-"*20,"green")
    pprint(r)

[32m
Found 4 vector index entries.[0m
[32m--------------------[0m
{'entityType': 'NODE',
 'id': 15,
 'indexProvider': 'vector-2.0',
 'labelsOrTypes': ['Company'],
 'lastRead': None,
 'name': 'company_node_info_idx',
 'owningConstraint': None,
 'populationPercent': 100.0,
 'properties': ['info_emb'],
 'readCount': None,
 'state': 'ONLINE',
 'type': 'VECTOR'}
[32m--------------------[0m
{'entityType': 'NODE',
 'id': 5,
 'indexProvider': 'vector-2.0',
 'labelsOrTypes': ['Chunk'],
 'lastRead': neo4j.time.DateTime(2025, 9, 22, 14, 20, 17, 666000000, tzinfo=<UTC>),
 'name': 'form_10k_chunks',
 'owningConstraint': None,
 'populationPercent': 100.0,
 'properties': ['text_emb'],
 'readCount': 4,
 'state': 'ONLINE',
 'type': 'VECTOR'}
[32m--------------------[0m
{'entityType': 'RELATIONSHIP',
 'id': 16,
 'indexProvider': 'vector-2.0',
 'labelsOrTypes': ['KNOWS'],
 'lastRead': None,
 'name': 'knows_relationship_info_idx',
 'owningConstraint': None,
 'populationPercent': 100.0,
 'propertie

In [8]:
# (p:PERSON): create embeddings only for nodes missing them
embed_property("node", "Person", "info")

# (c:COMPANY): create embeddings only for nodes missing them
embed_property("node", "Company", "info")

# [r:KNOWS]: create embeddings only for nodes missing them
embed_property("relationship", "KNOWS", "info")

[32m
Generating embeddings for (n:Person) on n.info[0m
  text: Iria is a female of 27 years old and studied Physics.Iria has blue eyes and long brunette and wavy hair. She likes to paint her nails in red or purple colours. She usually wears long earrings.
  vec: [0.031421136, 0.032104157, -0.15766615]
  text: Guillermo is a male of 26 years old and studied Industrial Engineering.Guillermo has brown eyes and short hair. He has a very fancy shirt that he takes to all important events. He shaved his head this summer.
  vec: [-0.023262369, 0.05322417, -0.15030223]
  text: Gabriela is a female of 26 years old and studied Physics.Gabriela has long curly hair with babylights. She's petite and likes to wear hippie-style clothes.
  vec: [0.028215451, 0.02993047, -0.18000375]
  text: Paula is a female of 25 years old and studied Computer Engineering.Paula short hair in a wolfcut style. She wears long and wide pants and sneakers to the laboratory.
  vec: [0.011417965, 0.040603343, -0.16852917]


### 4. Search 

Whenever we query this graph, we can use two different but complementary search techniques:

1. **KG Retreival**: through **Neo4J Cypher Query Language (CQL)** we can query precise entities and relations. The input query must be translated into CQL to get the desired results.

2. **Vector Retrieval**: **embedding the input query**, we can make a vector search against the vector indexes defined above.

The results will be a combination of both searches.

<p align="center">
  <img src="media/KGRAG_schema.svg">
</p>


In [9]:
# From query/question to cypher query language (cql) TODO
def create_question_cql(question:str):
    cql_query = ""
    #cql_query = "MATCH (person)-[:KNOWS]-(:Person {name:'Cristina'})" 
    return cql_query

In [10]:
# From user query/question to question embedding
def create_question_embedding(question:str):
    cprint(f"\nGenerating embeddings for question '{question}'", "green")
    vec = ollama.embed(model="nomic-embed-text", input=question)["embeddings"][0] 
    print(f"  text: {question}\n  vec: {vec[:10]}\n")
    return vec

In [11]:
# Query Nodes

def neo4j_node_vector_search(question, index_name):
    """Search for similar nodes using the Neo4j vector index"""
  
      
    top_k = 5
    vector_search_query = f"""
    CALL db.index.vector.queryNodes($index_name, $top_k, $question_embedding) 
    YIELD node, score
    {create_question_cql(question)}
    RETURN score, node.text AS text
    """

    result = kg.query(
    vector_search_query, 
    {"index_name": index_name, 
        "top_k": top_k,
        "question_embedding": create_question_embedding(question),}
    )

    return result



question = "Who shaved its head this summer?"
index_name = "person_node_info_idx"
result = neo4j_node_vector_search(question, index_name)
for r in result:
        print(dict(r))

question = "Which company investigates Cancer?"
index_name = "company_node_info_idx"
result = neo4j_node_vector_search(question, index_name)
for r in result:
        print(dict(r))


        
        
# Query Relationships

def neo4j_relationship_vector_search(question, index_name):
    """Search for similar nodes using the Neo4j vector index""" 
    top_k = 5
    vector_search_query = f"""
    CALL db.index.vector.queryRelationships($index_name, $top_k, $question_embedding) 
    YIELD relationship, score
    {create_question_cql(question)}
    RETURN score, relationship {{.* , info_emb: ""}}
    ORDER BY score DESC
    """

    result = kg.query(
    vector_search_query, 
    {"index_name": index_name, 
        "top_k": top_k,
        "question_embedding": create_question_embedding(question),}
    )

    return result

question = "Who is helping Iria at work?"
index_name = "knows_relationship_info_idx"
result  = neo4j_relationship_vector_search(question, index_name)
for r in result:
    print(dict(r))
        
print("Driver closed.")

[32m
Generating embeddings for question 'Who shaved its head this summer?'[0m


  text: Who shaved its head this summer?
  vec: [0.022573026, -0.015476549, -0.17212495, 0.0018683294, -0.038688686, 0.044817436, 0.01842398, 0.013745231, 0.050375726, 0.008345599]

{'score': 0.8374781608581543, 'text': None}
{'score': 0.7817301750183105, 'text': None}
{'score': 0.7744836807250977, 'text': None}
{'score': 0.7634782791137695, 'text': None}
{'score': 0.7479281425476074, 'text': None}
[32m
Generating embeddings for question 'Which company investigates Cancer?'[0m
  text: Which company investigates Cancer?
  vec: [0.03314031, 0.030437501, -0.19477448, 0.01064411, 0.07048723, 0.00393343, -0.058452524, 0.020778723, 0.0035936916, 0.003476617]

{'score': 0.7897243499755859, 'text': None}
{'score': 0.7851877212524414, 'text': None}
{'score': 0.7674016952514648, 'text': None}
[32m
Generating embeddings for question 'Who is helping Iria at work?'[0m
  text: Who is helping Iria at work?
  vec: [-0.039116025, 0.022910928, -0.2054694, -0.0038209686, 0.038757026, -0.014443832, 0.