# Neo4j Hello World (Notebook) - SEC Use Case

This notebook connects to a local Neo4j **Community** instance (via Docker), creates a tiny graph, and queries it.

**Assumes** 
 
 
- Neo4j service is running at `bolt://localhost:${URI_PORT}` with the user and password set in the `.env` file. **Run `docker compose up -d`**.
- Ollama service is up on `http://localhost:11434` (ollama default). **Run `ollama serve` and pull the model `ollama pull nomic-embed-text`** (if not pulled yet).

In [None]:
import os
from dotenv import load_dotenv  
import yaml
from pathlib import Path
from pprint import pprint
from termcolor import cprint
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
from neo4j import GraphDatabase


In [None]:
load_dotenv()  # Load local environment variables

URI = "bolt://localhost:" + os.environ.get("URI_PORT")
NEO4J_USER = os.environ.get("NEO4J_USER")
NEO4J_PWD = os.environ.get("NEO4J_PASSWORD")
NEO4J_DB = os.getenv("NEO4J_DATABASE", "neo4j")    # 👈 choose DB here

cprint(f"Connecting to Neo4j at {URI} with user {NEO4J_USER} and password {NEO4J_PWD}", "green")

[32mConnecting to Neo4j at bolt://localhost:7687 with user neo4j and password test1234[0m


In [3]:
# load cypher queries from yaml file
queries = yaml.safe_load(Path("queries_SEC.yaml").read_text())
queries.keys()  # list available queries

dict_keys(['constraints', 'create_chunks', 'create_vector_indexes', 'delete_all'])

## Using the Neo4j driver

In [4]:
driver = GraphDatabase.driver(uri=URI, auth=(NEO4J_USER, NEO4J_PWD))

### 1+2. Create data with rich text (chunks)

In [5]:
# Load data from file: form10k for the Netapp company

file_name = "./data/form10k/0000950170-23-027948.json"

# LangChain Text splitter for chunking process
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

def split_form10k_data_from_file(file):
    
    chunks_with_metadata = [] # accumlate chunk records
    
    data = json.load(open(file)) # open the json file
    for item in ['item1','item1a','item7','item7a']: # pull these keys from the json
        
        print(f'Processing {item} from {file}') 
        
        item_text_chunks = text_splitter.split_text(data[item]) # split the text into chunks
        
        chunk_seq_id = 0
        for chunk in item_text_chunks: # only take the first 20 chunks
            
            form_id = file[file.rindex('/') + 1:file.rindex('.')] # extract form id from file name
            
            # finally, construct a record with metadata and the chunk text
            chunks_with_metadata.append({
                'text': chunk, 
                'f10kItem': item,
                'chunkSeqId': chunk_seq_id,
                # constructed metadata...
                'formId': f'{form_id}', # pulled from the filename
                'uuid': f'{form_id}-{item}-chunk{chunk_seq_id:04d}',
                # metadata from file...
                'names': data['names'],
                'cik': data['cik'],
                'cusip6': data['cusip6'],
                'source': data['source'],
            })
            
            chunk_seq_id += 1
            
        print(f'\t{item} splitted into {chunk_seq_id} chunks')
        
    return chunks_with_metadata


chunks_dicts = split_form10k_data_from_file(file_name)

Processing item1 from ./data/form10k/0000950170-23-027948.json
	item1 splitted into 254 chunks
Processing item1a from ./data/form10k/0000950170-23-027948.json
	item1a splitted into 1 chunks
Processing item7 from ./data/form10k/0000950170-23-027948.json
	item7 splitted into 1 chunks
Processing item7a from ./data/form10k/0000950170-23-027948.json
	item7a splitted into 1 chunks


In [None]:
with driver.session(database=NEO4J_DB) as session:
   
    dbinfo = session.run("CALL db.info()").single()
    cprint(f"\nConnected to Neo4j database: {dbinfo['name']}", "green")
    
    cprint("\nCreating constraints (if not exist)", "green")
    for q in queries["constraints"]:
        session.run(q)
    
    cprint("\nInit Cleanup.", "green")
    for q in queries["delete_all"]:
        session.run(q)
    
    cprint("\nCreate data", "green")
    node_count = 0
    for chunk_dict in chunks_dicts[:20]:
        print(f"Creating `:Chunk` node for chunk ID {chunk_dict['uuid']}")
        session.run(queries["create_chunks"], 
        parameters={
            'chunkParamDict': chunk_dict
            }
        )
        node_count += 1
        
    print(f"Created {node_count} nodes")

[32m
== Connected to Neo4j database: neo4j[0m
[32m
== Creating constraints (if not exist)[0m
[32m
== Init Cleanup.[0m
 ok
[32m
== Creating sample data[0m
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0000
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0001
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0002
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0003
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0004
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0005
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0006
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0007
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0008
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0009
Creating `:Chunk` node for chunk ID 0000950170-23-027948-item1-chunk0010
Creating `:Chunk` node for chunk ID

### 3. Create property embeddings (first step into RAG) 

In [7]:
with driver.session(database=NEO4J_DB) as session:
    
    # Create vector index
    for q in queries["create_vector_indexes"]:
        session.run(q)
    
    # Show created vector indexes
    results = session.run("SHOW VECTOR INDEXES")
    idx = list(results)
    cprint(f"\nFound {len(idx)} vector index entries.", "green")
    for r in idx:
        cprint("-"*20,"green")
        pprint(dict(r))
    
    


[32m
Found 4 vector index entries.[0m
[32m--------------------[0m
{'entityType': 'NODE',
 'id': 19,
 'indexProvider': 'vector-2.0',
 'labelsOrTypes': ['Company'],
 'lastRead': neo4j.time.DateTime(2025, 9, 22, 16, 18, 56, 637000000, tzinfo=<UTC>),
 'name': 'company_node_info_idx',
 'owningConstraint': None,
 'populationPercent': 100.0,
 'properties': ['embedding'],
 'readCount': 1,
 'state': 'ONLINE',
 'type': 'VECTOR'}
[32m--------------------[0m
{'entityType': 'NODE',
 'id': 16,
 'indexProvider': 'vector-2.0',
 'labelsOrTypes': ['Chunk'],
 'lastRead': None,
 'name': 'form_10k_chunks',
 'owningConstraint': None,
 'populationPercent': 100.0,
 'properties': ['embedding'],
 'readCount': None,
 'state': 'ONLINE',
 'type': 'VECTOR'}
[32m--------------------[0m
{'entityType': 'RELATIONSHIP',
 'id': 14,
 'indexProvider': 'vector-2.0',
 'labelsOrTypes': ['KNOWS'],
 'lastRead': neo4j.time.DateTime(2025, 9, 22, 16, 18, 56, 684000000, tzinfo=<UTC>),
 'name': 'knows_relationship_info_idx',

In [None]:
from helper_neo4j import vectorize_property
    
with driver.session(database=NEO4J_DB) as session:
    
    vectorize_property(element = "node",
                       label = "Chunk",
                       source_property = "text",
                       session = session)

[32m
Generating embeddings for (n:Chunk) on n.text[0m
  text: >Item 1.  
Business


Overview


NetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.


Our opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and 

### 4. Search 

In [None]:
from helper_neo4j import neo4j_vector_search

with driver.session(database=NEO4J_DB) as session:

  result = neo4j_vector_search(element = "node",
                               query = 'In a single sentence, tell me about Netapp.',
                               index = 'form_10k_chunks',
                               top_k = 10,
                               session = session
                               )

for r in result:
    print(dict(r))



[32m
Generating embeddings for question 'In a single sentence, tell me about Netapp.'[0m
  text: In a single sentence, tell me about Netapp.
  vec: [0.023942923, 0.06676347, -0.123865075, -0.024302177, 0.07153227, -0.02668256, 0.007319313, -0.033634715, -0.017225634, -0.0583832]

{'score': 0.8420853614807129, 'node': {'cik': '1002047', 'text': '•\nNetApp Keystone is our pay-as-you-grow, storage-as-a-service (STaaS) offering that delivers a seamless hybrid cloud experience for those preferring operating expense consumption models to upfront capital expense or leasing. With a unified management console and monthly bill for both on-premises and cloud data storage services, Keystone lets organizations provision and monitor, and even move storage spend across their hybrid cloud environment for financial and operational flexibility. \n\n\n•\nNetApp Global Support supplies systems, processes, and people wherever needed to provide continuous operation in complex and critical environments, wi