# Neo4j GraphRAG Retriever Notebook

This notebook demonstrates how to use various retrievers and Cypher patterns with Neo4j GraphRAG for asset manager and cybersecurity risk retrieval.


In [None]:
import os
from neo4j import GraphDatabase
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.retrievers import VectorRetriever, VectorCypherRetriever, Text2CypherRetriever
from neo4j_graphrag.generation import GraphRAG
from dotenv import load_dotenv
import pandas as pd

from IPython.display import display, HTML
import textwrap

from IPython.core.display import HTML
display(HTML("<style>.output_area pre {white-space: pre-wrap; word-break: break-word;}</style>"))

# Load environment variables
load_dotenv()
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USER = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))


# --- Initialize LLM and Embedder ---

This section sets up the Large Language Model (LLM) and the embedding model for use in retrieval-augmented generation (RAG) workflows.

- **LLM**: Uses OpenAI's GPT-4o model via the `OpenAILLM` interface.
- **Embedder**: Uses OpenAI's embedding API via the `OpenAIEmbeddings` class.

> **Note:**  
> - The `OPENAI_API_KEY` must be set in your environment variables or [.env](cci:7://file:///c:/Users/Alison%20Cossette/OneDrive/GitHub/graphrag-ebook/.env:0:0-0:0) file.
> - You can adjust model parameters (such as response format or temperature) by modifying the commented-out `model_params` dictionary if needed.


In [None]:
# --- Initialize LLM and Embedder ---
###model_params = {"response_format": {"type": "json_object"},  # use json_object formatting for best results
###                "temperature": 0  # turning temperature down for more deterministic results
###                }

llm = OpenAILLM(model_name='gpt-4o', api_key=OPENAI_API_KEY)
embedder = OpenAIEmbeddings(api_key=OPENAI_API_KEY)

# Initialize Vector Retrieve

This section sets up the vector-based retriever for semantic search over your Neo4j knowledge graph.

- **Query:**  
  The example query asks, "What are the risks that Apple faces?"

- **VectorRetriever:**  
  - Connects to the Neo4j database using the provided `driver`.
  - Uses the `chunkEmbeddings` vector index for efficient semantic retrieval.
  - The `embedder` generates embeddings for the query.
  - Returns the `text` property from matching chunks.

> **Tip:**  
> You can modify the `return_properties` list to include additional properties from the retrieved nodes if needed.

In [None]:
# --- Initialize Vector Retriever ---

query = "What are the risks that Apple faces?"

vector_retriever = VectorRetriever(
    driver=driver,
    index_name='chunkEmbeddings',
    embedder=embedder,
    return_properties=['text'])

# Simple Vector Search Diagnostic 

This section performs a diagnostic semantic search using the vector retriever.

- **Purpose:**  
  Quickly test the vector search by retrieving the top 10 most relevant text chunks from the Neo4j knowledge graph for the given query.

- **How it works:**  
  1. `vector_retriever.search()` runs the query and returns the top 10 matches based on vector similarity.
  2. The results are formatted into a pandas DataFrame, displaying:
     - The similarity score (`Score`)
     - A snippet of the retrieved content (`Content`)
     - The unique identifier for each chunk (`ID`)

- **Usage:**  
  This diagnostic helps you verify that the vector search is working and inspect the quality of the top results for your query.


In [None]:
    # --- Simple Vector Search Diagnostic ---

result = vector_retriever.search(query_text=query, top_k=10)
result_table=pd.DataFrame([(item.metadata['score'], item.content[10:80], item.metadata['id']) for item in result.items], columns=['Score', 'Content', 'ID'])
result_table

# Retrieval-Augmented Generation (RAG) Query

This section demonstrates how to use the `GraphRAG` class to perform a retrieval-augmented generation workflow:

- **GraphRAG**:  
  Combines a Large Language Model (LLM) with a vector-based retriever to answer questions using both semantic search and generative reasoning.

- **How it works:**  
  1. The retriever (`vector_retriever`) finds the most relevant text chunks from the Neo4j graph based on the input query.
  2. The LLM (`llm`) uses the retrieved context to generate a natural language answer.
  3. The answer is printed directly.

- **Usage:**  
  This approach provides context-aware, accurate answers grounded in your knowledge graph data.


In [None]:
rag = GraphRAG(
    llm=llm,
    retriever=vector_retriever
)
#print(textwrap.fill(rag.search(query).answer, width=80))
print(rag.search(query).answer)

## Advanced RAG: Contextual Cypher Retrieval

This section demonstrates how to use a custom Cypher query with the `VectorCypherRetriever` to provide richer, more contextual answers.

- **Custom Cypher Query:**  
  The `detail_context_query` matches text chunks (`node`) to their source documents, associated companies, and the risk factors those companies face.  
  It returns:
  - The company name
  - The context text from the chunk
  - A list of distinct risk factors

- **VectorCypherRetriever:**  
  - Performs semantic search using the `chunkEmbeddings` vector index.
  - Applies the custom Cypher query to retrieve relevant context and associated risk factors.

- **GraphRAG:**  
  - Combines the LLM and the custom retriever to answer the question:  
    _"What are the top risk factors that Apple faces?"_

- **Usage:**  
  This approach enables highly specific, context-rich answers by leveraging the full power of graph relationships and semantic search.


In [99]:
# --- VectorCypherRetriever Example: Detailed Search with Context ---  OR WHY THIS IS A BAD QUERY :)

detail_context_query = """
WITH node
MATCH (node)-[:FROM_DOCUMENT]->(doc:Document)-[:FILED]->(company:Company)-[:FACES_RISK]->(risk:RiskFactor)
RETURN company.name AS company,  node.text AS context, collect(DISTINCT risk.name) AS risks
"""

vector_cypher_retriever = VectorCypherRetriever(
    driver=driver,
    index_name='chunkEmbeddings',
    embedder=embedder,
    retrieval_query=detail_context_query
)

rag = GraphRAG(llm=llm, retriever=vector_cypher_retriever)
query = "What are the top risk factors that Apple faces?"
print(rag.search(query).answer)

The top risk factors that Apple faces include:

1. **Macroeconomic and Industry Risks**: These include adverse economic conditions such as slow growth or recession, high unemployment, inflation, tighter credit, higher interest rates, and currency fluctuations affecting consumer confidence and spending.

2. **Global Operations and International Trade**: Apple relies heavily on international operations for sales and manufacturing, making it vulnerable to political events, trade restrictions, international disputes, and other business interruptions.

3. **Supply Chain Disruptions**: Given the complexity and scale of its global supply chain, Apple is exposed to risks from supply shortages, price fluctuations, and disruptions due to natural disasters or other events.

4. **Cybersecurity Attacks**: Apple is at heightened risk of cyber-attacks due to its high profile, which can result in unauthorized access to confidential information.

5. **Regulatory Changes and Legal Risks**: Ongoing regul

In [100]:
result = vector_cypher_retriever.search(query_text=query, top_k=10)
for item in result.items:
    print(item.content)

<Record company='APPLE INC' context='intended to be inactive\ntextual references only.\nApple Inc. | 2023 Form 10-K | 4\nSection: Item1a\n>Item 1A.    Risk Factors\nThe Company\'s business, reputation, results of operations, financial condition and stock price can\nbe affected by a number of factors, whether currently known or unknown, including those described\nbelow. When any one or more of these risks materialize from time to time, the Company\'s business,\nreputation, results of operations, financial condition and stock price can be materially and adversely\naffected.\nBecause of the following factors, as well as other factors affecting the Company\'s results of\noperations and financial condition, past financial performance should not be considered to be a\nreliable indicator of future performance, and investors should not use historical trends to anticipate\nresults or trends in future periods. This discussion of risk factors contains forward-looking\nstatements.\nThis section sh

In [101]:
result = vector_cypher_retriever.search(query_text=query, top_k=20)
for item in result.items:
    print(item.content)


<Record company='APPLE INC' context='intended to be inactive\ntextual references only.\nApple Inc. | 2023 Form 10-K | 4\nSection: Item1a\n>Item 1A.    Risk Factors\nThe Company\'s business, reputation, results of operations, financial condition and stock price can\nbe affected by a number of factors, whether currently known or unknown, including those described\nbelow. When any one or more of these risks materialize from time to time, the Company\'s business,\nreputation, results of operations, financial condition and stock price can be materially and adversely\naffected.\nBecause of the following factors, as well as other factors affecting the Company\'s results of\noperations and financial condition, past financial performance should not be considered to be a\nreliable indicator of future performance, and investors should not use historical trends to anticipate\nresults or trends in future periods. This discussion of risk factors contains forward-looking\nstatements.\nThis section sh

## Why "Apple" Queries Can Fail in Vector-Cypher Retrieval

When you ask a question like "What are the risks that Apple faces?" using a vector-Cypher retriever, you may not get the structured or complete answer you expect. Here’s why:

- **How Vector-Cypher Works:**  
  - The retrieval process first performs a semantic search over all text chunks in the graph.
  - It retrieves the top-k chunks most similar to your query—regardless of which company (or entity) they are about.
  - The Cypher query then starts from each chunk and traverses the graph for related information.

- **The Problem with Entity-Centric Queries:**  
  - If your query is about "Apple," but there are no chunks whose text is semantically similar to your query and also specifically about Apple, the retriever may return:
    - Chunks about other companies.
    - Chunks that mention "risk" but not "Apple."
    - Generic or boilerplate risk factor text.
  - The Cypher query can only traverse from the retrieved chunk—it cannot "filter" or "redirect" to Apple if the chunk isn’t already about Apple.

- **Key Limitation:**  
  - **The chunk is the anchor.** If your query is about an entity (like Apple), but the chunk retrieval is not entity-aware, you may never reach the correct node or context in the graph.
  - This is especially problematic for broad or entity-centric questions, where you want to aggregate or summarize information about a specific node (e.g., a company) rather than just retrieve semantically similar passages.

> **Conclusion:**  
> Vector-Cypher retrieval is powerful for finding relevant context, but it is fundamentally limited by the chunk-centric approach. For entity-centric questions, you need either:
> - Chunks that are explicitly about the entity, or
> - A retrieval/query strategy that starts from the entity node itself, not from arbitrary text chunks.

In [110]:
# --- VectorCypherRetriever Example: Detailed Search with Context ---  OR WHY THIS IS A GOOD QUERY :)
asset_manager_query = """
WITH node
MATCH (node)-[:FROM_DOCUMENT]->(doc:Document)-[:FILED]->(company:Company)-[:OWNS]-(manager:AssetManager)
RETURN company.name AS company, manager.managerName AS manager, node.text AS context
"""

vector_cypher_retriever = VectorCypherRetriever(
    driver=driver,
    index_name='chunkEmbeddings',
    embedder=embedder,
    retrieval_query=asset_manager_query
)

rag = GraphRAG(llm=llm, retriever=vector_cypher_retriever)
query = "Who are the asset managers most affected by banking regulations?"
print(rag.search(query).answer)

The asset managers associated with PayPal, such as AllianceBernstein L.P., Ameriprise Financial Inc., Amundi, and others, are likely affected by banking regulations as described in the context, as PayPal's operations are subject to complex and changing laws, rules, and regulations impacting their business.


## VectorCypherRetriever Example: Detailed Search with Context ---  OR WHY THIS IS A GOOD QUERY :)
asset_manager_query = """
WITH node
MATCH (node)-[:FROM_DOCUMENT]->(doc:Document)-[:FILED]->(company:Company)-[:OWNS]-(manager:AssetManager)
RETURN company.name AS company, manager.managerName AS manager, node.text AS context
"""

vector_cypher_retriever = VectorCypherRetriever(
    driver=driver,
    index_name='chunkEmbeddings',
    embedder=embedder,
    retrieval_query=asset_manager_query
)

rag = GraphRAG(llm=llm, retriever=vector_cypher_retriever)
query = "Who are the asset managers most affected by banking regulations?"
print(rag.search(query).answer)

In [111]:
result = vector_cypher_retriever.search(query_text=query, top_k=3)
for item in result.items:
    print(item.content)

<Record company='PAYPAL' manager='ALLIANCEBERNSTEIN L.P.' context='cryptocurrency assets could be treated as a general unsecured claim against the custodian, in which\ncase our customers could seek to hold us liable for any resulting losses.\nIn addition, our cryptocurrency product offerings could have the effect of heightening or\nexacerbating many of the risk factors described in this "Risk Factors" section.\nLending Regulation\nWe hold a number of U.S. state lending licenses for our U.S. consumer short-term installment loan\nproduct, which is subject to federal and state laws governing consumer credit and debt collection.\nWhile the consumer short-term installment loan products that we offer outside the U.S. are generally\nexempt from primary consumer credit legislation, certain consumer lending laws, consumer\nprotection or banking transparency regulations continue to apply to these products. Increased global\nregulatory focus on short-term installment products and consumer credit 


## VectorCypherRetriever Example: Finding Shared Risks Among Companies

This example demonstrates how to combine semantic search with graph traversal to uncover relationships—specifically, risks that connect major tech companies.

**How this query works:**

- **Semantic Search:**  
  The vector retriever finds the top-k text chunks most relevant to your query ("What risks connect major tech companies?").

- **Graph Traversal:**  
  For each retrieved chunk (`node`):
  - Follows the `:FROM_DOCUMENT` and `:FILED` relationships to a company (`c1`).
  - Finds all risk factors (`risk`) that `c1` faces.
  - Finds other companies (`c2`) that also face the same risk factor.
  - Ensures that `c1` and `c2` are different companies.

- **Returns:**  
  - `source_company`: The company from the retrieved chunk.
  - `related_companies`: Companies sharing at least one risk with the source company.
  - `shared_risks`: The names of the risk factors connecting these companies.

- **Why this is powerful:**  
  - Leverages the chunk as the semantic anchor, but then uses graph logic to discover structured, multi-entity relationships.
  - Surfaces both the context (from the chunk) and the broader network of shared risks—something that pure semantic or pure graph search alone would struggle to do as effectively.

> **Summary:**  
> This approach is ideal for exploratory questions about relationships in your graph, where you want to start from relevant context but end up with structured, comparative insights.


In [112]:
vector_company_risk_query = """
WITH node
MATCH (node)-[:FROM_DOCUMENT]->(doc:Document)-[:FILED]->(c1:Company)
MATCH (c1)-[:FACES_RISK]->(risk:RiskFactor)<-[:FACES_RISK]-(c2:Company)
WHERE c1 <> c2
RETURN
  c1.name AS source_company,
  collect(DISTINCT c2.name) AS related_companies,
  collect(DISTINCT risk.name) AS shared_risks
LIMIT 10
"""

vector_cypher_retriever = VectorCypherRetriever(
    driver=driver,
    index_name="chunkEmbeddings",
    embedder=embedder,
    retrieval_query=vector_company_risk_query
)

query = "What risks connect major tech companies?"
result = vector_cypher_retriever.search(query_text=query, top_k=5)
for item in result.items:
    print(item.content)

<Record source_company='MICROSOFT CORP' related_companies=['AMAZON', 'PG&E CORP', 'NVIDIA CORPORATION', 'PAYPAL', 'APPLE INC'] shared_risks=['competition rules', 'interest rates', 'COVID-19 pandemic', 'regulatory scrutiny', 'credit risk', 'uncertain tax positions']>
<Record source_company='PAYPAL' related_companies=['NVIDIA CORPORATION', 'PG&E CORP', 'APPLE INC', 'MICROSOFT CORP', 'MCDONALDS CORP'] shared_risks=['increased costs', 'Market Risk', 'natural disasters', 'pandemics', 'climate change', 'unplanned outages', 'regulatory scrutiny', 'credit risk', 'Macroeconomic Conditions', 'Interest Rate Risk']>
<Record source_company='AMAZON' related_companies=['MICROSOFT CORP', 'PG&E CORP', 'APPLE INC', 'NVIDIA CORPORATION', 'INTEL CORP'] shared_risks=['competition rules', 'inflation', 'interest rates', 'economic conditions', 'Natural disasters', 'Climate change', 'market values of investments']>


In [115]:
rag = GraphRAG(llm=llm, retriever=vector_cypher_retriever)
print(rag.search(query).answer)


The major tech companies are connected by several shared risks, including:

1. **Regulatory Scrutiny**: This is a common concern for companies like Microsoft, PayPal, and Amazon, as they navigate complex compliance and regulatory environments across different regions.

2. **Interest Rates**: Fluctuations in interest rates can impact the financial strategies of these companies, affecting borrowing costs and investment returns.

3. **Credit Risk**: The risk of a counterparty failing to meet its financial obligations is a concern shared by these tech giants.

4. **Pandemics and Natural Disasters**: Events like the COVID-19 pandemic and other natural disasters can disrupt operations and markets, posing risks to these companies.

5. **Climate Change**: This environmental risk can impact supply chains, operations, and overall market conditions, affecting tech companies like Amazon and PayPal.

6. **Economic Conditions and Inflation**: Economic volatility, including inflation, can influence m

## Text2CypherRetriever Example ---

This section demonstrates how to use the `Text2CypherRetriever` to automatically generate Cypher queries from natural language questions.

**How it works:**
- The retriever uses a Large Language Model (LLM) to translate your plain-English query into a Cypher query, based on your Neo4j schema.
- The schema is provided as a string describing the main node types and relationships in your graph (e.g., companies, risk factors, asset managers).

**Example Workflow:**
1. You provide a natural language question, such as:
   > "What are the company names of companies owned by BlackRock Inc."
2. The retriever generates a corresponding Cypher query using the schema and the LLM.
3. The generated Cypher is printed, allowing you to inspect or execute it.

**Why this is powerful:**
- Removes the need to manually write Cypher for each question.
- Makes graph querying accessible to non-technical users.
- Great for rapid prototyping, exploration, and building natural language interfaces to your knowledge graph.


In [116]:
# --- Text2CypherRetriever Example ---
from requests import session


print("\n--- Text2CypherRetriever Example ---")
text2cypher_retriever = Text2CypherRetriever(
    driver=driver,
    llm=llm,
    neo4j_schema="(:Chunk)-[]-(:Document)-[:FILED]-(:Company{companyName: str}),(:Company)-[:FACES_RISK]-(:RiskFactor),(:Company)-[:OWNS]-(:AssetManager{managerName: str})")

query = "what are the company names of companies owned by BlackRock Inc."
cypher_query = text2cypher_retriever.get_search_results(query)
print("\n--- Text2Cypher Output ---")
print("Original Query:", query)
print("Generated Cypher:\n", cypher_query.metadata["cypher"])



--- Text2CypherRetriever Example ---

--- Text2Cypher Output ---
Original Query: what are the company names of companies owned by BlackRock Inc.
Generated Cypher:
 MATCH (:AssetManager {managerName: 'BlackRock Inc.'})<-[:OWNS]-(c:Company)
RETURN c.companyName


## Executing the Generated Cypher Query

This section runs the Cypher query generated by the `Text2CypherRetriever` and displays the results.

**How it works:**
- The Cypher query, generated from your natural language question, is executed directly against the Neo4j database using the `driver`.
- Each record returned by the query is printed for inspection.

**Why this is useful:**
- Allows you to see the actual data returned from your graph for your question, closing the loop from natural language to structured results.
- Lets you verify both the correctness of the generated Cypher and the quality of the graph data.

**Typical workflow:**
1. Use the retriever to generate Cypher from your question.
2. Execute the Cypher and review the results.
3. Iterate on your question or schema as needed to improve answers.


In [117]:
result = driver.execute_query(cypher_query.metadata["cypher"])
for record in result.records:
    print(record)

<Record c.companyName='AMAZON.COM INC'>
<Record c.companyName='PG&E CORP'>
<Record c.companyName='MICROSOFT CORP'>
<Record c.companyName='INTEL CORP'>
<Record c.companyName='PAYPAL HLDGS INC'>
<Record c.companyName='MCDONALDS CORP'>
<Record c.companyName='NVIDIA CORPORATION'>
