# Working with the Full Dataset

In the previous notebooks, you may have noticed that some queries returned results that didn't fully answer the question. For example, when we asked:

**Query: "When does the fiscal year end?"**

The best match was:

```
Best match (score: 0.8934):
  Apple Inc. ("Apple" or the "Company") designs, manufactures and markets smartphones,
  personal computers, tablets, wearables and accessories, and sells a variety of related
  services. The Company's fiscal year is the 52- or 53-week period that ends on the last
  Saturday of September.
```

While this chunk *contains* the answer, it's not idealâ€”the fiscal year information is buried in a paragraph primarily about what Apple does. With more data, vector search can find chunks that are *specifically about* fiscal years.

This happened because we were working with a small sample of text from a single document. In a real GraphRAG application, you'd have:
- Multiple documents (10-K filings from different companies)
- Hundreds or thousands of chunks
- Many more extracted entities and relationships

In this notebook, we'll load the full dataset and re-run our queries to see much richer results.

---

## Load the Full Dataset

First, go to your command line and run the following command to restore the full Neo4j database:

```bash
uv run python scripts/restore_neo4j.py
```

This script loads a pre-built knowledge graph containing:
- SEC 10-K filings from multiple companies (Apple, Microsoft, Nvidia, etc.)
- Chunks with embeddings for semantic search
- Extracted entities (Companies, Products, Services, Executives, etc.)
- Relationships between entities

Wait for the script to complete before continuing.

## Setup

Import required modules and connect to Neo4j.

In [15]:
import sys
sys.path.insert(0, '../solutions')

from neo4j import GraphDatabase

from config import Neo4jConfig, get_embedder

In [16]:
neo4j_config = Neo4jConfig()
driver = GraphDatabase.driver(
    neo4j_config.uri,
    auth=(neo4j_config.username, neo4j_config.password)
)
driver.verify_connectivity()
print("Connected to Neo4j successfully!")

Connected to Neo4j successfully!


## Explore the Full Graph

Let's see what's in the restored database.

In [17]:
def show_graph_summary(driver):
    """Show a summary of the complete graph."""
    with driver.session() as session:
        # Count all node types
        result = session.run("""
            MATCH (n)
            UNWIND labels(n) as label
            RETURN label, count(*) as count
            ORDER BY count DESC
        """)
        print("=== Node Counts ===")
        for record in result:
            print(f"  {record['label']}: {record['count']}")
        
        # Count relationship types
        result = session.run("""
            MATCH ()-[r]->()
            RETURN type(r) as type, count(*) as count
            ORDER BY count DESC
        """)
        print("\n=== Relationship Counts ===")
        for record in result:
            print(f"  {record['type']}: {record['count']}")

show_graph_summary(driver)

=== Node Counts ===
  __KGBuilder__: 9
  __Entity__: 9
  Service: 5
  Chunk: 3
  Product: 3
  Company: 1
  Document: 1

=== Relationship Counts ===
  OFFERS_SERVICE: 5
  FROM_DOCUMENT: 3
  OFFERS_PRODUCT: 3
  NEXT_CHUNK: 2


## Vector Search with Full Dataset

Now let's re-run the same queries from the previous notebooks. With more data, we'll get more relevant and diverse results.

In [18]:
embedder = get_embedder()
print(f"Embedder initialized: {embedder.model}")

Embedder initialized: text-embedding-ada-002


In [19]:
INDEX_NAME = "chunkEmbeddings"

def vector_search(driver, embedder, query: str, top_k: int = 3):
    """Search for chunks similar to the query."""
    query_embedding = embedder.embed_query(query)
    
    with driver.session() as session:
        result = session.run("""
            CALL db.index.vector.queryNodes($index_name, $top_k, $embedding)
            YIELD node, score
            RETURN node.text as text, node.index as idx, score
            ORDER BY score DESC
        """, index_name=INDEX_NAME, top_k=top_k, embedding=query_embedding)
        
        return list(result)

In [20]:
queries = [
    "What products does Apple make?",
    "Tell me about iPhone and Mac computers",
    "What services does the company offer?",
    "When does the fiscal year end?"
]

for query in queries:
    print(f"\nQuery: \"{query}\"")
    print("-" * 50)
    results = vector_search(driver, embedder, query, top_k=3)
    for i, record in enumerate(results):
        print(f"\n[{i+1}] Score: {record['score']:.4f}")
        print(f"    {record['text']}")


Query: "What products does Apple make?"
--------------------------------------------------

[1] Score: 0.9340
    Apple Inc. ("Apple" or the "Company") designs, manufactures and markets smartphones, 
personal computers, tablets, wearables and accessories, and sells a variety of related 
services. The Company's fiscal year is the 52- or 53-week period that ends on the last 
Saturday of September.

Products

iPhone is the Company's line of smartphones based on its iOS operating system. The iPhone 
product line 

[2] Score: 0.9182
    ts iOS operating system. The iPhone 
product line includes iPhone 14 Pro, iPhone 14, iPhone 13 and iPhone SE. Mac is the Company's 
line of personal computers based on its macOS operating system. iPad is the Company's line 
of multi-purpose tablets based on its iPadOS operating system.

Services

Advertising includes third-party licensing arrangements and the Company's own advertising 
platforms. 

[3] Score: 0.9140
    nts and the Company's own advertising

## Explore Entities

With the full dataset, we have many more extracted entities to work with.

In [21]:
def show_entities(driver):
    """Display extracted entities by type."""
    with driver.session() as session:
        # Get entity counts by label (excluding internal labels)
        result = session.run("""
            MATCH (n)
            WHERE NOT n:Chunk AND NOT n:Document AND NOT n:__KGBuilder__
            WITH labels(n) as lbls
            UNWIND lbls as label
            WITH label
            WHERE NOT label STARTS WITH '__'
            RETURN label, count(*) as count
            ORDER BY count DESC
            LIMIT 10
        """)
        print("=== Entity Counts (Top 10) ===")
        for record in result:
            print(f"  {record['label']}: {record['count']}")

show_entities(driver)

=== Entity Counts (Top 10) ===


In [22]:
def list_companies(driver):
    """List all Company entities."""
    with driver.session() as session:
        result = session.run("""
            MATCH (c:Company)
            RETURN c.name as name
            ORDER BY c.name
            LIMIT 20
        """)
        print("=== Companies ===")
        for record in result:
            print(f"  - {record['name']}")

list_companies(driver)

=== Companies ===
  - Apple Inc.


## Find Chunks for Entities

With the full dataset, we can find multiple chunks that mention specific entities across different documents.

In [23]:
def find_chunks_for_entity(driver, entity_name: str, limit: int = 5):
    """Find chunks that mention a specific entity."""
    with driver.session() as session:
        result = session.run("""
            MATCH (e)-[:FROM_CHUNK]->(c:Chunk)
            WHERE e.name CONTAINS $name
            RETURN e.name as entity, labels(e)[0] as type, c.text as chunk_text
            LIMIT $limit
        """, name=entity_name, limit=limit)
        
        records = list(result)
        if records:
            print(f"=== Chunks mentioning '{entity_name}' ===")
            for i, record in enumerate(records):
                print(f"\n[{i+1}] Entity: {record['entity']} ({record['type']})")
                print(f"    Chunk: {record['chunk_text']}")
        else:
            print(f"No chunks found mentioning '{entity_name}'")

# Find chunks mentioning specific entities
find_chunks_for_entity(driver, "iPhone")

No chunks found mentioning 'iPhone'


In [24]:
# Try finding chunks for other entities
find_chunks_for_entity(driver, "Microsoft")

No chunks found mentioning 'Microsoft'


In [25]:
find_chunks_for_entity(driver, "GPU")

No chunks found mentioning 'GPU'


## Explore Relationships

With multiple companies and products, we can see the rich relationship network.

In [26]:
def show_company_products(driver, company_name: str):
    """Show products offered by a company."""
    with driver.session() as session:
        result = session.run("""
            MATCH (c:Company)-[:OFFERS_PRODUCT]->(p:Product)
            WHERE c.name CONTAINS $name
            RETURN c.name as company, p.name as product
            ORDER BY p.name
            LIMIT 20
        """, name=company_name)
        
        records = list(result)
        if records:
            print(f"=== Products from '{company_name}' ===")
            for record in records:
                print(f"  {record['company']} -> {record['product']}")
        else:
            print(f"No products found for '{company_name}'")

show_company_products(driver, "Apple")

=== Products from 'Apple' ===
  Apple Inc. -> Mac
  Apple Inc. -> iPad
  Apple Inc. -> iPhone


In [27]:
show_company_products(driver, "Microsoft")

No products found for 'Microsoft'


## Summary

With the full dataset loaded, you can see:

1. **More diverse search results** - Queries now return relevant chunks from multiple companies and documents
2. **Richer entity network** - Many more extracted companies, products, services, and other entities
3. **Cross-document relationships** - Entities are linked across different SEC filings
4. **Better answers** - More context means more accurate and comprehensive responses

This full dataset is what you'll use in the upcoming retriever and agent notebooks to build powerful GraphRAG applications.

---

**Next:** [Vector Retriever](02_01_vector_retriever.ipynb)

In [28]:
# Cleanup
driver.close()
print("Connection closed.")

Connection closed.
