# Working with the Full Dataset

In the previous notebooks we only load a small sample of the dataset. 

---

## Load the Full Dataset

Run the cell below to restore the full Neo4j database with pre-built data:

This script loads a pre-built knowledge graph containing:
- SEC 10-K filings from multiple companies (Apple, Microsoft, Nvidia, etc.)
- Chunks with embeddings for semantic search
- Extracted entities (Companies, Products, Services, Executives, etc.)
- Relationships between entities

Wait for the script to complete before continuing.

In [None]:
!uv run python ../scripts/restore_neo4j.py --force

## Setup

Import required modules and connect to Neo4j.

In [None]:
import sys
sys.path.insert(0, '../new-workshops/solutions')

from neo4j import GraphDatabase

from config import Neo4jConfig, get_embedder

In [None]:
neo4j_config = Neo4jConfig()
driver = GraphDatabase.driver(
    neo4j_config.uri,
    auth=(neo4j_config.username, neo4j_config.password)
)
driver.verify_connectivity()
print("Connected to Neo4j successfully!")

## Explore the Full Graph

Let's see what's in the restored database.

### Data Model

See [docs/DATA_MODEL.md](../docs/DATA_MODEL.md) for details on the graph schema, node types, relationships, and example queries.

In [None]:
def show_graph_summary(driver):
    """Show a summary of the complete graph."""
    with driver.session() as session:
        # Count all node types - with explicit grouping
        result = session.run("""
            MATCH (n)
            UNWIND labels(n) as label
            WITH label
            RETURN label, count(*) as count
            ORDER BY count DESC
        """)
        print("=== Node Counts ===")
        for record in result:
            print(f"  {record['label']}: {record['count']}")
        
        # Count relationship types - with explicit grouping
        result = session.run("""
            MATCH ()-[r]->()
            WITH type(r) as type
            RETURN type, count(*) as count
            ORDER BY count DESC
        """)
        print("\n=== Relationship Counts ===")
        for record in result:
            print(f"  {record['type']}: {record['count']}")

show_graph_summary(driver)

## Vector Search with Full Dataset

Now let's re-run the same queries from the previous notebooks. With more data, we'll get more relevant and diverse results.

In [None]:
embedder = get_embedder()
print(f"Embedder initialized: {embedder.model}")

In [None]:
INDEX_NAME = "chunkEmbeddings"

def vector_search(driver, embedder, query: str, top_k: int = 3):
    """Search for chunks similar to the query."""
    query_embedding = embedder.embed_query(query)
    
    with driver.session() as session:
        result = session.run("""
            CALL db.index.vector.queryNodes($index_name, $top_k, $embedding)
            YIELD node, score
            RETURN node.text as text, node.index as idx, score
            ORDER BY score DESC
        """, index_name=INDEX_NAME, top_k=top_k, embedding=query_embedding)
        
        return list(result)

In [None]:
queries = [
    "What products does Apple make?",
    "Tell me about iPhone and Mac computers",
    "What services does the company offer?",
    "When does the fiscal year end?"
]

for query in queries:
    print(f"\nQuery: \"{query}\"")
    print("-" * 50)
    results = vector_search(driver, embedder, query, top_k=3)
    for i, record in enumerate(results):
        print(f"\n[{i+1}] Score: {record['score']:.4f}")
        print(f"    {record['text']}")

## Explore Entities

With the full dataset, we have many more extracted entities to work with.

In [None]:
def show_entities(driver):
    """Display extracted entities by type."""
    with driver.session() as session:
        # Get entity counts by label (excluding internal labels)
        result = session.run("""
            MATCH (n)
            WHERE NOT n:Chunk AND NOT n:Document AND NOT n:__KGBuilder__
            WITH labels(n) as lbls
            UNWIND lbls as label
            WITH label
            WHERE NOT label STARTS WITH '__'
            RETURN label, count(*) as count
            ORDER BY count DESC
            LIMIT 10
        """)
        print("=== Entity Counts (Top 10) ===")
        for record in result:
            print(f"  {record['label']}: {record['count']}")

show_entities(driver)

In [None]:
def list_companies(driver):
    """List all Company entities."""
    with driver.session() as session:
        result = session.run("""
            MATCH (c:Company)
            WHERE c.name IS NOT NULL
            RETURN c.name as name
            ORDER BY c.name
            LIMIT 20
        """)
        print("=== Companies ===")
        for record in result:
            print(f"  - {record['name']}")

list_companies(driver)

## Find Chunks for Entities

With the full dataset, we can find multiple chunks that mention specific entities across different documents.

In [None]:
def find_chunks_for_entity(driver, entity_name: str, limit: int = 5):
    """Find chunks that mention a specific entity."""
    with driver.session() as session:
        result = session.run("""
            MATCH (e)-[:FROM_CHUNK]->(c:Chunk)
            WHERE e.name CONTAINS $name
            RETURN e.name as entity, labels(e)[0] as type, c.text as chunk_text
            LIMIT $limit
        """, name=entity_name, limit=limit)
        
        records = list(result)
        if records:
            print(f"=== Chunks mentioning '{entity_name}' ===")
            for i, record in enumerate(records):
                print(f"\n[{i+1}] Entity: {record['entity']} ({record['type']})")
                print(f"    Chunk: {record['chunk_text']}")
        else:
            print(f"No chunks found mentioning '{entity_name}'")

# Find chunks mentioning specific entities
find_chunks_for_entity(driver, "iPhone")

In [None]:
# Try finding chunks for other entities
find_chunks_for_entity(driver, "Microsoft")

In [None]:
find_chunks_for_entity(driver, "GPU")

## Explore Relationships

With multiple companies and products, we can see the rich relationship network.

In [None]:
def show_company_products(driver, company_name: str):
    """Show products offered by a company."""
    with driver.session() as session:
        result = session.run("""
            MATCH (c:Company)-[:OFFERS_PRODUCT]->(p:Product)
            WHERE c.name CONTAINS $name
              AND p.name IS NOT NULL
            RETURN c.name as company, p.name as product
            ORDER BY p.name
            LIMIT 20
        """, name=company_name)
        
        records = list(result)
        if records:
            print(f"=== Products from '{company_name}' ===")
            for record in records:
                print(f"  {record['company']} -> {record['product']}")
        else:
            print(f"No products found for '{company_name}'")

show_company_products(driver, "Apple")

In [None]:
show_company_products(driver, "Microsoft")

## Summary

With the full dataset loaded, you can see:

1. **More diverse search results** - Queries now return relevant chunks from multiple companies and documents
2. **Richer entity network** - Many more extracted companies, products, services, and other entities
3. **Cross-document relationships** - Entities are linked across different SEC filings
4. **Better answers** - More context means more accurate and comprehensive responses

This full dataset is what you'll use in the upcoming retriever and agent notebooks to build powerful GraphRAG applications.

---

**Next:** [Vector Retriever](02_01_vector_retriever.ipynb)

In [None]:
# Cleanup
driver.close()
print("Connection closed.")