# Working with the Full Dataset

In the previous notebooks we only load a small sample of the dataset. 

---

## Load the Full Dataset

Run the cell below to restore the full Neo4j database with pre-built data:

This script loads a pre-built knowledge graph containing:
- SEC 10-K filings from multiple companies (Apple, Microsoft, Nvidia, etc.)
- Chunks with embeddings for semantic search
- Extracted entities (Companies, Products, Services, Executives, etc.)
- Relationships between entities

Wait for the script to complete before continuing.

In [2]:
!uv run python ../scripts/restore_neo4j.py --force

=== Neo4j Database Restore ===

Source: GitHub (https://media.githubusercontent.com/media/neo4j-partners/workshop-financial-data/main/snapshot/financial_backup.json)

Database: neo4j+s://053c8750.databases.neo4j.io


Streaming from https://media.githubusercontent.com/media/neo4j-partners/workshop-financial-data/main/snapshot/financial_backup.json...
Connected to Neo4j at neo4j+s://053c8750.databases.neo4j.io
  Streamed 10.0 MB...
  Streamed 20.0 MB...
  Streamed 30.0 MB...
  Streamed 40.0 MB...
  Streamed 50.0 MB...
  Streamed 60.0 MB...
  Streamed 70.0 MB...
  Streamed 80.0 MB...
  Total: 90.0 MB
  Parsed 2145 nodes, 5070 relationships
Clearing existing data...
  Deleted 11 nodes...
Dropping existing indexes and constraints...
  Dropped index: __entity__tmp_internal_id
  Dropped index: chunkEmbeddings
Restoring schema (indexes and constraints)...
  Created constraint: managerName_AssetManager_uniq
  Created constraint: name_Company_uniq
  Created constraint: path_Document_uniq
  Creat

## Setup

Import required modules and connect to Neo4j.

In [3]:
import sys
sys.path.insert(0, '../new-workshops/solutions')

from neo4j import GraphDatabase

from config import Neo4jConfig, get_embedder

In [4]:
neo4j_config = Neo4jConfig()
driver = GraphDatabase.driver(
    neo4j_config.uri,
    auth=(neo4j_config.username, neo4j_config.password)
)
driver.verify_connectivity()
print("Connected to Neo4j successfully!")

Connected to Neo4j successfully!


## Explore the Full Graph

Let's see what's in the restored database.

### Data Model

See [docs/DATA_MODEL.md](../docs/DATA_MODEL.md) for details on the graph schema, node types, relationships, and example queries.

In [5]:
def show_graph_summary(driver):
    """Show a summary of the complete graph."""
    with driver.session() as session:
        # Count all node types
        result = session.run("""
            MATCH (n)
            UNWIND labels(n) as label
            RETURN label, count(*) as count
            ORDER BY count DESC
        """)
        print("=== Node Counts ===")
        for record in result:
            print(f"  {record['label']}: {record['count']}")
        
        # Count relationship types
        result = session.run("""
            MATCH ()-[r]->()
            RETURN type(r) as type, count(*) as count
            ORDER BY count DESC
        """)
        print("\n=== Relationship Counts ===")
        for record in result:
            print(f"  {record['type']}: {record['count']}")

show_graph_summary(driver)

=== Node Counts ===
  RiskFactor: 820
  FinancialMetric: 470
  Chunk: 390
  Product: 241
  TimePeriod: 102
  Transaction: 46
  Executive: 29
  AssetManager: 15
  Company: 12
  Document: 11
  StockType: 9

=== Relationship Counts ===
  FROM_CHUNK: 2414
  FACES_RISK: 836
  HAS_METRIC: 526
  FROM_DOCUMENT: 390
  NEXT_CHUNK: 381
  MENTIONS: 378
  OWNS: 118
  ISSUED_STOCK: 17
  FILED: 10


## Vector Search with Full Dataset

Now let's re-run the same queries from the previous notebooks. With more data, we'll get more relevant and diverse results.

In [16]:
embedder = get_embedder()
print(f"Embedder initialized: {embedder.model}")

Embedder initialized: text-embedding-ada-002


In [8]:
INDEX_NAME = "chunkEmbeddings"

def vector_search(driver, embedder, query: str, top_k: int = 3):
    """Search for chunks similar to the query."""
    query_embedding = embedder.embed_query(query)
    
    with driver.session() as session:
        result = session.run("""
            CALL db.index.vector.queryNodes($index_name, $top_k, $embedding)
            YIELD node, score
            RETURN node.text as text, node.index as idx, score
            ORDER BY score DESC
        """, index_name=INDEX_NAME, top_k=top_k, embedding=query_embedding)
        
        return list(result)

In [9]:
queries = [
    "What products does Apple make?",
    "Tell me about iPhone and Mac computers",
    "What services does the company offer?",
    "When does the fiscal year end?"
]

for query in queries:
    print(f"\nQuery: \"{query}\"")
    print("-" * 50)
    results = vector_search(driver, embedder, query, top_k=3)
    for i, record in enumerate(results):
        print(f"\n[{i+1}] Score: {record['score']:.4f}")
        print(f"    {record['text']}")


Query: "What products does Apple make?"
--------------------------------------------------

[1] Score: 0.8985
    
components used by the Company, including those that are available from multiple sources, are at
times subject to industry-wide shortage and significant commodity pricing fluctuations.
The Company uses some custom components that are not commonly used by its competitors, and
new products introduced by the Company often utilize custom components available from only one
source. When a component or product uses new technologies, initial capacity constraints may exist
until the suppliers' yields have matured or their manufacturing capacities have increased. The
continued availability of these components at acceptable prices, or at all, may be affected if
suppliers decide to concentrate on the production of common components instead of components
customized to meet the Company's requirements.
The Company has entered into agreements for the supply of many components; however, t

## Explore Entities

With the full dataset, we have many more extracted entities to work with.

In [10]:
def show_entities(driver):
    """Display extracted entities by type."""
    with driver.session() as session:
        # Get entity counts by label (excluding internal labels)
        result = session.run("""
            MATCH (n)
            WHERE NOT n:Chunk AND NOT n:Document AND NOT n:__KGBuilder__
            WITH labels(n) as lbls
            UNWIND lbls as label
            WITH label
            WHERE NOT label STARTS WITH '__'
            RETURN label, count(*) as count
            ORDER BY count DESC
            LIMIT 10
        """)
        print("=== Entity Counts (Top 10) ===")
        for record in result:
            print(f"  {record['label']}: {record['count']}")

show_entities(driver)

=== Entity Counts (Top 10) ===
  RiskFactor: 820
  FinancialMetric: 470
  Product: 241
  TimePeriod: 102
  Transaction: 46
  Executive: 29
  AssetManager: 15
  Company: 12
  StockType: 9


In [11]:
def list_companies(driver):
    """List all Company entities."""
    with driver.session() as session:
        result = session.run("""
            MATCH (c:Company)
            RETURN c.name as name
            ORDER BY c.name
            LIMIT 20
        """)
        print("=== Companies ===")
        for record in result:
            print(f"  - {record['name']}")

list_companies(driver)

=== Companies ===
  - ALPHABET INC
  - AMAZON
  - AMERICAN INTL GROUP
  - APPLE INC
  - Activision Blizzard, Inc.
  - INTEL CORP
  - MCDONALDS CORP
  - MICROSOFT CORP
  - NVIDIA CORPORATION
  - PAYPAL
  - PAYPAL HLDGS INC
  - PG&E CORP


## Find Chunks for Entities

With the full dataset, we can find multiple chunks that mention specific entities across different documents.

In [12]:
def find_chunks_for_entity(driver, entity_name: str, limit: int = 5):
    """Find chunks that mention a specific entity."""
    with driver.session() as session:
        result = session.run("""
            MATCH (e)-[:FROM_CHUNK]->(c:Chunk)
            WHERE e.name CONTAINS $name
            RETURN e.name as entity, labels(e)[0] as type, c.text as chunk_text
            LIMIT $limit
        """, name=entity_name, limit=limit)
        
        records = list(result)
        if records:
            print(f"=== Chunks mentioning '{entity_name}' ===")
            for i, record in enumerate(records):
                print(f"\n[{i+1}] Entity: {record['entity']} ({record['type']})")
                print(f"    Chunk: {record['chunk_text']}")
        else:
            print(f"No chunks found mentioning '{entity_name}'")

# Find chunks mentioning specific entities
find_chunks_for_entity(driver, "iPhone")

=== Chunks mentioning 'iPhone' ===

[1] Entity: iPhone (Product)
    Chunk: goal of maximizing each of the business' potential. The anticipated benefits of this reorganization
may not be obtained if circumstances prevent us from taking advantage of the strategic and
business opportunities that we expect it may afford us. As a result, we may incur the costs of a
holding company structure without realizing the anticipated benefits, which could adversely affect
our reputation, financial condition, and operating results. 
Alphabet's management is dedicating significant effort to the new operating structure. These efforts
may divert management's focus and resources from Alphabet's business, corporate initiatives, or
strategic opportunities, which could have an adverse effect on our businesses, results of operations,
financial condition, or prospects. Additionally, our subsidiaries may be restricted in their ability to pay
cash dividends or to make other distributions to Alphabet, as the new

In [13]:
# Try finding chunks for other entities
find_chunks_for_entity(driver, "Microsoft")

=== Chunks mentioning 'Microsoft' ===

[1] Entity: Microsoft Cloud (Product)
    Chunk: 10-K Filing Data
Section: Item1
>ITEM 1. B
USINESS
GENERAL
Embracing Our Future
Microsoft is a technology company whose mission is to empower every person and every
organization on the planet to achieve more. We strive to create local opportunity, growth, and impact
in every country around the world. We are creating the platforms and tools
,
 powered by artificial intelligence ("
AI
"), that deliver better, faster, and more effective solutions to support small and large business
competitiveness, improve educational and health outcomes, grow public-sector efficiency,  and
empower human ingenuity. From infrastructure and data, to business applications and collaboration,
we provide unique, differentiated value to customers.
 
In a world of increasing economic complexity, AI has the power to revolutionize many types of work.
Microsoft is now innovating and expanding our portfolio with AI capabilities to

In [14]:
find_chunks_for_entity(driver, "GPU")

=== Chunks mentioning 'GPU' ===

[1] Entity: NVIDIA Hopper GPU architecture (Product)
    Chunk: 11,132 
$
7,434 
Up 50%
Income from operations
$
4,224 
$
10,041 
Down 58%
Net income
$
4,368 
$
9,752 
Down 55%
Net income per diluted share
$
1.74 
$
3.85 
Down 55%
We specialize in markets where our computing platforms can provide tremendous acceleration for
applications. These platforms incorporate processors, interconnects, software, algorithms, systems,
and services to deliver unique value. Our platforms address four large markets where our expertise
is critical: Data Center, Gaming, Professional Visualization, and Automotive.
Revenue for fiscal year 2023 revenue was $26.97 billion, flat compared with a year ago.
37
Table of Contents
Data Center revenue was up 41% from a year ago led by strong growth from hyperscale customers
and also reflects purchases made by several CSP partners to support multi-year cloud service
agreements for our new NVIDIA AI cloud service offerings and our res

## Explore Relationships

With multiple companies and products, we can see the rich relationship network.

In [15]:
def show_company_products(driver, company_name: str):
    """Show products offered by a company."""
    with driver.session() as session:
        result = session.run("""
            MATCH (c:Company)-[:OFFERS_PRODUCT]->(p:Product)
            WHERE c.name CONTAINS $name
            RETURN c.name as company, p.name as product
            ORDER BY p.name
            LIMIT 20
        """, name=company_name)
        
        records = list(result)
        if records:
            print(f"=== Products from '{company_name}' ===")
            for record in records:
                print(f"  {record['company']} -> {record['product']}")
        else:
            print(f"No products found for '{company_name}'")

show_company_products(driver, "Apple")

No products found for 'Apple'


In [None]:
show_company_products(driver, "Microsoft")

No products found for 'Microsoft'


## Summary

With the full dataset loaded, you can see:

1. **More diverse search results** - Queries now return relevant chunks from multiple companies and documents
2. **Richer entity network** - Many more extracted companies, products, services, and other entities
3. **Cross-document relationships** - Entities are linked across different SEC filings
4. **Better answers** - More context means more accurate and comprehensive responses

This full dataset is what you'll use in the upcoming retriever and agent notebooks to build powerful GraphRAG applications.

---

**Next:** [Vector Retriever](02_01_vector_retriever.ipynb)

In [None]:
# Cleanup
driver.close()
print("Connection closed.")

Connection closed.
