# Entity Extraction Basics

This notebook demonstrates how to use an LLM to extract structured entities and relationships from text, building a knowledge graph.

**Background:** The previous notebooks (01_01, 01_02) taught manual approaches to chunking and embeddings. This notebook uses `SimpleKGPipeline` which handles everything in one step—splitting, embedding, entity extraction, and storage.

**Learning Objectives:**
- Understand the difference between lexical graphs and semantic graphs
- Define a schema with entity and relationship types
- Use `SimpleKGPipeline` to extract entities from text
- Query the combined graph (chunks + entities)

---

## Lexical vs Semantic Graphs

A **lexical graph** represents document structure:
```
(:Document) <-[:FROM_DOCUMENT]- (:Chunk) -[:NEXT_CHUNK]-> (:Chunk)
```

A **semantic graph** captures extracted entities and their relationships:
```
(:Chunk)-[:FROM_CHUNK]->(:Company)-[:OFFERS_PRODUCT]->(:Product)
                      \-[:OFFERS_SERVICE]->(:Service)
```

The combination enables powerful queries that leverage both content similarity AND structured relationships.

## Setup

Import required modules.

In [None]:
import sys
sys.path.insert(0, '../new-workshops/solutions')

from neo4j import GraphDatabase
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

from config import Neo4jConfig, get_llm, get_embedder

## Sample Data

We use sample SEC 10-K text that contains clear entities (companies, products, executives) for extraction.

In [None]:
SAMPLE_TEXT = """
Apple Inc. ("Apple" or the "Company") designs, manufactures and markets smartphones, 
personal computers, tablets, wearables and accessories, and sells a variety of related 
services. The Company's fiscal year is the 52- or 53-week period that ends on the last 
Saturday of September.

Products

iPhone is the Company's line of smartphones based on its iOS operating system. The iPhone 
product line includes iPhone 14 Pro, iPhone 14, iPhone 13 and iPhone SE. Mac is the Company's 
line of personal computers based on its macOS operating system. iPad is the Company's line 
of multi-purpose tablets based on its iPadOS operating system.

Services

Advertising includes third-party licensing arrangements and the Company's own advertising 
platforms. AppleCare offers a portfolio of fee-based service and support products. Cloud 
Services store and keep customers' content up-to-date across all devices. Digital Content 
operates various platforms for discovering, purchasing, streaming and downloading digital 
content and apps. Payment Services include Apple Card and Apple Pay.
""".strip()

print(f"Sample text length: {len(SAMPLE_TEXT)} characters")

## Connect to Neo4j

In [None]:
neo4j_config = Neo4jConfig()
driver = GraphDatabase.driver(
    neo4j_config.uri,
    auth=(neo4j_config.username, neo4j_config.password)
)
driver.verify_connectivity()
print("Connected to Neo4j successfully!")

## Clear Existing Data

Clear all nodes from previous runs. `SimpleKGPipeline` will create everything fresh—Documents, Chunks with embeddings, and extracted entities.

In [None]:
def clear_graph(driver):
    """Remove all nodes from previous runs."""
    with driver.session() as session:
        result = session.run("""
            MATCH (n)
            DETACH DELETE n
            RETURN count(n) as deleted
        """)
        count = result.single()["deleted"]
        print(f"Deleted {count} nodes")

clear_graph(driver)

## Define Schema

Define the entity types and relationship types we want to extract. This tells the LLM what to look for.

The schema uses simple dictionaries:
- **Entity types**: `{"label": "...", "description": "..."}`
- **Relationship types**: `{"label": "...", "description": "..."}`
- **Patterns**: Tuples of `(source_entity, relationship, target_entity)`

In [None]:
# Define entity types (can be simple strings or dicts with more detail)
ENTITY_TYPES = [
    {"label": "Company", "description": "A company or organization"},
    {"label": "Product", "description": "A product offered by a company"},
    {"label": "Service", "description": "A service offered by a company"},
]

# Define relationship types
RELATIONSHIP_TYPES = [
    {"label": "OFFERS_PRODUCT", "description": "Company offers a product"},
    {"label": "OFFERS_SERVICE", "description": "Company offers a service"},
]

# Define valid patterns (source, relationship, target)
PATTERNS = [
    ("Company", "OFFERS_PRODUCT", "Product"),
    ("Company", "OFFERS_SERVICE", "Service"),
]

print("Schema defined:")
print(f"  Entity types: {[e['label'] for e in ENTITY_TYPES]}")
print(f"  Relationship types: {[r['label'] for r in RELATIONSHIP_TYPES]}")
print(f"  Patterns: {PATTERNS}")

## Initialize LLM and Embedder

The pipeline needs:
- **LLM** - To extract entities from text
- **Embedder** - To generate chunk embeddings

In [None]:
llm = get_llm()
embedder = get_embedder()

print(f"LLM: {llm.model_name}")
print(f"Embedder: {embedder.model}")

## Create SimpleKGPipeline

The `SimpleKGPipeline` combines all the steps we've learned:
1. Text splitting
2. Embedding generation
3. Entity extraction (via LLM)
4. Writing to Neo4j

It handles everything automatically based on the schema you provide.

In [None]:
pipeline = SimpleKGPipeline(
    llm=llm,
    driver=driver,
    embedder=embedder,
    schema={
        "node_types": ENTITY_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS,
    },
    from_pdf=False,  # We're using text, not PDF
    on_error="IGNORE",  # Skip errors for demo
)

print("SimpleKGPipeline created successfully!")

## Run Entity Extraction

Run the pipeline on our sample text. This will:
1. Split text into chunks
2. Generate embeddings for each chunk
3. Use the LLM to extract entities and relationships
4. Write everything to Neo4j

In [None]:
print("Running entity extraction...")
print("(This may take 30-60 seconds)\n")

result = await pipeline.run_async(text=SAMPLE_TEXT)

print("Entity extraction complete!")

## Explore Extracted Entities

Query the graph to see what entities were extracted.

In [None]:
def show_entities(driver):
    """Display extracted entities by type."""
    with driver.session() as session:
        # Get entity counts by label
        result = session.run("""
            MATCH (n)
            WHERE n:Company OR n:Product OR n:Service
            RETURN labels(n)[0] as label, count(n) as count
            ORDER BY count DESC
        """)
        print("=== Entity Counts ===")
        for record in result:
            print(f"  {record['label']}: {record['count']}")
        
        # List entities by type
        for label in ["Company", "Product", "Service"]:
            result = session.run(f"""
                MATCH (n:{label})
                RETURN n.name as name
                ORDER BY n.name
                LIMIT 10
            """)
            names = [record["name"] for record in result]
            if names:
                print(f"\n=== {label} Entities ===")
                for name in names:
                    print(f"  - {name}")

show_entities(driver)

## Explore Relationships

See the relationships between extracted entities.

In [None]:
def show_relationships(driver):
    """Display extracted relationships."""
    with driver.session() as session:
        result = session.run("""
            MATCH (c:Company)-[r]->(target)
            WHERE type(r) IN ['OFFERS_PRODUCT', 'OFFERS_SERVICE']
            RETURN c.name as company, type(r) as relationship, 
                   labels(target)[0] as target_type, target.name as target_name
            ORDER BY c.name, type(r)
            LIMIT 20
        """)
        
        print("=== Extracted Relationships ===")
        for record in result:
            print(f"  ({record['company']}) -[{record['relationship']}]-> ({record['target_type']}: {record['target_name']})")

show_relationships(driver)

## Query the Combined Graph

Now we can query both the lexical graph (Documents, Chunks) and semantic graph (Entities) together.

In [None]:
def show_graph_summary(driver):
    """Show a summary of the complete graph."""
    with driver.session() as session:
        # Count all node types
        result = session.run("""
            MATCH (n)
            UNWIND labels(n) as label
            RETURN label, count(*) as count
            ORDER BY count DESC
        """)
        print("=== Node Counts ===")
        for record in result:
            print(f"  {record['label']}: {record['count']}")
        
        # Count relationship types
        result = session.run("""
            MATCH ()-[r]->()
            RETURN type(r) as type, count(*) as count
            ORDER BY count DESC
        """)
        print("\n=== Relationship Counts ===")
        for record in result:
            print(f"  {record['type']}: {record['count']}")

show_graph_summary(driver)

## Find Chunks Containing Entities

Query chunks that are connected to specific entities via `FROM_CHUNK` relationships.

In [None]:
def find_chunks_for_entity(driver, entity_name: str):
    """Find chunks that mention a specific entity."""
    with driver.session() as session:
        # Entities point TO chunks via FROM_CHUNK
        result = session.run("""
            MATCH (e)-[:FROM_CHUNK]->(c:Chunk)
            WHERE e.name CONTAINS $name
            RETURN e.name as entity, labels(e)[0] as type, c.text as chunk_text
            LIMIT 5
        """, name=entity_name)
        
        records = list(result)
        if records:
            print(f"Chunks mentioning '{entity_name}':")
            for record in records:
                print(f"\n  Entity: {record['entity']} ({record['type']})")
                print(f"  Chunk: {record['chunk_text']}")
        else:
            print(f"No chunks found mentioning '{entity_name}'")

# Try finding chunks for a product
find_chunks_for_entity(driver, "iPhone")

## Summary

In this notebook, you learned:

1. **Lexical vs Semantic graphs** - Documents/Chunks vs Entities/Relationships
2. **Schema definition** - Telling the LLM what entities and relationships to extract
3. **SimpleKGPipeline** - Automated pipeline for text → knowledge graph
4. **Combined queries** - Leveraging both document structure and extracted entities

You now have a complete knowledge graph with:
- **Documents** - Source file metadata
- **Chunks** - Text segments with embeddings
- **Entities** - Extracted companies, products, services
- **Relationships** - Connections between entities

Note that we've been working with a small sample of text. In the next notebook, you'll load the full dataset to see much richer results.

---

**Next:** [Working with the Full Dataset](01_04_full_dataset.ipynb)

In [None]:
# Cleanup
driver.close()
print("Connection closed.")