# GraphRAG Step 1: Document Processing & Knowledge Extraction

This notebook implements the core GraphRAG pipeline from Microsoft's paper ["From Local to Global: A Graph RAG Approach"](https://arxiv.org/abs/2404.16130).

## Pipeline Steps
1. **Load document** - Sample text for processing
2. **Chunk document** - Split into 600-token chunks with 100-token overlap
3. **Entity extraction** - Use LLM to identify named entities
4. **Relationship extraction** - Extract connections between entities
5. **Claims extraction** - Extract factual statements about entities

## Model
Using locally deployed Ollama with `qwen2.5:3b`

## Setup

In [1]:
import httpx
import json
from typing import Any
from dataclasses import dataclass, field
from langchain_text_splitters import RecursiveCharacterTextSplitter

OLLAMA_BASE_URL = "http://localhost:11434"
MODEL = "qwen2.5:3b"

  from pydantic.v1.fields import FieldInfo as FieldInfoV1


In [2]:
# Verify Ollama is running
response = httpx.get(f"{OLLAMA_BASE_URL}/api/tags")
models = [m["name"] for m in response.json().get("models", [])]
print(f"Available models: {models}")
assert MODEL in models, f"Model {MODEL} not found. Please run: ollama pull {MODEL}"

Available models: ['qwen2.5:3b']


## Step 1: Load Sample Document

Using a sample tech news article for demonstration.

In [3]:
# Sample document - a tech news article about AI companies
SAMPLE_DOCUMENT = """
OpenAI Announces GPT-5 Partnership with Microsoft

San Francisco, February 2026 - OpenAI, the artificial intelligence research company led by CEO Sam Altman, 
announced today a major expansion of its partnership with Microsoft. The deal, reportedly worth $10 billion, 
will see Microsoft integrate GPT-5, OpenAI's latest large language model, across its entire product suite 
including Azure, Office 365, and GitHub Copilot.

Sam Altman stated in the press conference: "This partnership represents the next chapter in our mission to 
ensure artificial general intelligence benefits all of humanity. Microsoft's enterprise reach combined with 
our research capabilities creates unprecedented opportunities."

Microsoft CEO Satya Nadella emphasized the strategic importance of the deal: "AI is the defining technology 
of our era. By deepening our collaboration with OpenAI, we're positioning Microsoft at the forefront of this 
transformation."

The announcement sent Microsoft shares up 4.2% in after-hours trading. Industry analysts at Goldman Sachs 
raised their price target for Microsoft stock to $450, citing the competitive advantage in enterprise AI.

Google, Microsoft's primary competitor in cloud services, responded by announcing accelerated deployment of 
its Gemini Ultra model. Google CEO Sundar Pichai said the company would invest an additional $5 billion in 
AI infrastructure through 2027.

The deal comes amid increased regulatory scrutiny of AI companies. The Federal Trade Commission (FTC) has 
opened an inquiry into whether the Microsoft-OpenAI relationship constitutes an anti-competitive arrangement. 
FTC Chair Lina Khan stated that regulators are "closely monitoring the concentration of AI capabilities 
among a small number of tech giants."

OpenAI board member and former Treasury Secretary Larry Summers defended the partnership, arguing that 
collaboration between research organizations and large technology companies accelerates beneficial AI 
development while maintaining necessary safety guardrails.

The GPT-5 model, which OpenAI claims achieves human-level performance on complex reasoning tasks, will 
begin rolling out to Microsoft enterprise customers in Q3 2026. Consumer products powered by GPT-5 are 
expected to launch by early 2027.
"""

print(f"Document length: {len(SAMPLE_DOCUMENT)} characters")

Document length: 2281 characters


## Step 2: Chunk Document

Following GraphRAG methodology: ~600 tokens per chunk with 100 token overlap.
Using character-based approximation (1 token ≈ 4 characters).

In [5]:
# GraphRAG uses 600 tokens with 100 token overlap
# Approximation: 1 token ≈ 4 characters
CHUNK_SIZE = 600 # ~600 tokens
CHUNK_OVERLAP = 100 # ~100 tokens

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_text(SAMPLE_DOCUMENT)

print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ({len(chunk)} chars) ---")
    print(chunk[:200] + "..." if len(chunk) > 200 else chunk)

Created 5 chunks

--- Chunk 1 (424 chars) ---
OpenAI Announces GPT-5 Partnership with Microsoft

San Francisco, February 2026 - OpenAI, the artificial intelligence research company led by CEO Sam Altman, 
announced today a major expansion of its ...

--- Chunk 2 (517 chars) ---
Sam Altman stated in the press conference: "This partnership represents the next chapter in our mission to 
ensure artificial general intelligence benefits all of humanity. Microsoft's enterprise reac...

--- Chunk 3 (462 chars) ---
The announcement sent Microsoft shares up 4.2% in after-hours trading. Industry analysts at Goldman Sachs 
raised their price target for Microsoft stock to $450, citing the competitive advantage in en...

--- Chunk 4 (360 chars) ---
The deal comes amid increased regulatory scrutiny of AI companies. The Federal Trade Commission (FTC) has 
opened an inquiry into whether the Microsoft-OpenAI relationship constitutes an anti-competit...

--- Chunk 5 (508 chars) ---
OpenAI board member an

## Helper: Ollama Chat Function

In [6]:
def chat_ollama(prompt: str, system: str = "", temperature: float = 0.0) -> str:
    """Send a chat request to Ollama and return the response."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    
    response = httpx.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={
            "model": MODEL,
            "messages": messages,
            "stream": False,
            "options": {"temperature": temperature}
        },
        timeout=120.0
    )
    response.raise_for_status()
    return response.json()["message"]["content"]

# Test the connection
test_response = chat_ollama("Say 'Hello GraphRAG!' and nothing else.")
print(f"Ollama test: {test_response}")

Ollama test: Hello GraphRAG!


## Step 3: Entity Extraction

Extract named entities with their types and descriptions.
Entity types: PERSON, ORGANIZATION, LOCATION, EVENT, PRODUCT, DATE, MONEY

In [7]:
@dataclass
class Entity:
    name: str
    type: str
    description: str
    source_chunk: int = 0

ENTITY_EXTRACTION_PROMPT = """
You are an expert at extracting named entities from text.

Extract all named entities from the following text. For each entity provide:
1. name: The entity name (use UPPERCASE for consistency)
2. type: One of [PERSON, ORGANIZATION, LOCATION, EVENT, PRODUCT, DATE, MONEY, CONCEPT]
3. description: A brief description of the entity based on the text

Return ONLY valid JSON array. Example format:
[
  {{"name": "JOHN SMITH", "type": "PERSON", "description": "CEO of Example Corp who announced the merger"}},
  {{"name": "EXAMPLE CORP", "type": "ORGANIZATION", "description": "Technology company acquiring StartupXYZ"}}
]

TEXT:
{text}

JSON OUTPUT:
"""

def extract_entities(text: str, chunk_id: int = 0) -> list[Entity]:
    """Extract entities from a text chunk using the LLM."""
    prompt = ENTITY_EXTRACTION_PROMPT.format(text=text)
    response = chat_ollama(prompt)
    
    # Parse JSON from response (handle potential markdown code blocks)
    json_str = response.strip()
    if json_str.startswith("```"):
        json_str = json_str.split("```")[1]
        if json_str.startswith("json"):
            json_str = json_str[4:]
    json_str = json_str.strip()
    
    try:
        entities_data = json.loads(json_str)
        return [
            Entity(
                name=e.get("name", "").upper(),
                type=e.get("type", "UNKNOWN"),
                description=e.get("description", ""),
                source_chunk=chunk_id
            )
            for e in entities_data
        ]
    except json.JSONDecodeError as ex:
        print(f"Failed to parse JSON: {ex}")
        print(f"Raw response: {response}")
        return []

In [8]:
# Extract entities from all chunks
all_entities: list[Entity] = []

for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i+1}/{len(chunks)}...")
    entities = extract_entities(chunk, chunk_id=i)
    all_entities.extend(entities)
    print(f"  Found {len(entities)} entities")

print(f"\nTotal entities extracted: {len(all_entities)}")

Processing chunk 1/5...
  Found 4 entities
Processing chunk 2/5...
  Found 2 entities
Processing chunk 3/5...
  Found 5 entities
Processing chunk 4/5...
  Found 4 entities
Processing chunk 5/5...
  Found 5 entities

Total entities extracted: 20


In [9]:
# Display extracted entities
print("\n=== EXTRACTED ENTITIES ===")
for entity in all_entities:
    print(f"\n[{entity.type}] {entity.name}")
    print(f"  Description: {entity.description}")
    print(f"  Source: Chunk {entity.source_chunk + 1}")


=== EXTRACTED ENTITIES ===

[ORGANIZATION] OPENAI
  Description: Artificial intelligence research company
  Source: Chunk 1

[ORGANIZATION] MICROSOFT
  Description: Technology company
  Source: Chunk 1

[PERSON] SAM ALTMAN
  Description: CEO of OpenAI
  Source: Chunk 1

[PRODUCT] GPT-5
  Description: Large language model developed by OpenAI
  Source: Chunk 1

[PERSON] SATYA NADELLA
  Description: CEO of Microsoft
  Source: Chunk 2

[ORGANIZATION] MICROSOFT
  Description: Technology company
  Source: Chunk 2

[ORGANIZATION] MICROSOFT
  Description: Company whose shares increased after announcement
  Source: Chunk 3

[ORGANIZATION] GOLDMAN SACHS
  Description: Financial institution providing price target for Microsoft stock
  Source: Chunk 3

[ORGANIZATION] GOOGLE
  Description: Competitor in cloud services with CEO Sundar Pichai
  Source: Chunk 3

[ORGANIZATION] MICROSOFT
  Description: Company whose shares increased after announcement
  Source: Chunk 3

[PERSON] SUNDAR PICHAI
  Descri

In [10]:
# Deduplicate entities by name (merge descriptions)
def deduplicate_entities(entities: list[Entity]) -> list[Entity]:
    """Merge duplicate entities, combining their descriptions."""
    entity_map: dict[str, Entity] = {}
    
    for entity in entities:
        key = entity.name
        if key in entity_map:
            # Merge descriptions if different
            existing = entity_map[key]
            if entity.description not in existing.description:
                existing.description = f"{existing.description} | {entity.description}"
        else:
            entity_map[key] = Entity(
                name=entity.name,
                type=entity.type,
                description=entity.description,
                source_chunk=entity.source_chunk
            )
    
    return list(entity_map.values())

unique_entities = deduplicate_entities(all_entities)
print(f"Unique entities after deduplication: {len(unique_entities)}")

Unique entities after deduplication: 14


## Step 4: Relationship Extraction

Extract relationships between entities with descriptions and strength scores.

In [11]:
@dataclass
class Relationship:
    source: str
    target: str
    description: str
    strength: float = 1.0
    source_chunk: int = 0

RELATIONSHIP_EXTRACTION_PROMPT = """
You are an expert at extracting relationships between entities.

Given the following text and list of entities, extract all relationships between them.
For each relationship provide:
1. source: The source entity name (UPPERCASE)
2. target: The target entity name (UPPERCASE)
3. description: A description of how these entities are related
4. strength: A score from 1-10 indicating relationship strength (10 = very strong)

Return ONLY valid JSON array. Example format:
[
  {{"source": "JOHN SMITH", "target": "EXAMPLE CORP", "description": "John Smith is the CEO of Example Corp", "strength": 9}},
  {{"source": "EXAMPLE CORP", "target": "STARTUPXYZ", "description": "Example Corp is acquiring StartupXYZ", "strength": 8}}
]

ENTITIES:
{entities}

TEXT:
{text}

JSON OUTPUT:
"""

def extract_relationships(text: str, entities: list[Entity], chunk_id: int = 0) -> list[Relationship]:
    """Extract relationships between entities from a text chunk."""
    entity_list = ", ".join([e.name for e in entities])
    prompt = RELATIONSHIP_EXTRACTION_PROMPT.format(text=text, entities=entity_list)
    response = chat_ollama(prompt)
    
    # Parse JSON from response
    json_str = response.strip()
    if json_str.startswith("```"):
        json_str = json_str.split("```")[1]
        if json_str.startswith("json"):
            json_str = json_str[4:]
    json_str = json_str.strip()
    
    try:
        rels_data = json.loads(json_str)
        return [
            Relationship(
                source=r.get("source", "").upper(),
                target=r.get("target", "").upper(),
                description=r.get("description", ""),
                strength=float(r.get("strength", 5)) / 10.0,  # Normalize to 0-1
                source_chunk=chunk_id
            )
            for r in rels_data
        ]
    except json.JSONDecodeError as ex:
        print(f"Failed to parse JSON: {ex}")
        print(f"Raw response: {response}")
        return []

In [12]:
# Extract relationships from all chunks
all_relationships: list[Relationship] = []

for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i+1}/{len(chunks)} for relationships...")
    # Get entities relevant to this chunk
    chunk_entities = [e for e in all_entities if e.source_chunk == i]
    if len(chunk_entities) < 2:
        print(f"  Skipping - need at least 2 entities")
        continue
    
    relationships = extract_relationships(chunk, chunk_entities, chunk_id=i)
    all_relationships.extend(relationships)
    print(f"  Found {len(relationships)} relationships")

print(f"\nTotal relationships extracted: {len(all_relationships)}")

Processing chunk 1/5 for relationships...
  Found 2 relationships
Processing chunk 2/5 for relationships...
  Found 0 relationships
Processing chunk 3/5 for relationships...
  Found 3 relationships
Processing chunk 4/5 for relationships...
  Found 2 relationships
Processing chunk 5/5 for relationships...
  Found 0 relationships

Total relationships extracted: 7


In [13]:
# Display extracted relationships
print("\n=== EXTRACTED RELATIONSHIPS ===")
for rel in all_relationships:
    print(f"\n{rel.source} --> {rel.target}")
    print(f"  Description: {rel.description}")
    print(f"  Strength: {rel.strength:.1f}")


=== EXTRACTED RELATIONSHIPS ===

OPENAI --> GPT-5
  Description: OpenAI develops GPT-5
  Strength: 0.8

MICROSOFT --> GPT-5
  Description: Microsoft integrates GPT-5 into its product suite
  Strength: 0.9

MICROSOFT --> GOLDMAN SACHS
  Description: Goldman Sachs raised their price target for Microsoft stock
  Strength: 0.8

MICROSOFT --> GOOGLE
  Description: Microsoft's primary competitor in cloud services is Google
  Strength: 0.9

GOOGLE --> SUNDAR PICHAI
  Description: Sundar Pichai is the CEO of Google
  Strength: 1.0

FEDERAL TRADE COMMISSION --> LINA KAHN
  Description: FEDERAL TRADE COMMISSION appoints Lina Kahn as Chair
  Strength: 0.8

FEDERAL TRADE COMMISSION --> MICROSOFT-OPENAI RELATIONSHIP
  Description: FEDERAL TRADE COMMISSION is investigating the Microsoft-OpenAI relationship
  Strength: 0.9


## Step 5: Claims Extraction

Extract factual claims/statements about entities, including dates, events, and specific facts.

In [14]:
@dataclass
class Claim:
    subject: str  # Entity the claim is about
    claim_type: str  # Type: FACT, EVENT, STATEMENT, METRIC
    description: str  # The actual claim
    date: str = ""  # Associated date if any
    source_chunk: int = 0

CLAIMS_EXTRACTION_PROMPT = """
You are an expert at extracting factual claims from text.

Extract all specific factual claims from the following text. For each claim provide:
1. subject: The entity the claim is about (UPPERCASE)
2. claim_type: One of [FACT, EVENT, STATEMENT, METRIC, PREDICTION]
3. description: The specific claim or fact
4. date: Associated date/timeframe if mentioned (otherwise empty string)

Focus on:
- Numerical facts (prices, percentages, amounts)
- Events (announcements, launches, decisions)
- Quotes and statements by people
- Predictions and forecasts

Return ONLY valid JSON array. Example format:
[
  {{"subject": "EXAMPLE CORP", "claim_type": "METRIC", "description": "Stock rose 15% in after-hours trading", "date": "2026-02-10"}},
  {{"subject": "JOHN SMITH", "claim_type": "STATEMENT", "description": "Stated that the merger will create 1000 new jobs", "date": ""}}
]

TEXT:
{text}

JSON OUTPUT:
"""

def extract_claims(text: str, chunk_id: int = 0) -> list[Claim]:
    """Extract factual claims from a text chunk."""
    prompt = CLAIMS_EXTRACTION_PROMPT.format(text=text)
    response = chat_ollama(prompt)
    
    # Parse JSON from response
    json_str = response.strip()
    if json_str.startswith("```"):
        json_str = json_str.split("```")[1]
        if json_str.startswith("json"):
            json_str = json_str[4:]
    json_str = json_str.strip()
    
    try:
        claims_data = json.loads(json_str)
        return [
            Claim(
                subject=c.get("subject", "").upper(),
                claim_type=c.get("claim_type", "FACT"),
                description=c.get("description", ""),
                date=c.get("date", ""),
                source_chunk=chunk_id
            )
            for c in claims_data
        ]
    except json.JSONDecodeError as ex:
        print(f"Failed to parse JSON: {ex}")
        print(f"Raw response: {response}")
        return []

In [15]:
# Extract claims from all chunks
all_claims: list[Claim] = []

for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i+1}/{len(chunks)} for claims...")
    claims = extract_claims(chunk, chunk_id=i)
    all_claims.extend(claims)
    print(f"  Found {len(claims)} claims")

print(f"\nTotal claims extracted: {len(all_claims)}")

Processing chunk 1/5 for claims...
  Found 0 claims
Processing chunk 2/5 for claims...
  Found 0 claims
Processing chunk 3/5 for claims...
  Found 4 claims
Processing chunk 4/5 for claims...
  Found 1 claims
Processing chunk 5/5 for claims...
  Found 0 claims

Total claims extracted: 5


In [16]:
# Display extracted claims
print("\n=== EXTRACTED CLAIMS ===")
for claim in all_claims:
    date_str = f" [{claim.date}]" if claim.date else ""
    print(f"\n[{claim.claim_type}] {claim.subject}{date_str}")
    print(f"  {claim.description}")


=== EXTRACTED CLAIMS ===

[METRIC] MICROSOFT [2023-10-05]
  Microsoft shares rose 4.2% in after-hours trading

[STATEMENT] GOLDMAN SCHACS INDUSTRY ANALYSTS
  Cited competitive advantage in enterprise AI

[EVENT] GOOGLE
  Announced accelerated deployment of its Gemini Ultra model

[STATEMENT] GOOGLE CEO SUNDAR PICHAI
  Stated the company would invest an additional $5 billion in AI infrastructure through 2027

[STATEMENT] LINA KAHN
  Stated that regulators are 'closely monitoring the concentration of AI capabilities among a small number of tech giants'


## Summary: Extraction Results

Consolidate all extracted knowledge elements.

In [17]:
print("="*60)
print("GRAPHRAG EXTRACTION SUMMARY")
print("="*60)
print(f"\nDocument: {len(SAMPLE_DOCUMENT)} characters")
print(f"Chunks: {len(chunks)}")
print(f"\nEntities: {len(all_entities)} total, {len(unique_entities)} unique")
print(f"Relationships: {len(all_relationships)}")
print(f"Claims: {len(all_claims)}")

# Entity type breakdown
print("\n--- Entity Types ---")
type_counts: dict[str, int] = {}
for e in unique_entities:
    type_counts[e.type] = type_counts.get(e.type, 0) + 1
for t, count in sorted(type_counts.items(), key=lambda x: -x[1]):
    print(f"  {t}: {count}")

# Claim type breakdown
print("\n--- Claim Types ---")
claim_type_counts: dict[str, int] = {}
for c in all_claims:
    claim_type_counts[c.claim_type] = claim_type_counts.get(c.claim_type, 0) + 1
for t, count in sorted(claim_type_counts.items(), key=lambda x: -x[1]):
    print(f"  {t}: {count}")

GRAPHRAG EXTRACTION SUMMARY

Document: 2281 characters
Chunks: 5

Entities: 20 total, 14 unique
Relationships: 7
Claims: 5

--- Entity Types ---
  ORGANIZATION: 8
  PERSON: 5
  PRODUCT: 1

--- Claim Types ---
  STATEMENT: 3
  METRIC: 1
  EVENT: 1


## Next Steps

In the next notebook we will:
1. **Build the knowledge graph** - Store entities and relationships in a graph structure
2. **Apply community detection** - Use Leiden algorithm to find topic clusters
3. **Generate community summaries** - Create hierarchical summaries for each cluster
4. **Store in SQLite** - Persist the graph for retrieval

In [18]:
# Export extracted data for next notebook
extraction_results = {
    "document_length": len(SAMPLE_DOCUMENT),
    "chunks": chunks,
    "entities": [{"name": e.name, "type": e.type, "description": e.description} for e in unique_entities],
    "relationships": [{"source": r.source, "target": r.target, "description": r.description, "strength": r.strength} for r in all_relationships],
    "claims": [{"subject": c.subject, "claim_type": c.claim_type, "description": c.description, "date": c.date} for c in all_claims]
}

with open("extraction_results.json", "w") as f:
    json.dump(extraction_results, f, indent=2)

print("Results saved to extraction_results.json")

Results saved to extraction_results.json
