# Day 3 – Multi‑Agent Long‑Text Knowledge Extractor (Gemini‑style)

A compact multi-agent notebook that demonstrates chunking, summarization, and synthesis for long texts. This notebook mirrors the structure of your bootcamp example but implements a working 'Knowledge Extractor' pipeline.

**Agents:** ChunkerAgent, SummarizerAgent, SynthesizerAgent, ValidationAgent (optional).

You can run this locally or in Colab. If you have a Gemini / Google Cloud Vertex AI API key, set it in the configuration cell below to use the Gemini model. Otherwise the notebook uses a lightweight local summarizer fallback.

## Install dependencies



In [1]:
!pip install -q google-genai langchain-google-genai langchain

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/63.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.6/63.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/475.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m471.0/475.3 kB[0m [31m51.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.3/475.3 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2. Configure Gemini API Key

In [2]:

import os
# Example (uncomment + replace to use):
# os.environ['GOOGLE_API_KEY'] = "YOUR_GOOGLE_API_KEY"

print("GOOGLE_API_KEY set:", 'GOOGLE_API_KEY' in os.environ)


GOOGLE_API_KEY set: False


## 3. This notebook implements a three-stage multi-agent pipeline:

1. **ChunkerAgent** — Split the long text into manageable chunks.
2. **SummarizerAgent** — Summarize each chunk (calls Gemini if available, else uses local heuristic summarizer).
3. **SynthesizerAgent** — Merge chunk summaries into a single coherent knowledge digest.

There's also a **ValidationAgent** to do simple checks on the final summary.

### ChunkerAgent
Splits the text into chunks by characters or sentences, with optional overlap.

In [8]:

from typing import List
import re

def chunk_text(text: str, max_chars: int = 1500, overlap: int = 200) -> List[str]:
    """Split text into chunks of max_chars with simple sentence-aware boundaries where possible."""
    text = re.sub(r'\s+', ' ', text).strip()
    if len(text) <= max_chars:
        return [text]
    chunks = []
    start = 0
    while start < len(text):
        end = start + max_chars
        if end >= len(text):
            chunks.append(text[start:])
            break
        snippet = text[start:end]
        last_period = snippet.rfind('. ')
        if last_period != -1 and last_period > int(max_chars*0.5):
            cut = start + last_period + 1
        else:
            cut = end
        chunks.append(text[start:cut].strip())
        start = max(cut - overlap, cut)
    return chunks

# quick sanity check
if __name__ == '__main__':
    sample = "Lorem Ipsum Dolor Sit amen." * 80
    print('Chunks:', len(chunk_text(sample, max_chars=200, overlap=40)))


Chunks: 11


### SummarizerAgent
Attempts to use Gemini if configured; otherwise falls back to a lightweight extractive summarizer.

The Gemini call is shown as a template — uncomment and configure your credentials to enable it.

In [11]:

def local_extractive_summarizer(text: str, max_sentences: int = 5) -> str:
    """A tiny extractive summarizer: score sentences by term frequency and pick top ones."""
    import math, re
    from collections import Counter

    sentences = re.split(r'(?<=[.!?])\s+', text)
    if len(sentences) <= max_sentences:
        return ' '.join(sentences).strip()
    words = re.findall(r"\w+", text.lower())
    freq = Counter(words)
    scores = []
    for i, s in enumerate(sentences):
        ws = re.findall(r"\w+", s.lower())
        if not ws:
            scores.append((i, 0.0))
            continue
        score = sum(freq[w] for w in ws) / math.sqrt(len(ws))
        scores.append((i, score))
    scores.sort(key=lambda x: x[1], reverse=True)
    top_idx = sorted([i for i,_ in scores[:max_sentences]])
    summary = ' '.join([sentences[i] for i in top_idx])
    return summary.strip()

def summarize_chunk(chunk: str, use_gemini: bool = False, gemini_client=None, max_sentences: int = 5) -> str:
    if use_gemini and gemini_client is not None:
        # Example pseudo-call for Gemini (uncomment and adapt to real client)
        # prompt = f"""Summarize the following text concisely, preserving key ideas and important details.\n\n{chunk}\n\nCONCISE SUMMARY:"""
        # response = gemini_client.generate_text(prompt)  # pseudocode
        # return response.text
        raise NotImplementedError("Gemini client integration is left as an exercise; set use_gemini=False to use fallback.")
    else:
        return local_extractive_summarizer(chunk, max_sentences=max_sentences)

# demo
if __name__ == '__main__':
    t = ("""As the large language models powering generative AI tools slurp up ever more data across the web,
    Cloudflare cofounder and CEO Matthew Prince said at WIRED’s Big Interview event in San Francisco on Thursday that the internet infrastructure company has blocked more than 400 billion AI bot requests for its customers since July 1.

The action comes after the company announced a Content Independence Day in July—an initiative with prominent publishers
and AI firms to block AI crawlers by default on content creators’ work unless the AI companies pay for access.
Since July 2024, Cloudflare has offered customers tools to block AI bots from scraping their content.
Cloudflare told WIRED that the number of AI bots blocked since July 1, 2025 is 416 billion.

""")
    print(local_extractive_summarizer(t, max_sentences=3))


As the large language models powering generative AI tools slurp up ever more data across the web, Cloudflare cofounder and CEO Matthew Prince said at WIRED’s Big Interview event in San Francisco on Thursday that the internet infrastructure company has blocked more than 400 billion AI bot requests for its customers since July 1. The action comes after the company announced a Content Independence Day in July—an initiative with prominent publishers and AI firms to block AI crawlers by default on content creators’ work unless the AI companies pay for access. Cloudflare told WIRED that the number of AI bots blocked since July 1, 2025 is 416 billion.


In [12]:

from typing import List
def summarize_all(chunks: List[str], use_gemini: bool = False, gemini_client=None, max_sentences: int = 5) -> List[str]:
    summaries = []
    for i, c in enumerate(chunks):
        s = summarize_chunk(c, use_gemini=use_gemini, gemini_client=gemini_client, max_sentences=max_sentences)
        summaries.append(s)
        print(f"Summarized chunk {i+1}/{len(chunks)} — {len(c)} chars -> {len(s)} chars")
    return summaries


### SynthesizerAgent
Takes the chunk summaries and merges them into a single coherent summary. Optionally asks Gemini to polish the final summary.

In [15]:

def synthesize_summaries(summaries: List[str], use_gemini: bool = False, gemini_client=None) -> str:
    joined = "\n\n".join(summaries)
    if use_gemini and gemini_client is not None:
        # pseudo code for Gemini polishing
        prompt = f"""You are an AI assistant. Combine the following chunk summaries into one coherent, structured summary. Use bullets or short paragraphs and highlight key insights.\n\n{joined}\n\nFINAL SUMMARY:"""
        response = gemini_client.generate_text(prompt)
        return response.text
        # raise NotImplementedError("Gemini polishing not implemented in this template.")
    import re
    sentences = re.split(r'(?<=[.!?])\s+', joined)
    seen = set()
    out = []
    for s in sentences:
        key = s.strip().lower()
        if not key or key in seen:
            continue
        seen.add(key)
        out.append(s.strip())
    final = ' '.join(out[:10])
    return final.strip()


### ValidationAgent
Simple checks to ensure the summary length and that it contains keywords from the original text.

In [16]:

from typing import Tuple
def validate_summary(original: str, summary: str, min_coverage_ratio: float = 0.05) -> Tuple[bool, dict]:
    import re
    orig_words = set(re.findall(r"\w+", original.lower()))
    summ_words = set(re.findall(r"\w+", summary.lower()))
    if not orig_words:
        return False, {'reason': 'original empty'}
    coverage = len(summ_words & orig_words) / len(orig_words)
    ok = coverage >= min_coverage_ratio
    return ok, {'coverage': coverage, 'summary_len': len(summary)}


## 4. Orchestrator / Run the pipeline
This cell demonstrates running all agents together on a sample long text. Replace `sample_text` with your document or load a `.txt` file.

In [17]:

# Example orchestration
def run_pipeline(text: str, max_chars: int = 1500, overlap: int = 200, max_sentences: int = 5, use_gemini: bool = False, gemini_client=None):
    print('Chunking...')
    chunks = chunk_text(text, max_chars=max_chars, overlap=overlap)
    print(f'Created {len(chunks)} chunks.')
    summaries = summarize_all(chunks, use_gemini=use_gemini, gemini_client=gemini_client, max_sentences=max_sentences)
    print('\nSynthesizing final summary...')
    final = synthesize_summaries(summaries, use_gemini=use_gemini, gemini_client=gemini_client)
    ok, info = validate_summary(text, final)
    print('\nValidation:', ok, info)
    return {'chunks': chunks, 'summaries': summaries, 'final_summary': final}

# Load sample text (or replace with your own)
sample_text = """Deep learning is a subset of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks. The field has grown rapidly in the last decade, driven by cheaper compute and larger datasets. This notebook demonstrates how to split long documents into chunks, summarize each piece, and then recombine the results into a coherent summary. The approach is robust for long blog posts, technical reports, and meeting transcripts. It also serves as a good demo for agentic systems where small specialized 'agents' perform isolated tasks and pass results to the next agent.""" * 8

result = run_pipeline(sample_text, max_chars=800, overlap=120, max_sentences=3)
print('\nFINAL SUMMARY:\n', result['final_summary'])


Chunking...
Created 7 chunks.
Summarized chunk 1/7 — 787 chars -> 553 chars
Summarized chunk 2/7 — 723 chars -> 489 chars
Summarized chunk 3/7 — 774 chars -> 591 chars
Summarized chunk 4/7 — 716 chars -> 475 chars
Summarized chunk 5/7 — 628 chars -> 540 chars
Summarized chunk 6/7 — 628 chars -> 540 chars
Summarized chunk 7/7 — 771 chars -> 587 chars

Synthesizing final summary...

Validation: True {'coverage': 1.0, 'summary_len': 929}

FINAL SUMMARY:
 Deep learning is a subset of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks. The field has grown rapidly in the last decade, driven by cheaper compute and larger datasets. It also serves as a good demo for agentic systems where small specialized 'agents' perform isolated tasks and pass results to the next agent.Deep learning is a subset of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural 