# Notebook 01: Ingest & Clean - "The Garbage-In Fix"

## Innocenti Risk Management Enablement Kit

---

### Prerequisites

Before running this notebook, you'll need:

1. **Jina API Key** (free tier available)
   - Sign up at [jina.ai](https://jina.ai/api-dashboard/)
   - Create an API key from the dashboard
   - You'll be prompted to enter it when running the notebook

---

### About Jina ReaderLM

[ReaderLM](https://jina.ai/reader/) is a vision-language model purpose-built for document reading:
- **Visual understanding** - Processes documents as images, not raw text extraction
- **Layout-aware** - Handles tables, columns, headers, footers intelligently
- **Clean output** - Returns structured markdown, not messy OCR text
- **No setup** - Simple API call, no model hosting required

---

### The Problem

Legal documents like the **EU AI Act** are notoriously hard to search:

1. **PDFs are messy** - Headers, footers, page numbers, and weird formatting
2. **OCR is expensive** - Traditional extraction requires heavy compute
3. **Context gets lost** - Naive chunking breaks legal clauses mid-sentence

### The Solution: Jina Reader (ReaderLM)

Jina Reader is a specialized model that "sees" document layout and extracts clean, structured text without traditional OCR.

**What we'll do:**
1. Fetch the EU AI Act PDF via Jina Reader API
2. Parse the markdown output
3. Intelligently chunk by **Article** (preserving legal context)
4. Save structured JSON for indexing

---

## 1. Setup & Dependencies

In [None]:
# Install dependencies
!pip install -q requests python-dotenv

# Check if running in Google Colab
import os
IN_COLAB = 'COLAB_GPU' in os.environ or 'COLAB_RELEASE_TAG' in os.environ

if IN_COLAB:
    print("üìç Running in Google Colab")
    # Always start from /content to avoid "directory not found" issues
    os.chdir('/content')
    # Clone or update the repo
    if not os.path.exists('/content/Tom-Innocenti-Risk-Management'):
        !git clone https://github.com/jeffvestal/Tom-Innocenti-Risk-Management.git
    else:
        # Pull latest changes
        !cd /content/Tom-Innocenti-Risk-Management && git pull
    # Change to notebooks directory
    os.chdir('/content/Tom-Innocenti-Risk-Management/notebooks')
    print(f"   Working directory: {os.getcwd()}")
else:
    print("üìç Running locally")

In [None]:
import requests
import re
import json
from pathlib import Path

# Import our credential helper
# (Path is already set correctly in previous cell for both Colab and local)
from utils.credentials import setup_notebook, get_credentials

print("‚úì Libraries loaded successfully!")

In [None]:
# Setup credentials (will prompt on first run)
# For this notebook, we only need the Jina API key
creds = get_credentials(require_elastic=False, require_jina=True)

## 2. Fetch PDF via Jina Reader

The Jina Reader API converts any URL to clean markdown. For PDFs, it uses ReaderLM to "see" the layout.

**Key headers:**
- `x-respond-with: markdown` - Get markdown output (vs. plain text)
- `Authorization: Bearer <key>` - Your Jina API key

In [None]:
# EU AI Act PDF URL (official EUR-Lex source)
PDF_URL = "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32024R1689"

# Jina Reader endpoint
READER_URL = f"https://r.jina.ai/{PDF_URL}"

print(f"Source PDF: {PDF_URL}")
print(f"Reader URL: {READER_URL[:60]}...")

In [None]:
import time

def fetch_with_jina_reader(url: str, api_key: str, max_retries: int = 3) -> str:
    """
    Fetch a URL via Jina Reader and return clean markdown.
    
    Includes retry logic for transient failures (empty responses).
    
    Args:
        url: The Jina Reader URL (https://r.jina.ai/<target_url>)
        api_key: Your Jina API key
        max_retries: Number of retry attempts if response is empty
    
    Returns:
        Clean markdown text
    """
    headers = {
        "Authorization": f"Bearer {api_key}",
        "x-respond-with": "markdown",
        "Accept": "text/plain"
    }
    
    print("Fetching PDF via Jina Reader...")
    print("(This may take 30-60 seconds for a large document)")
    
    for attempt in range(max_retries):
        response = requests.get(url, headers=headers, timeout=120)
        response.raise_for_status()
        
        content = response.text.strip()
        
        # Check if we got actual content (not empty or just whitespace)
        if len(content) > 100:
            print(f"‚úì Received {len(response.text):,} characters")
            return response.text
        
        # Empty response - retry
        if attempt < max_retries - 1:
            wait_time = (attempt + 1) * 5  # 5s, 10s, 15s
            print(f"‚ö† Empty response (attempt {attempt + 1}/{max_retries}). Retrying in {wait_time}s...")
            time.sleep(wait_time)
        else:
            print(f"‚úó Failed after {max_retries} attempts - received empty content")
            raise ValueError(
                "Jina Reader returned empty content after multiple retries. "
                "This can happen due to rate limiting. Wait a minute and try again."
            )
    
    return response.text  # Shouldn't reach here, but just in case

In [None]:
# Fetch the document
raw_markdown = fetch_with_jina_reader(READER_URL, creds["JINA_API_KEY"])

# Preview the first 1000 characters
print("\n--- Preview (first 1000 chars) ---")
print(raw_markdown[:1000])

## 3. Parse & Chunk by Article

Legal documents have structure. The EU AI Act is organized into **Articles**. 

**Chunking Strategy:**
- Split on `Article \d+` pattern
- Capture the article number and title
- Keep entire article text together (no mid-sentence breaks)

This preserves legal context that would be lost with naive character-based chunking.

In [None]:
def parse_articles(markdown_text: str) -> list[dict]:
    """
    Parse EU AI Act markdown into structured article chunks.
    
    Args:
        markdown_text: Raw markdown from Jina Reader
    
    Returns:
        List of article dictionaries with id, article_number, title, text, url
    """
    articles = []
    
    # Pattern to match article headers
    # Matches: "Article 1", "Article 5", "## Article 10", etc.
    article_pattern = r'^(?:#+ )?Article\s+(\d+)\s*\n+([^\n]+)?'
    
    # Split the document by article boundaries
    splits = re.split(r'(?=^(?:#+ )?Article\s+\d+)', markdown_text, flags=re.MULTILINE)
    
    for chunk in splits:
        if not chunk.strip():
            continue
            
        # Try to extract article number and title
        match = re.match(article_pattern, chunk, re.MULTILINE)
        if match:
            article_num = match.group(1)
            # Title is the line after "Article X" (if present)
            title_candidate = match.group(2) if match.group(2) else ""
            title = title_candidate.strip() if title_candidate else f"Article {article_num}"
            
            # Get the body text (everything after the header)
            body_start = match.end()
            body = chunk[body_start:].strip()
            
            # Clean up the body text
            body = re.sub(r'\n{3,}', '\n\n', body)  # Collapse multiple newlines
            body = body.strip()
            
            if body:  # Only add if there's actual content
                articles.append({
                    "id": f"en_art_{article_num}",
                    "article_number": article_num,
                    "title": title,
                    "text": body,
                    "language": "en",
                    "url": f"https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689#Art{article_num}"
                })
    
    return articles

In [None]:
# Parse the document into articles
articles = parse_articles(raw_markdown)

print(f"‚úì Extracted {len(articles)} articles")
print("\n--- Article Numbers Found ---")
print([a['article_number'] for a in articles[:20]], "..." if len(articles) > 20 else "")

In [None]:
# Preview a sample article (Article 5 - Prohibited Practices is a key one)
sample_article = next((a for a in articles if a['article_number'] == '5'), articles[0])

print(f"--- Sample: Article {sample_article['article_number']} ---")
print(f"Title: {sample_article['title']}")
print(f"ID: {sample_article['id']}")
print(f"\nText (first 500 chars):")
print(sample_article['text'][:500])

## 4. Save Structured JSON

We'll save the parsed articles as JSON for use in Notebook 02 (Indexing).

**Output Schema:**
```json
{
  "id": "en_art_5",
  "article_number": "5",
  "title": "Prohibited artificial intelligence practices",
  "text": "The following AI practices shall be prohibited...",
  "language": "en",
  "url": "https://eur-lex.europa.eu/..."
}
```

In [None]:
# Create output directory if it doesn't exist
# Works for both local (notebooks/../data) and Colab (/content/.../data)
output_dir = Path.cwd().parent / "data"
output_dir.mkdir(exist_ok=True)

output_file = output_dir / "eu_ai_act_clean.json"

# Save to JSON
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(articles, f, indent=2, ensure_ascii=False)

print(f"‚úì Saved {len(articles)} articles to {output_file}")
print(f"  File size: {output_file.stat().st_size / 1024:.1f} KB")

# In Colab, also save to /content for easy access
if 'IN_COLAB' in dir() and IN_COLAB:
    colab_output = Path('/content/eu_ai_act_clean.json')
    with open(colab_output, 'w', encoding='utf-8') as f:
        json.dump(articles, f, indent=2, ensure_ascii=False)
    print(f"‚úì Also saved to {colab_output} (for easy Colab access)")

## 5. Verification & Stats

Let's verify the output and gather some statistics about our dataset.

In [None]:
# Calculate statistics
total_chars = sum(len(a['text']) for a in articles)
avg_chars = total_chars / len(articles) if articles else 0
min_chars = min(len(a['text']) for a in articles) if articles else 0
max_chars = max(len(a['text']) for a in articles) if articles else 0

print("=" * 50)
print("  EU AI Act Dataset Summary")
print("=" * 50)
print(f"  Total Articles:     {len(articles)}")
print(f"  Total Characters:   {total_chars:,}")
print(f"  Avg per Article:    {avg_chars:,.0f} chars")
print(f"  Smallest Article:   {min_chars:,} chars")
print(f"  Largest Article:    {max_chars:,} chars")
print("=" * 50)

In [None]:
# Show the top 5 longest articles (usually the most important)
sorted_by_length = sorted(articles, key=lambda x: len(x['text']), reverse=True)

print("\n--- Top 5 Longest Articles ---")
for i, article in enumerate(sorted_by_length[:5], 1):
    print(f"{i}. Article {article['article_number']}: {article['title'][:50]}... ({len(article['text']):,} chars)")

---

## Next Steps

You've successfully:
1. ‚úÖ Fetched the EU AI Act PDF via Jina Reader
2. ‚úÖ Parsed it into structured article chunks
3. ‚úÖ Saved clean JSON for indexing

**Continue to Notebook 02** to index this data in Elasticsearch with `semantic_text` and Jina Embeddings v3.

---

### Key Takeaways

| Concept | What We Learned |
|---------|----------------|
| **ReaderLM** | Jina Reader "sees" PDF layout without OCR |
| **Smart Chunking** | Split by semantic boundaries (Articles), not character count |
| **Metadata Preservation** | Keep article numbers, titles, URLs for filtering & display |