# Notebook 1: Data Preparation

This notebook demonstrates:
1. Downloading the GDPR PDF
2. Parsing with LangChain loaders
3. Implementing different chunking strategies
4. Generating embeddings with OpenAI
5. Building and persisting FAISS index

## Setup

Make sure to set your `OPENAI_API_KEY` environment variable:
```bash
export OPENAI_API_KEY='your-key-here'
```

Or create a `.env` file with:
```
OPENAI_API_KEY=your-key-here
```

In [None]:
# Import required modules
import sys
sys.path.append('..')

from src.data_prep import (
    download_gdpr_pdf,
    load_and_split,
    build_and_persist_faiss,
    get_chunk_statistics
)
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

print("✓ Imports successful")
print(f"OpenAI API Key set: {bool(os.getenv('OPENAI_API_KEY'))}")

## Step 1: Download GDPR PDF

Download the GDPR regulation PDF from the EU official source.

In [None]:
# Download GDPR PDF
pdf_path = download_gdpr_pdf("../data/gdpr.pdf")
print(f"PDF saved to: {pdf_path}")

## Step 2: Load and Split Document

Test different chunking strategies:
- Paragraph-based
- Article-based
- Token-based

In [None]:
# Try paragraph-based chunking
print("\n=== Paragraph-based Chunking ===")
docs_paragraph = load_and_split(pdf_path, strategy="paragraph")
stats_paragraph = get_chunk_statistics(docs_paragraph)
print(f"Chunks: {stats_paragraph['count']}")
print(f"Avg length: {stats_paragraph['avg_length']:.0f} chars")
print(f"\nFirst chunk preview:\n{docs_paragraph[0]['content'][:200]}...")

In [None]:
# Try token-based chunking
print("\n=== Token-based Chunking ===")
docs_token = load_and_split(pdf_path, strategy="token", chunk_size=500, chunk_overlap=100)
stats_token = get_chunk_statistics(docs_token)
print(f"Chunks: {stats_token['count']}")
print(f"Avg length: {stats_token['avg_length']:.0f} chars")

## Step 3: Analyze Chunk Statistics

Compare different chunking strategies.

In [None]:
# Compare strategies
import pandas as pd

comparison = pd.DataFrame([
    {"strategy": "paragraph", **stats_paragraph},
    {"strategy": "token", **stats_token}
])

print("\n=== Chunking Strategy Comparison ===")
print(comparison)

## Step 4: Build FAISS Index

Generate embeddings and build FAISS vector store.

**Note:** This requires a valid OpenAI API key. Without it, a mock index will be created.

In [None]:
# Build FAISS index with paragraph chunks
faiss_path = build_and_persist_faiss(
    docs_paragraph,
    "../faiss_index",
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

print(f"\nFAISS index saved to: {faiss_path}")

## Summary

In this notebook, we:
- ✓ Downloaded the GDPR PDF
- ✓ Tested different chunking strategies
- ✓ Analyzed chunk statistics
- ✓ Built and persisted FAISS index

Next: Move to `02_rag_baseline.ipynb` to implement the baseline RAG pipeline.