# Milestone 1: Data Preparation

This notebook demonstrates the data preparation pipeline for the GDPR RAG system:
1. Download GDPR PDF document
2. Parse and extract text using LangChain loaders
3. Implement chunking strategies (paragraph, article, token-based)
4. Generate embeddings using OpenAI
5. Build FAISS vector store
6. Persist index to disk

## Setup

Make sure you have set your `OPENAI_API_KEY` environment variable or in a `.env` file.

In [None]:
# Import required modules
import sys
sys.path.append('..')

from src import data_prep
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

print("Data preparation module loaded successfully!")
print(f"OpenAI API Key present: {bool(os.getenv('OPENAI_API_KEY'))}")

## Step 1: Download GDPR PDF

Download the official GDPR regulation PDF from the EU website.

In [None]:
# Download GDPR PDF
pdf_path = data_prep.download_gdpr_pdf("../data/gdpr.pdf")
print(f"PDF path: {pdf_path}")

# Note: In dry-run mode (no network), this returns a placeholder path
# TODO: Implement actual download for production use

## Step 2: Load and Parse PDF

Use LangChain's document loaders to extract text from the PDF.

In [None]:
# Load and split the PDF
# Try different chunking strategies

# Strategy 1: Paragraph-based chunking
chunks_paragraph = data_prep.load_and_split(
    pdf_path, 
    strategy="paragraph"
)
print(f"\nParagraph chunks: {len(chunks_paragraph)}")

# Strategy 2: Article-based chunking
chunks_article = data_prep.load_and_split(
    pdf_path,
    strategy="article"
)
print(f"Article chunks: {len(chunks_article)}")

# Strategy 3: Token-based chunking
chunks_token = data_prep.load_and_split(
    pdf_path,
    strategy="token",
    chunk_size=500,
    chunk_overlap=100
)
print(f"Token chunks: {len(chunks_token)}")

## Step 3: Examine Chunks

Let's examine the structure and content of our chunks.

In [None]:
# Display first few chunks
print("Sample chunks:")
for i, chunk in enumerate(chunks_paragraph[:3], 1):
    print(f"\nChunk {i}:")
    print(f"Content: {chunk['content'][:100]}...")
    print(f"Metadata: {chunk['metadata']}")

## Step 4: Build FAISS Vector Store

Generate embeddings and build the FAISS index for efficient retrieval.

In [None]:
# Build and persist FAISS index
faiss_path = "../faiss_index"

# Use the paragraph chunks for our index
success = data_prep.build_and_persist_faiss(
    chunks_paragraph,
    faiss_path=faiss_path
)

if success:
    print(f"\n✅ FAISS index built and saved to {faiss_path}")
else:
    print("\n❌ Failed to build FAISS index")

print("\nNote: Full functionality requires OPENAI_API_KEY")
print("Without API key, the system runs in dry-run mode with placeholders")

## Step 5: Test Retrieval (Optional)

If FAISS index was built successfully, test retrieval.

In [None]:
# Get example chunks for testing without API key
example_chunks = data_prep.get_example_chunks()

print(f"Example chunks loaded: {len(example_chunks)}")
for i, chunk in enumerate(example_chunks, 1):
    print(f"\n{i}. {chunk['content'][:80]}...")

## Summary

In this notebook, we:
- ✅ Downloaded/prepared GDPR PDF data
- ✅ Implemented multiple chunking strategies
- ✅ Generated embeddings (with API key) or placeholders
- ✅ Built and persisted FAISS vector store
- ✅ Verified the data preparation pipeline

Next: Proceed to `02_rag_baseline.ipynb` to build the baseline RAG system.