# Notebook 1: Data Preparation

This notebook demonstrates:
1. Downloading the GDPR PDF from official EU sources
2. Parsing PDF with LangChain loaders
3. Implementing chunking strategies (paragraph, article, token-based)
4. Generating embeddings with OpenAI
5. Building and persisting FAISS index

**Note**: Set your `OPENAI_API_KEY` environment variable before running. Without it, the code runs in dry-run mode with placeholder outputs.

In [None]:
# Import required modules
import os
from dotenv import load_dotenv
from src.data_prep import download_gdpr_pdf, load_and_split, build_and_persist_faiss, get_chunking_stats

# Load environment variables
load_dotenv()

print("✓ Imports successful")
print(f"OpenAI API Key configured: {'Yes' if os.getenv('OPENAI_API_KEY') else 'No (dry-run mode)'}")

## Step 1: Download GDPR PDF

Download the official GDPR regulation PDF from EU sources.

In [None]:
# Download GDPR PDF
pdf_path = "data/gdpr.pdf"
result_path = download_gdpr_pdf(pdf_path)

print(f"\n✓ PDF saved to: {result_path}")
print(f"File exists: {os.path.exists(result_path)}")

## Step 2: Load and Chunk Documents

Load the PDF and split it using different strategies:
- **Paragraph**: Natural text breaks
- **Article**: GDPR article boundaries
- **Token**: Fixed-size chunks with overlap

In [None]:
# Test different chunking strategies
strategies = ["paragraph", "article", "token"]

for strategy in strategies:
    print(f"\n{'='*60}")
    print(f"Strategy: {strategy.upper()}")
    print(f"{'='*60}")
    
    docs = load_and_split(pdf_path, strategy=strategy)
    stats = get_chunking_stats(docs)
    
    print(f"Number of chunks: {stats['num_chunks']}")
    print(f"Average length: {stats['avg_length']:.0f} characters")
    print(f"Min length: {stats['min_length']} characters")
    print(f"Max length: {stats['max_length']} characters")
    
    # Show sample chunk
    if docs:
        print(f"\nSample chunk:\n{docs[0]['page_content'][:200]}...")

## Step 3: Select Optimal Strategy

Based on the analysis, select the best chunking strategy for our use case.

In [None]:
# Use paragraph strategy (balanced approach)
selected_strategy = "paragraph"
documents = load_and_split(pdf_path, strategy=selected_strategy)

print(f"Selected strategy: {selected_strategy}")
print(f"Total documents: {len(documents)}")
print(f"\nReady for embedding generation!")

## Step 4: Generate Embeddings and Build FAISS Index

Create embeddings using OpenAI and build a FAISS index for efficient similarity search.

In [None]:
# Build and persist FAISS index
faiss_path = "faiss_index/"
openai_api_key = os.getenv("OPENAI_API_KEY")

index_path = build_and_persist_faiss(
    docs=documents,
    faiss_path=faiss_path,
    openai_api_key=openai_api_key
)

print(f"\n✓ FAISS index created and saved to: {index_path}")
print(f"\nNext steps:")
print("1. Proceed to Notebook 2 for baseline RAG")
print("2. Use the FAISS index for retrieval")

## Summary

In this notebook, we:
- ✓ Downloaded the GDPR PDF
- ✓ Tested multiple chunking strategies
- ✓ Generated document embeddings
- ✓ Built and persisted FAISS index

The FAISS index is now ready for retrieval in the RAG pipeline!