# Milestone 1: Data Preparation

This notebook demonstrates:
1. Downloading the GDPR PDF from official sources
2. Parsing with LangChain loaders
3. Implementing chunking strategies (paragraph, article, token-based)
4. Generating embeddings using OpenAI
5. Building and persisting FAISS vector store

## Prerequisites

Set your OpenAI API key:
```bash
export OPENAI_API_KEY='your-key-here'
```

Or create a `.env` file:
```
OPENAI_API_KEY=your-key-here
```

In [None]:
# Import required modules
import sys
sys.path.insert(0, '..')

from src.data_prep import download_gdpr_pdf, load_and_split, build_and_persist_faiss, get_embedding_stats
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

print("✓ Modules imported successfully")
print(f"✓ OpenAI API key set: {bool(os.environ.get('OPENAI_API_KEY'))}")

## Step 1: Download GDPR PDF

Download the official GDPR regulation from the EU website.

In [None]:
# Download GDPR PDF
pdf_path = download_gdpr_pdf("gdpr.pdf")
print(f"PDF path: {pdf_path}")

# Note: In dry-run mode (no API key), this returns a placeholder path

## Step 2: Load and Split Document

Parse the PDF and split into chunks using different strategies.

In [None]:
# Strategy 1: Paragraph-based splitting
docs_paragraph = load_and_split(pdf_path, strategy="paragraph", chunk_size=1000)
print(f"Paragraph strategy: {len(docs_paragraph)} chunks")

# Strategy 2: Token-based splitting
docs_token = load_and_split(pdf_path, strategy="token", chunk_size=512, chunk_overlap=50)
print(f"Token strategy: {len(docs_token)} chunks")

# Strategy 3: Article-based splitting (GDPR-specific)
docs_article = load_and_split(pdf_path, strategy="article")
print(f"Article strategy: {len(docs_article)} chunks")

## Step 3: Analyze Chunks

Get statistics about the document chunks.

In [None]:
# Get statistics
stats = get_embedding_stats(docs_paragraph)
print(f"Statistics for paragraph-based chunks:")
print(f"  Total chunks: {stats['count']}")
print(f"  Average length: {stats['avg_length']:.0f} chars")
print(f"  Min length: {stats['min_length']} chars")
print(f"  Max length: {stats['max_length']} chars")

## Step 4: Build FAISS Vector Store

Generate embeddings and build the FAISS index.

**Note**: This step requires an OpenAI API key and will incur costs.

In [None]:
# Build and persist FAISS index
# Using paragraph-based chunks as they provide good balance
faiss_path = "../faiss_index"

index_path = build_and_persist_faiss(
    docs_paragraph,
    faiss_path,
    openai_api_key=os.environ.get('OPENAI_API_KEY')
)

print(f"✓ FAISS index built and saved to: {index_path}")
print(f"\nNext: Open 02_rag_baseline.ipynb to use this index for retrieval")

## Summary

In this notebook, we:
- ✓ Downloaded the GDPR PDF
- ✓ Parsed and split the document using multiple strategies
- ✓ Analyzed chunk statistics
- ✓ Built and persisted a FAISS vector store

The FAISS index is now ready for use in the RAG pipeline.