# 01: Data Preparation

Load and preprocess GDPR documents for RAG.

**Learning Objectives:**
- Load documents from various sources
- Implement text preprocessing
- Create chunks with overlap
- Extract metadata


## Setup

This notebook demonstrates data preparation for the GDPR RAG system.

**API Keys Required:**
- None (this notebook works in dry-run mode)

Set `OPENAI_API_KEY` in `.env` for production mode.

In [None]:
import sys
sys.path.insert(0, '..')

from src.data_prep import DataPreprocessor
import os

print('✓ Imports successful')
print(f'Dry-run mode: {not bool(os.getenv("OPENAI_API_KEY"))}')

## Load and Preprocess GDPR Documents


In [None]:
# Initialize preprocessor
preprocessor = DataPreprocessor(chunk_size=1000, chunk_overlap=200)

# Load documents (works in dry-run mode with sample data)
documents = preprocessor.load_documents('gdpr_documents/')
print(f'Loaded {len(documents)} documents')

# Preview first document
print('\nFirst document preview:')
print(documents[0]['content'][:200] + '...')

## Chunk Documents

TODO: Experiment with different chunking strategies

In [None]:
# Chunk documents
chunks = preprocessor.chunk_documents(documents)
print(f'Created {len(chunks)} chunks')

# Preview a chunk
print('\nSample chunk:')
print(chunks[0])