# Data Preparation for Self-RAG

This notebook helps you prepare your legal corpus and Q&A data for training the Self-RAG system.

In [1]:
import sys
sys.path.append('..')

import json
import pandas as pd
from pathlib import Path
from src.retrieval.chunking import DocumentChunker

## 1. Load Sample Documents

Start with the provided sample data or load your own legal documents.

In [2]:
# Load sample documents
with open('../data/samples/sample_documents.json', 'r') as f:
    documents = json.load(f)

print(f"Loaded {len(documents)} documents")
print(f"\nFirst document:")
print(json.dumps(documents[0], indent=2))

Loaded 10 documents

First document:
{
  "text": "To establish negligence, a plaintiff must prove four essential elements: (1) duty of care, (2) breach of that duty, (3) causation, and (4) damages. Each element must be proven by a preponderance of the evidence, meaning it is more likely than not that the defendant was negligent. The duty of care arises from the relationship between the parties and the foreseeability of harm.",
  "source": "negligence_basics.txt",
  "title": "Elements of Negligence",
  "doc_id": 1
}


## 2. Explore Document Statistics

In [3]:
# Convert to DataFrame for analysis
df = pd.DataFrame(documents)

# Add text length
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

print("Document Statistics:")
print(df[['text_length', 'word_count']].describe())

print(f"\nSources: {df['source'].unique()}")
print(f"Average words per document: {df['word_count'].mean():.0f}")

Document Statistics:
       text_length  word_count
count    10.000000   10.000000
mean    352.000000   54.300000
std      23.911875    6.481598
min     319.000000   47.000000
25%     332.500000   49.250000
50%     357.000000   53.000000
75%     367.250000   60.500000
max     389.000000   63.000000

Sources: ['negligence_basics.txt' 'causation.txt' 'damages.txt' 'duty_of_care.txt'
 'res_ipsa.txt' 'negligence_per_se.txt' 'defenses.txt'
 'professional_negligence.txt']
Average words per document: 54


## 3. Test Document Chunking

Test different chunking strategies to find optimal settings.

In [4]:
# Create chunker
chunker = DocumentChunker({
    'chunk_size': 256,
    'chunk_overlap': 30
})

# Chunk all documents
all_chunks = chunker.chunk_documents(documents)

print(f"Total chunks created: {len(all_chunks)}")
print(f"Average chunks per document: {len(all_chunks) / len(documents):.1f}")

# Show example chunks
print("\nExample chunks from first document:")
doc_0_chunks = [c for c in all_chunks if c['doc_id'] == 0]
for i, chunk in enumerate(doc_0_chunks[:2]):
    print(f"\nChunk {i+1}:")
    print(chunk['text'][:150] + "...")

Total chunks created: 21
Average chunks per document: 2.1

Example chunks from first document:

Chunk 1:
To establish negligence, a plaintiff must prove four essential elements: (1) duty of care, (2) breach of that duty, (3) causation, and (4) damages. ...

Chunk 2:
Each element must be proven by a preponderance of the evidence, meaning it is more likely than not that the defendant was negligent. . The duty of car...


## 4. Load and Prepare Q&A Data

In [5]:
# Load Q&A data
with open('../data/samples/sample_qa_data.json', 'r') as f:
    qa_data = json.load(f)

print(f"Loaded {len(qa_data)} Q&A pairs")

# Convert to DataFrame
qa_df = pd.DataFrame(qa_data)

print("\nSample Q&A:")
for i, row in qa_df.head(2).iterrows():
    print(f"\nQ: {row['question']}")
    print(f"A: {row['answer'][:100]}...")

Loaded 10 Q&A pairs

Sample Q&A:

Q: What are the four elements that must be proven to establish negligence?
A: To establish negligence, a plaintiff must prove: (1) duty of care - the defendant owed a legal duty ...

Q: What is the standard for determining breach of duty in negligence cases?
A: The standard for breach of duty is objective and based on what a reasonable person would do in simil...


## 5. Split Data for Training and Testing

In [6]:
from sklearn.model_selection import train_test_split

# Split 80/20
train_qa, test_qa = train_test_split(qa_data, test_size=0.2, random_state=42)

print(f"Training set: {len(train_qa)} examples")
print(f"Test set: {len(test_qa)} examples")

# Save splits
Path('../data/training').mkdir(parents=True, exist_ok=True)
with open('../data/training/train_qa.json', 'w') as f:
    json.dump(train_qa, f, indent=2)

with open('../data/training/test_qa.json', 'w') as f:
    json.dump(test_qa, f, indent=2)

print("\nSaved train and test splits!")

Training set: 8 examples
Test set: 2 examples

Saved train and test splits!


## 6. Prepare Your Own Data

Use this template to load your own legal documents:

In [7]:
# Template for loading your own documents
def load_custom_documents(directory_path):
    """
    Load documents from a directory.
    Modify this function based on your data format.
    """
    documents = []
    
    # Example: Load from text files
    for file_path in Path(directory_path).glob('*.txt'):
        with open(file_path, 'r') as f:
            text = f.read()
        
        documents.append({
            'text': text,
            'source': file_path.name,
            'title': file_path.stem,
            'doc_id': len(documents)
        })
    
    return documents

# Uncomment to use:
# my_documents = load_custom_documents('../data/my_legal_corpus')
# print(f"Loaded {len(my_documents)} custom documents")

## Summary

Data preparation complete! You now have:
- ✅ Loaded and explored documents
- ✅ Tested chunking strategies
- ✅ Prepared Q&A data
- ✅ Created train/test splits

**Next Steps:**
1. Go to `02_retrieval_pipeline.ipynb` to build the retrieval system
2. Or proceed to `03_self_rag_training.ipynb` to train models