# Data Validation with jupyter-lab-progress

This notebook demonstrates the `LabValidator` class for validating student work in lab exercises, particularly useful for data science and machine learning workshops.

## Table of Contents
1. Basic Validation Setup
2. Embedding Shape Validation
3. DataFrame Validation
4. Combining Validation with Progress Tracking
5. Real-World Examples
6. Custom Validation Patterns

## Setup

Import necessary modules and create sample data:

In [None]:
from jupyter_lab_progress import LabValidator, LabProgress, show_info, show_warning
import pandas as pd
import numpy as np

# Create validator instance
validator = LabValidator()

## 1. Embedding Shape Validation

Validate that embeddings have the correct dimensions for vector search:

In [None]:
# Create sample embeddings
correct_embeddings = np.random.rand(10, 384)  # 10 embeddings of dimension 384
wrong_embeddings = np.random.rand(10, 256)    # Wrong dimension

print("Correct embeddings shape:", correct_embeddings.shape)
print("Wrong embeddings shape:", wrong_embeddings.shape)

In [None]:
# Validate correct embeddings
is_valid = validator.check_embedding_shape(
    embeddings=correct_embeddings,
    expected_dim=384
)

if is_valid:
    show_info("✅ Embeddings have the correct shape!")
else:
    show_warning("❌ Embeddings have incorrect dimensions")

In [None]:
# Validate wrong embeddings
is_valid = validator.check_embedding_shape(
    embeddings=wrong_embeddings,
    expected_dim=384
)

if is_valid:
    show_info("✅ Embeddings have the correct shape!")
else:
    show_warning("❌ Embeddings have incorrect dimensions. Expected 384, got 256.")

## 2. Validating Different Embedding Formats

Handle various embedding formats students might create:

In [None]:
# Single embedding (1D array)
single_embedding = np.random.rand(384)
print("Single embedding shape:", single_embedding.shape)

is_valid = validator.check_embedding_shape(single_embedding, 384)
print(f"Single embedding valid: {is_valid}")

# Batch of embeddings (2D array)
batch_embeddings = np.random.rand(5, 384)
print("\nBatch embeddings shape:", batch_embeddings.shape)

is_valid = validator.check_embedding_shape(batch_embeddings, 384)
print(f"Batch embeddings valid: {is_valid}")

In [None]:
# List of embeddings
list_embeddings = [np.random.rand(384) for _ in range(3)]
print("List of embeddings, each shape:", list_embeddings[0].shape)

# Convert to numpy array for validation
array_embeddings = np.array(list_embeddings)
is_valid = validator.check_embedding_shape(array_embeddings, 384)

if is_valid:
    show_info(f"✅ All {len(list_embeddings)} embeddings have the correct dimension!")

## 3. DataFrame Validation

Validate that DataFrames contain expected columns and values:

In [None]:
# Create sample DataFrame
df = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Audio'],
    'price': [999.99, 29.99, 79.99, 299.99, 149.99],
    'embedding': [np.random.rand(384) for _ in range(5)]
})

print("DataFrame shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nFirst few rows:")
df.head()

In [None]:
# Validate that specific products exist
try:
    validator.assert_in_dataframe(
        df=df,
        column='name',
        values=['Laptop', 'Mouse'],
        context='Product catalog validation'
    )
    show_info("✅ Required products found in DataFrame!")
except AssertionError as e:
    show_warning(f"❌ Validation failed: {e}")

In [None]:
# Try to validate missing products
try:
    validator.assert_in_dataframe(
        df=df,
        column='name',
        values=['Laptop', 'Tablet'],  # Tablet doesn't exist
        context='Product catalog validation'
    )
    show_info("✅ Required products found in DataFrame!")
except AssertionError as e:
    show_warning(f"❌ Validation failed: {e}")

## 4. Validating Data Processing Steps

Use validation to ensure students complete data processing correctly:

In [None]:
# Create a lab progress tracker
data_lab = LabProgress(
    steps=[
        "Load raw data",
        "Clean missing values",
        "Create embeddings",
        "Validate embeddings",
        "Save processed data"
    ],
    title="📊 Data Processing Lab"
)

show_info("Let's process some product data for vector search!")
data_lab.display()

In [None]:
# Step 1: Load raw data
raw_data = pd.DataFrame({
    'product': ['Laptop Pro', 'Wireless Mouse', None, 'USB Keyboard', 'Gaming Monitor'],
    'description': [
        'High-performance laptop for professionals',
        'Ergonomic wireless mouse with precision tracking',
        None,
        'Mechanical keyboard with RGB lighting',
        '4K gaming monitor with 144Hz refresh rate'
    ],
    'price': [1299.99, 49.99, 89.99, None, 599.99]
})

print("Raw data loaded:")
print(raw_data)
data_lab.mark_completed("Load raw data")
data_lab.display()

In [None]:
# Step 2: Clean missing values
show_warning("⚠️ Found missing values in the data. Cleaning required!")

# Remove rows with missing critical data
cleaned_data = raw_data.dropna(subset=['product', 'description'])

# Fill missing prices with median
median_price = cleaned_data['price'].median()
cleaned_data['price'] = cleaned_data['price'].fillna(median_price)

print("\nCleaned data:")
print(cleaned_data)

# Validate no missing values remain
if cleaned_data.isnull().sum().sum() == 0:
    show_info("✅ All missing values have been handled!")
    data_lab.mark_completed("Clean missing values")
    data_lab.display()

In [None]:
# Step 3: Create embeddings (simulated)
def create_embedding(text, dim=384):
    """Simulate creating embeddings from text."""
    # In real scenario, use sentence-transformers or OpenAI
    np.random.seed(hash(text) % 1000)  # Reproducible "embeddings"
    return np.random.rand(dim)

# Create embeddings for descriptions
cleaned_data['embedding'] = cleaned_data['description'].apply(
    lambda x: create_embedding(x)
)

print("Embeddings created!")
print(f"Embedding shape: {cleaned_data['embedding'].iloc[0].shape}")
data_lab.mark_completed("Create embeddings")
data_lab.display()

In [None]:
# Step 4: Validate embeddings
show_info("Validating embedding dimensions...")

# Check all embeddings
all_valid = True
for idx, embedding in enumerate(cleaned_data['embedding']):
    if not validator.check_embedding_shape(embedding, expected_dim=384):
        show_warning(f"❌ Invalid embedding at index {idx}")
        all_valid = False

if all_valid:
    show_info("✅ All embeddings validated successfully!")
    data_lab.mark_completed("Validate embeddings")
    
# Also validate required products exist
try:
    validator.assert_in_dataframe(
        df=cleaned_data,
        column='product',
        values=['Laptop Pro', 'Gaming Monitor'],
        context='Final data validation'
    )
    show_info("✅ Required products present in final dataset!")
except AssertionError as e:
    show_warning(f"Missing required products: {e}")

data_lab.display()

## 5. Real-World Example: Vector Search Lab

Complete example combining all validation features:

In [None]:
# Vector Search Lab Setup
show_info("""🔍 Vector Search Lab: Building a Semantic Search System

In this lab, you'll build a product search system using vector embeddings.""")

# Initialize tracking
vector_lab = LabProgress(
    steps=[
        "Prepare product data",
        "Generate embeddings",
        "Validate embeddings",
        "Create search index",
        "Test search queries"
    ],
    title="🔍 Vector Search Lab Progress"
)
vector_lab.display()

In [None]:
# Prepare comprehensive product catalog
products_df = pd.DataFrame({
    'id': range(1, 11),
    'name': [
        'MacBook Pro 16"', 'Dell XPS 15', 'ThinkPad X1 Carbon',
        'iPad Pro', 'Samsung Galaxy Tab', 'Microsoft Surface',
        'AirPods Pro', 'Sony WH-1000XM5', 'Bose QuietComfort',
        'Magic Mouse'
    ],
    'category': [
        'Laptop', 'Laptop', 'Laptop',
        'Tablet', 'Tablet', 'Tablet',
        'Audio', 'Audio', 'Audio',
        'Accessories'
    ],
    'description': [
        'Professional laptop with M3 Pro chip and stunning display',
        'Powerful Windows laptop with 4K OLED screen',
        'Business ultrabook with exceptional keyboard',
        'Tablet with M2 chip for creative professionals',
        'Android tablet with S Pen for productivity',
        '2-in-1 device running Windows 11',
        'Wireless earbuds with active noise cancellation',
        'Premium over-ear headphones with industry-leading ANC',
        'Comfortable headphones for all-day listening',
        'Wireless mouse with multi-touch surface'
    ]
})

print("Product catalog prepared:")
print(products_df[['name', 'category']].head())
vector_lab.mark_completed("Prepare product data")
vector_lab.display()

In [None]:
# Generate embeddings with different models (simulated)
def generate_embeddings(texts, model='base', dim=384):
    """Simulate different embedding models."""
    embeddings = []
    for text in texts:
        if model == 'base':
            np.random.seed(hash(text) % 1000)
        elif model == 'large':
            np.random.seed(hash(text) % 2000)
        embeddings.append(np.random.rand(dim))
    return np.array(embeddings)

# Generate embeddings
show_info("Generating embeddings using base model...")
embeddings = generate_embeddings(products_df['description'].tolist())
products_df['embedding'] = list(embeddings)

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimension: {embeddings[0].shape}")
vector_lab.mark_completed("Generate embeddings")
vector_lab.display()

In [None]:
# Comprehensive validation
show_info("Running comprehensive validation checks...")

# 1. Validate embedding dimensions
embedding_array = np.array(products_df['embedding'].tolist())
if validator.check_embedding_shape(embedding_array, expected_dim=384):
    show_info("✅ Embedding dimensions: PASS")
else:
    show_warning("❌ Embedding dimensions: FAIL")

# 2. Validate required categories
try:
    validator.assert_in_dataframe(
        df=products_df,
        column='category',
        values=['Laptop', 'Tablet', 'Audio'],
        context='Product categories'
    )
    show_info("✅ Product categories: PASS")
except AssertionError:
    show_warning("❌ Product categories: FAIL")

# 3. Validate specific products
try:
    validator.assert_in_dataframe(
        df=products_df,
        column='name',
        values=['MacBook Pro 16"', 'AirPods Pro'],
        context='Required products'
    )
    show_info("✅ Required products: PASS")
except AssertionError:
    show_warning("❌ Required products: FAIL")

# 4. Custom validation: Check embedding values are normalized
norms = np.linalg.norm(embedding_array, axis=1)
if np.all(norms > 0):
    show_info("✅ Embedding normalization: PASS (all non-zero)")
else:
    show_warning("❌ Embedding normalization: FAIL (found zero vectors)")

vector_lab.mark_completed("Validate embeddings")
vector_lab.display()

## 6. Creating Custom Validation Functions

Extend validation for specific use cases:

In [None]:
def validate_mongodb_ready(df, validator):
    """Validate data is ready for MongoDB insertion."""
    checks_passed = []
    
    # Check 1: No null values
    if df.isnull().sum().sum() == 0:
        checks_passed.append("No null values")
    else:
        show_warning("❌ Found null values in DataFrame")
        return False
    
    # Check 2: Embeddings are lists or arrays
    if 'embedding' in df.columns:
        valid_embeddings = all(
            isinstance(emb, (list, np.ndarray)) 
            for emb in df['embedding']
        )
        if valid_embeddings:
            checks_passed.append("Valid embedding format")
        else:
            show_warning("❌ Invalid embedding format")
            return False
    
    # Check 3: Has required fields
    required_fields = ['id', 'name', 'description']
    if all(field in df.columns for field in required_fields):
        checks_passed.append("Required fields present")
    else:
        show_warning(f"❌ Missing required fields: {required_fields}")
        return False
    
    show_info(f"✅ MongoDB validation passed: {', '.join(checks_passed)}")
    return True

# Test the custom validation
is_ready = validate_mongodb_ready(products_df, validator)
print(f"\nData ready for MongoDB: {is_ready}")

In [None]:
# Create a validation report
def generate_validation_report(df, validator):
    """Generate a comprehensive validation report."""
    report = {
        'total_records': len(df),
        'columns': list(df.columns),
        'missing_values': df.isnull().sum().to_dict(),
        'data_types': df.dtypes.to_dict(),
        'validation_checks': []
    }
    
    # Run various checks
    if 'embedding' in df.columns:
        emb_array = np.array(df['embedding'].tolist())
        emb_valid = validator.check_embedding_shape(emb_array, 384)
        report['validation_checks'].append({
            'check': 'embedding_dimensions',
            'passed': emb_valid,
            'details': f'Shape: {emb_array.shape}'
        })
    
    # Display report
    show_info("📊 Validation Report Generated")
    print(f"\nTotal Records: {report['total_records']}")
    print(f"Columns: {', '.join(report['columns'])}")
    print("\nValidation Checks:")
    for check in report['validation_checks']:
        status = "✅ PASS" if check['passed'] else "❌ FAIL"
        print(f"- {check['check']}: {status} ({check['details']})")
    
    return report

# Generate report
validation_report = generate_validation_report(products_df, validator)

## Lab Completion

Wrap up with final validation and next steps:

In [None]:
# Final lab completion
vector_lab.mark_completed("Create search index")
vector_lab.mark_completed("Test search queries")

show_info("""🎉 Congratulations! You've completed the Vector Search Lab!

You've successfully:
✅ Prepared and cleaned product data
✅ Generated vector embeddings
✅ Validated data integrity
✅ Created MongoDB-ready documents
✅ Implemented comprehensive validation

Your data is now ready for vector search!""")

vector_lab.display()

# Show final statistics
show_info(f"""📈 Final Statistics:
• Products indexed: {len(products_df)}
• Embedding dimension: 384
• Categories: {products_df['category'].nunique()}
• Validation checks passed: All ✅""")

## Summary

The `LabValidator` class provides powerful validation capabilities:

### Key Features:
- **`check_embedding_shape()`** - Validates embedding dimensions
- **`assert_in_dataframe()`** - Ensures required data exists
- **Custom validation** - Extend for specific use cases

### Best Practices:
1. Validate early and often
2. Provide clear error messages
3. Combine with progress tracking
4. Create custom validators for complex requirements
5. Generate validation reports for debugging

Use validation to ensure students complete labs correctly and catch common mistakes!