# Quality Assurance and Validation

This notebook performs comprehensive quality assurance checks on the generated synthetic dataset.

**Workshop**: AI/ML Pipeline - Synthetic Data Generation  
**Date**: January 23, 2026  
**Platform**: CyVerse Jupyter Lab PyTorch GPU

## QA Checklist

1. **Image Validation**: Verify all images are valid and readable
2. **Metadata Completeness**: Check all images have captions, labels, comments
3. **Duplicate Detection**: Identify potentially duplicate images
4. **Label Distribution Analysis**: Check for bias in labels
5. **Human Review Interface**: Sample random images for manual inspection
6. **QA Report Generation**: Create comprehensive quality report

## Setup and Imports

In [None]:
import sys
from pathlib import Path
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from collections import Counter

# Add parent directory to path
parent_dir = Path.cwd().parent
if str(parent_dir) not in sys.path:
    sys.path.insert(0, str(parent_dir))

from src import config, output_handler, validation

print("✓ All modules imported successfully")

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load Configuration and Dataset

In [None]:
# Load configuration
cfg = config.load_config()

# Initialize output handler
output_dir = cfg.get_output_path()
handler = output_handler.OutputHandler(
    output_dir=output_dir,
    image_format=cfg.output['image_format']
)

# Get dataset summary
summary = handler.get_summary()

print("Dataset Summary:")
print("=" * 80)
print(f"Output directory: {summary['output_directory']}")
print(f"Images: {summary['images_saved']}")
print(f"Captions: {summary['captions_saved']}")
print(f"Labels: {summary['labels_saved']}")
print(f"Comments: {summary['comments_saved']}")
print("=" * 80)

## 2. Automated Validation

Run comprehensive automated validation checks.

In [None]:
print("Running automated validation...\n")

# Initialize dataset validator
validator = validation.DatasetValidator(
    output_dir=output_dir,
    min_image_size_kb=cfg.validation.get('min_image_size_kb', 10),
    duplicate_threshold=cfg.validation.get('duplicate_threshold', 95)
)

# Run validation
validation_report = validator.validate_dataset()

print("✓ Validation complete")

### Validation Results Summary

In [None]:
summary_data = validation_report['summary']

print("Validation Summary:")
print("=" * 80)
print(f"\nTotal Images: {summary_data['total_images']}")
print(f"Valid Images: {summary_data['valid_images']}")
print(f"Validation Rate: {summary_data['validation_rate']:.1f}%")
print(f"\nMetadata Completeness:")
print(f"  Metadata: {'✓ Complete' if summary_data['metadata_complete'] else '✗ Incomplete'}")
print(f"  Captions: {'✓ Complete' if summary_data['captions_complete'] else '✗ Incomplete'}")
print(f"  Labels: {'✓ Complete' if summary_data['labels_complete'] else '✗ Incomplete'}")
print(f"  Comments: {'✓ Complete' if summary_data['comments_complete'] else '✗ Incomplete'}")
print(f"\nDuplicate Detection:")
print(f"  Duplicates Found: {summary_data['duplicates_found']}")
print(f"\nOverall Quality: {summary_data['dataset_quality']}")
print("=" * 80)

### Image Validation Details

In [None]:
images_report = validation_report['images']

print(f"\nImage Validation Details:")
print(f"  Total: {images_report['total_images']}")
print(f"  Valid: {images_report['valid_images']}")
print(f"  Invalid: {images_report['invalid_images']}")
print(f"  Rate: {images_report['validation_rate']:.1f}%")

# Show any invalid images
if images_report['invalid_images'] > 0:
    print("\n⚠ Invalid Images:")
    for result in images_report['results']:
        if not result['valid']:
            print(f"  - {Path(result['file_path']).name}: {', '.join(result['issues'])}")

### Duplicate Detection

In [None]:
duplicates = validation_report['duplicates']

print(f"\nDuplicate Detection:")
print(f"  Number of duplicate pairs: {duplicates['num_duplicates']}")

if duplicates['num_duplicates'] > 0:
    print("\n⚠ Potential Duplicates:")
    for img1, img2, similarity in duplicates['duplicate_pairs'][:10]:  # Show first 10
        print(f"  - {Path(img1).name} ↔ {Path(img2).name} (similarity: {similarity:.1f}%)")
else:
    print("\n✓ No duplicates detected")

## 3. Statistical Analysis

Analyze label distribution and caption characteristics.

### Label Distribution Analysis

In [None]:
# Load all labels
labels_csv = output_dir / "all_labels.csv"

if labels_csv.exists():
    labels_df = pd.read_csv(labels_csv)
    
    print("Label Distribution Analysis:")
    print("=" * 80)
    
    # Analyze each label category
    label_cols = [col for col in labels_df.columns if col != 'image_id']
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    fig.suptitle('Label Distribution by Category', fontsize=16)
    
    for idx, col in enumerate(label_cols[:6]):  # Plot first 6 categories
        row = idx // 3
        col_idx = idx % 3
        
        # Count values
        value_counts = labels_df[col].value_counts()
        
        # Plot
        value_counts.plot(kind='bar', ax=axes[row, col_idx])
        axes[row, col_idx].set_title(col.replace('_', ' ').title())
        axes[row, col_idx].set_xlabel('')
        axes[row, col_idx].tick_params(axis='x', rotation=45)
    
    # Hide unused subplots
    for idx in range(len(label_cols), 6):
        row = idx // 3
        col_idx = idx % 3
        axes[row, col_idx].axis('off')
    
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print("\nLabel Statistics:")
    for col in label_cols:
        print(f"\n{col.replace('_', ' ').title()}:")
        print(labels_df[col].value_counts().to_string())
else:
    print("⚠ Labels CSV not found. Please run notebook 04 to generate metadata.")

### Caption Length Analysis

In [None]:
# Load all captions
captions_csv = output_dir / "all_captions.csv"

if captions_csv.exists():
    captions_df = pd.read_csv(captions_csv)
    
    # Calculate caption lengths
    captions_df['caption_length'] = captions_df['caption'].str.len()
    captions_df['word_count'] = captions_df['caption'].str.split().str.len()
    
    print("Caption Statistics:")
    print("=" * 80)
    print(f"Total captions: {len(captions_df)}")
    print(f"\nCharacter Length:")
    print(f"  Mean: {captions_df['caption_length'].mean():.1f}")
    print(f"  Median: {captions_df['caption_length'].median():.1f}")
    print(f"  Min: {captions_df['caption_length'].min()}")
    print(f"  Max: {captions_df['caption_length'].max()}")
    print(f"\nWord Count:")
    print(f"  Mean: {captions_df['word_count'].mean():.1f}")
    print(f"  Median: {captions_df['word_count'].median():.1f}")
    print(f"  Min: {captions_df['word_count'].min()}")
    print(f"  Max: {captions_df['word_count'].max()}")
    
    # Plot distributions
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    captions_df['caption_length'].hist(bins=20, ax=axes[0])
    axes[0].set_title('Caption Length Distribution (characters)')
    axes[0].set_xlabel('Characters')
    axes[0].set_ylabel('Frequency')
    
    captions_df['word_count'].hist(bins=20, ax=axes[1])
    axes[1].set_title('Caption Word Count Distribution')
    axes[1].set_xlabel('Words')
    axes[1].set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()
else:
    print("⚠ Captions CSV not found. Please run notebook 04 to generate metadata.")

## 4. Human Review Interface

Random sample of images for manual inspection.

In [None]:
import random
from PIL import Image

# Get all images
image_files = list(handler.images_dir.glob(f"*.{cfg.output['image_format']}"))

# Sample for review (5-10% or minimum 10 images)
sample_size = max(10, int(len(image_files) * 0.05))
sample_images = random.sample(image_files, min(sample_size, len(image_files)))

print(f"Reviewing {len(sample_images)} randomly sampled images:")
print("=" * 80)

# Create review interface
for idx, image_file in enumerate(sample_images, 1):
    image_id = image_file.stem
    
    print(f"\n{'='*80}")
    print(f"Review {idx}/{len(sample_images)}: {image_id}")
    print(f"{'='*80}")
    
    # Load and display image
    img = Image.open(image_file)
    display(img.resize((512, 512)))
    
    # Load metadata
    metadata = handler.load_metadata(image_id)
    if metadata:
        print(f"\nOriginal Prompt (excerpt):")
        print(f"  {metadata.get('prompt', 'N/A')[:200]}...")
        print(f"\nSource Theme: {metadata.get('source_data', {}).get('atropia', {}).get('theme', 'N/A')}")
    
    # Load caption
    caption_file = handler.captions_dir / f"{image_id}_caption.json"
    if caption_file.exists():
        with open(caption_file, 'r') as f:
            caption_data = json.load(f)
        print(f"\nGenerated Caption:")
        print(f"  {caption_data['caption']}")
    
    # Load labels
    labels_file = handler.labels_dir / f"{image_id}_labels.json"
    if labels_file.exists():
        with open(labels_file, 'r') as f:
            labels_data = json.load(f)
        print(f"\nGenerated Labels:")
        for category, label in labels_data['labels'].items():
            print(f"  {category}: {label}")
    
    # Load sample comments
    comments_file = handler.comments_dir / f"{image_id}_comments.json"
    if comments_file.exists():
        with open(comments_file, 'r') as f:
            comments_data = json.load(f)
        print(f"\nSample Comments (first 3):")
        for i, comment in enumerate(comments_data['comments'][:3], 1):
            print(f"  {i}. {comment}")
    
    print(f"\n{'='*80}\n")

### Manual Review Checklist

For each image reviewed above, consider:
- ✓ Image quality and clarity
- ✓ Relevance to social movement theme
- ✓ Caption accuracy and descriptiveness
- ✓ Label appropriateness
- ✓ Comment realism and diversity

**Note any issues below:**

(Space for notes)

## 5. Save QA Report

Save comprehensive quality assurance report.

In [None]:
# Save validation report
report_path = validator.save_report(validation_report)

print(f"✓ QA report saved to: {report_path}")
print(f"\nReport includes:")
print(f"  - Image validation results")
print(f"  - Metadata completeness checks")
print(f"  - Duplicate detection results")
print(f"  - Overall quality assessment")

## 6. Final Summary and Recommendations

In [None]:
print("\n" + "=" * 80)
print("QUALITY ASSURANCE COMPLETE")
print("=" * 80)

quality = summary_data['dataset_quality']
validation_rate = summary_data['validation_rate']

print(f"\nOverall Dataset Quality: {quality}")
print(f"Validation Rate: {validation_rate:.1f}%")

print("\nDataset is ready for:")
if quality in ['Excellent', 'Good']:
    print("  ✓ Model training and development")
    print("  ✓ Semantic similarity analysis")
    print("  ✓ Network structure analysis")
    print("  ✓ Publication and sharing")
else:
    print("  ⚠ Review and address issues before proceeding")
    print("  ⚠ Consider regenerating problematic images")

print("\nNext Steps:")
print("  1. Review QA report for detailed findings")
print("  2. Address any issues identified")
print("  3. Package dataset for Model Development phase")
print("  4. Document generation methodology")

print("\n" + "=" * 80)
print(f"Dataset Location: {output_dir}")
print(f"QA Report: {report_path}")
print("=" * 80)

## Workshop Complete!

Congratulations! You have successfully:

1. ✓ Set up and tested the environment
2. ✓ Prepared source data from three sources
3. ✓ Generated synthetic social movement images
4. ✓ Created captions, labels, and comments
5. ✓ Validated and assessed dataset quality

Your synthetic dataset is ready for the **Model Development** phase of the workshop!

### Resources

- Dataset: `data/generated/`
- QA Report: `data/qa/`
- Documentation: `README.md` and `CLAUDE.md`
- Support: CyVerse and workshop instructors