# Setup and Environment Test

This notebook validates your environment and tests API connectivity before running the full pipeline.

**Workshop**: AI/ML Pipeline - Synthetic Data Generation  
**Date**: January 23, 2026  
**Platform**: CyVerse Jupyter Lab PyTorch GPU

## What This Notebook Does

1. Verifies all required packages are installed
2. Tests configuration loading
3. Validates API authentication
4. Generates a test image
5. Estimates costs for different batch sizes
6. Provides troubleshooting guidance

## 1. Package Verification

First, let's verify all required packages are installed.

In [None]:
import sys
from pathlib import Path

# Check Python version
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# List of required packages
required_packages = [
    'google.generativeai',
    'pandas',
    'numpy',
    'PIL',
    'cv2',
    'dotenv',
    'yaml',
    'tqdm',
    'matplotlib',
    'seaborn',
    'requests',
    'bs4'
]

print("\nChecking required packages...")
missing_packages = []

for package in required_packages:
    try:
        __import__(package)
        print(f"  âœ“ {package}")
    except ImportError:
        print(f"  âœ— {package} - MISSING")
        missing_packages.append(package)

if missing_packages:
    print(f"\nâš  Warning: {len(missing_packages)} packages missing!")
    print("Please run: pip install -r requirements.txt")
else:
    print("\nâœ“ All required packages are installed!")

## 2. Import Custom Modules

Import our custom modules from the src/ directory.

In [None]:
# Add parent directory to path
parent_dir = Path.cwd().parent
if str(parent_dir) not in sys.path:
    sys.path.insert(0, str(parent_dir))

try:
    from src import config, gemini_client, data_loader, prompt_builder, output_handler, validation
    print("âœ“ All custom modules imported successfully!")
except ImportError as e:
    print(f"âœ— Failed to import custom modules: {e}")
    print("\nTroubleshooting:")
    print("1. Make sure you're running from the notebooks/ directory")
    print("2. Verify src/ directory exists with all .py files")
    print("3. Check for syntax errors in src modules")

## 3. Configuration Test

Test loading configuration from files.

In [None]:
try:
    # Load configuration
    cfg = config.load_config()
    print("âœ“ Configuration loaded successfully!")
    print(f"\nConfiguration: {cfg}")
    
    # Display key settings
    print("\nGeneration Settings:")
    print(f"  Number of images: {cfg.generation['num_images']}")
    print(f"  Batch size: {cfg.generation['batch_size']}")
    print(f"  Resolution: {cfg.generation['resolution']}")
    print(f"  Model: {cfg.generation['model']}")
    
    print("\nRate Limiting:")
    print(f"  Requests/minute: {cfg.rate_limiting['requests_per_minute']}")
    print(f"  Requests/day: {cfg.rate_limiting['requests_per_day']}")
    
except FileNotFoundError as e:
    print(f"âœ— Configuration file not found: {e}")
    print("\nTroubleshooting:")
    print("1. Verify config/generation_config.yaml exists")
    print("2. Check file permissions")
except Exception as e:
    print(f"âœ— Configuration error: {e}")

## 4. API Key Validation

Check if Google Gemini API key is configured.

In [None]:
try:
    api_key = cfg.api_key
    print("âœ“ API key found!")
    print(f"  Key preview: {api_key[:10]}...{api_key[-4:]}")
    
except ValueError as e:
    print(f"âœ— API key not found: {e}")
    print("\nSetup Instructions:")
    print("1. Get your API key from: https://makersuite.google.com/app/apikey")
    print("2. Copy config/.env.example to config/.env")
    print("3. Add your API key to config/.env: GOOGLE_API_KEY=your_key_here")
    print("4. Restart this notebook")
    
    # Stop execution if no API key
    raise

## 5. API Connection Test

Test connection to Google Gemini API.

In [None]:
import google.generativeai as genai

try:
    # Configure API
    genai.configure(api_key=cfg.api_key)
    
    # List available models
    print("Testing API connection...")
    models = [m.name for m in genai.list_models()]
    
    print("\nâœ“ Successfully connected to Gemini API!")
    print(f"  Available models: {len(models)}")
    
    # Check if our model is available
    target_model = cfg.generation['model']
    if any(target_model in m for m in models):
        print(f"  âœ“ Target model '{target_model}' is available")
    else:
        print(f"  âš  Warning: Target model '{target_model}' not found in available models")
        print(f"  Available image models: {[m for m in models if 'image' in m.lower()]}")
        
except Exception as e:
    print(f"âœ— API connection failed: {e}")
    print("\nTroubleshooting:")
    print("1. Verify your API key is correct")
    print("2. Check your internet connection")
    print("3. Ensure you have API access enabled")
    raise

## 6. Generate Test Image

Generate a single test image to verify everything works.

In [None]:
from IPython.display import display
import time

print("Generating test image...")
print("(This may take 10-30 seconds)\n")

try:
    # Initialize rate limiter
    rate_limiter = gemini_client.RateLimiter(
        requests_per_minute=cfg.rate_limiting['requests_per_minute'],
        requests_per_day=cfg.rate_limiting['requests_per_day']
    )
    
    # Initialize image generator
    generator = gemini_client.GeminiImageGenerator(
        api_key=cfg.api_key,
        rate_limiter=rate_limiter,
        model=cfg.generation['model'],
        resolution=cfg.generation['resolution']
    )
    
    # Simple test prompt
    test_prompt = (
        "Photorealistic image of a peaceful civic gathering in an urban setting. "
        "Diverse crowd of people holding signs, organized demonstration, "
        "clear daytime lighting, high quality."
    )
    
    print(f"Test prompt: {test_prompt}\n")
    
    # Generate image
    start_time = time.time()
    result = generator.generate_image(test_prompt)
    elapsed = time.time() - start_time
    
    print(f"\nâœ“ Test image generated successfully in {elapsed:.1f}s!")
    print(f"  Image size: {result['metadata']['image_size']}")
    print(f"  Image mode: {result['metadata']['image_mode']}")
    
    # Display image
    print("\nGenerated Image:")
    display(result['image'])
    
except Exception as e:
    print(f"\nâœ— Image generation failed: {e}")
    print("\nTroubleshooting:")
    print("1. Check API quota limits")
    print("2. Verify model name is correct")
    print("3. Try a simpler prompt")
    raise

## 7. Cost Estimation

Estimate costs for different batch sizes before running the full pipeline.

In [None]:
import pandas as pd

print("Cost Estimation for Different Batch Sizes")
print("=" * 80)

# Test different image counts
test_counts = [10, 20, 50, 100, 200]
estimates = []

for count in test_counts:
    # Temporarily set count
    cfg.set('generation.num_images', count)
    cost_est = cfg.estimate_cost()
    estimates.append(cost_est)

# Create comparison table
df = pd.DataFrame(estimates)
df = df[['num_images', 'resolution', 'image_generation', 'captions', 'labels', 'comments', 'total_estimated']]
df.columns = ['Images', 'Resolution', 'Image Gen', 'Captions', 'Labels', 'Comments', 'Total (USD)']

print(df.to_string(index=False))

print("\n" + "=" * 80)
print("\nNotes:")
print("- Costs are estimates based on current Gemini API pricing")
print("- Actual costs may vary based on prompt complexity and API changes")
print("- Free tier has usage limits - start with smaller batches")
print("\nRecommendation: Start with 10-20 images for testing")

## 8. Directory Structure Check

Verify all output directories are ready.

In [None]:
print("Checking directory structure...\n")

# Reset configuration
cfg = config.load_config()

# Check directories
directories = {
    'Output': cfg.get_output_path(),
    'Images': cfg.get_output_path('images'),
    'Captions': cfg.get_output_path('captions'),
    'Labels': cfg.get_output_path('labels'),
    'Comments': cfg.get_output_path('comments'),
    'Metadata': cfg.get_output_path('metadata'),
    'Raw Data': cfg.get_data_path('raw'),
    'QA': cfg.get_data_path('../qa')
}

for name, path in directories.items():
    exists = path.exists()
    status = "âœ“" if exists else "âœ—"
    print(f"{status} {name}: {path}")

print("\nâœ“ Directory structure ready!")

## 9. System Summary

Complete system check summary.

In [None]:
print("\n" + "=" * 80)
print("ENVIRONMENT SETUP COMPLETE")
print("=" * 80)

print("\nâœ“ Python packages installed")
print("âœ“ Custom modules imported")
print("âœ“ Configuration loaded")
print("âœ“ API key configured")
print("âœ“ API connection tested")
print("âœ“ Test image generated")
print("âœ“ Directory structure ready")

print("\n" + "=" * 80)
print("NEXT STEPS")
print("=" * 80)

print("\n1. Run notebook 02_prepare_source_data.ipynb to fetch source data")
print("2. Run notebook 03_generate_images.ipynb to generate synthetic images")
print("3. Run notebook 04_generate_metadata.ipynb for captions/labels/comments")
print("4. Run notebook 05_quality_assurance.ipynb for QA checks")

print("\nðŸ’¡ Tip: Start with a small batch (10-20 images) to test the full pipeline")
print("   You can increase the num_images in config/generation_config.yaml later")

## Troubleshooting Guide

### Common Issues

**1. ModuleNotFoundError**
- Run: `pip install -r requirements.txt` from the project root
- Ensure you're using the correct Python environment

**2. API Key Error**
- Get API key from: https://makersuite.google.com/app/apikey
- Copy config/.env.example to config/.env
- Add your key: GOOGLE_API_KEY=your_key_here

**3. API Connection Failed**
- Check internet connection
- Verify API key is correct
- Check API quota limits

**4. Image Generation Timeout**
- Increase timeout in rate limiter settings
- Check API status: https://status.openai.com/
- Try simpler prompts

**5. Out of Memory**
- Reduce batch size in config
- Close other applications
- Restart Jupyter kernel

### Getting Help

- Workshop support: Contact instructors
- Documentation: Check README.md and CLAUDE.md
- CyVerse support: https://cyverse.org/support