# Setup and Environment Test

This notebook validates your environment and tests API connectivity before running the full pipeline.

**Workshop**: AI/ML Pipeline - Synthetic Data Generation  
**Platform**: CyVerse Jupyter Lab PyTorch GPU

## What This Notebook Does

1. Verifies all required packages are installed
2. Tests configuration loading
3. Validates API authentication
4. Generates a test image
5. Estimates costs for different batch sizes
6. Provides troubleshooting guidance

## 1. Package Verification

First, let's verify all required packages are installed.

In [1]:
import sys
import os

from pathlib import Path

# Check Python version
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# src_path = os.path.join(os.getcwd(), '..', 'src')

# # Insert it at the beginning of the system path
# sys.path.insert(0, src_path)

Python version: 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)]
Python executable: c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\ai_workshop2026\Scripts\python.exe


In [2]:


# List of required packages
required_packages = [
    'google.generativeai',
    'pandas',
    'numpy',
    'PIL',
    'cv2',
    'dotenv',
    'yaml',
    'tqdm',
    'matplotlib',
    'seaborn',
    'requests',
    'bs4'
]

print("\nChecking required packages...")
missing_packages = []

for package in required_packages:
    try:
        __import__(package)
        print(f"  âœ“ {package}")
    except ImportError:
        print(f"  âœ— {package} - MISSING")
        missing_packages.append(package)

if missing_packages:
    print(f"\nâš  Warning: {len(missing_packages)} packages missing!")
    print("Please run: pip install -r requirements.txt")
else:
    print("\nAll required packages are installed!")


Checking required packages...



All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  __import__(package)


  âœ“ google.generativeai
  âœ“ pandas
  âœ“ numpy
  âœ“ PIL
  âœ“ cv2
  âœ“ dotenv
  âœ“ yaml
  âœ“ tqdm
  âœ“ matplotlib
  âœ“ seaborn
  âœ“ requests
  âœ“ bs4

âœ“ All required packages are installed!


## 2. Import Custom Modules

Import our custom modules from the src/ directory.

In [4]:
# Add parent directory to path
parent_dir = Path.cwd().parent
if str(parent_dir) not in sys.path:
    sys.path.insert(0, str(parent_dir))

try:
    from src import config, gemini_client, data_loader, prompt_builder, output_handler, validation
    print("All custom modules imported successfully!")
except ImportError as e:
    print(f"Failed to import custom modules: {e}")
    print("\nTroubleshooting:")
    print("1. Make sure you're running from the notebooks/ directory")
    print("2. Verify src/ directory exists with all .py files")
    print("3. Check for syntax errors in src modules")

All custom modules imported successfully!


## 3. Configuration Test

Test loading configuration from files.

In [5]:
try:
    # Load configuration
    cfg = config.load_config()
    print("âœ“ Configuration loaded successfully!")
    print(f"\nConfiguration: {cfg}")
    
    # Display key settings
    print("\nGeneration Settings:")
    print(f"  Number of images: {cfg.generation['num_images']}")
    print(f"  Batch size: {cfg.generation['batch_size']}")
    print(f"  Resolution: {cfg.generation['resolution']}")
    print(f"  Model: {cfg.generation['model']}")
    
    print("\nRate Limiting:")
    print(f"  Requests/minute: {cfg.rate_limiting['requests_per_minute']}")
    print(f"  Requests/day: {cfg.rate_limiting['requests_per_day']}")
    
except FileNotFoundError as e:
    print(f"âœ— Configuration file not found: {e}")
    print("\nTroubleshooting:")
    
    print("1. Verify config/generation_config.yaml exists")
    print("2. Check file permissions")
except Exception as e:
    print(f"âœ— Configuration error: {e}")

2026-01-21 18:21:04,333 - src.config - INFO - Logging configured successfully


âœ“ Configuration loaded successfully!

Configuration: Config(images=50, model=gemini-2.5-flash-image, resolution=1K)

Generation Settings:
  Number of images: 50
  Batch size: 10
  Resolution: 1K
  Model: gemini-2.5-flash-image

Rate Limiting:
  Requests/minute: 10
  Requests/day: 1000


## 4. API Key Validation

Check if Google Gemini API key is configured.

In [6]:
try:
    api_key = cfg.api_key
    print("API key found!")
    print(f"  Key preview: {api_key[:10]}...{api_key[-4:]}")
    
except ValueError as e:
    print(f"âœ— API key not found: {e}")
    print("\nSetup Instructions:")
    print("1. Get your API key from: https://makersuite.google.com/app/apikey")
    print("2. Copy config/.env.example to config/.env")
    print("3. Add your API key to config/.env: GOOGLE_API_KEY=your_key_here")
    print("4. Restart this notebook")
    
    # Stop execution if no API key
    raise

API key found!
  Key preview: AIzaSyAHQN...B2Qs


## 5. API Connection Test

Test connection to Google Gemini API.

In [7]:
import google.generativeai as genai

try:
    # Configure API
    genai.configure(api_key=cfg.api_key)
    
    # List available models
    print("Testing API connection...")
    models = [m.name for m in genai.list_models()]
    
    print("\nâœ“ Successfully connected to Gemini API!")
    print(f"  Available models: {len(models)}")
    
    # Check if our model is available
    target_model = cfg.generation['model']
    if any(target_model in m for m in models):
        print(f"  âœ“ Target model '{target_model}' is available")
    else:
        print(f"  Warning: Target model '{target_model}' not found in available models")
        print(f"  Available image models: {[m for m in models if 'image' in m.lower()]}")
        
except Exception as e:
    print(f"API connection failed: {e}")
    print("\nTroubleshooting:")
    print("1. Verify your API key is correct")
    print("2. Check your internet connection")
    print("3. Ensure you have API access enabled")
    raise

Testing API connection...

âœ“ Successfully connected to Gemini API!
  Available models: 53
  âœ“ Target model 'gemini-2.5-flash-image' is available


## 6. Generate Test Image

Generate a single test image to verify everything works.

In [8]:
from IPython.display import display
import time

print("Generating test image...")
print("(This may take 10-30 seconds)\n")

try:
    # Initialize rate limiter
    rate_limiter = gemini_client.RateLimiter(
        requests_per_minute=cfg.rate_limiting['requests_per_minute'],
        requests_per_day=cfg.rate_limiting['requests_per_day']
    )
    
    # Initialize image generator
    generator = gemini_client.GeminiImageGenerator(
        api_key=cfg.api_key,
        rate_limiter=rate_limiter,
        model=cfg.generation['model'],
        resolution=cfg.generation['resolution']
    )
    
    # Simple test prompt
    test_prompt = (
        "Photorealistic image of a peaceful civic gathering in an urban setting. "
        "Diverse crowd of people holding signs, organized demonstration, "
        "clear daytime lighting, high quality."
    )
    
    print(f"Test prompt: {test_prompt}\n")
    
    # Generate image
    start_time = time.time()
    result = generator.generate_image(test_prompt)
    elapsed = time.time() - start_time
    
    print(f"\nâœ“ Test image generated successfully in {elapsed:.1f}s!")
    print(f"  Image size: {result['metadata']['image_size']}")
    print(f"  Image mode: {result['metadata']['image_mode']}")
    
    # Display image
    print("\nGenerated Image:")
    display(result['image'])
    
except Exception as e:
    print(f"\nâœ— Image generation failed: {e}")
    print("\nTroubleshooting:")
    print("1. Check API quota limits")
    print("2. Verify model name is correct")
    print("3. Try a simpler prompt")
    raise

2026-01-21 18:21:33,339 - src.gemini_client - INFO - Gemini API client initialized
2026-01-21 18:21:33,343 - src.gemini_client - INFO - Initialized gemini-2.5-flash-image for image generation
2026-01-21 18:21:33,350 - src.gemini_client - INFO - Generating image with prompt length: 173


Generating test image...
(This may take 10-30 seconds)

Test prompt: Photorealistic image of a peaceful civic gathering in an urban setting. Diverse crowd of people holding signs, organized demonstration, clear daytime lighting, high quality.



* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_input_token_count, limit: 0, model: gemini-2.5-flash-preview-image
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 0, model: gemini-2.5-flash-preview-image
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 0, model: gemini-2.5-flash-preview-image
Please retry in 26.541605749s. [links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_input_token_count"
  quota_id: "GenerateContentInputTokensPerModelPerMinute-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash-preview-image"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
}
violations {
  quota_metric: "generativelanguage.googleapis.co


âœ— Image generation failed: 429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/rate-limit. 
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 0, model: gemini-2.5-flash-preview-image
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 0, model: gemini-2.5-flash-preview-image
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_input_token_count, limit: 0, model: gemini-2.5-flash-preview-image
Please retry in 20.31769879s. [links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "Gene

ResourceExhausted: 429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/rate-limit. 
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 0, model: gemini-2.5-flash-preview-image
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 0, model: gemini-2.5-flash-preview-image
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_input_token_count, limit: 0, model: gemini-2.5-flash-preview-image
Please retry in 20.31769879s. [links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerDayPerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash-preview-image"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
}
violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash-preview-image"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
}
violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_input_token_count"
  quota_id: "GenerateContentInputTokensPerModelPerMinute-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash-preview-image"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
}
, retry_delay {
  seconds: 20
}
]

## 7. Cost Estimation

Estimate costs for different batch sizes before running the full pipeline.

In [9]:
import pandas as pd

print("Cost Estimation for Different Batch Sizes")
print("=" * 80)

# Test different image counts
test_counts = [10, 20, 50, 100, 200]
estimates = []

for count in test_counts:
    # Temporarily set count
    cfg.set('generation.num_images', count)
    cost_est = cfg.estimate_cost()
    estimates.append(cost_est)

# Create comparison table
df = pd.DataFrame(estimates)
df = df[['num_images', 'resolution', 'image_generation', 'captions', 'labels', 'comments', 'total_estimated']]
df.columns = ['Images', 'Resolution', 'Image Gen', 'Captions', 'Labels', 'Comments', 'Total (USD)']

print(df.to_string(index=False))

print("\n" + "=" * 80)
print("\nNotes:")
print("- Costs are estimates based on current Gemini API pricing")
print("- Actual costs may vary based on prompt complexity and API changes")
print("- Free tier has usage limits - start with smaller batches")
print("\nRecommendation: Start with 10-20 images for testing")

Cost Estimation for Different Batch Sizes
 Images Resolution  Image Gen  Captions  Labels  Comments  Total (USD)
     10         1K       0.01     0.001   0.001     0.005        0.017
     20         1K       0.02     0.002   0.002     0.010        0.034
     50         1K       0.05     0.005   0.005     0.025        0.085
    100         1K       0.10     0.010   0.010     0.050        0.170
    200         1K       0.20     0.020   0.020     0.100        0.340


Notes:
- Costs are estimates based on current Gemini API pricing
- Actual costs may vary based on prompt complexity and API changes
- Free tier has usage limits - start with smaller batches

Recommendation: Start with 10-20 images for testing


## 8. Directory Structure Check

Verify all output directories are ready.

In [10]:
print("Checking directory structure...\n")

# Reset configuration
cfg = config.load_config()

# Check directories
directories = {
    'Output': cfg.get_output_path(),
    'Images': cfg.get_output_path('images'),
    'Captions': cfg.get_output_path('captions'),
    'Labels': cfg.get_output_path('labels'),
    'Comments': cfg.get_output_path('comments'),
    'Metadata': cfg.get_output_path('metadata'),
    'Raw Data': cfg.get_data_path('raw'),
    'QA': cfg.get_data_path('../qa')
}

for name, path in directories.items():
    exists = path.exists()
    status = "âœ“" if exists else "âœ—"
    print(f"{status} {name}: {path}")

print("\nâœ“ Directory structure ready!")

2026-01-21 18:24:00,166 - src.config - INFO - Loaded environment variables from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\config\.env
2026-01-21 18:24:00,202 - src.config - INFO - Loaded configuration from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\config\generation_config.yaml
2026-01-21 18:24:00,211 - src.config - INFO - Logging configured successfully


Checking directory structure...

âœ“ Output: c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\generated
âœ“ Images: c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\generated\images
âœ“ Captions: c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\generated\captions
âœ“ Labels: c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\generated\labels
âœ“ Comments: c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\generated\comments
âœ“ Metadata: c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\generated\metadata
âœ“ Raw Data: c:\Users\lwert\OneDrive - University

## 9. System Summary

Complete system check summary.

In [11]:
print("\n" + "=" * 80)
print("ENVIRONMENT SETUP COMPLETE")
print("=" * 80)

print("\nâœ“ Python packages installed")
print("âœ“ Custom modules imported")
print("âœ“ Configuration loaded")
print("âœ“ API key configured")
print("âœ“ API connection tested")
print("âœ“ Test image generated")
print("âœ“ Directory structure ready")

print("\n" + "=" * 80)
print("NEXT STEPS")
print("=" * 80)

print("\n1. Run notebook 02_prepare_source_data.ipynb to fetch source data")
print("2. Run notebook 03_generate_images.ipynb to generate synthetic images")
print("3. Run notebook 04_generate_metadata.ipynb for captions/labels/comments")
print("4. Run notebook 05_quality_assurance.ipynb for QA checks")

print("\nðŸ’¡ Tip: Start with a small batch (10-20 images) to test the full pipeline")
print("   You can increase the num_images in config/generation_config.yaml later")


ENVIRONMENT SETUP COMPLETE

âœ“ Python packages installed
âœ“ Custom modules imported
âœ“ Configuration loaded
âœ“ API key configured
âœ“ API connection tested
âœ“ Test image generated
âœ“ Directory structure ready

NEXT STEPS

1. Run notebook 02_prepare_source_data.ipynb to fetch source data
2. Run notebook 03_generate_images.ipynb to generate synthetic images
3. Run notebook 04_generate_metadata.ipynb for captions/labels/comments
4. Run notebook 05_quality_assurance.ipynb for QA checks

ðŸ’¡ Tip: Start with a small batch (10-20 images) to test the full pipeline
   You can increase the num_images in config/generation_config.yaml later


## Troubleshooting Guide

### Common Issues

**1. ModuleNotFoundError**
- Run: `pip install -r requirements.txt` from the project root
- Ensure you're using the correct Python environment

**2. API Key Error**
- Get API key from: https://makersuite.google.com/app/apikey
- Copy config/.env.example to config/.env
- Add your key: GOOGLE_API_KEY=your_key_here

**3. API Connection Failed**
- Check internet connection
- Verify API key is correct
- Check API quota limits

**4. Image Generation Timeout**
- Increase timeout in rate limiter settings
- Check API status: https://status.openai.com/
- Try simpler prompts

**5. Out of Memory**
- Reduce batch size in config
- Close other applications
- Restart Jupyter kernel

### Getting Help

- Workshop support: Contact instructors
- Documentation: Check README.md and CLAUDE.md
- CyVerse support: https://cyverse.org/support