# Generate Synthetic Images

This notebook generates synthetic social movement images using Google Gemini API, combining data from Atropia, World Bank demographics, and visual references.

**Workshop**: AI/ML Pipeline - Synthetic Data Generation  
**Date**: January 23, 2026  
**Platform**: CyVerse Jupyter Lab PyTorch GPU

## Pipeline Overview

1. Load configuration and source data
2. Initialize API client with rate limiting
3. Build prompts from combined data sources
4. Generate images in batches with checkpoints
5. Save images and metadata
6. Summarize results

## Setup and Imports

In [1]:
import sys
from pathlib import Path
import time
from datetime import datetime
from IPython.display import display, clear_output
from tqdm.notebook import tqdm

# Add parent directory to path
parent_dir = Path.cwd().parent
if str(parent_dir) not in sys.path:
    sys.path.insert(0, str(parent_dir))

from src import config, gemini_client, data_loader, prompt_builder, output_handler

print("All modules imported successfully")
print(f"Working directory: {Path.cwd()}")


All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  import google.generativeai as genai


All modules imported successfully
Working directory: c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\notebooks


## 1. Load Configuration

Review and adjust generation parameters if needed.

In [2]:
# Load configuration
cfg = config.load_config()

print("Current Configuration:")
print("=" * 80)
print(f"\nImages to generate: {cfg.generation['num_images']}")
print(f"Batch size: {cfg.generation['batch_size']}")
print(f"Resolution: {cfg.generation['resolution']}")
print(f"Model: {cfg.generation['model']}")
print(f"\nPrompt style: {cfg.prompts['style']}")
print(f"Prompt complexity: {cfg.prompts['complexity']}")
print(f"\nRate limit: {cfg.rate_limiting['requests_per_minute']} requests/minute")

print("\n" + "=" * 80)

2026-01-22 15:38:02,607 - src.config - INFO - Logging configured successfully


Current Configuration:

Images to generate: 50
Batch size: 10
Resolution: 1K
Model: gemini-2.5-pro-image

Prompt style: realistic
Prompt complexity: medium

Rate limit: 10 requests/minute



### Adjust Settings (Optional)

You can modify settings here if needed. Otherwise, skip this cell.

In [None]:
# OPTIONAL: Adjust number of images
# cfg.set('generation.num_images', 20)  # Change to desired number

# OPTIONAL: Adjust batch size
# cfg.set('generation.batch_size', 5)  # Smaller batches for testing

print("Settings adjusted (if any changes made above)")

### Cost Estimation

Review estimated costs before proceeding.

In [3]:
cost_estimate = cfg.estimate_cost()

print("Cost Estimation:")
print("=" * 80)
print(f"\nNumber of images: {cost_estimate['num_images']}")
print(f"Resolution: {cost_estimate['resolution']}")
print(f"\nEstimated Costs (USD):")
print(f"  Image generation: ${cost_estimate['image_generation']:.4f}")
print(f"  Captions: ${cost_estimate['captions']:.4f}")
print(f"  Labels: ${cost_estimate['labels']:.4f}")
print(f"  Comments: ${cost_estimate['comments']:.4f}")
print(f"\n  TOTAL: ${cost_estimate['total_estimated']:.4f}")
print("\n" + "=" * 80)
print("\nNote: Actual costs may vary. These are estimates based on typical usage.")
print("Free tier limits apply - start with small batches to avoid quota issues.")

Cost Estimation:

Number of images: 50
Resolution: 1K

Estimated Costs (USD):
  Image generation: $0.0500
  Captions: $0.0050
  Labels: $0.0050
  Comments: $0.0250

  TOTAL: $0.0850


Note: Actual costs may vary. These are estimates based on typical usage.
Free tier limits apply - start with small batches to avoid quota issues.


## 2. Load Source Data

Load data from all three sources: Atropia, World Bank, and social media references.

In [4]:
print("Loading source data...\n")

data_dir = cfg.get_data_path('raw')

# Initialize data loaders
atropia_loader = data_loader.AtropiaDataLoader(data_dir=data_dir)
worldbank_loader = data_loader.WorldBankDataLoader(data_dir=data_dir)
socialmedia_loader = data_loader.SocialMediaDataLoader(data_dir=data_dir)

# Load data
atropia_data = atropia_loader.load_data()
print(f"Loaded {len(atropia_data)} Atropia samples")

worldbank_data = worldbank_loader.load_data()
print(f"Loaded {len(worldbank_data)} World Bank profiles")

socialmedia_data = socialmedia_loader.load_descriptions()
print(f"Loaded {len(socialmedia_data)} visual references")

# Initialize combiner
combiner = data_loader.DataCombiner(
    atropia_loader=atropia_loader,
    worldbank_loader=worldbank_loader,
    socialmedia_loader=socialmedia_loader
)

print("\nAll source data loaded and ready")

2026-01-22 15:38:13,099 - src.data_loader - INFO - Loading Atropia data from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\atropia_samples.json
2026-01-22 15:38:13,109 - src.data_loader - INFO - Loading World Bank data from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\worldbank_demographics.csv
2026-01-22 15:38:13,166 - src.data_loader - INFO - Loading visual descriptions from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\imagepath_labels_descp.json


Loading source data...

Loaded 100 Atropia samples
Loaded 50 World Bank profiles
Loaded 16567 visual references

All source data loaded and ready


## 3. Build Prompts

Generate prompts by combining data from all sources.

In [None]:
print("Building prompts...\n")

# Initialize prompt builder
builder = prompt_builder.PromptBuilder(
    style=cfg.prompts['style'],
    complexity=cfg.prompts['complexity'],
    include_temporal=cfg.prompts['include_temporal_context'],
    include_demographics=cfg.prompts['include_demographics'],
    themes=cfg.prompts['themes'],
    settings=cfg.prompts['settings']
)

# Generate combined data samples
num_images = cfg.generation['num_images']
combined_samples = combiner.sample_combined(n=num_images)

# Build prompts
prompts_data = builder.build_batch_prompts(combined_samples)

print(f"  Built {len(prompts_data)} prompts")
print(f"  Style: {cfg.prompts['style']}")
print(f"  Complexity: {cfg.prompts['complexity']}")

2026-01-22 15:38:20,452 - src.data_loader - INFO - Loading Atropia data from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\atropia_samples.json
2026-01-22 15:38:20,460 - src.data_loader - INFO - Loading World Bank data from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\worldbank_demographics.csv
2026-01-22 15:38:20,518 - src.data_loader - INFO - Loading visual descriptions from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\imagepath_labels_descp.json
2026-01-22 15:38:20,623 - src.data_loader - INFO - Loading Atropia data from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\atropia_samples.json


Building prompts...



2026-01-22 15:38:20,630 - src.data_loader - INFO - Loading World Bank data from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\worldbank_demographics.csv
2026-01-22 15:38:20,644 - src.data_loader - INFO - Loading visual descriptions from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\imagepath_labels_descp.json
2026-01-22 15:38:20,717 - src.data_loader - INFO - Loading Atropia data from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\atropia_samples.json
2026-01-22 15:38:20,717 - src.data_loader - INFO - Loading World Bank data from c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\raw\worldbank_demographics.csv
2026-01-22 15:38:20,732 - src.data_loader - INFO - Loading visual de

✓ Built 50 prompts
  Style: realistic
  Complexity: medium


### Preview Sample Prompts

Let's review a few prompts before generation.

In [6]:
print("Sample Prompts:")
print("=" * 80)

for i in range(min(3, len(prompts_data))):
    prompt_info = prompts_data[i]
    print(f"\nPrompt {i+1}:")
    print(f"  {prompt_info['prompt'][:200]}...")
    print(f"\n  Source - Theme: {prompt_info['source_data']['atropia']['theme']}")
    print(f"  Source - Demographics: Age {prompt_info['source_data']['demographics']['age_group']}, "
          f"{prompt_info['source_data']['demographics']['occupation']}")
    print("-" * 80)

Sample Prompts:

Prompt 1:
  photorealistic, high detail, natural lighting. A scene depicting political unrest with a digital image with a purple background featuring a white illustration of a megaphone and text in spanish. the t...

  Source - Theme: political_unrest
  Source - Demographics: Age 35-44, student
--------------------------------------------------------------------------------

Prompt 2:
  photorealistic, high detail, natural lighting. A scene depicting political unrest with a black square image with gold-colored text that reads 'tomás eduardo bravo gutiérrez' and '17 febrero 2021'.. Di...

  Source - Theme: political_unrest
  Source - Demographics: Age 25-34, manufacturing
--------------------------------------------------------------------------------

Prompt 3:
  photorealistic, high detail, natural lighting. A scene depicting economic conditions with a bar graph showing the number of victims in different countries, including colombia, mexico, guatemala, argen...

  So

## 4. Initialize Generation Pipeline

Set up API client, rate limiter, and output handler.

In [7]:
print("Initializing generation pipeline...\n")

# Initialize rate limiter
rate_limiter = gemini_client.RateLimiter(
    requests_per_minute=cfg.rate_limiting['requests_per_minute'],
    requests_per_day=cfg.rate_limiting['requests_per_day'],
    enable_backoff=cfg.rate_limiting['enable_exponential_backoff'],
    initial_delay=cfg.rate_limiting['initial_retry_delay'],
    backoff_multiplier=cfg.rate_limiting['backoff_multiplier'],
    max_retries=cfg.rate_limiting['max_retries']
)
print("✓ Rate limiter initialized")

# Initialize image generator
generator = gemini_client.GeminiImageGenerator(
    api_key=cfg.api_key,
    rate_limiter=rate_limiter,
    model=cfg.generation['model'],
    resolution=cfg.generation['resolution'],
    aspect_ratio=cfg.generation['aspect_ratio']
)
print(f"✓ Image generator initialized (model: {cfg.generation['model']})")

# Initialize output handler
output_dir = cfg.get_output_path()
handler = output_handler.OutputHandler(
    output_dir=output_dir,
    image_format=cfg.output['image_format'],
    metadata_format=cfg.output['metadata_format'],
    export_csv=cfg.output['export_csv_summaries'],
    date_organized=cfg.output['date_organized']
)
print(f"✓ Output handler initialized (output: {output_dir})")

# Initialize checkpoint manager
checkpoint_path = parent_dir / cfg.generation['checkpoint_file']
checkpoint_mgr = gemini_client.CheckpointManager(checkpoint_path)
print(f"✓ Checkpoint manager initialized")

print("\n✓ Pipeline ready to generate images")

2026-01-22 15:38:44,655 - src.gemini_client - INFO - Gemini API client initialized
2026-01-22 15:38:44,658 - src.gemini_client - INFO - Initialized gemini-2.5-pro-image for image generation
2026-01-22 15:38:44,676 - src.output_handler - INFO - Output directories created at c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\generated


Initializing generation pipeline...

✓ Rate limiter initialized
✓ Image generator initialized (model: gemini-2.5-pro-image)
✓ Output handler initialized (output: c:\Users\lwert\OneDrive - University of Arizona\Documents\Fellowships\Jetstream\AI-ML_PipelineWorkshop\DataCollection\data\generated)
✓ Checkpoint manager initialized

✓ Pipeline ready to generate images


## 5. Check for Resume

Check if there's a previous interrupted generation to resume.

In [8]:
# Check for existing checkpoint
checkpoint = checkpoint_mgr.load_checkpoint()

if checkpoint:
    print("Found existing checkpoint!")
    print(f"  Completed: {checkpoint.get('completed', 0)}/{checkpoint.get('total', 0)} images")
    print(f"  Last checkpoint: {checkpoint.get('timestamp', 'unknown')}")
    
    # Ask user if they want to resume
    print("\n⚠ Set RESUME = True in the next cell to continue, or False to start fresh")
else:
    print("No existing checkpoint found. Starting fresh generation.")

No existing checkpoint found. Starting fresh generation.


In [9]:
# Set this to True to resume from checkpoint, False to start fresh
RESUME = False

if RESUME and checkpoint:
    start_index = checkpoint.get('completed', 0)
    print(f"Resuming from image {start_index + 1}")
else:
    start_index = 0
    checkpoint_mgr.clear_checkpoint()
    print("Starting fresh generation")

Starting fresh generation


## 6. Generate Images

Main generation loop with progress tracking and checkpoints.

In [10]:
print("\n" + "=" * 80)
print("STARTING IMAGE GENERATION")
print("=" * 80)

# Generation tracking
batch_size = cfg.generation['batch_size']
total_images = len(prompts_data)
generated_count = start_index
error_count = 0
start_time = time.time()

print(f"\nGenerating {total_images - start_index} images in batches of {batch_size}")
print(f"Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\nProgress:")

# Main generation loop
for i in tqdm(range(start_index, total_images), initial=start_index, total=total_images):
    try:
        prompt_data = prompts_data[i]
        
        # Generate image
        result = generator.generate_image(prompt_data['prompt'])
        
        # Save image and metadata
        image_path = handler.save_image(
            image=result['image'],
            index=i + 1,
            prompt_data=prompt_data
        )
        
        generated_count += 1
        
        # Display latest image (every 5 images)
        if (i + 1) % 5 == 0:
            clear_output(wait=True)
            print(f"\nProgress: {generated_count}/{total_images} images")
            print(f"Latest image: {image_path.name}")
            display(result['image'].resize((256, 256)))  # Display smaller preview
        
        # Save checkpoint after each batch
        if (i + 1) % batch_size == 0:
            checkpoint_data = {
                'completed': generated_count,
                'total': total_images,
                'timestamp': datetime.now().isoformat(),
                'last_index': i
            }
            checkpoint_mgr.save_checkpoint(checkpoint_data)
            print(f"\n✓ Checkpoint saved: {generated_count}/{total_images} images")
        
    except Exception as e:
        error_count += 1
        print(f"\n✗ Error generating image {i+1}: {e}")
        
        # Continue with next image
        continue

# Final statistics
elapsed_time = time.time() - start_time
minutes = int(elapsed_time // 60)
seconds = int(elapsed_time % 60)

print("\n" + "=" * 80)
print("GENERATION COMPLETE")
print("=" * 80)
print(f"\nSuccessfully generated: {generated_count} images")
print(f"Errors encountered: {error_count}")
print(f"Total time: {minutes}m {seconds}s")
print(f"Average time per image: {elapsed_time/generated_count:.1f}s")

# Clear checkpoint on successful completion
if error_count == 0:
    checkpoint_mgr.clear_checkpoint()
    print("\n✓ Checkpoint cleared (generation completed successfully)")


STARTING IMAGE GENERATION

Generating 50 images in batches of 10
Started at: 2026-01-22 15:38:57

Progress:


  0%|          | 0/50 [00:00<?, ?it/s]

2026-01-22 15:38:58,064 - src.gemini_client - INFO - Generating image with prompt length: 724
2026-01-22 15:38:58,525 - src.gemini_client - INFO - Retrying in 2.0s
2026-01-22 15:39:00,618 - src.gemini_client - INFO - Retrying in 4.0s
2026-01-22 15:39:04,710 - src.gemini_client - ERROR - All retry attempts failed: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.
2026-01-22 15:39:04,715 - src.gemini_client - INFO - Generating image with prompt length: 514
2026-01-22 15:39:04,828 - src.gemini_client - INFO - Retrying in 2.0s



✗ Error generating image 1: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.


2026-01-22 15:39:06,911 - src.gemini_client - INFO - Retrying in 4.0s
2026-01-22 15:39:11,009 - src.gemini_client - ERROR - All retry attempts failed: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.
2026-01-22 15:39:11,017 - src.gemini_client - INFO - Generating image with prompt length: 615
2026-01-22 15:39:11,133 - src.gemini_client - INFO - Retrying in 2.0s



✗ Error generating image 2: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.


2026-01-22 15:39:13,215 - src.gemini_client - INFO - Retrying in 4.0s
2026-01-22 15:39:17,296 - src.gemini_client - ERROR - All retry attempts failed: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.
2026-01-22 15:39:17,299 - src.gemini_client - INFO - Generating image with prompt length: 492
2026-01-22 15:39:17,409 - src.gemini_client - INFO - Retrying in 2.0s



✗ Error generating image 3: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.


2026-01-22 15:39:19,498 - src.gemini_client - INFO - Retrying in 4.0s
2026-01-22 15:39:23,576 - src.gemini_client - ERROR - All retry attempts failed: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.
2026-01-22 15:39:23,583 - src.gemini_client - INFO - Generating image with prompt length: 431
2026-01-22 15:39:23,701 - src.gemini_client - INFO - Retrying in 2.0s



✗ Error generating image 4: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.


2026-01-22 15:39:25,791 - src.gemini_client - INFO - Retrying in 4.0s
2026-01-22 15:39:29,865 - src.gemini_client - ERROR - All retry attempts failed: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.
2026-01-22 15:39:29,878 - src.gemini_client - INFO - Generating image with prompt length: 528
2026-01-22 15:39:29,982 - src.gemini_client - INFO - Retrying in 2.0s



✗ Error generating image 5: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.


2026-01-22 15:39:32,075 - src.gemini_client - INFO - Retrying in 4.0s
2026-01-22 15:39:36,293 - src.gemini_client - ERROR - All retry attempts failed: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.
2026-01-22 15:39:36,302 - src.gemini_client - INFO - Generating image with prompt length: 507
2026-01-22 15:39:36,416 - src.gemini_client - INFO - Retrying in 2.0s



✗ Error generating image 6: 404 models/gemini-2.5-pro-image is not found for API version v1beta, or is not supported for generateContent. Call ListModels to see the list of available models and their supported methods.


2026-01-22 15:39:38,499 - src.gemini_client - INFO - Retrying in 4.0s


KeyboardInterrupt: 

## 7. Save Generation Log

Save complete generation log for record keeping.

In [None]:
# Save generation log
log_path = handler.save_generation_log()
print(f"✓ Generation log saved to: {log_path}")

# Get and display summary
summary = handler.get_summary()

print("\nGeneration Summary:")
print("=" * 80)
print(f"Output directory: {summary['output_directory']}")
print(f"Images saved: {summary['images_saved']}")
print(f"Image format: {summary['image_format']}")
print(f"\nDirectory structure:")
for name, path in summary['directories'].items():
    print(f"  {name}: {path}")

## 8. Preview Generated Images

Display a few random samples from the generated dataset.

In [None]:
import random
from PIL import Image
import matplotlib.pyplot as plt

# Get all generated images
image_files = list(handler.images_dir.glob(f"*.{cfg.output['image_format']}"))

if image_files:
    # Sample 6 random images
    sample_files = random.sample(image_files, min(6, len(image_files)))
    
    # Display in grid
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    fig.suptitle('Sample Generated Images', fontsize=16)
    
    for idx, image_file in enumerate(sample_files):
        row = idx // 3
        col = idx % 3
        
        img = Image.open(image_file)
        axes[row, col].imshow(img)
        axes[row, col].set_title(image_file.name)
        axes[row, col].axis('off')
    
    plt.tight_layout()
    plt.show()
else:
    print("No images found to display")

## Next Steps

Images have been generated successfully! Continue with:

1. **Generate Metadata**: Run notebook `04_generate_metadata.ipynb` to create captions, labels, and comments
2. **Quality Assurance**: Run notebook `05_quality_assurance.ipynb` to validate the dataset

Your generated images are saved in: `data/generated/images/`