# Prepare Source Data

This notebook fetches and prepares data from three sources:
1. **Atropia Data**: Fictional country news samples for context
2. **World Bank Demographics**: Synthetic demographic profiles
3. **Social Media References**: Visual descriptions for image generation

**Workshop**: AI/ML Pipeline - Synthetic Data Generation  
**Date**: January 23, 2026  
**Platform**: CyVerse Jupyter Lab PyTorch GPU

## Setup

Import required modules and configure paths.

In [None]:
import sys
from pathlib import Path
import json
import pandas as pd

# Add parent directory to path to import src modules
parent_dir = Path.cwd().parent
if str(parent_dir) not in sys.path:
    sys.path.insert(0, str(parent_dir))

from src import config, data_loader

print("✓ Modules imported successfully")

## Load Configuration

Load configuration settings for data fetching.

In [None]:
# Load configuration
cfg = config.load_config()

# Get source data configuration
source_config = cfg.source_data

print("Configuration loaded:")
print(f"  Atropia samples: {source_config['atropia']['num_samples']}")
print(f"  World Bank profiles: {source_config['worldbank']['num_profiles']}")
print(f"\n✓ Configuration ready")

## 1. Fetch Atropia Data

Atropia is a fictional country used in U.S. military training scenarios. We'll fetch (or generate) news samples about political events, protests, and civil society activities.

In [None]:
# Initialize Atropia data loader
data_dir = cfg.get_data_path('raw')
atropia_loader = data_loader.AtropiaDataLoader(
    data_dir=data_dir,
    num_samples=source_config['atropia']['num_samples']
)

# Fetch data
print("Fetching Atropia data...")
atropia_data = atropia_loader.fetch_data()

print(f"\n✓ Loaded {len(atropia_data)} Atropia samples")
print(f"  Saved to: {data_dir / 'atropia_samples.json'}")

### Preview Atropia Data

Let's look at a few samples to understand the data structure.

In [None]:
# Display first 3 samples
print("Sample Atropia News:")
print("=" * 80)

for i, sample in enumerate(atropia_data[:3], 1):
    print(f"\n{i}. {sample['title']}")
    print(f"   Theme: {sample['theme']}")
    print(f"   Location: {sample['location']}")
    print(f"   Excerpt: {sample['excerpt']}")
    print("-" * 80)

### Analyze Atropia Themes

Let's see the distribution of themes in our dataset.

In [None]:
import matplotlib.pyplot as plt
from collections import Counter

# Count themes
themes = [sample['theme'] for sample in atropia_data]
theme_counts = Counter(themes)

# Plot
plt.figure(figsize=(10, 6))
plt.bar(theme_counts.keys(), theme_counts.values())
plt.xlabel('Theme')
plt.ylabel('Count')
plt.title('Atropia Data: Theme Distribution')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("Theme distribution:")
for theme, count in theme_counts.most_common():
    print(f"  {theme}: {count}")

## 2. Fetch World Bank Demographics

Generate synthetic demographic profiles based on World Bank data patterns.

In [None]:
# Initialize World Bank data loader
worldbank_loader = data_loader.WorldBankDataLoader(
    data_dir=data_dir,
    num_profiles=source_config['worldbank']['num_profiles']
)

# Fetch data
print("Fetching World Bank demographics...")
worldbank_data = worldbank_loader.fetch_data()

print(f"\n✓ Loaded {len(worldbank_data)} demographic profiles")
print(f"  Saved to: {data_dir / 'worldbank_demographics.csv'}")

### Preview World Bank Demographics

In [None]:
# Display first few profiles
print("Sample Demographic Profiles:")
print(worldbank_data.head(10))

print("\nDataset Info:")
print(worldbank_data.info())

### Analyze Demographics Distribution

In [None]:
# Create subplots for different demographic categories
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('World Bank Demographics: Distribution Analysis', fontsize=16)

# Age groups
worldbank_data['age_group'].value_counts().plot(kind='bar', ax=axes[0, 0])
axes[0, 0].set_title('Age Groups')
axes[0, 0].set_xlabel('')

# Occupations
worldbank_data['occupation'].value_counts().plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Occupations')
axes[0, 1].set_xlabel('')
axes[0, 1].tick_params(axis='x', rotation=45)

# Education levels
worldbank_data['education_level'].value_counts().plot(kind='bar', ax=axes[0, 2])
axes[0, 2].set_title('Education Levels')
axes[0, 2].set_xlabel('')

# Settings
worldbank_data['setting'].value_counts().plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_title('Settings (Urban/Rural)')
axes[1, 0].set_xlabel('')

# Household sizes
worldbank_data['household_size'].value_counts().sort_index().plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Household Sizes')
axes[1, 1].set_xlabel('')

# Hide last subplot
axes[1, 2].axis('off')

plt.tight_layout()
plt.show()

## 3. Load Social Media References

Load visual descriptions or reference images for prompt construction.

In [None]:
# Initialize social media data loader
socialmedia_loader = data_loader.SocialMediaDataLoader(data_dir=data_dir)

# Load descriptions
print("Loading social media visual references...")
socialmedia_data = socialmedia_loader.load_descriptions()

print(f"\n✓ Loaded {len(socialmedia_data)} visual references")

### Preview Visual References

In [None]:
# Display first few references
print("Sample Visual References:")
print("=" * 80)

for i, ref in enumerate(socialmedia_data[:5], 1):
    print(f"\n{i}. {ref['description']}")
    print(f"   Setting: {ref.get('setting', 'N/A')}")
    print(f"   Activity: {ref.get('activity_level', 'N/A')}")
    print("-" * 80)

## 4. Combine Data Sources

Demonstrate how data from all three sources can be combined for prompt generation.

In [None]:
# Initialize data combiner
combiner = data_loader.DataCombiner(
    atropia_loader=atropia_loader,
    worldbank_loader=worldbank_loader,
    socialmedia_loader=socialmedia_loader
)

# Generate 5 sample combinations
print("Generating sample data combinations...")
combined_samples = combiner.sample_combined(n=5)

print(f"\n✓ Generated {len(combined_samples)} combined samples")

### Preview Combined Data

In [None]:
# Display combined samples
print("Sample Combined Data:")
print("=" * 80)

for i, sample in enumerate(combined_samples, 1):
    print(f"\nCombination {i}:")
    print(f"  Atropia Theme: {sample['atropia']['theme']}")
    print(f"  Atropia Location: {sample['atropia']['location']}")
    print(f"  Demographics: Age {sample['demographics']['age_group']}, {sample['demographics']['occupation']}")
    print(f"  Visual Reference: {sample['visual_reference']['description']}")
    print("-" * 80)

## Summary

All source data has been prepared and is ready for image generation!

In [None]:
print("\n" + "=" * 80)
print("SOURCE DATA PREPARATION COMPLETE")
print("=" * 80)
print(f"\n✓ Atropia samples: {len(atropia_data)}")
print(f"✓ World Bank profiles: {len(worldbank_data)}")
print(f"✓ Visual references: {len(socialmedia_data)}")
print(f"\nAll data saved to: {data_dir}")
print("\nNext step: Run notebook 03_generate_images.ipynb to create synthetic images")