## Configuration

Set your parameters for the pipeline run.

In [None]:
# Pipeline Configuration
config = {
    "topic": "Films and TV Shows",                # Broad topic to explore
    "simulated_date": "2024-07-01",     # The "present" we're simulating (July 2024)
    "num_questions": 100,                # How many questions to generate
    "num_categories": 20,                # Number of main categories
    "buffer_days": 7,                   # Safety buffer (Question_Date = Resolution_Date - buffer_days)
    
    # Model Selection
    # Available models: llama70, alcf_oss120b, alcf_oss20b, gemma-3-27b-it
    "architect_model": "llama70",        # Model for taxonomy expansion
    "historian_model": "llama70",        # Model for question generation
    "critic_model": "llama70",           # Model for validation
    
    # Output files
    "output_file": "data/movie_questions.jsonl",
    "discarded_file": "data/discarded_movie_questions.jsonl",
    "checkpoint_file": "logs/completed_log.txt"
}

print("Configuration:")
print("="*60)
for key, value in config.items():
    print(f"{key:20s}: {value}")
print("="*60)

Configuration:
topic               : Politics
simulated_date      : 2024-07-01
num_questions       : 100
num_categories      : 20
buffer_days         : 7
architect_model     : llama70
historian_model     : llama70
critic_model        : llama70
output_file         : data/generated_questions.jsonl
discarded_file      : data/discarded_questions.jsonl
checkpoint_file     : logs/completed_log.txt


## Initialize Pipeline

Import the pipeline and create an instance with your chosen models.

In [2]:
from forecast_kag.question_generation_pipeline import ForecastingQuestionPipeline

# Initialize pipeline
pipeline = ForecastingQuestionPipeline(
    architect_model=config['architect_model'],
    historian_model=config['historian_model'],
    critic_model=config['critic_model'],
    buffer_days=config['buffer_days'],
    output_file=config['output_file'],
    discarded_file=config['discarded_file'],
    checkpoint_file=config['checkpoint_file']
)

2025-11-30 14:41:01 - INFO - FORECASTING QUESTION GENERATION PIPELINE INITIALIZED
2025-11-30 14:41:01 - INFO - Architect Model: llama70
2025-11-30 14:41:01 - INFO - Historian Model: llama70
2025-11-30 14:41:01 - INFO - Critic Model: llama70
2025-11-30 14:41:01 - INFO - Safety Buffer: 7 days
2025-11-30 14:41:01 - INFO - Min Resolution Buffer: 30 days (ensures information is available)


## Run Pipeline

Execute the complete pipeline. This will:
1. Expand the topic into categories/subcategories
2. Generate questions for each subcategory
3. Validate each question independently
4. Save validated questions to JSONL

**Note**: This may take several minutes depending on `num_questions` and API response times.

In [None]:
# Run the pipeline
try:
    stats = pipeline.run(
        topic=config['topic'],
        simulated_date=config['simulated_date'],
        num_questions=config['num_questions'],
        num_categories=config['num_categories']
    )
except Exception as e:
    print(f"\n{'='*80}")
    print("ERROR: Pipeline execution failed")
    print(f"{'='*80}")
    print(f"Error: {str(e)}")
    print(f"\nTroubleshooting:")
    print("1. Check that your model servers are running and accessible")
    print("2. Verify models/model_servers.yaml has correct endpoints")
    print("3. Try reducing num_questions to a smaller number (e.g., 3)")
    print("4. Check your internet connection for search functionality")
    print(f"{'='*80}")
    raise

2025-11-30 14:41:01 - INFO - 
2025-11-30 14:41:01 - INFO - PIPELINE EXECUTION STARTED
2025-11-30 14:41:01 - INFO - Topic: Politics
2025-11-30 14:41:01 - INFO - Simulated Date: 2024-07-01
2025-11-30 14:41:01 - INFO - Current Date: 2025-11-30
2025-11-30 14:41:01 - INFO - Target Questions: 100

2025-11-30 14:41:01 - INFO - [MODULE A: THE ARCHITECT] Taxonomy Expansion
2025-11-30 14:41:01 - INFO - Topic: Politics
2025-11-30 14:41:01 - INFO - Generating 20 categories with 5 subcategories each
2025-11-30 14:41:01 - INFO - Calling LLM for Taxonomy Expansion...
2025-11-30 14:41:30 - INFO - Taxonomy Expansion completed
2025-11-30 14:41:30 - INFO - Generated 19 categories
2025-11-30 14:41:30 - INFO -   - Elections: complexity=9, subcategories=5
2025-11-30 14:41:30 - INFO -   - Government Policies: complexity=8, subcategories=5
2025-11-30 14:41:30 - INFO -   - International Relations: complexity=9, subcategories=5
2025-11-30 14:41:30 - INFO -   - Political Scandals: complexity=7, subcategories=5
2

## View Statistics

Review the pipeline performance metrics.

In [4]:
print("\n" + "="*80)
print("PIPELINE STATISTICS")
print("="*80)
for key, value in stats.items():
    print(f"{key:30s}: {value}")
print("="*80)


PIPELINE STATISTICS
total_attempted               : 191
total_generated               : 134
total_validated               : 87
total_discarded               : 47
categories_processed          : 19
subcategories_processed       : 95
subcategories_skipped         : 0


## View Generated Questions

Load and display the validated questions from the JSONL file.

In [5]:
import json
import os

# Load generated questions
questions = []
if os.path.exists(config['output_file']):
    with open(config['output_file'], 'r') as f:
        for line in f:
            questions.append(json.loads(line))

print("\n" + "="*80)
print(f"GENERATED QUESTIONS ({len(questions)} total)")
print("="*80)

for i, q in enumerate(questions, 1):
    print(f"\n{i}. {q['question']}")
    print(f"   Category: {q['category']} → {q['subcategory']}")
    print(f"   Question Date: {q['t_ask']}")
    print(f"   Resolution Date: {q['t_resolve']}")
    print(f"   Answer: {q['answer']}")
    print(f"   Complexity: {q['meta']['complexity_score']}/10")
    print(f"   Event: {q['meta']['event_description'][:100]}..." if q['meta']['event_description'] else "")

print("\n" + "="*80)


GENERATED QUESTIONS (87 total)

1. Will Joe Biden announce his candidacy for the 2024 US Presidential Election before December 31, 2024?
   Category: Elections → US Presidential Election Results
   Question Date: 2024-07-14
   Resolution Date: 2024-07-21
   Answer: No
   Complexity: 9/10
   Event: Joe Biden announced his withdrawal from the 2024 presidential election...

2. Will Gavin Newsom announce his candidacy for California Governor before October 31, 2024?
   Category: Elections → State Governor Elections
   Question Date: 2024-10-24
   Resolution Date: 2024-10-31
   Answer: No
   Complexity: 9/10
   Event: Gavin Newsom's term as governor ends and he is not running for re-election due to term limits...

3. Will Eric Adams win the New York City mayoral election before November 4, 2024?
   Category: Elections → Local Mayoral Elections
   Question Date: 2024-10-28
   Resolution Date: 2024-11-04
   Answer: No
   Complexity: 9/10
   Event: Eric Adams dropped out of the New York mayor

## View Discarded Questions

Review questions that failed validation (useful for debugging and understanding failure modes).

In [6]:
# Load discarded questions
discarded = []
if os.path.exists(config['discarded_file']):
    with open(config['discarded_file'], 'r') as f:
        for line in f:
            discarded.append(json.loads(line))

print("\n" + "="*80)
print(f"DISCARDED QUESTIONS ({len(discarded)} total)")
print("="*80)

for i, d in enumerate(discarded[:5], 1):  # Show first 5
    print(f"\n{i}. {d['question_data']['question']}")
    print(f"   Reason: {d['reason']}")
    print(f"   Historian Answer: {d['question_data'].get('answer', 'N/A')}")
    print(f"   Critic Answer: {d['validation_data'].get('answer', 'N/A')}")

if len(discarded) > 5:
    print(f"\n... and {len(discarded) - 5} more discarded questions")

print("\n" + "="*80)


DISCARDED QUESTIONS (47 total)

1. Will the European People's Party win the most seats in the European Parliament before June 1, 2025?
   Reason: Answer mismatch between Historian and Critic
   Historian Answer: No
   Critic Answer: Yes

2. Will Joe Biden announce his candidacy for the 2024 United States presidential election before October 31, 2024?
   Reason: Answer mismatch between Historian and Critic
   Historian Answer: No
   Critic Answer: Yes

3. Will the Biden Administration announce a new federal budget proposal for 2026 before March 15, 2025?
   Reason: Answer mismatch between Historian and Critic
   Historian Answer: Yes
   Critic Answer: No

4. Will the European Commission announce a new trade agreement with the United Kingdom before December 31, 2024?
   Reason: Answer mismatch between Historian and Critic
   Historian Answer: No
   Critic Answer: Yes

5. Will NATO announce an expansion to include North Macedonia and Finland before January 31, 2025?
   Reason: Answer mis

## Example: Data Schema

Here's what a single question looks like in the JSONL file:

In [7]:
if questions:
    print("\nExample Question Record:")
    print("="*80)
    print(json.dumps(questions[0], indent=2))
    print("="*80)


Example Question Record:
{
  "id": "cc06d066-3231-48ed-930a-218dff5057d9",
  "topic": "Politics",
  "category": "Elections",
  "subcategory": "US Presidential Election Results",
  "question": "Will Joe Biden announce his candidacy for the 2024 US Presidential Election before December 31, 2024?",
  "t_ask": "2024-07-14",
  "t_resolve": "2024-07-21",
  "answer": "No",
  "meta": {
    "complexity_score": 9,
    "validation_confidence": "High",
    "event_description": "Joe Biden announced his withdrawal from the 2024 presidential election"
  }
}


## Understanding the Data Schema

Each question in `generated_questions.jsonl` has this structure:

```json
{
  "id": "uuid_v4_string",           // Unique identifier
  "topic": "Technology",            // Main topic
  "category": "AI",                 // Category from taxonomy
  "subcategory": "LLM Releases",    // Specific subcategory
  "question": "Will...",            // Binary Yes/No question
  "t_ask": "2024-03-15",            // Question date (simulated past)
  "t_resolve": "2024-03-22",        // Resolution date (actual event)
  "answer": "Yes",                  // Ground truth answer
  "meta": {
    "complexity_score": 8,          // Topic complexity (1-10)
    "validation_confidence": "High", // Critic's confidence
    "event_description": "..."      // What happened
  }
}
```

### Key Properties:

- **No News Context Saved**: Only the answer is saved (prevents model bias)
- **Time Safety**: `t_ask = t_resolve - 7 days` (configurable buffer)
- **Append-Only**: JSONL format allows safe concurrent writes
- **Crash-Proof**: Each write is flushed with `os.fsync()`

## Resuming from Checkpoint

If the pipeline crashes or is interrupted, it will automatically resume from where it left off.

The `completed_log.txt` file tracks processed subcategories:

```
Technology|Artificial Intelligence|LLM Releases
Technology|Artificial Intelligence|AI Startups
Technology|Cybersecurity|Data Breaches
...
```

Simply re-run the pipeline with the same configuration, and it will skip completed subcategories.

In [8]:
# View checkpoint file
if os.path.exists(config['checkpoint_file']):
    with open(config['checkpoint_file'], 'r') as f:
        completed = f.readlines()
    
    print("\n" + "="*80)
    print(f"COMPLETED SUBCATEGORIES ({len(completed)} total)")
    print("="*80)
    for line in completed[:10]:  # Show first 10
        print(f"  ✓ {line.strip()}")
    
    if len(completed) > 10:
        print(f"  ... and {len(completed) - 10} more")
    print("="*80)
else:
    print("No checkpoint file found.")


COMPLETED SUBCATEGORIES (95 total)
  ✓ Politics|Elections|US Presidential Election Results
  ✓ Politics|Elections|European Parliamentary Election Outcomes
  ✓ Politics|Elections|National Leadership Elections
  ✓ Politics|Elections|State Governor Elections
  ✓ Politics|Elections|Local Mayoral Elections
  ✓ Politics|Government Policies|US Federal Budget Approvals
  ✓ Politics|Government Policies|European Union Trade Agreements
  ✓ Politics|Government Policies|National Healthcare Reform Initiatives
  ✓ Politics|Government Policies|Environmental Protection Legislation
  ✓ Politics|Government Policies|Tax Reform Bills
  ... and 85 more
