# GDELT Dataset Generation

This notebook demonstrates how to use GDELT (Global Database of Events, Language, and Tone) as a data source for generating forecasting questions. GDELT provides access to a massive database of global news articles via BigQuery, offering broader coverage than standard news search.

In [None]:
%pip install lightningrod-ai

from IPython.display import clear_output
clear_output()

In [None]:
import os
from datetime import datetime
from lightningrod import (
    LightningRod,
    GdeltSeedGenerator,
    QuestionGenerator,
    QuestionPipeline,
    WebSearchLabeler,
    FilterCriteria,
    AnswerType,
    AnswerTypeEnum,
)
from lightningrod._generated.models import QuestionRenderer

api_key = os.getenv("LIGHTNINGROD_API_KEY", "your-api-key-here")

client = LightningRod(api_key=api_key)

## Configure GDELT Seed Generator

GDELT provides access to a much larger corpus of news articles than standard Google News search. It's particularly useful for:
- Global events and international news
- Historical analysis
- Large-scale dataset generation

The `GdeltSeedGenerator` queries BigQuery to fetch articles, allowing you to process thousands of articles per interval.

In [None]:
gdelt_seed_generator = GdeltSeedGenerator(
    start_date=datetime(2025, 1, 1),
    end_date=datetime(2025, 1, 31),
    interval_duration_days=7,
    articles_per_interval=1000,
)

## When to Use GDELT vs Google News

**Use GDELT when:**
- You need access to a very large number of articles
- You're analyzing global or international events
- You need historical data
- You want broader coverage across many sources

**Use Google News when:**
- You need recent, curated news articles
- You want more control over search queries
- You're working with smaller, focused datasets
- You need faster iteration on specific topics

In [None]:
answer_type = AnswerType(answer_type=AnswerTypeEnum.BINARY)

question_generator = QuestionGenerator(
    instructions=(
        "Generate forward-looking questions about global events and international news. "
        "Questions should focus on future outcomes that can be verified."
    ),
    examples=[
        "Will the conflict in region X escalate in the next month?",
        "Will country Y sign the trade agreement this quarter?",
        "Will the international summit achieve its stated goals?",
    ],
    bad_examples=[
        "What happened in the conflict?",
        "When was the trade agreement signed?",
        "Who attended the summit?",
    ],
    filter_=FilterCriteria(
        rubric="The question should be forward-looking and about future global events",
        min_score=0.7
    ),
    answer_type=answer_type,
)

In [None]:
labeler = WebSearchLabeler(
    answer_type=answer_type,
    confidence_threshold=0.5,
)

renderer = QuestionRenderer(
    answer_type=answer_type,
)

## Run the Pipeline

The pipeline works the same way as with Google News - GDELT is just a different data source.

In [None]:
pipeline_config = QuestionPipeline(
    seed_generator=gdelt_seed_generator,
    question_generator=question_generator,
    labeler=labeler,
    renderer=renderer,
)

dataset = client.transforms.run(pipeline_config, max_questions=100)

In [None]:
print(f"Generated dataset with {dataset.num_rows} samples from GDELT\n")

samples = dataset.to_samples()
for i, sample in enumerate(samples[:5]):
    print(f"Sample {i+1}:")
    if sample.question:
        print(f"  Question: {sample.question.question_text}")
    if sample.label:
        print(f"  Answer: {sample.label.label}")
        print(f"  Confidence: {sample.label.label_confidence:.2f}")
    print()