# Top Aggregated News as Data Source

This notebook demonstrates how to use an aggregated dataset of top news & events (based on [GDELT](https://www.gdeltproject.org/)) as a data source for generating forecasting questions. This provides access to a massive database of global news articles, offering broader coverage than standard news search.

In [1]:
%pip install -e ..
%pip install dotenv

from IPython.display import clear_output
clear_output()

import os
from dotenv import load_dotenv
from lightningrod import LightningRod

load_dotenv()

api_key = os.getenv("LIGHTNINGROD_API_KEY")
base_url = os.getenv("LIGHTNINGROD_BASE_URL", "https://api.lightningrod.ai/api/public/v1")

if not api_key:
    raise ValueError("LIGHTNINGROD_API_KEY is not set")

# Note: base_url param can be omitted
client = LightningRod(api_key=api_key, base_url=base_url)

## Configure GDELT Seed Generator


The `GdeltSeedGenerator` fetches articles at intervals defined by `interval_duration_days` - it does not fetch articles for every day (unless you set `interval_duration_days=1`), but instead steps forward by the specified interval between each batch.

In [2]:
from datetime import datetime
from lightningrod import GdeltSeedGenerator, AnswerType, AnswerTypeEnum, QuestionGenerator, FilterCriteria, WebSearchLabeler, QuestionRenderer, QuestionPipeline

gdelt_seed_generator = GdeltSeedGenerator(
    start_date=datetime(2025, 1, 1),
    end_date=datetime(2025, 1, 31),
    interval_duration_days=7,
    articles_per_interval=1000,
)

answer_type = AnswerType(answer_type=AnswerTypeEnum.BINARY)

question_generator = QuestionGenerator(
    instructions=(
        "Generate forward-looking questions about global events and international news. "
        "Questions should focus on future outcomes that can be verified."
    ),
    examples=[
        "Will the conflict in region X escalate in the next month?",
        "Will country Y sign the trade agreement this quarter?",
        "Will the international summit achieve its stated goals?",
    ],
    bad_examples=[
        "What happened in the conflict?",
        "When was the trade agreement signed?",
        "Who attended the summit?",
    ],
    filter_=FilterCriteria(
        rubric="The question should be forward-looking and about future global events",
        min_score=0.7
    ),
    answer_type=answer_type,
)

# Labeler automatically finds answers to questions using web search
labeler = WebSearchLabeler(
    answer_type=answer_type,
    confidence_threshold=0.5,
)

# Renderer formats the question output
renderer = QuestionRenderer(
    answer_type=answer_type,
)

pipeline_config = QuestionPipeline(
    seed_generator=gdelt_seed_generator,
    question_generator=question_generator,
    labeler=labeler,
    renderer=renderer,
)

> Note: This can take a few minutes to complete processing.

## Run the Pipeline

The pipeline works the same way as with Google News - GDELT is just a different data source.

In [3]:
dataset = client.transforms.run(pipeline_config, max_questions=10) # keep max questions low when testing

Exception: Transform job 2b66d89d-9ac3-4251-bd5c-025ec920ff89 failed (error: None)

In [None]:
%pip install pandas

from IPython.display import clear_output
clear_output()

In [None]:
import pandas as pd

# Download samples to memory
samples = dataset.download()
print(f"Generated {dataset.num_rows} samples\n")

# Convert cached samples to a list of dictionaries
rows = dataset.flattened()

df = pd.DataFrame(rows)
df

Generated 1 samples

                              question.question_text label.label  \
0  Will Donald Trump be sentenced in the hush mon...           1   

   label.label_confidence label.resolution_date  \
0                     1.0   2025-01-10T00:00:00   

                                     label.reasoning  \
0  Donald Trump was sentenced in the hush money c...   

                                label.answer_sources  \
0  https://vertexaisearch.cloud.google.com/ground...   

                                              prompt  \
0  QUESTION:\nWill Donald Trump be sentenced in t...   

                                      seed.seed_text  \
0  Title: ABC News â€“ Breaking News, Latest News a...   

                                            seed.url seed.seed_creation_date  \
0  https://abcnews.go.com/Politics/wireStory/trum...     2025-01-10T00:00:00   

   is_valid                        meta.sample_id  \
0      True  548dd97c-b733-4f8c-9f72-c754e7be4d35   

                 

## When to use Top Aggregated News vs News Search

**Use `GdeltSeedGenerator` when:**
- You need access to a very large number of articles
- You're analyzing global or international events
- You need historical data
- You want broader coverage across many sources

**Use `NewsSeedGenerator` when:**
- You need recent, curated news articles
- You want more control over search queries
- You're working with smaller, focused datasets
- You need faster iteration on specific topics