# News Data Sources for Forecasting Questions

This notebook demonstrates two approaches to using news articles as data sources for generating forecasting questions:

1. **News Search** (`NewsSeedGenerator`) - Uses Google News to search for recent articles matching specific queries
2. **Top Aggregated News** (`GdeltSeedGenerator`) - Uses [GDELT](https://www.gdeltproject.org/) to access a massive database of global news articles

Both approaches follow the same pipeline pattern but offer different trade-offs in coverage, control, and use cases.

## Set up the client

Sign up at [dashboard.lightningrod.ai](https://dashboard.lightningrod.ai/?redirect=/api) to get your API key and **$50 of free credits**.

In Google Colab, go to the Secrets section (üîë icon in left sidebar) and add a secret named `LIGHTNINGROD_API_KEY` with your API key.

In [None]:
%pip install lightningrod-ai

from IPython.display import clear_output
clear_output()

from google.colab import userdata
from lightningrod import LightningRod

api_key = userdata.get("LIGHTNINGROD_API_KEY")
lr = LightningRod(api_key=api_key)

## Build the pipeline

Let's first define the configuration that will be shared across both pipelines to generate binary questions from news sources.

In [2]:
from lightningrod import AnswerType, AnswerTypeEnum, WebSearchLabeler, QuestionRenderer

answer_type = AnswerType(answer_type=AnswerTypeEnum.BINARY)

labeler = WebSearchLabeler(
    answer_type=answer_type,
    confidence_threshold=0.5,
)

renderer = QuestionRenderer(
    answer_type=answer_type,
)

### Approach 1: News Search (Google News)

The `NewsSeedGenerator` searches Google News for articles matching your query. You can specify date ranges, search queries, and how many articles to fetch per interval.

In [None]:
from datetime import datetime
from lightningrod import NewsSeedGenerator, QuestionGenerator, QuestionPipeline

news_seed_generator = NewsSeedGenerator(
    start_date=datetime(2025, 1, 1),
    end_date=datetime(2025, 1, 31),
    interval_duration_days=7,  # Split date range into intervals of this many days
    articles_per_search=20,  # Maximum number of articles to fetch per search query per interval
    search_query="AI technology announcements",
)

question_generator = QuestionGenerator(
    instructions=(
        "Generate forward-looking questions about AI technology announcements. "
        "Questions should be about future events or outcomes that can be verified later."
    ),
    examples=[
        "Will OpenAI release a new model in Q2 2025?",
        "Will Google announce a new AI product this month?",
        "Will Apple integrate AI features into iOS 19?",
    ],
    bad_examples=[
        "What did OpenAI announce?",
        "Who is the CEO of Google?",
        "When was ChatGPT released?",
    ],
    answer_type=answer_type,
)


news_search_pipeline_config = QuestionPipeline(
    seed_generator=news_seed_generator,
    question_generator=question_generator,
    labeler=labeler,
    renderer=renderer,
)

### Approach 2: Top Aggregated News (GDELT)

The `GdeltSeedGenerator` fetches articles at intervals defined by `interval_duration_days` - it does not fetch articles for every day (unless you set `interval_duration_days=1`), but instead steps forward by the specified interval between each batch. This provides access to a massive database of global news articles.

In [None]:

from lightningrod import GdeltSeedGenerator, FilterCriteria

gdelt_seed_generator = GdeltSeedGenerator(
    start_date=datetime(2025, 1, 1),
    end_date=datetime(2025, 1, 31),
    interval_duration_days=7,  # Split date range into intervals of this many days
    articles_per_interval=1000,  # Maximum number of articles to fetch per interval (e.g., per 7-day period)
)

gdelt_question_generator = QuestionGenerator(
    instructions=(
        "Generate forward-looking questions about global events and international news. "
        "Questions should focus on future outcomes that can be verified."
    ),
    examples=[
        "Will the conflict in region X escalate in the next month?",
        "Will country Y sign the trade agreement this quarter?",
        "Will the international summit achieve its stated goals?",
    ],
    bad_examples=[
        "What happened in the conflict?",
        "When was the trade agreement signed?",
        "Who attended the summit?",
    ],
    filter_=FilterCriteria(
        rubric="The question should be forward-looking and about future global events",
        min_score=0.7
    ),
    answer_type=answer_type,
)

gdelt_pipeline_config = QuestionPipeline(
    seed_generator=gdelt_seed_generator,
    question_generator=gdelt_question_generator,
    labeler=labeler,
    renderer=renderer,
)

## Run the Pipelines

You can run either pipeline configuration. Both work the same way - they just use different data sources.

In [5]:
search_dataset = lr.transforms.run(news_search_pipeline_config, max_questions=10)

gdelt_dataset = lr.transforms.run(gdelt_pipeline_config, max_questions=10)

> Note: This can take a few minutes to complete processing.

## View Results

Inspect the generated questions and answers. Each sample contains `seed`, `question`, `label`, `prompt`, and optional `context` and `meta` fields.

In [6]:
%pip install pandas

from IPython.display import clear_output
clear_output()

In [7]:
import pandas as pd

news_samples = search_dataset.download()
print(f"Generated {search_dataset.num_rows} samples\n")

news_rows = search_dataset.flattened()
news_df = pd.DataFrame(news_rows)
news_df

Generated 10 samples



Unnamed: 0,question.question_text,label.label,label.label_confidence,label.reasoning,label.answer_sources,seed.seed_text,seed.url,seed.seed_creation_date,seed.search_query,is_valid,meta.sample_id,meta.filter_reason,meta.parent_sample_id,meta.processing_time_ms,label.resolution_date,prompt
0,"Will Arteria AI's research arm, Arteria Caf√©, ...",Undetermined,0.7,"Arteria AI launched its research arm, Arteria ...",https://vertexaisearch.cloud.google.com/ground...,Title: Arteria AI Launches New Research Arm to...,https://www.businesswire.com/news/home/2025012...,2025-01-29T00:00:00,AI technology announcements,False,5c7456df-1990-4dbf-8554-da1cd05f8dc6,Undetermined label,5e931acc-90e4-42a6-9097-93917c3ab5ec,19261.285,,
1,Will Alibaba's newest AI model maintain its le...,0,0.9,DeepSeek-V3 was initially launched in December...,https://vertexaisearch.cloud.google.com/ground...,Title: reuters.com\n\nURL Source: https://www....,https://www.reuters.com/technology/artificial-...,2025-01-29T00:00:00,AI technology announcements,True,19c51d3f-5b4e-4bb9-b844-57e28618fd3d,,3cac9f1d-f3c0-4fa0-899f-be8539ca4e6a,23904.055,2025-06-30T00:00:00,QUESTION:\nWill Alibaba's newest AI model main...
2,Will an independent research team or AI lab re...,Undetermined,1.0,The question asks about a report being release...,,Title: China's DeepSeek faces questions over c...,https://www.aljazeera.com/news/2025/1/29/ai-ga...,2025-01-29T00:00:00,AI technology announcements,False,feaa1ed2-d1bb-44ae-bd87-9715e95affcb,Undetermined label,08a48049-6917-414f-9836-860c3c9debe3,4331.688,,
3,Will the NYCEDC Startup Internship Program pla...,1,1.0,The NYCEDC Startup Internship Program will exp...,https://vertexaisearch.cloud.google.com/ground...,"Title: Mayor Adams, NYCEDC Release First-Of-It...",https://www.nyc.gov/mayors-office/news/2025/01...,2025-01-31T00:00:00,AI technology announcements,True,7c9adaf3-e599-47a7-ae4a-9b9f64adecfe,,deed4613-946a-429a-bdf0-4f6c78e5c6a1,7687.722,2025-01-31T00:00:00,QUESTION:\nWill the NYCEDC Startup Internship ...
4,Will the Asian Journal of Law and Society rece...,Undetermined,1.0,The Asian Journal of Law and Society's special...,https://vertexaisearch.cloud.google.com/ground...,Title: Special Issue on AI Sovereignty and Int...,https://www.cambridge.org/core/journals/asian-...,2025-01-31T00:00:00,AI technology announcements,False,5692869a-246d-4dc8-8faa-28476540221e,Undetermined label,f2b679f1-f6e9-43d4-aa25-7f952d52c82c,9505.701,2025-10-15T00:00:00,
5,Will Microsoft and Pearson release a new co-br...,1,0.9,Microsoft and Pearson announced a strategic co...,https://vertexaisearch.cloud.google.com/ground...,Title: Advancing education to prepare for an A...,https://www.microsoft.com/en-us/education/blog...,2025-01-30T00:00:00,AI technology announcements,True,b49cf391-2c0c-40d7-bd88-58f4a232ee2c,,17340c20-72b1-49b7-b47d-1fee13619e13,8358.511,2025-01-14T00:00:00,QUESTION:\nWill Microsoft and Pearson release ...
6,Will NVIDIA's quarterly revenue exceed $35 bil...,Undetermined,1.0,The question asks whether NVIDIA's quarterly r...,https://vertexaisearch.cloud.google.com/ground...,Title: Access to this page has been denied.\n\...,https://www.investors.com/news/technology/ai-s...,2025-01-30T00:00:00,AI technology announcements,False,635ac95b-5020-42d4-bb4e-9fa6d25aa94e,Undetermined label,8f0e0d68-7588-4ac8-ad8d-0f833764eea7,23168.364,,
7,Will Alibaba's Qwen2.5-Max model maintain a hi...,Undetermined,1.0,The question asks for a definitive ranking of ...,,Title: Alibaba claims its AI model trounces De...,https://www.livescience.com/technology/artific...,2025-01-29T00:00:00,AI technology announcements,False,77efa48f-ff92-499d-8496-8b65c4d3b45e,Undetermined label,3aef6a90-f851-4ddd-aa92-2badb5d93e55,17836.828,,
8,Will Microsoft invest at least $80 billion in ...,1,1.0,"Microsoft's President, Brad Smith, announced o...",https://vertexaisearch.cloud.google.com/ground...,Title: Tech Giants Have Been Spending Big on A...,https://www.investopedia.com/will-msft-meta-am...,2025-01-29T00:00:00,AI technology announcements,True,c4afc8c7-1d45-4568-9d3a-a33479e767f4,,a91bee96-4257-48c5-ad08-5ec071230d1d,10196.047,2025-01-06T00:00:00,QUESTION:\nWill Microsoft invest at least $80 ...
9,Will Oracle release additional AI agent featur...,1,1.0,Oracle made several announcements and releases...,https://vertexaisearch.cloud.google.com/ground...,Title: Oracle debuts new AI agents as artifici...,https://finance.yahoo.com/news/oracle-debuts-n...,2025-01-30T00:00:00,AI technology announcements,True,cdc620ec-e28e-4078-a451-37451b337ebd,,d84c0f40-5f5f-4bee-9915-52c861f36c3d,13227.272,2025-10-15T00:00:00,QUESTION:\nWill Oracle release additional AI a...


In [8]:
gdelt_samples = gdelt_dataset.download()
print(f"Generated {gdelt_dataset.num_rows} samples\n")

gdelt_rows = gdelt_dataset.flattened()
gdelt_df = pd.DataFrame(gdelt_rows)
gdelt_df

Generated 6 samples



Unnamed: 0,question.question_text,label.label,label.label_confidence,label.resolution_date,label.reasoning,label.answer_sources,prompt,seed.seed_text,seed.url,seed.seed_creation_date,is_valid,meta.sample_id,meta.filter_score,meta.parent_sample_id,meta.processing_time_ms,meta.filter_reason
0,Will Melania Trump announce a formal public po...,0,1.0,2018-05-07T00:00:00,"Melania Trump officially launched her ""Be Best...",https://vertexaisearch.cloud.google.com/ground...,QUESTION:\nWill Melania Trump announce a forma...,Title: What is the role of the First Lady in A...,https://economictimes.indiatimes.com/news/inte...,2025-01-20T00:00:00,True,d966cf0f-7996-4f5c-84f8-ce1a1528580a,1.0,65e0068d-071d-4c8d-9078-4eefc6f0d494,8614.052,
1,Will Donald Trump's inaugural address on Janua...,Undetermined,1.0,,The question asks about the specific content o...,,,"Title: Sundar Pichai, Elon Musk, Jeff Bezos, M...",https://economictimes.indiatimes.com/news/inte...,2025-01-20T00:00:00,False,5bab6bd5-e33b-462e-9a95-88ecc39b2cc9,1.0,ae6ccc15-56c3-4c07-b1a4-8a2e73edd1ec,5344.261,Undetermined label
2,Will the U.S. Department of Justice under Dona...,Undetermined,1.0,,The question asks about a definitive future ac...,https://vertexaisearch.cloud.google.com/ground...,,Title: Who has Donald Trump threatened to pros...,https://economictimes.indiatimes.com/news/inte...,2025-01-20T00:00:00,False,dfbb10ff-9c2f-464b-a112-001851fdb1fa,1.0,b974d8a0-3a3d-40c6-a976-f626c407bf7c,15665.574,Undetermined label
3,Will Donald Trump sign an executive order rega...,1,1.0,2025-01-20T00:00:00,Donald Trump won the 2024 US Presidential Elec...,https://vertexaisearch.cloud.google.com/ground...,QUESTION:\nWill Donald Trump sign an executive...,Title: Trump returns to White House with power...,https://economictimes.indiatimes.com/news/inte...,2025-01-20T00:00:00,True,f4e756d6-751a-44c9-9978-ff0d2db93718,1.0,98e85a64-2ed5-4915-b24f-ee65b3067c8d,20281.909,
4,Will Donald Trump revoke at least two dozen ex...,0,1.0,2025-01-20T00:00:00,The question asks whether Donald Trump would r...,,QUESTION:\nWill Donald Trump revoke at least t...,"Title: 'We love you, America': Joe Biden share...",https://economictimes.indiatimes.com/news/inte...,2025-01-20T00:00:00,True,d2988a9e-065f-47f9-bc24-a30b4ebd45ac,1.0,4c14f6e3-794d-4968-bc94-3897d3773765,18996.587,
5,Will Amazon release a documentary or series fe...,1,1.0,2025-01-07T00:00:00,Amazon MGM Studios has purchased the rights to...,https://vertexaisearch.cloud.google.com/ground...,QUESTION:\nWill Amazon release a documentary o...,Title: How Melania Trump is preparing for life...,https://economictimes.indiatimes.com/news/inte...,2025-01-20T00:00:00,True,9caf8a4c-abfe-4405-8192-01dc1340bb74,1.0,8a474804-9d3c-4097-b7b6-289751da6782,9827.678,


## When to Use Each Approach

**Use `NewsSeedGenerator` (Google News) when:**
- You need recent, curated news articles
- You want more control over search queries
- You're working with smaller, focused datasets
- You need faster iteration on specific topics
- You want to target specific keywords or themes

**Use `GdeltSeedGenerator` (Top Aggregated News) when:**
- You need access to a very large number of articles
- You're analyzing global or international events
- You need historical data
- You want broader coverage across many sources
- You're working with large-scale datasets