# Quick Start Guide

Generate forecasting datasets from real-world data sources in minutes.

## What is Lightning Rod?

Lightning Rod transforms raw data (news articles, documents, reports) into high-quality forecasting training datasets. Instead of manually labeling data, you define data sources and let the system automatically:

1. **Collect seeds** - Gather raw data from news, documents, or custom sources
2. **Generate questions** - Create forecasting questions using AI
3. **Label questions** - Find ground truth answers automatically
4. **Export datasets** - Get ready-to-use datasets for model training

## Use Cases

- **Train forecasting models** - Generate training data for predicting future events
- **Domain-specific datasets** - Create custom datasets from your documents or industry news
- **Research & experimentation** - Quickly prototype with different data sources and question types
- **Continuous data generation** - Set up pipelines that generate fresh datasets regularly

## Core Concepts

**Seeds** - Raw data sources (news articles, documents, etc.) that serve as the foundation for question generation.

**Questions** - Forecasting questions generated from seeds. Can be binary (Yes/No), continuous (numeric), multiple choice, or free response.

**Labels** - Ground truth answers to questions, found automatically via web search or other methods. Include confidence scores and resolution dates.

**Samples** - Complete data units containing seed, question, label, formatted prompt, and optional context/metadata.

**Datasets** - Collections of samples stored efficiently. Can be downloaded, used as pipeline inputs, or exported for training.

> **Tip**: See [API.md](../API.md) for complete API reference and alternative configurations.

## Step 1: Install SDK

In [3]:
%pip install -e ../

from IPython.display import clear_output
clear_output()

## Step 2: Import and Initialize Client

The `LightningRod` client is your main entry point to the SDK. It provides access to:

- **`transforms`** - Run data generation pipelines
- **`datasets`** - Access and download datasets
- **`files`** - Upload files for use as data sources
- **`filesets`** - Manage collections of files

Set your API key as an environment variable `LIGHTNINGROD_API_KEY` or pass it directly to the constructor.

In [None]:
import os
from lightningrod import LightningRod

api_key = os.getenv("LIGHTNINGROD_API_KEY")
if not api_key:
    raise ValueError("LIGHTNINGROD_API_KEY is not set")

client = LightningRod(api_key=api_key)

## Step 3: Configure Seed Generator

Seed generators collect raw data that will be transformed into questions. 

**`NewsSeedGenerator`** searches Google News for articles matching your query. It:
- Searches within a date range (`start_date` to `end_date`)
- Divides the range into intervals (`interval_duration_days`)
- Fetches articles per search query
- Optionally filters by source domain or applies content filters

**Alternative seed generators:**
- `GdeltSeedGenerator` - Large-scale datasets from GDELT global news database
- `FileSetSeedGenerator` - Chunk documents from uploaded files
- `FileSetQuerySeedGenerator` - RAG queries against your document collection

See examples 02-04 for different data sources.

In [5]:
from datetime import datetime, timedelta
from lightningrod import NewsSeedGenerator

seed_generator = NewsSeedGenerator(
    start_date=datetime.now() - timedelta(days=30),
    end_date=datetime.now(),
    interval_duration_days=7,
    search_query="technology announcements",
)

## Step 4: Configure Question Generator

Question generators transform seeds into forecasting questions using AI.

**`QuestionGenerator`** creates questions from seeds with:
- Custom instructions for question style and format
- Example questions to guide generation
- Bad examples to avoid
- Answer type specification (binary, continuous, multiple choice, free response)

**Answer Types:**
- **`BINARY`** - Yes/No questions (this example)
- **`CONTINUOUS`** - Numeric predictions (e.g., "What will the price be?")
- **`MULTIPLE_CHOICE`** - Questions with predefined options
- **`FREE_RESPONSE`** - Open-ended text answers

See examples 05-08 for different answer types.

In [6]:
from lightningrod import QuestionGenerator, AnswerType, AnswerTypeEnum

answer_type = AnswerType(answer_type=AnswerTypeEnum.BINARY)

question_generator = QuestionGenerator(
    instructions="Generate forward-looking questions about technology announcements.",
    answer_type=answer_type,
)

## Step 5: Configure Labeler and Renderer

**`WebSearchLabeler`** automatically finds ground truth answers by:
- Searching the web for information about each question
- Extracting answers based on the answer type
- Assigning confidence scores (0.0 to 1.0)
- Providing reasoning and source URLs

**`QuestionRenderer`** formats questions into prompts ready for model input:
- Combines question text, context, and answer instructions
- Creates the `prompt` field in each sample
- Supports custom templates or uses intelligent defaults

Both components use the `answer_type` to ensure proper formatting and labeling.

In [7]:
from lightningrod import WebSearchLabeler, QuestionRenderer

labeler = WebSearchLabeler(answer_type=answer_type)
renderer = QuestionRenderer(answer_type=answer_type)

## Step 6: Create Pipeline and Run

**`QuestionPipeline`** combines all components into a complete data generation pipeline:

- `seed_generator` - Collects raw data
- `question_generator` - Creates questions from seeds
- `labeler` - Finds answers (optional if using `QuestionAndLabelGenerator`)
- `renderer` - Formats final prompts for use in the pipelines (optional)

**`transforms.run()`** submits the pipeline, waits for completion, and returns a dataset. For long-running jobs, use `transforms.submit()` to submit without waiting, then check status with `transforms.jobs.get(job_id)`.

The `max_questions` parameter limits the number of questions generated (useful for testing).

In [8]:
from lightningrod import QuestionPipeline

pipeline_config = QuestionPipeline(
    seed_generator=seed_generator,
    question_generator=question_generator,
    labeler=labeler,
    renderer=renderer,
)

dataset = client.transforms.run(pipeline_config, max_questions=10)

> **Tip:** you can also view pipeline statuses on our dashboard (Dataset Generation -> Datasets): https://dashboard.lightningrod.ai

## Step 7: View Results

The dataset contains all generated samples. Each sample includes:

- **`seed`** - Original raw data (news article text, URL, etc.)
- **`question`** - Generated forecasting question
- **`label`** - Ground truth answer with confidence score and reasoning
- **`prompt`** - Formatted prompt ready for model input
- **`context`** - Optional additional context (news articles, RAG results)
- **`meta`** - Optional custom metadata

**`dataset.download()`** fetches all samples into memory. **`dataset.flattened()`** converts samples to dictionaries suitable for pandas DataFrames or other analysis tools.

In [9]:

%pip install pandas

from IPython.display import clear_output
clear_output()

> **Tip:** You can also view the pipeline results in our dashboard (Dataset Generation -> Datasets): https://dashboard.lightningrod.ai

In [None]:
import pandas as pd

samples = dataset.download()
print(f"Generated {dataset.num_rows} samples\n")

rows = dataset.flattened()
df = pd.DataFrame(rows)

print("\n\nFull sample structure:")
print(df.head())

Generated 10 samples

                              question.question_text   label.label  \
0  Will Samsung announce a specific retail availa...             0   
1  Will the ROG Swift OLED PG27UCWM support a 480...             1   
2  Will Meta integrate Manus’s autonomous agent t...             1   
3  Will Intel officially launch its Panther Lake ...             1   
4  Will Gemini 3 Flash remain the default model f...  Undetermined   

   label.label_confidence label.resolution_date  \
0                     1.0   2026-01-07T00:00:00   
1                     1.0   2026-01-04T00:00:00   
2                     1.0   2025-12-29T00:00:00   
3                     1.0   2026-01-05T00:00:00   
4                     0.9                   NaN   

                                     label.reasoning  \
0  Samsung did not announce a specific retail ava...   
1  The ASUS ROG Swift OLED PG27UCWM monitor expli...   
2  Meta announced its acquisition of Manus in Dec...   
3  Multiple sources confir

## Next Steps

### Explore Different Data Sources
- **Example 02**: Google News with custom filters
- **Example 03**: GDELT for large-scale datasets
- **Example 04**: Custom documents from file sets

### Try Different Question Types
- **Example 05**: Binary (Yes/No) questions
- **Example 06**: Continuous (numeric) predictions
- **Example 07**: Multiple choice questions
- **Example 08**: Free response questions

### Advanced Features
- Add context generators to enrich questions with relevant news
- Use filter criteria to improve question quality
- Generate rollouts from multiple models
- Chain pipelines together using datasets as inputs

### Use Cases

**Financial Forecasting**
```python
seed_generator = NewsSeedGenerator(
    search_query="earnings reports",
    source_domain=["https://reuters.com/finance"]
)
```

**Product Launch Predictions**
```python
seed_generator = NewsSeedGenerator(
    search_query="product announcements",
    filter_criteria=FilterCriteria(
        rubric="Focus on specific product launches with dates"
    )
)
```

**Custom Document Analysis**
```python
file_set = client.filesets.create("SEC Filings")
# Upload files...
seed_generator = FileSetSeedGenerator(file_set_id=file_set.id)
```

For complete API reference, see [API.md](../API.md).