# Custom Document Dataset Generation

This notebook demonstrates how to generate forecasting questions from your own documents and files. You can upload PDFs, reports, internal documents, or any text files and use them as a data source for question generation.

In [4]:
%pip install -e ..

from IPython.display import clear_output
clear_output()

In [None]:
import os
from lightningrod import LightningRod

api_key = os.getenv("LIGHTNINGROD_API_KEY")
if not api_key:
    raise ValueError("LIGHTNINGROD_API_KEY is not set")

client = LightningRod(api_key=api_key)

## Create a File Set

A file set is a collection of files that can be used together for dataset generation. Files are automatically indexed for RAG-based question generation.

In [6]:
file_set = client.filesets.create(
    name="Company Reports 2025",
    description="Quarterly earnings reports and company documents"
)

print(f"Created file set: {file_set.name} (ID: {file_set.id})")

Created file set: Company Reports 2025 (ID: 1ff91902-cee4-45eb-bc6f-1677e010b1f5)


## Upload Files

Upload your documents to the file set. You can add metadata to help organize and filter files later.

In [7]:
uploaded_file = client.filesets.files.upload(
    file_set_id=file_set.id,
    file_path="./data/sample_earnings_report.txt",
    metadata={
        "document_type": "earnings_report",
        "quarter": "Q1",
        "year": 2025,
    }
)
print(f"Uploaded: {uploaded_file.original_file_name}")

Uploaded: sample_earnings_report.txt


## List Files in File Set

Verify that your files are in the file set.

In [8]:
files_response = client.filesets.files.list(file_set.id)
print(f"Files in set: {files_response.total}")
for file in files_response.files:
    print(f"  - {file.original_file_name} ({file.size_bytes} bytes)")

Files in set: 1
  - sample_earnings_report.txt (3355 bytes)


## Configure Document-Based Seed Generator

The `FileSetSeedGenerator` extracts text from your uploaded files and chunks them into seeds for question generation. Files are automatically processed and indexed.

In [None]:
from lightningrod import FileSetSeedGenerator, FileSetQuerySeedGenerator

file_set_chunk_seed_generator = FileSetSeedGenerator(
    file_set_id=file_set.id,
    chunk_size=1000,
    chunk_overlap=100,
)

file_set_rag_seed_generator = FileSetQuerySeedGenerator(
    file_set_id=file_set.id,
    prompts=[
        "What is the company's revenue for the first quarter of 2025?",
        "What is the company's revenue for the second quarter of 2025?",
        "What is the company's revenue for the third quarter of 2025?",
        "What is the company's revenue for the fourth quarter of 2025?",
    ],
)

## Configure Question Generator

Generate questions based on the content of your documents.

In [10]:
from lightningrod import AnswerType, AnswerTypeEnum, QuestionGenerator

answer_type = AnswerType(answer_type=AnswerTypeEnum.BINARY)

question_generator = QuestionGenerator(
    instructions=(
        "Generate binary forecasting questions based on earnings report content. "
        "Focus on specific future outcomes mentioned in guidance, product launches, market expansion plans, or strategic initiatives. "
        "Questions must reference concrete metrics, timelines, or events from the document. "
        "Avoid questions about past performance or historical facts."
    ),
    examples=[
        "Will Acme Corporation meet its Q2 2025 revenue guidance of $2.5-2.6 billion?",
        "Will the AI-powered analytics platform launch in Q2 2025 as scheduled?",
        "Will Acme's Asia-Pacific expansion generate at least $50 million in revenue in the second half of 2025?",
        "Will the Consumer Products Division recover in Q2 2025 as management expects?",
        "Will EU regulatory changes affect more than 5% of Acme's revenue base if implemented?",
    ],
    bad_examples=[
        "Will Acme's cloud services division continue to grow? (too vague - missing specific metrics, timeline, or threshold)",
        "Will the company face competition in the future? (too generic - not grounded in specific document claims or guidance)",
        "Will Acme's strategic initiatives be successful? (too vague - doesn't reference concrete outcomes or measurable criteria)",
        "Will the TechStart acquisition contribute to revenue growth? (missing specific metrics - document mentions $120M annualized revenue but question doesn't reference it)",
        "Will Acme expand into new markets? (too vague - document specifies Asia-Pacific expansion in Q2 with $50-75M revenue target, but question lacks these details)",
    ],
    answer_type=answer_type,
)

In [11]:
from lightningrod import WebSearchLabeler, QuestionRenderer

labeler = WebSearchLabeler(answer_type=answer_type)
renderer = QuestionRenderer(answer_type=answer_type)

## Run the chunk-based Pipeline

Generate questions from your custom documents, by splitting them into chunks, where each chunk acts as a seed for the dataset generation pipeline.

In [None]:
from lightningrod import QuestionPipeline

chunk_pipeline_config = QuestionPipeline(
    seed_generator=file_set_chunk_seed_generator,
    question_generator=question_generator,
    labeler=labeler,
    renderer=renderer,
)

chunk_dataset = client.transforms.run(chunk_pipeline_config, max_questions=10) # keep max questions low when testing

In [22]:
%pip install pandas

from IPython.display import clear_output
clear_output()

In [None]:
import pandas as pd

# Download samples to memory
samples = chunk_dataset.download()
print(f"Generated {chunk_dataset.num_rows} samples\n")

# Convert cached samples to a list of dictionaries
rows = chunk_dataset.flattened()

df = pd.DataFrame(rows)
print(df.head())

Generated 4 samples

                              question.question_text   label.label  \
0  Will Acme Corporation's Cloud Services Divisio...  Undetermined   
1  Will Acme Corporation's reported revenue for Q...  Undetermined   
2  Will pending regulatory changes in the Europea...  Undetermined   
3  Will Acme's Asia-Pacific expansion generate at...  Undetermined   

   label.label_confidence                                    label.reasoning  \
0                     1.0  The search results provide financial informati...   
1                     1.0  The search results predominantly point to "Acm...   
2                     0.1  The question asks about the specific financial...   
3                     1.0  The question asks whether "Acme's Asia-Pacific...   

                                label.answer_sources  \
0  https://vertexaisearch.cloud.google.com/ground...   
1  https://vertexaisearch.cloud.google.com/ground...   
2  https://vertexaisearch.cloud.google.com/ground...   
3  

## Run the RAG-based Pipeline

Generate questions from your custom documents, by running RAG queries against them, with answers to those queries acting as seeds for the dataset generation pipeline.

In [28]:
from lightningrod import QuestionPipeline

rag_pipeline_config = QuestionPipeline(
    seed_generator=file_set_rag_seed_generator,
    question_generator=question_generator,
    labeler=labeler,
    renderer=renderer,
)

rag_dataset = client.transforms.run(rag_pipeline_config, max_questions=10) # keep max questions low when testing

In [29]:
import pandas as pd

# Download samples to memory
samples = rag_dataset.download()
print(f"Generated {rag_dataset.num_rows} samples\n")

# Convert cached samples to a list of dictionaries
rows = rag_dataset.flattened()

df = pd.DataFrame(rows)
print(df.head())

Generated 4 samples

                              question.question_text   label.label  \
0  Will Acme Corporation report revenue of $2.5 b...             0   
1  Will Acme Corporation generate at least $50 mi...  Undetermined   
2  Will Acme Corporation report revenue of at lea...             0   
3  Will Acme Corporation report revenue of at lea...             0   

   label.label_confidence label.resolution_date  \
0                     1.0   2025-07-23T00:00:00   
1                     0.9                   NaN   
2                     1.0   2025-07-23T00:00:00   
3                     1.0   2025-07-23T00:00:00   

                                     label.reasoning  \
0  Acme United Corporation reported net sales (re...   
1  The search results provide financial informati...   
2  Acme United Corporation reported net sales (re...   
3  Acme United Corporation (NYSE American: ACU) r...   

                                label.answer_sources  \
0  https://vertexaisearch.cloud.goo