# Custom Document Dataset Generation

This notebook demonstrates how to generate forecasting questions from your own documents and files. You can upload PDFs, reports, internal documents, or any text files and use them as a data source for question generation.

In [None]:
%pip install lightningrod-ai

from IPython.display import clear_output
clear_output()

In [None]:
import os
from pathlib import Path
from datetime import datetime
from lightningrod import (
    LightningRod,
    FileSetSeedGenerator,
    QuestionGenerator,
    QuestionPipeline,
    WebSearchLabeler,
    AnswerType,
    AnswerTypeEnum,
)
from lightningrod._generated.models import QuestionRenderer

api_key = os.getenv("LIGHTNINGROD_API_KEY", "your-api-key-here")

client = LightningRod(api_key=api_key)

## Create a File Set

A file set is a collection of files that can be used together for dataset generation. Files are automatically indexed for RAG-based question generation.

In [None]:
file_set = client.filesets.create(
    name="Company Reports 2025",
    description="Quarterly earnings reports and company documents"
)

print(f"Created file set: {file_set.name} (ID: {file_set.id})")

## Upload Files

Upload your documents to the file set. You can add metadata to help organize and filter files later.

In [None]:
file_path = Path("sample_earnings_report.txt")

if file_path.exists():
    uploaded_file = client.filesets.files.upload(
        file_set_id=file_set.id,
        file_path=file_path,
        metadata={
            "document_type": "earnings_report",
            "quarter": "Q1",
            "year": 2025,
        }
    )
    print(f"Uploaded: {uploaded_file.original_file_name}")
else:
    print(f"Note: Create a {file_path} file to test file upload")

## List Files in File Set

Verify that your files are in the file set.

In [None]:
files_response = client.filesets.files.list(file_set.id)
print(f"Files in set: {files_response.total}")
for file in files_response.files:
    print(f"  - {file.original_file_name} ({file.size_bytes} bytes)")

## Configure Document-Based Seed Generator

The `FileSetSeedGenerator` extracts text from your uploaded files and chunks them into seeds for question generation. Files are automatically processed and indexed.

In [None]:
file_set_seed_generator = FileSetSeedGenerator(
    file_set_id=file_set.id,
    chunk_size=1000,
    chunk_overlap=100,
)

## Configure Question Generator

Generate questions based on the content of your documents.

In [None]:
answer_type = AnswerType(answer_type=AnswerTypeEnum.BINARY)

question_generator = QuestionGenerator(
    instructions=(
        "Generate forward-looking questions based on the document content. "
        "Questions should be about future events or outcomes mentioned or implied in the documents."
    ),
    examples=[
        "Will the company meet its revenue target for Q2?",
        "Will the new product launch be delayed?",
        "Will the merger be completed this year?",
    ],
    answer_type=answer_type,
)

In [None]:
labeler = WebSearchLabeler(answer_type=answer_type)
renderer = QuestionRenderer(answer_type=answer_type)

## Run the Pipeline

Generate questions from your custom documents.

In [None]:
pipeline_config = QuestionPipeline(
    seed_generator=file_set_seed_generator,
    question_generator=question_generator,
    labeler=labeler,
    renderer=renderer,
)

dataset = client.transforms.run(pipeline_config, max_questions=20)

In [None]:
print(f"Generated dataset with {dataset.num_rows} samples from custom documents\n")

samples = dataset.to_samples()
for i, sample in enumerate(samples[:5]):
    print(f"Sample {i+1}:")
    if sample.question:
        print(f"  Question: {sample.question.question_text}")
    if sample.label:
        print(f"  Answer: {sample.label.label}")
    print()