# Custom Document Dataset Generation

This notebook demonstrates how to generate forecasting questions from your own documents and files. You can use PDFs, text, CSV or other types of files as a data source for question generation.

In [1]:
%pip install -e ..
%pip install python-dotenv

from IPython.display import clear_output
clear_output()

In [2]:
from utils import get_secret
from lightningrod import LightningRod

api_key = get_secret("LIGHTNINGROD_API_KEY")
base_url = get_secret("LIGHTNINGROD_BASE_URL", "https://api.lightningrod.ai/api/public/v1")

lr = LightningRod(api_key=api_key, base_url=base_url)

## File Chunking and Input Dataset Creation

To generate questions from your own documents, we'll need to first chunk them and map each chunk to a "seed". The collection of these seeds is treated as the input dataset to the generation pipeline. 

This approach gives you full control over how your documents are processed and chunked.

In [3]:
%pip install langchain-text-splitters

from IPython.display import clear_output
clear_output()

from langchain_text_splitters import RecursiveCharacterTextSplitter

with open("./data/sample_earnings_report.txt", "r") as f:
    document_text = f.read()

# Use RecursiveCharacterTextSplitter for more intelligent chunking
# It tries to split on paragraphs, sentences, and words in that order
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_text(document_text)
print(f"Created {len(chunks)} chunks from document\n")
print(f"First chunk preview: {chunks[0][:200]}...")

Created 4 chunks from document

First chunk preview: ACME CORPORATION
Q1 2025 EARNINGS REPORT

EXECUTIVE SUMMARY

Acme Corporation (NASDAQ: ACME) today reported financial results for the first quarter ended March 31, 2025. Revenue increased 18% year-ove...


## Create Input Dataset from Chunks

Now we'll create Sample objects from our chunks and upload them to create an input dataset that can be used for question generation.

> Sample is the core data structure that flows through our pipelines and can act both as an input or output of our dataset generation pipelines.

In [4]:
from lightningrod import Sample, SampleMeta, Seed

samples = []
for idx, chunk_text in enumerate(chunks):
    seed = Seed(seed_text=chunk_text)
    sample_meta = SampleMeta.from_dict({
        "source_file": "sample_earnings_report.txt",
        "chunk_index": idx,
        "total_chunks": len(chunks),
        "document_type": "earnings_report",
    })
    sample = Sample(seed=seed, meta=sample_meta)
    samples.append(sample)

print(f"Created {len(samples)} samples from chunks")

Created 4 samples from chunks


## Upload Samples to create an Input Dataset

Upload the samples to create an input dataset that we can use for question generation.

In [5]:
input_dataset = lr.datasets.create_from_samples(samples, batch_size=1000)
print(f"Created input dataset: {input_dataset.id}")
print(f"Total samples: {input_dataset.num_rows}")

Created input dataset: c6af2c5b-ef20-4baf-acb0-d6335379f750
Total samples: 4


## Configure Question Generator

Generate questions based on the content of your documents.

In [6]:
from lightningrod import AnswerType, AnswerTypeEnum, QuestionAndLabelGenerator

answer_type = AnswerType(answer_type=AnswerTypeEnum.BINARY)

qa_config = QuestionAndLabelGenerator(
    answer_type=answer_type,
    questions_per_seed=1,
    instructions=(
        "Generate binary forecasting questions based on earnings report content. "
        "Focus on specific future outcomes mentioned in guidance, product launches, market expansion plans, or strategic initiatives. "
        "Questions must reference concrete metrics, timelines, or events from the document. "
        "Avoid questions about past performance or historical facts."
    ),
    examples=[
        "Will Acme Corporation meet its Q2 2025 revenue guidance of $2.5-2.6 billion?",
        "Will the AI-powered analytics platform launch in Q2 2025 as scheduled?",
        "Will Acme's Asia-Pacific expansion generate at least $50 million in revenue in the second half of 2025?",
        "Will the Consumer Products Division recover in Q2 2025 as management expects?",
        "Will EU regulatory changes affect more than 5% of Acme's revenue base if implemented?",
    ],
    bad_examples=[
        "Will Acme's cloud services division continue to grow? (too vague - missing specific metrics, timeline, or threshold)",
        "Will the company face competition in the future? (too generic - not grounded in specific document claims or guidance)",
        "Will Acme's strategic initiatives be successful? (too vague - doesn't reference concrete outcomes or measurable criteria)",
        "Will the TechStart acquisition contribute to revenue growth? (missing specific metrics - document mentions $120M annualized revenue but question doesn't reference it)",
        "Will Acme expand into new markets? (too vague - document specifies Asia-Pacific expansion in Q2 with $50-75M revenue target, but question lacks these details)",
    ],
)

## Run the Pipeline with Input Dataset

Pass our input dataset to the transform pipeline. The samples from the input dataset will be used as seeds for question/answer generation.

In [7]:
chunk_dataset = lr.transforms.run(
    qa_config, 
    input_dataset=input_dataset,
    max_questions=10 # keep max questions low when testing
)

> Note: This can take a few minutes to complete processing.

## View the Results

In [8]:
%pip install pandas

from IPython.display import clear_output
clear_output()

In [9]:
import pandas as pd

# Download output samples to memory
samples = chunk_dataset.download()
print(f"Generated {chunk_dataset.num_rows} samples\n")

# Convert cached samples to a list of dictionaries
rows = chunk_dataset.flattened()

df = pd.DataFrame(rows)
df.head()

Generated 4 samples



Unnamed: 0,question.question_text,label.label,label.label_confidence,seed.seed_text,is_valid,meta.sample_id,meta.chunk_index,meta.source_file,meta.total_chunks,meta.document_type,meta.parent_sample_id,meta.processing_time_ms
0,Will Acme host its conference call to discuss ...,Yes,1.0,RISK FACTORS\n\nThe company faces several risk...,True,6528be77-89d9-46b7-a268-90dedc86a46f,3,sample_earnings_report.txt,4,earnings_report,9a7a6f3f-8b96-47dd-8680-5490bd08f76b,0.942
1,Will Acme Corporation report total revenue of ...,Yes,1.0,"M&A Activity\nIn March, we completed the acqui...",True,b6926f63-45b1-4e67-aa1b-1190486f0b47,2,sample_earnings_report.txt,4,earnings_report,f1b931f7-e62d-477e-9cc4-536486e87870,0.706
2,Will Acme's Asia-Pacific expansion generate at...,Yes,1.0,Enterprise Software Division\nRevenue increase...,True,37c09ef5-5aee-4923-a6b8-1ad887c24e17,1,sample_earnings_report.txt,4,earnings_report,619154b2-3cc7-41eb-a303-3d0d02b41637,0.549
3,Will Acme Corporation's Cloud Services Divisio...,Yes,0.65,ACME CORPORATION\nQ1 2025 EARNINGS REPORT\n\nE...,True,25778a8f-61d9-498c-98bb-120666c3597b,0,sample_earnings_report.txt,4,earnings_report,77803252-1fb8-4108-b98d-b7299732a2b5,1.208
