# Custom Document Dataset Generation

This notebook demonstrates how to generate forecasting questions from your own documents and files. You can use PDFs, text, CSV or other types of files as a data source for question generation.

In [None]:
%pip install lightningrod-ai

from IPython.display import clear_output
clear_output()

## Set up the client

Sign up at [dashboard.lightningrod.ai](https://dashboard.lightningrod.ai/?redirect=/api) to get your API key and **$50 of free credits**.

In Google Colab, go to the Secrets section (ðŸ”‘ icon in left sidebar) and add a secret named `LIGHTNINGROD_API_KEY` with your API key.

In [None]:
from google.colab import userdata
from lightningrod import LightningRod

api_key = userdata.get("LIGHTNINGROD_API_KEY")
lr = LightningRod(api_key=api_key)

## Create Input Dataset

Now we'll create Sample objects from our input files and upload them to create an input dataset that can be used for question generation.

> Tip: if you prefer more control, you can also generate these samples manually by chunking the input file(s) according to your needs.

In [6]:
from lightningrod import preprocessing

%pip install langchain-text-splitters
!git clone https://github.com/lightning-rod-labs/lightningrod-python-sdk

from IPython.display import clear_output
clear_output()

samples = preprocessing.files_to_samples("./data/sample_earnings_report.txt",)
print(f"Created {len(samples)} samples from chunks")

input_dataset = lr.datasets.create_from_samples(samples, batch_size=1000)
print(f"Created input dataset: {input_dataset.id}")
print(f"Total samples: {input_dataset.num_rows}")

Created 4 samples from chunks
Created input dataset: 9ae0ba05-56dd-4a2d-8876-2ba4a182594f
Total samples: 4


## Configure Question Generator

Generate questions based on the content of your documents.

In [7]:
from lightningrod import AnswerType, AnswerTypeEnum, QuestionAndLabelGenerator

answer_type = AnswerType(answer_type=AnswerTypeEnum.BINARY)

qa_config = QuestionAndLabelGenerator(
    answer_type=answer_type,
    questions_per_seed=1,
    instructions=(
        "Generate binary forecasting questions based on earnings report content. "
        "Focus on specific future outcomes mentioned in guidance, product launches, market expansion plans, or strategic initiatives. "
        "Questions must reference concrete metrics, timelines, or events from the document. "
        "Avoid questions about past performance or historical facts."
    ),
    examples=[
        "Will Acme Corporation meet its Q2 2025 revenue guidance of $2.5-2.6 billion?",
        "Will the AI-powered analytics platform launch in Q2 2025 as scheduled?",
        "Will Acme's Asia-Pacific expansion generate at least $50 million in revenue in the second half of 2025?",
        "Will the Consumer Products Division recover in Q2 2025 as management expects?",
        "Will EU regulatory changes affect more than 5% of Acme's revenue base if implemented?",
    ],
    bad_examples=[
        "Will Acme's cloud services division continue to grow? (too vague - missing specific metrics, timeline, or threshold)",
        "Will the company face competition in the future? (too generic - not grounded in specific document claims or guidance)",
        "Will Acme's strategic initiatives be successful? (too vague - doesn't reference concrete outcomes or measurable criteria)",
        "Will the TechStart acquisition contribute to revenue growth? (missing specific metrics - document mentions $120M annualized revenue but question doesn't reference it)",
        "Will Acme expand into new markets? (too vague - document specifies Asia-Pacific expansion in Q2 with $50-75M revenue target, but question lacks these details)",
    ],
)

## Run the Pipeline with Input Dataset

Pass our input dataset to the transform pipeline. The samples from the input dataset will be used as seeds for question/answer generation.

In [8]:
chunk_dataset = lr.transforms.run(
    qa_config, 
    input_dataset=input_dataset,
    max_questions=10 # keep max questions low when testing
)

> Note: This can take a few minutes to complete processing.

## View the Results

In [9]:
%pip install pandas

from IPython.display import clear_output
clear_output()

In [10]:
import pandas as pd

# Download output samples to memory
samples = chunk_dataset.download()
print(f"Generated {chunk_dataset.num_rows} samples\n")

# Convert cached samples to a list of dictionaries
rows = chunk_dataset.flattened()

df = pd.DataFrame(rows)
df.head()

Generated 4 samples



Unnamed: 0,question.question_text,label.label,label.label_confidence,seed.seed_text,is_valid,meta.file_name,meta.file_type,meta.sample_id,meta.chunk_index,meta.total_chunks,meta.parent_sample_id,meta.processing_time_ms
0,Will Acme Corporation's full year 2025 earning...,Yes,1.0,"M&A Activity\nIn March, we completed the acqui...",True,sample_earnings_report.txt,txt,5791c15f-9a28-40f4-beda-377da0a0eb23,2,4,2764d081-3330-4dfd-a903-0f0ffeb55b0d,0.849
1,Will Acme's conference call to discuss its res...,Yes,1.0,RISK FACTORS\n\nThe company faces several risk...,True,sample_earnings_report.txt,txt,2d6cf31e-a61c-43e8-91fb-fbff535223d4,3,4,25048467-3ab8-459f-8ae3-eb21160dcb7e,0.596
2,Will Acme's AI-powered analytics platform laun...,Yes,1.0,Enterprise Software Division\nRevenue increase...,True,sample_earnings_report.txt,txt,bf53272d-25b9-4592-98b2-61a96a15d781,1,4,2596565d-c5b8-44d7-bda7-86ce549a1b49,0.822
3,Will Acme Corporation's Cloud Services divisio...,Yes,0.85,ACME CORPORATION\nQ1 2025 EARNINGS REPORT\n\nE...,True,sample_earnings_report.txt,txt,2802e6b2-53bf-48cb-bff6-911121d40207,0,4,a35e6adb-2e9e-4c3b-b0cf-91a3c1035530,0.527
