# Seed Data Creation

## Contribution Overview

### What is a Contribution?

To add knowledge to a model, a user groups source documents of that contain the knowledge into knowledge contributions. A knowledge contribution is made up of:

1. One or more PDF documents that can be described by a contribution summary.
2. A contribution summary.
3. A contribution domain.
4. A unique name used to create a directory in the workspace for artifacts created by each step for the contribution.

Once contributions are set up a user can go through the data pre-processing workflow.

### What is a Contribution Summary?

In the synthetic data generation step, a model (known as the teacher model) generates synthetic data based on the source document.
The contribution summary and domain are used in the prompts that are sent to the teacher model to create data.

The document gets broken up into [chunks](#Chunking), and each chunk is in the prompt sent to the teacher model.
The contribution summary provides additional context to each chunk of a source document ensuring the teacher model has necessary background information.

Contribution summaries should be specific, avoid acronyms or other vague references, and the represent the documents focus areas.
When a contribution includes many versions of the same document, publication dates, volume numbers, or any other identifiers to distinguish between versions should be included in the contribution summary.

Here is an example of a contribution summary from a recent paper on [inference-time scaling](https://arxiv.org/pdf/2502.01618):

```
"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)"
```

Since the title of the paper does a good job summaraizing the paper, the summary is based off the title but with the acronym LLM spelled out. 

Usually contributions only have one document. Contributions with multiple documents happen when the subject matter and format are similar among a group of documents. 

An example of a contribution having multiple documents would be the desire to teach a model an organization's bylaws over the years 2021, 2022, 2023, 2024, with a different PDF for each year.

A contribution summary in this case might look like:

`Bylaws of organization Foo from 2021 - 2024`

In the case that there was only one source document from the year 2023, the contribution summary would be:

`2023 Bylaws of organization Foo`

Another example of having multiple documents within the same contribution would be if the documents had the same format. An example here could be grouping together a furniture company's instruction manuals. The format and layout of the instruction manuals would be the same across different pieces of furniture, but each manual covers different furniture.

`Furniture company Foo's assembly instructions for tables, desks, and nightstands`

If the contribution only contained a PDF for the assembly instructions for an oak dining table the summary would be:

`Assembly instructions for furniture company Foo's oak dining table`

### What is a Contribution Domain?

A contribution's domain is the overarching subject or scope of the source document(s). The domain provides critical context to guide the teacher model in generating synthetic data that is relevant and grounded.

The domain is brief and should not exceed 3 words, but should ideally be 1-2 words.

To determine the domain, users should review document's primary subject and identify the main topic or purpose of the document.
Consider the intended use of the document and align it with the use case or audience. E.g. a tech manual for developers might fall under the “software development” domain.

For the contribution summary examples discussed in the previous sections, domains could be `Artificial Intelligence Research`, `Bylaws`, and `Furniture Assembly`.

**Note:** Different contributions can have the same domain

In [None]:
from pathlib import Path

# Populated later on
contributions = []
WORKSPACE_DIR = Path("data/sample-contributions")

# Inference Time Scaling Contribution
contribution_name = "inference-time-scaling"
contribution_domain = "Artificial Intelligence Research" 
contribution_summary = "A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)"

# Add contribution information to the knowledge_contribution dictionary for it
knowledge_contribution = {"name": contribution_name, "domain": contribution_domain, "summary": contribution_summary}
contributions.append(knowledge_contribution)

# NFL Rules Contribution
contribution2_name = "nfl"
contribution2_domain = "sports rules" 
contribution2_summary = "Official playing rules of the National Football League 2022, 2023"
knowledge_contribution2 = {"name": contribution2_name, "domain": contribution2_domain, "summary": contribution2_summary}
contributions.append(knowledge_contribution2)

for contribution in contributions:
    contribution_dir = WORKSPACE_DIR / contribution["name"]
    contribution["dir"] = contribution_dir

To create contributions, define the `name` for the contribution, and the `domain` and `summary`. The `name`, `domain` and `summary` go into a dictionary called `knowledge_contribution` which gets added to a list called `contributions`.

## Seed Data Creation

To start the synthetic data generation process, users need to prepare a diverse set of questions and answers based off chunks from each source document. A chunk and question-and-answer pairs are called a seed example.

### Install docling-sdg

[Docling-sdg](https://github.com/docling-project/docling-sdg) project is used to generate question and answer pairs for seed examples.

In [None]:
!pip install -qq docling-sdg

### Select the chunks for the seed examples

Chunks for seed examples should be diverse in style. These can be selected by hand or selecting diverse chunks from all of the chunks using the [subset selection notebook](https://github.com/instructlab/examples/blob/main/notebooks/instructlab-knowledge/subset-selection.ipynb).

If users are selecting chunks by hand, chunks should be taken directly from lines in `chunks.jsonl`. These lines have `chunk`, `file`, and `metadata` fields for each entry.

The below code randomly selects a preset number of chunks and saves them in a jsonl file for the next step.

In [None]:
from utils.qna_gen import save_random_chunk_selection

NUM_SEED_EXAMPLES = 7

for contribution in contributions:
    chunks_jsonl_path = contribution["dir"] / "chunks.jsonl"
    output_dir = contribution["dir"]

    selected_chunks_jsonl = save_random_chunk_selection(chunks_jsonl_path,
                           output_dir,
                           NUM_SEED_EXAMPLES)
    print(f"selected_chunks.jsonl saved to: {selected_chunks_jsonl}")

### Initialize QA generator model & Number of Seed examples

To generate seed examples you need to set: 
1. The the Open AI compatible endpoint for the model generating question and answer pairs. This endpoint can be local or remote.
2. The model's API key
3. The model's name
4. The number of seed examples you wish to generate for each contribution

In [None]:
import os

API_KEY = os.getenv("MODEL_API_KEY") or ""  # the API access key for your account (cannot be empty)
ENDPOINT_URL = os.getenv("MODEL_ENDPOINT_URL") or "" # the URL of your model's API. URL can end in "/v1"
MODEL_NAME = os.getenv("MODEL_NAME") or "mistralai/Mixtral-8x7B-Instruct-v0.1" # the name of your model
NUM_SEED_EXAMPLES = 7 # number of seed examples set to 7 so that users have options to pick 5 from they like the most

#### [OPTIONAL] Prompt customization for Q&A Generation

Optionally insert your own stylistic customization statement below. If `customization_str` is `None`, there will be no customization attempted and the default QA generation prompt will be used.

In [None]:
customization_str = None 

# Example: 
# customization_str = "Write at the fifth grade level."

#### Generate questions and answers and create qna.yaml file

In [None]:
from utils.qna_gen import generate_seed_examples

for contribution in contributions:
    output_dir = contribution["dir"]
    selected_chunks_path = contribution["dir"] / "selected_chunks.jsonl"

    qna_output_path = generate_seed_examples(contribution["name"],
                           selected_chunks_path,
                           output_dir,
                           API_KEY,
                           ENDPOINT_URL,
                           MODEL_NAME,
                           contribution["domain"],
                           contribution["summary"],                  
                           customization_str)
    print(f"qna.yaml saved to: {qna_output_path}")

### Review and Revise Seed Examples

A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly.

In [None]:
from utils.qna_gen import view_seed_example

index = 0 # index of seed example to view. Value must be lower than number of seed examples

# pass in path to qna.yaml file and seed example index to view single seed example
view_seed_example(qna_output_path, index)

After assessment, the `qna.yaml` files can be quickly reviewed to ensure they includes the required elements and correct number of each. It is recommended to have at least 5 seed examples. Each seed example must have 3 question and answer pairs.

In [None]:
from utils.qna_gen import review_seed_examples_file

for contribution in contributions:
        qna_path = contribution["dir"] / "qna.yaml"
        review_seed_examples_file(qna_path, min_seed_examples=5, num_qa_pairs=3)

## Create Seed Dataset for SDG

This step creates the seed data for SDG. This data is a JSON file that contains a combination of the `seed_examples` in the `qna.yaml` and the chunks from the source document. 

Intermediate seed data files are created for each contribution with the contribution's name included in the file name. For example for the `nfl` contribution, a file containing seed data called `seed_data-nfl.jsonl`. This file contains a combination of all of the chunks from the NFL source documents and the seed examples in the `qna.yaml` for `nfl`.

After seed data files are created for each contribution, a final `seed_data.jsonl`. This file is a concatenation of all of the intermediate `seed_data-{contribution name}.jsonl` files and should be used as an input to SDG.

In [None]:
!pip install -qq datasets transformers

In [None]:
from utils.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets

contribution_datasets = []
for contribution in contributions:
    chunks_dir = contribution["dir"]
    qna_dir = contribution["dir"]
    seed_data = get_seed_dataset(chunks_dir, qna_dir)
    output_path = f'{contribution_dir}/seed_data-{contribution_name}.jsonl'
    seed_data.to_json(output_path, orient='records', lines=True)
    contribution_datasets.append(seed_data)
    print(f"Intermediate results saved to: {output_path}")

final_seed_data = safe_concatenate_datasets(contribution_datasets)
output_path = f'{WORKSPACE_DIR}/seed_data.jsonl'
final_seed_data.to_json(output_path, orient='records', lines=True)

print(f"Final seed data contains {final_seed_data.data.num_rows} rows")
print(f"Final seed data for SDG saved to: {output_path}")

### Inspect the seed data

In [None]:
print(seed_data.data.table.slice(length=1))