# Knowledge Contribution

## Knowledge Contribution Overview

To add knowledge to a model, a user will group source documents of that contain the knowledge into knowledge contributions. A knowledge contribution is made up of:

1. Chunks of one or more documents.
2. A one sentence contribution summary that describes the documents.
3. A few word contribution domain that describes the field of the documents.
4. 5 or more seed examples from the documents.

The contribution summary, domain, and seed examples for a contribution are written to a YAML file. In this notebook we will create a that YAML file will be called `qna.yaml`.

## Initializing a Knowledge Contribution

To get started we need to pick the documents that will be used for our contribution.

There are documents in `sample-contributions/sample-pdfs` that can be grouped into a `nfl` contribution: `2022-nfl-rulebook.pdf` and `2023-nfl-rulebook.pdf`.

Chunks of the two documents live in `sample-contributions/nfl/chunks.jsonl`.

In [1]:
from pathlib import Path

WORKSPACE_DIR = Path("sample-contributions")
contribution_name = "nfl"

## Knowledge Contribution Summary

The chunks as well as the contribution information are the inputs to generating synthetic data. This generated data is used to create a dataset for fine-tuning the model on the contribution.
When generating synthetic data, a model (known as the teacher model) generates data based on the source document.
The contribution summary and domain are used in the prompts that are sent to the teacher model to create data.

The document gets broken up into chunks, and each chunk is in the prompt sent to the teacher model.
The contribution summary provides additional context to each chunk of a source document ensuring the teacher model has necessary background information.

Contribution summaries should be specific, avoid acronyms or other vague references, and the represent the documents focus areas.
When a contribution includes many versions of the same document, publication dates, volume numbers, or any other identifiers to distinguish between versions should be included in the contribution summary.

Here is an example of a contribution summary from a recent paper on [inference-time scaling](https://arxiv.org/pdf/2502.01618):

`A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)`

Since the title of the paper does a good job summaraizing the paper, the summary is based off the title but with the acronym LLM spelled out. 

Usually contributions only have one document. Contributions with multiple documents happen when the subject matter and format are similar among a group of documents. 

An example of a contribution having multiple documents would be the desire to teach a model an organization's bylaws over the years 2021, 2022, 2023, 2024, with a different PDF for each year.

A contribution summary in this case might look like:

`Bylaws of organization Foo from 2021 - 2024`

In the case that there was only one source document from the year 2023, the contribution summary would be:

`2023 Bylaws of organization Foo`

Another example of having multiple documents within the same contribution would be if the documents had the same format. An example here could be grouping together a furniture company's instruction manuals. The format and layout of the instruction manuals would be the same across different pieces of furniture, but each manual covers different furniture.

`Furniture company Foo's assembly instructions for tables, desks, and nightstands`

If the contribution only contained a PDF for the assembly instructions for an oak dining table the summary would be:

`Assembly instructions for furniture company Foo's oak dining table`

Based on this overview let's set the contribution summary of a `nfl` contribution.

In [2]:
contribution_summary = "Official playing rules of the National Football League 2022, 2023"

## Knowledge Contribution Domain

A contribution's domain is the overarching subject or scope of the source document(s). The domain provides critical context to guide the teacher model in generating synthetic data that is relevant and grounded.

The domain is brief and should not exceed 3 words, but should ideally be 1-2 words.

To determine the domain, users should review document's primary subject and identify the main topic or purpose of the document.
Consider the intended use of the document and align it with the use case or audience. E.g. a tech manual for developers might fall under the “software development” domain.

For the contribution summary examples discussed in the previous sections, domains could be `Artificial Intelligence Research`, `Bylaws`, and `Furniture Assembly`.

**Note:** Different contributions can have the same domain

Here is the contribution domain for the `nfl` contribution.

In [3]:
contribution_domain = "Sports Rules" 

## Knowledge Contribution Seed Examples

Each knowledge contribution should have at least 5 seed examples. A seed example consists of a `context` and 3 `question` and `answer` pairs. These questions and answers would be typical in the style of questions the user of the model would ask and typical in style of the answer the user of the trained model would like to receive.

### Contexts

A context is a part of the source document that is no more than 500 tokens. The contexts should be varied by text type. During synthetic data generation this is to guide the teacher model in handling varied text types present in the source documents.

For example if you have a document with paragraphs of text, bulleted lists, and tables, there should be a context for each type.

After chunking a document, using different kinds of chunks as a contexts is a simple way to get contexts for seed examples.

Below is a chunk from the nfl rules documents that can be used.

In [4]:
context_1 = """
A.R. 15.260 Line to gain on fourth down. Fourth-and-2 on B41. 
With 3:43 remaining in the fourth quarter, back A2 takes a handoff and runs to the B39 where he is hit and driven backward.
The officials spot the ball at the B39 and award possession to Team B.
Ruling: Reviewable. A's ball first-and-10 on B39, and wind on the ready.
Team A coach must challenge this play outside two minutes of either half.
Replay Official is not responsible for initiating a review for a turnover on downs for plays that start before the two-minute warning.
"""

### Selecting chunks for contexts

Various methods can be used to find diverse chunks from a set of chunks to use as contexts. The below code gets a user started by randomly selects a preset number of chunks and returns them for the next step.

In [5]:
from utils import get_random_chunks

chunks_jsonl_path = WORKSPACE_DIR / contribution_name / "chunks.jsonl"

random_chunks = get_random_chunks(chunks_jsonl_path, num_chunks=4)
print(random_chunks[0])

A.R. 15.266 UNR/UNS enforcement
First-and-10 on A30. QBA1 throws a low pass that is ruled intercepted by B2 at the A43-yard line. B2 returns the ball to the A10yard line where he is tackled by the facemask by A3. Replays show that the ball hit the ground before B2 intercepted it.
Ruling: Reviewable. A's ball second-and-25 on A15, reset the clock to the time when the ball hit the ground. Pass is incomplete but the facemask penalty must be enforced. This applies to any UNR or UNS foul, and it is enforced as a dead ball penalty. Only the Replay Official can initiate a review of this play.


### Question and Answer Pairs

Each context must have 3 question and answer pairs. The purpose of these question and answer pairs is to demonstrate to the teacher model the style and structure of questions and answers to generate for the given text type. 

The questions and the answers must be grounded in the context. Answers should be detailed as opposed to short. Single word answers should be avoided. This is because detailed answers set the tone for the teacher model to generate comprehensive responses during synthetic data generation. Detailed answers also cover more information from the context as well.

Here is a context and 3 question and answer pairs for the `nfl` documents.

In [6]:
nfl_qna1 = [
    {
        "question": "What is the distance Team A needs to gain on fourth down?",
        "answer": "Team A needs to gain 2 yards on fourth down."
    },
    {
        "question": " What is the final spot of the ball and the team that is awarded possession after the play?",
        "answer": "The ball is spotted at the B39 and Team B is awarded possession."
    },
    {
        "question": "Why might the Team A coach choose to challenge the play?",
        "answer": " The coach might choose to challenge the play because it results in a turnover on downs, which is reviewable, and the replay official will not initiate a review for such plays that start before the two-minute warning."
    }
]

### Autogenerating Question and Answer Pairs

Contexts can be used to generate question and answer pairs. Here we are using 4 more contexts and setting a  sending them to an LLM served on an OpenAI compatible endpoint. First we need to ensure the `openai` library is installed.

In [7]:
!pip install -qq openai

Next we need to set the url, key and model name for the OpenAI compatible endpoint.

In [8]:
import os

API_KEY = os.getenv("MODEL_API_KEY") or ""  # the API access key for your account (cannot be empty)
ENDPOINT_URL = os.getenv("MODEL_ENDPOINT_URL") or "" # the URL of your model's API. URL can end in "/v1"
MODEL_NAME = os.getenv("MODEL_NAME") or "" # the name of your model

Finally we need to set the prompt used to generate the questions and answer pairs. The simple prompt provided should be used as a starting point.

In [13]:
from utils import generate_qa_pairs

qa_gen_prompt = """
Read the following paragraph and generate exactly 3 question-answer pairs.
Format your response as a JSON list of objects, each with 'question' and 'answer'.
"""
seed_examples = [{"context": context_1, "questions_and_answers": nfl_qna1}]

for chunk in random_chunks:
    # TODO:
    # 1. ensure what is returned by generate_qa_pairs is a dict with the key "questions_and_answers"
    # 2. pretty print context + question and answer pairs for users to view
    qa_pairs = generate_qa_pairs(qa_gen_prompt, chunk, API_KEY, ENDPOINT_URL, MODEL_NAME)
    seed_examples.append({"context": chunk, "questions_and_answers": qa_pairs.get("questions_and_answers")})

### Creating the YAML file

Now we will add all of the contribution information together to create a yaml file called qna.yaml with all of the seed examples. This is needed to create the seed_data input for SDG-Hub. 

In [14]:
from utils import create_knowledge_qna_yaml

qna_yaml_path = create_knowledge_qna_yaml(WORKSPACE_DIR / contribution_name, contribution_domain, contribution_summary, seed_examples)

The `qna.yaml` files can be quickly reviewed to ensure they includes the required elements and correct number of each. It is recommended to have at least 5 seed examples. Each seed example must have 3 question and answer pairs.

In [15]:
from utils import review_seed_examples_file

print(qna_yaml_path)
qna_yaml_path = Path("sample-contributions/nfl/qna.yaml")
review_seed_examples_file(qna_yaml_path, min_seed_examples=5, num_qa_pairs=3)

sample-contributions/nfl/qna.yaml
Reviewing seed examples file at /Users/amaredia/dev/odh-data-processing/notebooks/model-customization-data-preprocessing/knowledge-contribution/sample-contributions/nfl/qna.yaml
Found contribution summary...
Found 'domain'...
Found 5 'contexts' in 'seed_examples'. Minimum expected number is 5...
Seed Example 1 contains expected number (3) of 'question_and_answers'...
Seed Example 2 contains expected number (3) of 'question_and_answers'...
Seed Example 3 contains expected number (3) of 'question_and_answers'...
Seed Example 4 contains expected number (3) of 'question_and_answers'...
Seed Example 5 contains expected number (3) of 'question_and_answers'...
Seed Examples YAML /Users/amaredia/dev/odh-data-processing/notebooks/model-customization-data-preprocessing/knowledge-contribution/sample-contributions/nfl/qna.yaml is valid :)


