# Seed Data Creation for SDG

This step creates the seed data for SDG. This data is a JSON file that contains a combination of the `seed_examples` in the `qna.yaml` and the chunks from the source document. 

Intermediate seed data files are created for each contribution with the contribution's name included in the file name. For example for the `nfl` contribution, a file containing seed data called `seed_data-nfl.jsonl`. This file contains a combination of all of the chunks from the NFL source documents and the seed examples in the `qna.yaml` for `nfl`.

After seed data files are created for each contribution, a final `seed_data.jsonl`. This file is a concatenation of all of the intermediate `seed_data-{contribution name}.jsonl` files and should be used as an input to SDG.

In [1]:
!pip install -qq datasets transformers

In [9]:
from utils.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets
from pathlib import Path

chunks_dir = Path("sample-contributions/nfl")
yaml_dir = Path("sample-contributions/nfl")

seed_data = get_seed_dataset(chunks_dir, yaml_dir)
output_path = Path("sample-contributions/nfl/seed_data-nfl.jsonl")
seed_data.to_json(output_path, orient='records', lines=True)

# In the case where you have more than one contribution, you'd use one 
#contribution_datasets = [seed_data]
#final_seed_data = safe_concatenate_datasets(contribution_datasets)
#output_path = f'{WORKSPACE_DIR}/seed_data.jsonl'
#final_seed_data.to_json(output_path, orient='records', lines=True)

print(f"Final seed data contains {seed_data.data.num_rows} rows")
print(f"Final seed data for SDG saved to: {output_path}")

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

Filter:   0%|          | 0/115 [00:00<?, ? examples/s]

Map:   0%|          | 0/26 [00:00<?, ? examples/s]

Map:   0%|          | 0/26 [00:00<?, ? examples/s]

Map:   0%|          | 0/26 [00:00<?, ? examples/s]

Map:   0%|          | 0/26 [00:00<?, ? examples/s]

Map:   0%|          | 0/26 [00:00<?, ? examples/s]

Filter:   0%|          | 0/130 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Final seed data contains 245 rows
Final seed data for SDG saved to: sample-contributions/nfl/seed_data-nfl.jsonl


### Inspect the seed data

In [8]:
print(seed_data.data.table.slice(length=1))

pyarrow.Table
document: string
document_outline: string
document_title: string
domain: string
icl_document: string
icl_query_1: string
icl_response_1: string
icl_query_2: string
icl_response_2: string
icl_query_3: string
icl_response_3: string
----
document: [["2022 OFFICIAL PLAYING RULES OF THE NATIONAL FOOTBALL LEAGUE
Roger Goodell, Commissioner"]]
document_outline: [["Official playing rules of the National Football League 2022, 2023"]]
document_title: [["2022-nfl-rulebook"]]
domain: [["Sports Rules"]]
icl_document: [["
A.R. 15.260 Line to gain on fourth down. Fourth-and-2 on B41. 
With 3:43 remaining in the fourth  (... 443 chars omitted)"]]
icl_query_1: [["What is the distance Team A needs to gain on fourth down?"]]
icl_response_1: [["Team A needs to gain 2 yards on fourth down."]]
icl_query_2: [[" What is the final spot of the ball and the team that is awarded possession
after the play?"]]
icl_response_2: [["The ball is spotted at the B39 and Team B is awarded possession."]]
icl_q