<a href="https://colab.research.google.com/github/lmassaron/fine-tuning-workshop/blob/main/02_synthetic_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The **`wikipedia-api`** package is a Python library that makes it easy to access and retrieve data from Wikipedia.

It is a convenient **wrapper** around Wikipedia's official API. Instead of having to deal with complex web requests and raw data formats (like JSON), this package provides simple Python functions to:

*   **Get a Wikipedia page:** Fetch the full text, summary, and other details of a specific article.
*   **Search for pages:** Find articles related to a search query.
*   **Handle multiple languages:** Easily switch between different language editions of Wikipedia (e.g., 'en' for English, 'es' for Spanish).
*   **Manage categories and links:** List all pages in a category or all links on a page.

The **`synthetic-data-kit`** is a powerful command-line interface (CLI) tool, developed by Meta, designed to streamline and accelerate the process of creating high-quality, synthetic datasets for fine-tuning Large Language Models (LLMs).

The primary problem it solves is the lack of high-quality, domain-specific data needed for customizing a general-purpose LLM for a particular task or industry (like law, medicine, or finance).

The package is built around a simple and modular 4-command workflow that takes you from raw documents to a ready-to-use fine-tuning dataset.

1.  **`ingest`**: This first step takes your raw data from various sources and converts it into a standardized text format. It can handle a wide range of file types, including:
    *   PDFs (`.pdf`)
    *   Word Documents (`.docx`)
    *   PowerPoint Presentations (`.ppt`)
    *   Web pages (`.html`)
    *   YouTube video transcripts
    *   Plain text (`.txt`)

2.  **`create`**: This is the core data generation step. It takes the ingested text, intelligently splits it into manageable chunks, and then uses a powerful LLM (like Llama 3) to generate new, synthetic data based on that text. You can instruct it to create different types of datasets, such as:
    *   **Question-Answer (QA) pairs**: Ideal for building chatbots and Q&A systems.
    *   **Reasoning Traces / Chain-of-Thought (CoT)**: Creates examples that show the step-by-step reasoning process, which is useful for improving a model's logical abilities.
    *   **Summaries**: Generates summaries of the text chunks.

3.  **`curate`**: To ensure high quality, this command uses an LLM as a "judge" to review the synthetically generated examples. It filters out low-quality or irrelevant pairs, ensuring that the final dataset is clean and effective for fine-tuning.

4.  **`save-as`**: The final step is to export the curated data into a format that is compatible with popular fine-tuning libraries and workflows. You can save the dataset as:
    *   Hugging Face Datasets
    *   JSON or JSONL files

In essence, the `synthetic-data-kit` provides an end-to-end, customizable pipeline for turning your private documents or domain-specific knowledge into a structured dataset that can be used to make a powerful, general LLM an expert in your specific area of interest.

**`unsloth`** is a high-performance Python library designed to make **fine-tuning Large Language Models (LLMs) dramatically faster and more memory-efficient.**

It is a powerful optimization layer that sits on top of popular libraries like Hugging Face's `transformers`, PyTorch, and PEFT (for LoRA).

Unsloth achieves its incredible performance by re-implementing the most computationally intensive parts of the training process from scratch.

1.  **Custom GPU Kernels:** It replaces standard PyTorch operations with its own highly optimized code written in Triton (a language for writing efficient GPU code).
2.  **Manual Autograd Engine:** Instead of using PyTorch's general-purpose automatic differentiation (autograd), Unsloth uses a specialized, manual backpropagation engine that is tailored *specifically* for training LLMs with LoRA. This eliminates a massive amount of overhead.

This results in:

*   **Massive Speedup:** It can make your fine-tuning process **2-5 times faster** than a standard implementation. A training job that took 10 hours might now take only 2-4 hours.
*   **Drastic Memory Reduction:** It reduces VRAM usage by **up to 80%**. This is its most significant advantage. It allows you to:
    *   Fine-tune much larger models on consumer-grade GPUs (like an RTX 3090 or a free Google Colab T4).
    *   Use a much larger batch size, which can further speed up training and improve model performance.
*   **No Performance Loss:** These optimizations are achieved without sacrificing the final model's accuracy or performance.
*   **Easy to Use:** It's designed as a "drop-in" replacement. You typically only need to change a few lines of your existing Hugging Face training script to enable it. For example, you replace `AutoModelForCausalLM` with Unsloth's `FastLanguageModel`.

In [None]:
%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm synthetic-data-kit==0.0.3
else:
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_t4 = False
    get_vllm, get_triton = ("vllm==0.9.2", "triton==3.2.0") if is_t4 else ("vllm==0.10.2", "triton")
    !uv pip install -qqq --upgrade         unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers
    !uv pip install -qqq {get_triton}
    !uv pip install synthetic-data-kit==0.0.3
!uv pip install transformers==4.55.4
!uv pip install --no-deps trl==0.22.2
!uv pip install wikipedia-api

In [None]:
%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm
else:
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_t4 = False
    get_vllm, get_triton = ("vllm==0.9.2", "triton==3.2.0") if is_t4 else ("vllm==0.10.2", "triton")
    !uv pip install -qqq --upgrade \
        unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers
    !uv pip install -qqq {get_triton}
!uv pip install transformers==4.55.4
!uv pip install --no-deps trl==0.22.2

In [None]:
import os
import time
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import wikipediaapi
from unsloth.dataprep import SyntheticDataKit
import huggingface_hub
from collections import Counter
import itertools
from datasets import Dataset
import pandas as pd
from datasets import Dataset, DatasetDict, ClassLabel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 09-25 08:31:23 [__init__.py:244] Automatically detected platform cuda.
ERROR 09-25 08:31:30 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
DEMO = True

In [None]:
# Pre-compile the regular expression pattern for better performance
BRACES_PATTERN = re.compile(r'\{.*?\}|\}')

def remove_braces_and_content(text):
    """Remove all occurrences of curly braces and their content from the given text"""
    return BRACES_PATTERN.sub('', text)

def clean_string(input_string):
    """Clean the input string."""

    # Remove extra spaces by splitting the string by spaces and joining back together
    cleaned_string = ' '.join(input_string.split())

    # Remove consecutive carriage return characters until there are no more consecutive occurrences
    cleaned_string = re.sub(r'\r+', '\r', cleaned_string)

    # Remove all occurrences of curly braces and their content from the cleaned string
    cleaned_string = remove_braces_and_content(cleaned_string)

    # Return the cleaned string
    return cleaned_string

 This function **finds a specific category on Wikipedia and returns a list of all the article titles that belong to it.**


1.  **`category = wiki_wiki.page("Category:" + category_name)`**: It first finds the specific "Category" page on Wikipedia (e.g., "Category:Physics").

2.  **`if category.exists():`**: It checks to make sure this category page actually exists to avoid errors.

3.  **`for article in category.categorymembers.values():`**: If the category exists, it loops through all the "members" (the articles and subcategories) listed on that page.

4.  **`pages.append(article.title)`**: Inside the loop, it grabs the `title` of each article and adds it to the `pages` list.

5.  **`return pages`**: Finally, it returns the complete list of collected article titles.

In [None]:
def extract_wikipedia_pages(wiki_wiki, category_name):
    """Extract all references from a category on Wikipedia"""

    # Get the Wikipedia page corresponding to the provided category name
    category = wiki_wiki.page("Category:" + category_name)

    # Initialize an empty list to store page titles
    pages = []

    # Check if the category exists
    if category.exists():
        # Iterate through each article in the category and append its title to the list
        for article in category.categorymembers.values():
            pages.append(article.title)

    # Return the list of page titles
    return pages

The main purpose of this function is to build a large, custom dataset of text about specific topics by **recursively crawling Wikipedia categories.** It starts with a few main topics, finds all related articles and sub-topics, and then extracts and cleans the text from every unique article it discovers.

It works in four main phases:

### Phase 1: Initial Crawl
It begins by looping through the initial list of `categories` you provide (e.g., "Physics"). For each one, it grabs the titles of all the immediate member articles and subcategories.

### Phase 2: Cleanup and Preparation
After the first pass, it separates the articles from the subcategories it found. It puts the subcategories (anything with "Category:" in the title) into a new "to-do" list called `categories_to_explore` and makes sure the main `wikipedia_pages` list only contains unique article titles.

### Phase 3: Deep Dive (Recursive Crawl)
This is the core crawling logic. It uses a `while` loop that continues as long as there are subcategories left in the `categories_to_explore` list. In each loop, it:
1.  Pops a subcategory off the list.
2.  Finds all *its* members.
3.  Adds any new articles it finds to the master `wikipedia_pages` list.
4.  Adds any new *sub-subcategories* it finds back onto the `categories_to_explore` list.

This process continues until it has explored every related subcategory and has a comprehensive list of all unique article titles.

### Phase 4: Text Extraction and Cleaning
Finally, once it has the complete list of article titles, it loops through each one and:
1.  Downloads the page content.
2.  Performs a simple check to filter out unwanted topics (e.g., pages mentioning "Biden" or "Trump").
3.  Extracts the text from the page's **summary** and each of its **sections**.
4.  Adds this cleaned text to the final `extracted_texts` list, which is then returned.

In [None]:
def get_wikipedia_pages(categories):
    """Retrieve Wikipedia pages from a list of categories and extract their content"""

    # Create a Wikipedia object
    wiki_wiki = wikipediaapi.Wikipedia('Gemma AI Assistant (gemma@example.com)', 'en')

    # Initialize lists to store explored categories and Wikipedia pages
    explored_categories = []
    wikipedia_pages = []

    # Iterate through each category
    print("- Processing Wikipedia categories:")
    for category_name in categories:
        print(f"\tExploring {category_name} on Wikipedia")

        # Get the Wikipedia page corresponding to the category
        category = wiki_wiki.page("Category:" + category_name)

        # Extract Wikipedia pages from the category and extend the list
        wikipedia_pages.extend(extract_wikipedia_pages(wiki_wiki, category_name))

        # Add the explored category to the list
        explored_categories.append(category_name)

    # Extract subcategories and remove duplicate categories
    categories_to_explore = [item.replace("Category:", "") for item in wikipedia_pages if "Category:" in item]
    wikipedia_pages = list(set([item for item in wikipedia_pages if "Category:" not in item]))

    # Explore subcategories recursively
    while categories_to_explore:
        category_name = categories_to_explore.pop()
        print(f"\tExploring {category_name} on Wikipedia")

        # Extract more references from the subcategory
        more_refs = extract_wikipedia_pages(wiki_wiki, category_name)

        # Iterate through the references
        for ref in more_refs:
            # Check if the reference is a category
            if "Category:" in ref:
                new_category = ref.replace("Category:", "")
                # Add the new category to the explored categories list
                if new_category not in explored_categories:
                    explored_categories.append(new_category)
            else:
                # Add the reference to the Wikipedia pages list
                if ref not in wikipedia_pages:
                    wikipedia_pages.append(ref)

    # Initialize a list to store extracted texts
    extracted_texts = []

    # Iterate through each Wikipedia page
    print("- Processing Wikipedia pages:")
    for page_title in tqdm(wikipedia_pages):
        try:
            # Make a request to the Wikipedia page
            page = wiki_wiki.page(page_title)

            # Check if the page summary does not contain certain keywords
            if "Biden" not in page.summary and "Trump" not in page.summary:
                # Append the page title and summary to the extracted texts list
                if len(page.summary) > len(page.title):
                    extracted_texts.append(page.title + " : " + clean_string(page.summary))

                # Iterate through the sections in the page
                for section in page.sections:
                    # Append the page title and section text to the extracted texts list
                    if len(section.text) > len(page.title):
                        extracted_texts.append(page.title + " : " + clean_string(section.text))

        except Exception as e:
            print(f"Error processing page {page.title}: {e}")

    # Return the extracted texts
    return extracted_texts

In [None]:
categories = [
    "Sherlock_Holmes",
    "Arthur_Conan_Doyle",
    "A_Scandal_in_Bohemia",
    "The_Adventures_of_Sherlock_Holmes",
    "A_Study_in_Scarlet",
    "The_Sign_of_the_Four",
    "The_Memoirs_of_Sherlock_Holmes",
    "The_Hound_of_the_Baskervilles",
    "The_Return_of_Sherlock_Holmes",
    "The_Valley_of_Fear",
    "His_Last_Bow",
    "The_Case-Book_of_Sherlock_Holmes",
    "Canon_of_Sherlock_Holmes",
    "Dr._Watson",
    "221B_Baker_Street",
    "Mrs._Hudson",
    "Professor_Moriarty",
    "The_Strand_Magazine",
    "Minor_Sherlock_Holmes_characters",
    "Inspector_Lestrade",
    "Mycroft_Holmes",
    "Irene_Adler",
    "Colonel_Moran",
    "Baker_Street_Irregulars",
    "Giant_rat_of_Sumatra",
    "The_Story_of_the_Lost_Special",
    "How_Watson_Learned_the_Trick",
    "Diogenes_Club",
    "The_Dynamics_of_an_Asteroid",
    "Reichenbach_Falls",
    "A_Treatise_on_the_Binomial_Theorem",
    "Sherlockian_game",
    "List_of_Holmesian_studies",
    "The_New_Annotated_Sherlock_Holmes",
    "The_Private_Life_of_Sherlock_Holmes_(book)",
    "The_Great_Detective_(book)",
    "Naked_Is_the_Best_Disguise",
    "Sherlock_Holmes_fandom",
    "Sherlockiana",
    "Sherlock_Holmes_Museum",
    "The_Sherlock_Holmes",
    "The_Baker_Street_Irregulars",
    "The_Baker_Street_Journal",
    "Sidney_Paget",
    "The_Strand_Magazine",
    "Undershaw",
    "Canon_of_Sherlock_Holmes",
    "Adaptations_of_Sherlock_Holmes",
    "Sherlock_Holmes_pastiches",
    "Popular_culture_references_to_Sherlock_Holmes",
]

if DEMO:
    categories = ["Sherlock_Holmes"]

extracted_texts = get_wikipedia_pages(categories)
print("Found", len(extracted_texts), "Wikipedia pages")

- Processing Wikipedia categories:
	Exploring Sherlock_Holmes on Wikipedia
	Exploring Writers of Sherlock Holmes pastiches on Wikipedia
	Exploring Works based on Sherlock Holmes on Wikipedia
	Exploring Sherlock Holmes short story collections on Wikipedia
	Exploring Sherlock Holmes short stories on Wikipedia
	Exploring Sherlock Holmes audio adaptations on Wikipedia
	Exploring Sherlock Holmes scholars on Wikipedia
	Exploring Sherlock Holmes novels on Wikipedia
	Exploring Sherlock Holmes navigational boxes on Wikipedia
	Exploring Sherlock Holmes lists on Wikipedia
	Exploring Dartmoor on Wikipedia
	Exploring Sherlock Holmes characters on Wikipedia
	Exploring Baker Street on Wikipedia
- Processing Wikipedia pages:


100%|██████████| 459/459 [01:04<00:00,  7.08it/s]

Found 2042 Wikipedia pages





In [None]:
output_dir = 'data/output'
os.makedirs(output_dir, exist_ok=True)

for k, text in enumerate(extracted_texts):
    file_path = os.path.join(output_dir, f'sherlock_{k}.txt')
    with open(file_path, 'w') as f:
        f.write(text)
    if DEMO and k > 9:
        break

print("All texts have been saved successfully.")

All texts have been saved successfully.


In [None]:
filenames = [f"data/output/{file}" for file in os.listdir("data/output")]

This cell initializes the `SyntheticDataKit` tool:

1.  **Loads the "engine":** It downloads and sets up the language model that will act as the "engine" for creating the data. It's using **`"unsloth/Llama-3.2-3B-Instruct"`**, which is a version of Meta's powerful Llama 3.2 model that has been heavily optimized by **Unsloth** for maximum speed and memory efficiency.

2.  **Sets a Limit:** It configures the `max_seq_length` to **2048 tokens**. This tells the generator the maximum length of a text chunk it should process at one time, which is a key setting for balancing performance and the quality of the generated data.

In [None]:
generator = SyntheticDataKit.from_pretrained(
    # Choose any model from https://huggingface.co/unsloth
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = 2048, # Longer sequence lengths will be slower!
)

Setting up for **Question-Answer (QA) pair** generation with the following rules:

*   **`output_folder = "data"`**: Specifies that the final dataset should be saved in a folder named `"data"`.
*   **`temperature = 0.7` & `top_p = 0.95`**: These settings control the creativity and randomness of the LLM. A higher temperature allows the model to generate a more diverse and varied set of questions and answers, making the final dataset richer.
*   **`overlap = 64`**: When the tool chops up your long source documents into smaller pieces, this ensures that each piece overlaps with the previous one by 64 tokens. This helps maintain context and prevents ideas from being cut off at the edges.
*   **`max_generation_tokens = 512`**: This sets the maximum length for each generated question and answer pair, preventing them from becoming too long.

In [None]:
generator.prepare_qa_generation(
    output_folder = "data", # Output location of synthetic data
    temperature = 0.7, # Higher temp makes more diverse datases
    top_p = 0.95,
    overlap = 64, # Overlap portion during chunking
    max_generation_tokens = 512, # Can increase for longer QA pairs
)

This cell is just a confirmation step to ensure that the necessary components are active before you start the main data generation process.:

1.  **`VLLM server is running`**: This tells you that **vLLM**, the high-speed engine used by the `synthetic-data-kit` to run the LLM, has started successfully.
2.  **`Available models: ... 'unsloth/Llama-3.2-3B-Instruct'`**: This is the crucial part. It confirms that the vLLM server has successfully loaded the correct model (`unsloth/Llama-3.2-3B-Instruct`) and it's ready to be used for data generation.


**vLLM** is a high-performance Python library designed to make **running and serving Large Language Models (LLMs) for inference incredibly fast and efficient.**

Think of it as a specialized, high-speed engine that replaces the standard inference methods in libraries like Hugging Face's `transformers`.

Standard LLM inference is often inefficient, especially when handling many users or requests at once. A huge amount of GPU memory is wasted managing a dynamic memory block called the **KV Cache**, which stores the context of the conversation. This leads to low throughput (fewer requests served per second) and higher costs.

vLLM's key innovation is an algorithm called **PagedAttention**. Inspired by how operating systems use virtual memory and paging to manage computer memory, PagedAttention does the same for the GPU's KV Cache:

*   It breaks the large, clunky KV Cache into a collection of small, fixed-size "blocks."
*   These blocks can be stored anywhere in the GPU's memory, eliminating wasted space and fragmentation.
*   This allows vLLM to pack many more user requests onto a single GPU and manage them with extreme efficiency, much like a well-organized file system.

In summary, **vLLM is a specialized engine for LLM *serving***. While a library like Unsloth makes *training* models faster, vLLM makes *using* them in a production environment faster and cheaper.

In [None]:
!synthetic-data-kit system-check

vLLM STDOUT: INFO:     127.0.0.1:39226 - "GET /v1/models HTTP/1.1" 200 OK
[?25l[32m VLLM server is running at [0m[4;94mhttp://localhost:8000/v1[0m
[32m⠋[0m[32m Checking VLLM server at http://localhost:8000/v1...[0m[2KAvailable models: [1m{[0m[32m'object'[0m: [32m'list'[0m, [32m'data'[0m: [1m[[0m[1m{[0m[32m'id'[0m: 
[32m'unsloth/Llama-3.2-3B-Instruct'[0m, [32m'object'[0m: [32m'model'[0m, [32m'created'[0m: [1;36m1758789636[0m, 
[32m'owned_by'[0m: [32m'vllm'[0m, [32m'root'[0m: [32m'unsloth/Llama-3.2-3B-Instruct'[0m, [32m'parent'[0m: [3;35mNone[0m, 
[32m'max_model_len'[0m: [1;36m2048[0m, [32m'permission'[0m: [1m[[0m[1m{[0m[32m'id'[0m: 
[32m'modelperm-836372cf29c34ca2953745621762a2cd'[0m, [32m'object'[0m: [32m'model_permission'[0m, 
[32m'created'[0m: [1;36m1758789636[0m, [32m'allow_create_engine'[0m: [3;91mFalse[0m, [32m'allow_sampling'[0m: [3;92mTrue[0m, 
[32m'allow_logprobs'[0m: [3;92mTrue[0m, [32m'allow_sea

This cell is the **main data generation step**. It iterates through your source text files and uses the `synthetic-data-kit` to create the question-answer dataset.

**`--num-pairs 25 --type "qa"`**: These flags are the specific instructions. They tell the model to generate exactly **25 unique Question-Answer pairs** from the text in the current file.

**`time.sleep(2)`**: After generating the data for one file, the code pauses for 2 seconds. This is a small safety measure to give the system a moment to finish processing and avoid any potential issues before starting the next file.

In [None]:
# Process chunks
for filename in filenames:
    !synthetic-data-kit \
        -c synthetic_data_kit_config.yaml \
        create {filename} \
        --num-pairs 25 \
        --type "qa"
    time.sleep(2) # Sleep some time to leave some room for processing

vLLM STDOUT: INFO:     127.0.0.1:39228 - "GET /v1/models HTTP/1.1" 200 OK
vLLM STDOUT: INFO:     127.0.0.1:39242 - "GET /v1/models HTTP/1.1" 200 OK
[2K[32m⠹[0m Generating qa content from data/output/sherlock_5.txt...vLLM STDOUT: INFO 09-25 08:40:38 [chat_utils.py:444] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
vLLM STDOUT: INFO 09-25 08:40:38 [logger.py:43] Received request chatcmpl-34ad77e7c5d1478087d9517f0bfe1467: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 25 Sep 2025\n\nSummarize this document in 3-5 sentences, focusing on the main topic and key concepts.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSherlockiana : The Baker Street Journal, an Irregular Quarterly of Sherlockiana Sherlock Holmes and Sherlockiana Collection at the Harry Ransom Center Digital Collections The Universal Sherlock Holmes at the University of

This cell performs an automated **quality check** on the synthetic data you just created. Its purpose is to **filter out and remove any low-quality or irrelevant question-answer pairs**, ensuring the final dataset is clean and effective for fine-tuning.

Here's how it works:

1.  **The Loop:** The code loops through each of the `_qa_pairs.json` files that were generated in the previous step.

2.  **The `curate` Command:** For each file, it runs the `synthetic-data-kit curate` command. This command uses a powerful language model (acting as a "judge") to read each question-answer pair and assign it a quality score.

3.  **The Threshold:** The `--threshold 5.0` setting is the crucial instruction: it tells the tool to **discard any QA pair with a quality score below 5.0**.

In [None]:
QUALITY_CHECK = True

if QUALITY_CHECK:
    qa_pairs_filenames = [
        f"data/generated/sherlock_{i}_qa_pairs.json"
        for i in range(len(filenames))
    ]
    for filename in qa_pairs_filenames:
        !synthetic-data-kit \
            -c synthetic_data_kit_config.yaml \
            curate --threshold 5.0 \
            {filename}

vLLM STDOUT: INFO:     127.0.0.1:56882 - "GET /v1/models HTTP/1.1" 200 OK
vLLM STDOUT: INFO:     127.0.0.1:56898 - "GET /v1/models HTTP/1.1" 200 OK
vLLM STDOUT: INFO 09-25 08:44:42 [logger.py:43] Received request chatcmpl-f1895ef507f340e6b2eb87864a987a6e: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 25 Sep 2025\n\nRate each of these question-answer pairs for quality and return exactly this JSON format:\n\n[\n  {"question": "same question text", "answer": "same answer text", "rating": n}\n]\n\nWhere n is a number from 1-10.\n\nDO NOT include any text outside of the JSON array, just return valid JSON:\n\n[\n  {\n    "question": "What does Sherlockiana encompass?",\n    "answer": "various categories of materials and content related to the fictional detective Sherlock Holmes"\n  },\n  {\n    "question": "Who created the fictional detective Sherlock Holmes?",\n    "answer": "Arthur Conan Doyle"\n  },\n  {\n    "qu

This cell performs the **final step of the data preparation pipeline**: it **formats the synthetic dataset for fine-tuning.**

1.  **Gather Files:** First, it creates a list of all the JSON files containing the question-answer pairs that you generated and curated in the previous steps.

2.  **Run the `save-as` Command:** It then loops through each of these files and runs the `synthetic-data-kit save-as` command.

3.  **Format for Fine-Tuning (`-f ft`):** This is the key part. The `-f ft` flag is a specific instruction that tells the tool to convert the simple QA pairs into the structured format that fine-tuning libraries (like Hugging Face's TRL) expect. This usually means organizing each entry into a conversational format with distinct "user" (the question) and "assistant" (the answer) roles.

In [None]:
qa_pairs_filenames = [
    f"data/generated/sherlock_{i}_qa_pairs.json"
    for i in range(len(filenames))
]
for filename in qa_pairs_filenames:
    !synthetic-data-kit \
        -c synthetic_data_kit_config.yaml \
        save-as {filename} -f ft

[?25l[32m⠋[0m Converting data/generated/sherlock_0_qa_pairs.json to ft format with json 
storage...
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m[1;32mdata/final/sherlock_0_qa_pairs_ft.json[0m
[?25l[32m⠋[0m Converting data/generated/sherlock_1_qa_pairs.json to ft format with json 
storage...
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m[1;32mdata/final/sherlock_1_qa_pairs_ft.json[0m
[?25l[32m⠋[0m Converting data/generated/sherlock_2_qa_pairs.json to ft format with json 
storage...
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m[1;32mdata/final/sherlock_2_qa_pairs_ft.json[0m
[?25l[32m⠋[0m Converting data/generated/sherlock_3_qa_pairs.json to ft format with json 
storage...
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m[1;32mdata/final/sherlock_3_qa_pairs_ft.json[0m
[?25l[32m⠋[0m Converting data/generated/sherlock_4_qa_pairs.json to ft format with json 
storage...
[1A[2K[1A[2K[32m Converted to ft

In [None]:
final_filenames = os.listdir("data/final")

conversations = pd.concat(
    [pd.read_json(f"data/final/{name}") for name in final_filenames]
).reset_index(drop=True)

In [None]:
all_contents = list(
    itertools.chain.from_iterable(
        [
            [message["content"] for message in conversation]
            for conversation in conversations["messages"]
        ]
    )
)

content_counts = Counter(all_contents)

most_common_content = content_counts.most_common()

In [None]:
print(most_common_content[:50])

[('You are a helpful assistant.', 165), ('Yes', 12), ('Sherlockiana', 6), ('Not specified', 6), ('Arthur Conan Doyle', 4), ('Not specified in the text', 3), ('Professor James Moriarty', 3), ('What is Sherlockiana?', 2), ('1987', 2), ('Charles Spencer', 2), ('Sherlock Holmes', 2), ('Sherlock Holmes pastiches in print and other media such as films', 2), ('memorabilia associated with Sherlock Holmes', 2), ('anything about, inspired by, or tangentially concerning Sherlock Holmes', 2), ('What is a Sherlock Holmes pastiche?', 2), ('not mentioned in the text', 2), ("Who wrote the short story 'The Ultimate Crime'?", 2), ('Isaac Asimov', 2), ('The Dynamics of an Asteroid', 2), ('Who won the 1887 celestial mechanics contest?', 2), ('Henri Poincaré', 2), ('The Baker Street Journal', 2), ('1994', 2), ('2015', 2), ('1914', 2), ('Sherlock Holmes fans', 2), ('What type of publication is The Baker Street Journal?', 2), ('Publication', 2), ('An Asteroid', 2), ('What institution is intrigued by the orig

In [None]:
dataset = Dataset.from_pandas(conversations)

In [None]:
final_dataset = DatasetDict({
    'train': dataset,
})

print("\nFinal Hugging Face Dataset object:")
print(final_dataset)

# You can inspect an example
print("\nExample from the training set:")
print(final_dataset['train'][0])


Final Hugging Face Dataset object:
DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 165
    })
})

Example from the training set:
{'messages': [{'content': 'You are a helpful assistant.', 'role': 'system'}, {'content': 'What institution is intrigued by the original location of 221B Baker Street?', 'role': 'user'}, {'content': 'The United States Smithsonian Museums', 'role': 'assistant'}]}


In [None]:
try:
    from huggingface_hub import login
    from google.colab import userdata

    # Retrieve your Hugging Face token from Colab's secrets manager
    # The name 'HF_TOKEN' should match the name you used in the secrets tab
    hf_token = userdata.get('HF_TOKEN')

    # Check if the token was successfully retrieved
    if hf_token:
        # Log in to Hugging Face using the retrieved token
        # The `add_to_git_credential=True` argument is optional and useful if you plan to push models to the Hub
        login(token=hf_token, add_to_git_credential=True)
        print("Hugging Face login successful using Google Colab secrets!")
    else:
        print("Error: HF_TOKEN not found in Google Colab secrets or is empty.")
        print("Please ensure you have created a secret named 'HF_TOKEN' in the 'Secrets' tab (🔑) on the left sidebar.")
except:
    pass

Hugging Face login successful using Google Colab secrets!


In [None]:
if not DEMO:

    # Your final_dataset object from the script above is ready
    repo_id = "lmassaron/Sherlock_QA"
    print(f"\nUploading dataset to the Hub at {repo_id}...")

    # This command uploads the dataset. It will create the repo if it doesn't exist.
    final_dataset.push_to_hub(repo_id)
    print("Upload complete!")