# Datafast Quickstart
Datafast can generate different dataset types like:
* Text Classification Dataset
* Raw Text Dataset for pre-training
* Instruction Dataset
* Preference Dataset
* Multiple Choice Questions Dataset

*This notebooks walks you through a text classification example, using the OpenAI GPT-4.1-mini*

💪 We'll demonstrate datafast's capabilities by creating a trail conditions classification dataset with the following characteristics:

- Multi-class: the report belongs to one of four trail condition categories
- Multi-lingual: the reports in the dataset will be in 2 different languages
- Multi-LLM: we generate examples using multiple LLM providers to boost diversity
- Publish the dataset to your Hugging Face Hub.

In [1]:
# You can ignore 'ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.'
!pip install datafast --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m935.9 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.1/286.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.6/94.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.6/345.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following de

In [2]:
import datafast
print(datafast.__version__)

0.0.13


In order to use OpenAI models, you'll need create an api key and configure it in your Google Colab Secrets.
1. Create an openai api key from [here](https://platform.openai.com/settings/organization/api-keys) (you'll need an account on the OpenAI platform, but no need of a ChatGPT subscription).
2. Open your Colab secrets (click on the key icon here on the left)
3. Give a the name, for instance `OPENAI_API_KEY`, and past the value in `Value`.
4. Toggle `Notebook access` to give access to this specific notebook to the API key.

💸 Using an OpenAI model you will get charged! Use a small and cheap model for testing and learning like `gpt-4.1-nano` then switch to a better model if needed for more complex tasks.

In [3]:
from google.colab import userdata
import os

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN') # set HF_TOKEN if you want to publish your dataset to your HF Hub
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY') # if you want to use OpenAI models for generation

## Uncomment if you want to try it with Anthropic of Gemini models
# os.environ['ANTHROPIC_API_KEY'] = userdata.get('ANTHROPIC_API_KEY')
# os.environ['GEMINI_API_KEY'] = userdata.get('GEMINI_API_KEY')

## Step 1: Import Required Modules

Generating a dataset with `datafast` requires 3 types of imports:

* Dataset
* Configs
* LLM Providers

In [4]:
# Imports
from datafast.datasets import ClassificationDataset
from datafast.schema.config import ClassificationDatasetConfig, PromptExpansionConfig
from datafast.llms import OpenAIProvider, AnthropicProvider, GeminiProvider
from dotenv import load_dotenv

## Step 2: Configure Your Dataset

The `ClassificationDatasetConfig` class defines all parameters for your text classification dataset.

- **`classes`**: List of dictionaries defining your classification labels. Each dictionary represent a class and should include:
    - `name`: Label identifier (required)
    - `description`: Detailed description of what this class represents (required)

In [5]:
CLASSES = [
        {
            "name": "trail_obstruction",
            "description": "Conditions where the trail is partially or fully blocked by obstacles like fallen trees, landslides, snow, or dense vegetation, and other forms of obstruction."
        },
        {
            "name": "infrastructure_issues",
            "description": "Problems related to trail structures and amenities, including damaged bridges, signage, stairs, handrails, or markers, and other forms of infrastructure issues."
        },
        {
            "name": "hazards",
            "description": "Trail conditions posing immediate risks to hiker safety, such as slippery surfaces, dangerous wildlife, hazardous crossings, or unstable terrain, and other forms of hazards."
        },
        {
            "name": "positive_conditions",
            "description": "Conditions highlighting clear, safe, and enjoyable hiking experiences, including well-maintained trails, reliable infrastructure, clear markers, or scenic features, and other forms of positive conditions."
        }
    ]

- **`num_samples_per_prompt`**: Number of examples to generate in a single LLM call.

- **`output_file`**: Path where the generated dataset will be saved (JSONL format).

- **`languages`**: Dictionary mapping language codes to their names (e.g., `{"en": "English"}`).
    - You can use any language code and name you want. However, make sure that the underlying LLM provider you'll be using supports the language you're using.

## Step 3: Prompt Expansion for Diverse Examples

Prompt expansion is key concept in the `datafast` library. Prompt expansion helps generating multiple variations of a base prompt to increase the diversity of the generated data.

We are using two optional placholders for prompt expansion:

* `{{style}}`: In our example we want to generate hiker reports in different writing styles
* `{{trail_type}}`: In our example we want to generate reports about different types of trails


You can use different variables depending on your actual use case. For example, you can use `{{experience_level}}` to generate reports from hikers with different experience levels or `{{season}}` to generate reports across different seasons.

🔥 Prompt expansion will automatically generate all possible combinations of style and trail type to maximize diversity. You can also limit the number of combinations by setting the `combinatorial` parameter to `False` and providing a value for `num_random_samples` instead to your `PromptExpansionConfig`.

In [6]:
config = ClassificationDatasetConfig(
    # Define your classification classes
    classes=CLASSES,
    # Number of examples to generate per prompt
    num_samples_per_prompt=5,

    # Output file path
    output_file="trail_conditions_classification.jsonl",

    # Languages to generate data for
    languages={
        "en": "English",
        # "fr": "French", # Uncomment to generate in multiple languages
    },

    # Custom prompts (optional - otherwise defaults will be used)
    prompts=[
        (
            "Generate {num_samples} hiker reports in {language_name} which are diverse "
            "and representative of a '{label_name}' trail condition category. "
            "{label_description}. The reports should be brief and about a {{trail_type}}."
        )
    ],
    expansion=PromptExpansionConfig(
        placeholders={
            "trail_type": [
                "mountain trail",
                "coastal path",
                "forest walk",
            ]
        },
))

## Step 4: Set Up LLM Providers

Configure one or more LLM providers to generate your dataset. Using multiple providers helps create more diverse and robust datasets.


In [7]:
providers = [
    OpenAIProvider(model_id="gpt-4.1-mini"),
]

## Step 5: Generate the Dataset

In [8]:
# Initialize dataset with your configuration
dataset = ClassificationDataset(config)

# Check how many row are expected
num_expected_rows = dataset.get_num_expected_rows(providers)
print(f"Expected number of rows: {num_expected_rows}")

# Generate examples using configured providers
dataset.generate(providers)

Expected number of rows: 60
 Generated and saved 5 examples total
 Generated and saved 10 examples total
 Generated and saved 15 examples total
 Generated and saved 20 examples total
 Generated and saved 25 examples total
 Generated and saved 30 examples total
 Generated and saved 35 examples total
 Generated and saved 40 examples total
 Generated and saved 45 examples total
 Generated and saved 50 examples total
 Generated and saved 55 examples total
 Generated and saved 60 examples total


<datafast.datasets.ClassificationDataset at 0x787f987cf050>

## Optional: Push to Hugging Face Hub
*Note that you must have configured a HF_TOKEN in your google colab secrets.*

In [9]:
USERNAME = "patrickfleith"  # <--- Your hugging face username
DATASET_NAME = "outdoor_colab"  # <--- Your hugging face dataset name

dataset.push_to_hub(
    repo_id=f"{USERNAME}/{DATASET_NAME}",
    train_size=0.6
)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

'https://huggingface.co/datasets/patrickfleith/outdoor_colab'

More guides [here](https://patrickfleith.github.io/datafast/guides/)