This notebook shows a simple example of how to use some of the `flamingo` utilities to pre-process a dataset
and upload it as a W&B artifact.

Generally, this workflow will be performed in a dev environment on cluster so that the dataset files
can be saved on a shared volume. 
But this notebook can be run locally for educational purposes to illustrate the basic functions.

(1) Load and pre-process the base dataset from HuggingFace

In [12]:
from datasets import load_dataset

base_dataset = "fka/awesome-chatgpt-prompts"
dataset = load_dataset(base_dataset, split="train")

dataset

Dataset({
    features: ['act', 'prompt'],
    num_rows: 153
})

In [11]:
def preprocess_func(examples):
    texts = []
    for x in examples["prompt"]:
        texts.append(x[::-1]) # Dummy reverse the prompt
    examples["text"] = texts
    return examples

# Map some preprocessing function over the base dataset (e.g., for prompt formatting)
dataset = dataset.map(
    preprocess_func, 
    batched=True,
    remove_columns=dataset.column_names
)

dataset

Map:   0%|          | 0/153 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 153
})

(2) Save the dataset to disk

In [16]:
from pathlib import Path

# Add an actual path here to where you want the data to live on shared storage
dataset_save_path = str(Path("example_dataset").absolute())

dataset.save_to_disk(dataset_save_path)

Saving the dataset (0/1 shards):   0%|          | 0/153 [00:00<?, ? examples/s]

(3) Log the dataset directory as a W&B artifact for later use

In [21]:
from flamingo.jobs.utils import FlamingoJobType
from flamingo.integrations.wandb import WandbRunConfig, wandb_init_from_config
from flamingo.integrations.wandb import ArtifactType, ArtifactURIScheme, log_artifact_from_path

In [20]:
run_config = WandbRunConfig(
    name="dataset-preprocessing-example",
    project="sfriedowitz-dev",
    entity="mozilla-ai"
)

In [22]:
with wandb_init_from_config(config=run_config, job_type=FlamingoJobType.PREPROCESSING):
    # Specify that this path references local files by adding `uri_scheme = ArtifactURIScheme.FILE`
    # This will upload a reference artifact to the dataset, rather than uploading the actual directory contents
    log_artifact_from_path(
        name="example-dataset",
        path=dataset_save_path,
        artifact_type=ArtifactType.DATASET,
        uri_scheme=ArtifactURIScheme.FILE
    )

[34m[1mwandb[0m: Currently logged in as: [33msfriedowitz[0m ([33mmozilla-ai[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Generating checksum for up to 10000 files in "/Users/sfriedowitz/workspace/flamingo/examples/example_dataset"...
[34m[1mwandb[0m: Done. 0.0s


VBox(children=(Label(value='0.018 MB of 0.018 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))