# Creating an instruction dataset using SelfInstruct

In this Jupyter notebook, we effectively create a instruction dataset from a set of seeds regarding medicaments and drugs, obtained from a public dataset. Both the instructions and the responses to those instructions are created using [distilabel](https://github.com/argilla-io/distilabel), an AI Feedback (AIF) framework for building datasets with and for LLMs. The datasets are then stored and visualized with [Argilla](https://github.com/argilla-io/argilla), a collaboratory platform that allows to upload and annotate datasets. It can be done locally or, in this case, using a HuggingFace Space.

## Installations, imports and environment

In [None]:
#!pip install "distilabel[argilla]" argilla datasets vllm argilla

Collecting distilabel[argilla]
  Downloading distilabel-0.6.0-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.4/132.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting argilla
  Downloading argilla-1.25.0-py3-none-any.whl (415 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m415.1/415.1 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting vllm
  Downloading vllm-0.3.3-cp310-cp310-manylinux1_x86_64.whl (44.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from distilabel[argilla])
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [1]:
import os

import argilla as rg
from datasets import Dataset, load_dataset
from distilabel.dataset import CustomDataset
from distilabel.llm import LLM, vLLM
from distilabel.pipeline import Pipeline
from distilabel.tasks import Task, TextGenerationTask, SelfInstructTask
import torch
from vllm import LLM

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [2]:
# Public variables
os.environ["ARGILLA_API_KEY"] = "admin.apikey"
os.environ["ARGILLA_API_URL"] = "https://ignacioct-argilla.hf.space"
os.environ["ARGILLA_WORKSPACE"] = "admin"
os.environ["HF_NAMESPACE"] = "argilla"

# Secrets
os.environ["HF_API_KEY"] = "hf_..."

In [3]:
rg.init(api_url=os.environ["ARGILLA_API_URL"], api_key=os.environ["ARGILLA_API_KEY"])

This may lead to potential compatibility issues during your experience.
To ensure a seamless and optimized connection, we highly recommend aligning your client version with the server version.


## Dataset Preprocessing

To create the dataset, we will use the datasets [wiki_medical_terms](https://huggingface.co/datasets/gamino/wiki_medical_terms). It is a dataset containing 6000 medical terms and their wikipedia text. Originally, it is intended to use on a downstream task that requires medical terms and their wikipedia explanation, but we can just the the medical terms to generate our instruction dataset.

### Downloading the dataset

In [None]:
dataset = load_dataset("gamino/wiki_medical_terms", split='train[:100]')

print(dataset)
print(dataset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['page_title', 'page_text', '__index_level_0__'],
    num_rows: 100
})
{'page_title': 'Paracetamol poisoning', 'page_text': 'Paracetamol poisoning, also known as acetaminophen poisoning, is caused by excessive use of the medication paracetamol (acetaminophen). Most people have few or non-specific symptoms in the first 24 hours following overdose. These include feeling tired, abdominal pain, or nausea. This is typically followed by a couple of days without any symptoms, after which yellowish skin, blood clotting problems, and confusion occurs as a result of liver failure. Additional complications may include kidney failure, pancreatitis, low blood sugar, and lactic acidosis. If death does not occur, people tend to recover fully over a couple of weeks. Without treatment, death from toxicity occurs 4 to 18 days later.Paracetamol poisoning can occur accidentally or as an attempt to die by suicide. Risk factors for toxicity include alcoholism, malnutrition, and the t

### Obtaining a subset of the dataset to generate instructions

In [None]:
seed_list = []

for record in dataset:
    # Append the medical term and the description, cropped to the first 300 characters
    seed_list.append(record["page_title"] + ": " + record["page_text"][:300])

instructions_dataset = Dataset.from_dict({"input": seed_list})

print(instructions_dataset)
print(instructions_dataset[0]["input"])

Dataset({
    features: ['input'],
    num_rows: 100
})
Paracetamol poisoning: Paracetamol poisoning, also known as acetaminophen poisoning, is caused by excessive use of the medication paracetamol (acetaminophen). Most people have few or non-specific symptoms in the first 24 hours following overdose. These include feeling tired, abdominal pain, or nausea. This is typically fo


## SelfInstruct Task

Let's create a SelfInstructTask, with an application description that guides the LLM for the desired used.

In [None]:
instructions_task = SelfInstructTask(
    application_description="A assistant that can answer questions about medicaments and drugs and what is their use."
)

In [None]:
def load_model(task: Task) -> LLM:  #

    os.environ["CUDA_VISIBLE_DEVICES"] = "0"  #

    return vLLM(
        model=LLM(model="microsoft/phi-2"),
        task=task,
        max_new_tokens=512,
        temperature=0.7,
    )

In [None]:
instructions_pipeline = Pipeline(generator=load_model(instructions_task))

INFO 03-07 16:50:34 llm_engine.py:87] Initializing an LLM engine with config: model='microsoft/phi-2', tokenizer='microsoft/phi-2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 03-07 16:50:40 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 03-07 16:51:03 llm_engine.py:357] # GPU blocks: 1509, # CPU blocks: 819
INFO 03-07 16:51:05 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-07 16:51:05 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-07 16:51:24 model_runner.py:756] Graph capturing finished in 18 secs.


In [None]:
generated_instructions = instructions_pipeline.generate(
    dataset=instructions_dataset, num_generations=2, batch_size=4, display_progress_bar=True
)

INFO:distilabel:Executing dry-run...
INFO:distilabel:Processing batch 1 of 1...
INFO:distilabel:Calling generator for batch 1...
  prompts = self._generate_prompts(inputs, default_format=None)


Flattening the indices:   0%|          | 0/1 [00:00<?, ? examples/s]

INFO:distilabel:Dry-run executed with no issues. Starting the actual generation...


Output()

INFO:distilabel:Processing batch 1 of 25...
INFO:distilabel:Calling generator for batch 1...


INFO:distilabel:Processing batch 2 of 25...
INFO:distilabel:Calling generator for batch 2...
INFO:distilabel:Processing batch 3 of 25...
INFO:distilabel:Calling generator for batch 3...
INFO:distilabel:Processing batch 4 of 25...
INFO:distilabel:Calling generator for batch 4...
INFO:distilabel:Calling generator for batch 5...
INFO:distilabel:Processing batch 6 of 25...
INFO:distilabel:Calling generator for batch 6...
INFO:distilabel:Processing batch 7 of 25...
INFO:distilabel:Calling generator for batch 7...
INFO:distilabel:Processing batch 8 of 25...
INFO:distilabel:Calling generator for batch 8...
INFO:distilabel:Processing batch 9 of 25...
INFO:distilabel:Calling generator for batch 9...
INFO:distilabel:Processing batch 10 of 25...
INFO:distilabel:Calling generator for batch 10...
INFO:distilabel:Processing batch 11 of 25...
INFO:distilabel:Calling generator for batch 11...
INFO:distilabel:Processing batch 12 of 25...
INFO:distilabel:Calling generator for batch 12...
INFO:distilabel

Flattening the indices:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

INFO:distilabel:Checkpoint saved to disk: /content/ckpt.
INFO:distilabel:Final dataset saved at /content/ckpt


Let's take a look at the instructions that we have generated.

In [None]:
instructions = []
for generations in generated_instructions["instructions"]:
    for generation in generations:
        instructions.extend(generation)

print(f"Number of generated instructions: {len(instructions)}")

for instruction in instructions[:10]:
    print(instruction)

Number of generated instructions: 1147
What is Paracetamol poisoning and what are its symptoms?
What are the recommended treatments for Paracetamol poisoning?
Can Paracetamol poisoning be fatal?
How can Paracetamol poisoning be prevented?
What is the difference between Paracetamol and other pain-relieving medications?
What are the common side effects of paracetamol overdose?
How can a paracetamol overdose be treated?
What are the symptoms of paracetamol poisoning?
What should a person do if they suspect a paracetamol overdose?
How can paracetamol poisoning be prevented?


### Pushing the Instruction dataset to Argilla to visualize and annotate

In [None]:
%pip install -q -U "distilabel[hf-inference-endpoints, argilla]" argilla

In [None]:
instructions_rg_dataset = generated_instructions.to_argilla()
instructions_rg_dataset[0]

  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_instructions.to_argilla()
  instructions_rg_dataset = generated_in

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/65.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/745 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/34.8M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/28 [00:00<?, ?it/s]

Batches:   0%|          | 0/28 [00:00<?, ?it/s]



[38;5;4mℹ The specified spaCy model "en_core_web_md" was not              found
on disk. Downloading...[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Output()

FeedbackRecord(fields={'input': 'Paracetamol poisoning: Paracetamol poisoning, also known as acetaminophen poisoning, is caused by excessive use of the medication paracetamol (acetaminophen). Most people have few or non-specific symptoms in the first 24 hours following overdose. These include feeling tired, abdominal pain, or nausea. This is typically fo', 'instructions': 'What is Paracetamol poisoning and what are its symptoms?'}, metadata={'length-input': 323, 'length-instructions': 56, 'generation-model': 'microsoft/phi-2', 'input_n_tokens': 46, 'input_n_unique_tokens': 38, 'input_n_sentences': 4, 'input_perplexity': 4.59, 'input_entropy': 1.52, 'input_flesch_reading_ease': 60.91, 'instructions_n_tokens': 9, 'instructions_n_unique_tokens': 8, 'instructions_n_sentences': 1, 'instructions_perplexity': 1.23, 'instructions_entropy': 0.2, 'instructions_flesch_reading_ease': 84.9}, vectors={'input': [-0.0694112628698349, -0.21183548867702484, -0.08680713176727295, 0.0717763751745224, 0.37

In [None]:
instructions_rg_dataset.push_to_argilla(name=f"wiki_medical_terms_instructions")

Output()

INFO:argilla.client.feedback.dataset.local.mixins:✓ Dataset succesfully pushed to Argilla
INFO:argilla.client.feedback.dataset.local.mixins:RemoteFeedbackDataset(
   id=04e3ff73-a079-4f96-82f3-d62d9ab3dad8
   name=wiki_medical_terms_instructions
   workspace=Workspace(id=8e7ba4eb-0f20-4214-97d2-1af479517625, name=admin, inserted_at=2024-02-12 10:33:22.586390, updated_at=2024-02-12 10:33:22.586390)
   url=https://ignacioct-argilla.hf.space/dataset/04e3ff73-a079-4f96-82f3-d62d9ab3dad8/annotation-mode
   fields=[RemoteTextField(id=UUID('49b074b4-b1ba-47f4-9821-1323b4246682'), client=None, name='input', title='input', required=True, type='text', use_markdown=True), RemoteTextField(id=UUID('9466bea0-55c0-400e-aea8-cfc3dc5c99b1'), client=None, name='instructions', title='instructions', required=True, type='text', use_markdown=False)]
   questions=[RemoteRatingQuestion(id=UUID('b92aa445-496b-4ca2-a7b2-8f373552ce00'), client=None, name='instruction-rating', title='How would you rate the genera

RemoteFeedbackDataset(
   id=04e3ff73-a079-4f96-82f3-d62d9ab3dad8
   name=wiki_medical_terms_instructions
   workspace=Workspace(id=8e7ba4eb-0f20-4214-97d2-1af479517625, name=admin, inserted_at=2024-02-12 10:33:22.586390, updated_at=2024-02-12 10:33:22.586390)
   url=https://ignacioct-argilla.hf.space/dataset/04e3ff73-a079-4f96-82f3-d62d9ab3dad8/annotation-mode
   fields=[RemoteTextField(id=UUID('49b074b4-b1ba-47f4-9821-1323b4246682'), client=None, name='input', title='input', required=True, type='text', use_markdown=True), RemoteTextField(id=UUID('9466bea0-55c0-400e-aea8-cfc3dc5c99b1'), client=None, name='instructions', title='instructions', required=True, type='text', use_markdown=False)]
   questions=[RemoteRatingQuestion(id=UUID('b92aa445-496b-4ca2-a7b2-8f373552ce00'), client=None, name='instruction-rating', title='How would you rate the generated instruction?', description=None, required=True, type='rating', values=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])]
   guidelines=None
   metadata_p

The dataset can be visualized [here](wiki_medical_terms_instructions), in the dataset **wiki_medical_terms_instructions**. Username is *admin* and password is *12345678*.

## Obtaining a dataset with the instructions and automatic generated answers

A further step into our dataset creation is to automatically generate answers to those instructions, using LLMs. These answers, alongside the instructions, can the be curated, in order to align LLMs.

In [4]:
remote_dataset = rg.FeedbackDataset.from_argilla(
    "wiki_medical_terms_instructions", workspace="admin"
)
instructions_dataset = remote_dataset.pull()  # get first 100 records

instructions_dataset = instructions_dataset.format_as("datasets")
instructions_dataset

Dataset({
    features: ['input', 'instructions', 'instruction-rating', 'instruction-rating-suggestion', 'instruction-rating-suggestion-metadata', 'external_id', 'metadata', 'vectors'],
    num_rows: 888
})

In [5]:
instructions_dataset = instructions_dataset.remove_columns(["input", "instruction-rating", "instruction-rating-suggestion", "instruction-rating-suggestion-metadata", "external_id", "vectors", "metadata"])
instructions_dataset = instructions_dataset.rename_columns({"instructions": "input"})

In [6]:
instructions_dataset

Dataset({
    features: ['input'],
    num_rows: 888
})

In [7]:
pipeline = Pipeline(
        generator=vLLM(
            model=LLM(model="abacaj/phi-2-super", dtype="float16"),
            task=TextGenerationTask(),
            max_new_tokens=512,
            temperature=0.3,
        ),
    )

INFO 03-07 23:16:37 llm_engine.py:87] Initializing an LLM engine with config: model='abacaj/phi-2-super', tokenizer='abacaj/phi-2-super', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 03-07 23:16:41 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 03-07 23:17:00 llm_engine.py:357] # GPU blocks: 1509, # CPU blocks: 819
INFO 03-07 23:17:02 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-07 23:17:02 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-07 23:17:19 model_runner.py:756] Graph capturing finished in 17 secs.


In [8]:
preference_dataset = pipeline.generate(
        instructions_dataset,  # type: ignore
        num_generations=1,
        batch_size=8,
        display_progress_bar=True,
    )

INFO:distilabel:Executing dry-run...
INFO:distilabel:Processing batch 1 of 1...
INFO:distilabel:Calling generator for batch 1...
  prompts = self._generate_prompts(inputs, default_format=None)


Flattening the indices:   0%|          | 0/1 [00:00<?, ? examples/s]

INFO:distilabel:Dry-run executed with no issues. Starting the actual generation...


Output()

INFO:distilabel:Processing batch 1 of 111...
INFO:distilabel:Calling generator for batch 1...


INFO:distilabel:Processing batch 2 of 111...
INFO:distilabel:Calling generator for batch 2...
INFO:distilabel:Processing batch 3 of 111...
INFO:distilabel:Calling generator for batch 3...
INFO:distilabel:Processing batch 4 of 111...
INFO:distilabel:Calling generator for batch 4...
INFO:distilabel:Processing batch 5 of 111...
INFO:distilabel:Calling generator for batch 5...
INFO:distilabel:Processing batch 6 of 111...
INFO:distilabel:Calling generator for batch 6...
INFO:distilabel:Processing batch 7 of 111...
INFO:distilabel:Calling generator for batch 7...
INFO:distilabel:Processing batch 8 of 111...
INFO:distilabel:Calling generator for batch 8...
INFO:distilabel:Processing batch 9 of 111...
INFO:distilabel:Calling generator for batch 9...
INFO:distilabel:Processing batch 10 of 111...
INFO:distilabel:Calling generator for batch 10...
INFO:distilabel:Processing batch 11 of 111...
INFO:distilabel:Calling generator for batch 11...
INFO:distilabel:Processing batch 12 of 111...
INFO:disti

Flattening the indices:   0%|          | 0/888 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/888 [00:00<?, ? examples/s]

INFO:distilabel:Checkpoint saved to disk: /content/ckpt.
INFO:distilabel:Final dataset saved at /content/ckpt


In [9]:
preference_dataset[3]

{'input': 'How can Paracetamol poisoning be prevented?',
 'generation_model': ['abacaj/phi-2-super'],
 'generation_prompt': ["You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\nHow can Paracetamol poisoning be prevented?"],
 'raw_generation_responses': ['\n\nParacetamol poisoning can be prevented by following these guidelines:\n\n1. Always read and follow the instructions on the medication label and package.\n2. Do not exceed the recommended dosage.\n3. Avoid taking more than one paracetamol product at a time, as this can increase the r

In [10]:
preference_dataset.save_to_disk("preference_dataset")

Saving the dataset (0/1 shards):   0%|          | 0/888 [00:00<?, ? examples/s]

In [4]:
preference_dataset = CustomDataset.load_from_disk("preference_dataset")

In [5]:
preference_dataset[5]

{'input': 'What are the common side effects of paracetamol overdose?',
 'generation_model': ['abacaj/phi-2-super'],
 'generation_prompt': ["You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\nWhat are the common side effects of paracetamol overdose?"],
 'raw_generation_responses': ['\nAssistant: Paracetamol overdose can cause serious liver damage and even be fatal. The most common side effects of paracetamol overdose include nausea, vomiting, abdominal pain, and dark urine. Other symptoms may include loss of appetite, fever, and jaundice

In [6]:
# Uploading the Preference Dataset
preference_rg_dataset = preference_dataset.to_argilla()

# Adding the context as a metadata property in the new Feedback dataset, as this
# information will be useful later.
# for record_feedback, record_huggingface in zip(
#     preference_rg_dataset, preference_dataset
# ):
#     record_feedback.metadata["context"] = record_huggingface["context"]

preference_rg_dataset.push_to_argilla(name=f"wiki_medical_terms_instructions_answers")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/28 [00:00<?, ?it/s]

Batches:   0%|          | 0/23 [00:00<?, ?it/s]



Output()

Output()

INFO:argilla.client.feedback.dataset.local.mixins:✓ Dataset succesfully pushed to Argilla
INFO:argilla.client.feedback.dataset.local.mixins:RemoteFeedbackDataset(
   id=80ecd19d-a563-45c3-8f84-0b2d127c7820
   name=wiki_medical_terms_instructions_answers
   workspace=Workspace(id=8e7ba4eb-0f20-4214-97d2-1af479517625, name=admin, inserted_at=2024-02-12 10:33:22.586390, updated_at=2024-02-12 10:33:22.586390)
   url=https://ignacioct-argilla.hf.space/dataset/80ecd19d-a563-45c3-8f84-0b2d127c7820/annotation-mode
   fields=[RemoteTextField(id=UUID('0e080f2a-cb4b-433b-93b6-3d88736635f3'), client=None, name='input', title='input', required=True, type='text', use_markdown=True), RemoteTextField(id=UUID('f34c9d26-7286-4397-8b9f-66425a83f311'), client=None, name='generations-1', title='generations-1', required=True, type='text', use_markdown=True)]
   questions=[RemoteRatingQuestion(id=UUID('25b9ba9c-f4df-4f85-9d9f-488a1219df5f'), client=None, name='generations-1-rating', title='How would you rate

RemoteFeedbackDataset(
   id=80ecd19d-a563-45c3-8f84-0b2d127c7820
   name=wiki_medical_terms_instructions_answers
   workspace=Workspace(id=8e7ba4eb-0f20-4214-97d2-1af479517625, name=admin, inserted_at=2024-02-12 10:33:22.586390, updated_at=2024-02-12 10:33:22.586390)
   url=https://ignacioct-argilla.hf.space/dataset/80ecd19d-a563-45c3-8f84-0b2d127c7820/annotation-mode
   fields=[RemoteTextField(id=UUID('0e080f2a-cb4b-433b-93b6-3d88736635f3'), client=None, name='input', title='input', required=True, type='text', use_markdown=True), RemoteTextField(id=UUID('f34c9d26-7286-4397-8b9f-66425a83f311'), client=None, name='generations-1', title='generations-1', required=True, type='text', use_markdown=True)]
   questions=[RemoteRatingQuestion(id=UUID('25b9ba9c-f4df-4f85-9d9f-488a1219df5f'), client=None, name='generations-1-rating', title='How would you rate the generation at `generations-1`?', description=None, required=True, type='rating', values=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])]
   guidelines

The dataset can be visualized [here](wiki_medical_terms_instructions), in the dataset **wiki_medical_terms_instructions_answers**. Username is *admin* and password is *12345678*.