# SelfInstruct with Distilabel and OpenAI

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

This notebook shows how to generate instructions with the **SelfInstruct approach**
- It is based on this paper: https://arxiv.org/pdf/2212.10560

**Table of content**
- Start from a list of very specific topics
- Create a disitlabel pipeline with SelftInstruct
- Generate a dataset of instructions related to these topics.

# Imports and Setup of distilabel with OpenAI
- As we use `openai` you'll need to setup your `OPENAI_API_KEY` (read a guide here [here](https://github.com/patrickfleith/datapipes/blob/main/notebooks/How_to_use_an_OpenAI_Chat_model.ipynb))
- We will not use the openai client, but the distilabel wrapper, to interface with a `SelfInstruct` pipeline.

In [1]:
!pip install distilabel --upgrade --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.2/442.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.1/50.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the followi

In [2]:
from distilabel.steps.tasks import SelfInstruct
from distilabel.llms import OpenAILLM
from google.colab import userdata
import distilabel
import openai

In [3]:
# printing my version such you can check yours - in check of troubles
print(f"Open AI version: {openai.__version__}")
print(f"Distilabel version: {distilabel.__version__}")

Open AI version: 1.57.4
Distilabel version: 1.4.2


In [None]:
# printing my version such you can check yours - in check of troubles
print(f"Open AI version: {openai.__version__}")
print(f"Distilabel version: {distilabel.__version__}")

Open AI version: 1.54.4
Distilabel version: 1.4.1


In [4]:
# load your API key
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

Documentation for `SeltInstruct` is here:
- `https://distilabel.argilla.io/dev/components-gallery/tasks/selfinstruct/`

# Creating the Instruction Generation pipeline with `SelfInstruct`

For the inner working of `SelftInstruct` read the paper, it is short and very informative [here](https://arxiv.org/pdf/2212.10560). In this notebook we focus on applying it and shipping a dataset ⚡

### So what is SelfInstruct?

What are the mandatory pieces:
- **`llm`** - Can be from a closed one like OpenAI api, Gemini, Anthropic or an open-source one like Llama or Mistral.
- A list of seed topics: typically something like `topics = ['spacewalk', 'french cheese', 'hiking in winter']`




In [5]:
llm = OpenAILLM(model="gpt-4o-mini", api_key=OPENAI_API_KEY)
seeds = ['spacewalk', 'french cheese', 'hiking in winter']


self_instruct = SelfInstruct(
    llm=llm
)

self_instruct.load()



In [6]:
result = next(self_instruct.process([{"input": seed} for seed in seeds]))

By default, a `SelftInstruct` Task will generate one result object per seed.
- each result will have 5 generated instruction (default which can be changed. See below).

In [7]:
# there are as much result as seed
len(result), len(seeds)

(3, 3)

**Let's look at the first item in result:**
- is a dictionary → cool for further processing
- keep the input seed for traceability
- included a list of generated instructions at the `instructions` key.
- keep track of the model used to generate it
- keep track of the full criteria list
- you can also see the full prompt (including the criteria used)

In [8]:
result[0]

{'input': 'spacewalk',
 'instructions': ['What challenges do astronauts face during a spacewalk?  ',
  'Explain the steps involved in preparing for a spacewalk mission.  ',
  'Describe the tools and equipment typically used in a spacewalk.  ',
  'Illustrate the importance of teamwork during a spacewalk operation.  ',
  'Summarize the historical milestones in spacewalks since their inception.'],
 'distilabel_metadata': {'raw_output_self_instruct_0': 'What challenges do astronauts face during a spacewalk?  \nExplain the steps involved in preparing for a spacewalk mission.  \nDescribe the tools and equipment typically used in a spacewalk.  \nIllustrate the importance of teamwork during a spacewalk operation.  \nSummarize the historical milestones in spacewalks since their inception.',
  'raw_input_self_instruct_0': [{'role': 'user',
    'content': '# Task Description\nDevelop 5 user queries that can be received by the given AI application and applicable to the provided context. Emphasize 

Let's look at what kind of instructions have been generated for the seed `spacewalk`

In [9]:
# let's look at the instructions
for idx, inst in enumerate(result[0]['instructions']):
    print(f"{idx+1}. {inst}")

1. What challenges do astronauts face during a spacewalk?  
2. Explain the steps involved in preparing for a spacewalk mission.  
3. Describe the tools and equipment typically used in a spacewalk.  
4. Illustrate the importance of teamwork during a spacewalk operation.  
5. Summarize the historical milestones in spacewalks since their inception.


We can also check the full prompt used under the hood 👀

In [11]:
print(result[0]['distilabel_metadata']['raw_input_self_instruct_0'][0]['content'])

# Task Description
Develop 5 user queries that can be received by the given AI application and applicable to the provided context. Emphasize diversity in verbs and linguistic structures within the model's textual capabilities.

# Criteria for Queries
Incorporate a diverse range of verbs, avoiding repetition.
Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.
Design queries to be self-contained and standalone.
Blend interrogative (e.g., "What is the significance of x?") and imperative (e.g., "Detail the process of x.") styles.
Write each query on a separate line and avoid using numbered lists or bullet points.

# AI Application
AI assistant

# Context
spacewalk

# Output



# Customise the pipeline

To customise the pipeline we can change the following parameters.

- **`application_description`**: (Optional) The description of the AI application that one want to build with these instructions. Defaults to AI assistant.
- **criteria_for_query_generation**: (Optional) criteria used internally to generate good quality instructions hopefully meeting these criteria
- **`num_instructions`**: (Optional) The number of instructions to be generated. Defaults to 5.

In [14]:
llm=OpenAILLM(model="gpt-4o-mini", api_key=OPENAI_API_KEY)

application_description = "AI assistant for customer advises in a clothing shop"

custom_criteria_for_query_generation = """
Incorporate a diverse range of verbs, avoiding repetition.
Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.
Design queries to be self-contained and standalone.
Blend interrogative (e.g., "What is the significance of x?") and imperative (e.g., "Detail the process of x.") styles.
All queries shall absolutely be written in English
"""

seeds = ['new trends', 'hat', 'hiking in winter']

self_instruct = SelfInstruct(
    llm=llm,
    num_instructions=7,
    application_description=application_description,
    criteria_for_query_generation=custom_criteria_for_query_generation
)

self_instruct.load()

result = next(self_instruct.process([{"input": seed} for seed in seeds]))



In [15]:
for idx, seed in enumerate(seeds):
    print(seed, '\n')
    for idx, inst in enumerate(result[idx]['instructions']):
        print(f"{idx+1}. {inst}")
    print('\n')

new trends 

1. What are the latest clothing trends for this season?  
2. Describe how to incorporate bold colors into a wardrobe this year.  
3. Can you summarize the emerging styles in activewear?  
4. Suggest some outfit combinations that align with the current fashion trends.  
5. Highlight the key fabrics that are gaining popularity right now.  
6. Explore the influence of streetwear on contemporary fashion trends.  
7. Outline the must-have accessories that complement this season’s styles.  


hat 

1. What are the latest trends in hats for this season? 
2. Explain the differences between fedoras and baseball caps in terms of style and occasion. 
3. Suggest some outfit combinations that would work well with a wide-brimmed hat. 
4. How should I care for and maintain my wool felt hat to ensure it lasts? 
5. Illustrate the process of choosing the right hat size for different head shapes. 
6. Can you recommend any eco-friendly hat brands that prioritize sustainable materials? 
7. Des

In [16]:
def transform_to_dict(data):
    result_dict = {"input": [], "instruction": []}

    for entry in data:
        input_value = entry['input']
        for instruction in entry['instructions']:
            result_dict["input"].append(input_value)
            result_dict["instruction"].append(instruction)

    return result_dict

dataset_as_dict = transform_to_dict(result)

In [17]:
from datasets import load_dataset

HF_USERNAME = 'patrickfleith'
DATASET_NAME = 'selfinstruct_demo'
PRIVATE = False # True or False - your dataset will be public if False on HF hub

DATASET_ID = HF_USERNAME + '/' + DATASET_NAME

from datasets import Dataset
ds = Dataset.from_dict(dataset_as_dict, split='train')

In [18]:
ds.push_to_hub(DATASET_ID, private=PRIVATE)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/304 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/patrickfleith/selfinstruct_demo/commit/132a9d407cd53793165b3a4097b2dc0cd7c2f63d', commit_message='Upload dataset', commit_description='', oid='132a9d407cd53793165b3a4097b2dc0cd7c2f63d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/patrickfleith/selfinstruct_demo', endpoint='https://huggingface.co', repo_type='dataset', repo_id='patrickfleith/selfinstruct_demo'), pr_revision=None, pr_num=None)