# Self-Instruct Data Generation Example
This notebook demonstrates a Self-Instruct-style synthetic data generation pipeline for two tasks: **Sentiment Classification** and **Question Answering (QA)**.

## Setup
Install required libraries and set your OpenAI API key.

In [None]:
!pip install openai

import os
import json
import openai

# Set your OpenAI API key
os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY_HERE'
openai.api_key = os.getenv('OPENAI_API_KEY')

## Define Seed Instructions
We start with two seed instructions, one for sentiment classification and one for QA.

In [None]:
seed_instructions = [
    {
        "instruction": "Classify the sentiment of the following review: positive, neutral, or negative.",
        "input": "I absolutely loved this product, it works flawlessly!"
    },
    {
        "instruction": "Answer the question based on the context provided.",
        "input": "Context: The Eiffel Tower is located in Paris.\nQuestion: Where is the Eiffel Tower located?"
    }
]

# Number of synthetic examples per seed
num_samples = 5

## Generation Function
Define a function to call the OpenAI API and generate synthetic examples.

In [None]:
def generate_example(seed, model="text-davinci-003"):
    prompt = f"Instruction: {seed['instruction']}\nInput: {seed['input']}\nOutput:"
    response = openai.Completion.create(
        engine=model,
        prompt=prompt,
        max_tokens=64,
        temperature=0.7,
        n=1,
        stop=["\n"]
    )
    text = response.choices[0].text.strip()
    return {
        "instruction": seed['instruction'],
        "input": seed['input'],
        "output": text
    }

## Generate Synthetic Dataset

In [None]:
dataset = []
for seed in seed_instructions:
    for _ in range(num_samples):
        example = generate_example(seed)
        dataset.append(example)

# Save to JSONL file
output_path = 'self_instruct_dataset.jsonl'
with open(output_path, 'w', encoding='utf-8') as f:
    for item in dataset:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')
print(f"Saved {len(dataset)} examples to {output_path}")

## Inspect Generated Examples

In [None]:
# Display first 5 examples
for ex in dataset[:5]:
    print(json.dumps(ex, ensure_ascii=False, indent=2))