# Self-Instruct Pipelines (MVP)

**Project:** Synthetic Data Creation: Survey and Synthesis  
**Method Group:** Language-Model and Cognitive Generation  
**Sub-method:** Self-Instruct Pipelines  
**Author:** Prajna Penmetsa

**Goal:**  Implement an iterative synthetic-data generation pipeline in which a language model produces and refines its own **instruction–response** datasets.  
- Using the Gemini 2.5 Flash REST API, the model begins with a few manually provided *seed examples*, expands them into new tasks and solutions, and repeats this process across multiple rounds.  
- This demonstrates how self-instruct pipelines can improve dataset diversity and quality for learning middleware applications without manual labeling.

In [1]:
from dotenv import load_dotenv
import os, json, requests, time, random

load_dotenv()
API_KEY = os.getenv("GEMINI_API_KEY")
MODEL = "gemini-2.5-flash"
GEMINI_URL = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL}:generateContent?key={API_KEY}"

os.makedirs("outputs/sip", exist_ok=True)

def call_gemini(prompt_text):
    payload = {"contents":[{"parts":[{"text":prompt_text}]}]}
    r = requests.post(GEMINI_URL, json=payload)
    if r.ok:
        return r.json()["candidates"][0]["content"]["parts"][0]["text"]
    else:
        print("❌ Error:", r.status_code, r.text)
        return None

In [2]:
seed_examples = [
    {"instruction": "Solve for x: 3x + 9 = 21", 
     "response": "Subtract 9 → 3x = 12 → x = 4."},
    {"instruction": "Explain photosynthesis in one sentence.", 
     "response": "Plants convert sunlight, carbon dioxide, and water into glucose and oxygen."},
    {"instruction": "Translate 'Good morning' to Spanish.", 
     "response": "'Buenos días'."},
    {"instruction": "Find the perimeter of a square with side 8 cm.", 
     "response": "Perimeter = 4 × 8 = 32 cm."},
]
print(f"Seed examples: {len(seed_examples)}")


Seed examples: 4


In [3]:
def make_meta_prompt(examples, n_new=5):
    sample_text = json.dumps(examples, indent=2)
    prompt = f"""
You are a dataset generator improving your own instruction set.
Below are example instruction–response pairs.

{sample_text}

Generate {n_new} new, diverse, high-quality pairs that follow the same JSON format:
[
  {{
    "instruction": "...",
    "response": "..."
  }},
  ...
]

Output only the JSON array. No commentary or markdown.
"""
    return prompt


In [8]:
all_data = seed_examples.copy()
num_rounds = 3
n_per_round = 5

for round_i in range(num_rounds):
    print(f"\n=== Round {round_i+1} ===")
    prompt = make_meta_prompt(random.sample(all_data, min(5, len(all_data))), n_new=n_per_round)
    gen_text = call_gemini(prompt)
    if not gen_text:
        continue
    # Extract JSON block
    import re
    match = re.search(r"\[.*\]", gen_text, re.DOTALL)
    if match:
        try:
            new_items = json.loads(match.group(0))
            all_data.extend(new_items)
            print(f"✅ Added {len(new_items)} new examples. Total now: {len(all_data)}")
        except Exception as e:
            print("⚠️ Parse error:", e)
    else:
        print("⚠️ No JSON found in response.")

# Save dataset
with open("outputs/self_instruct_dataset.json", "w") as f:
    json.dump(all_data, f, indent=2, ensure_ascii=False)
print(f"\n✅ Saved full dataset with {len(all_data)} examples.")



=== Round 1 ===
✅ Added 5 new examples. Total now: 9

=== Round 2 ===
✅ Added 5 new examples. Total now: 14

=== Round 3 ===
✅ Added 5 new examples. Total now: 19

✅ Saved full dataset with 19 examples.


### Observations & Reflection (From Generated Output)

The self-instruct pipeline using **Gemini 2.5 Flash** successfully produced a diverse, high-quality set of instruction–response pairs across multiple subject areas.

**1. Structural and Format Quality**  
- Every entry strictly followed the required JSON schema (`instruction`, `response`).  
- Responses were concise and aligned directly with the question intent.  
- No invalid or empty fields were detected.

**2. Diversity and Domain Spread**  
- Tasks spanned math (algebra, geometry, units), science (photosynthesis, supernova, human heart), language (translation, spelling, grammar), and general knowledge (capitals, leap year).  
- This shows effective generalization beyond the initial seed examples.  

**3. Pedagogical Strength**  
- Questions were simple, factual, and clear enough for middle-school or general-knowledge learning contexts.  
- Responses were short, direct, and suitable for automated tutoring or curriculum generation.  

**4. Fidelity and Correctness**  
- All mathematical computations and factual statements were correct.  
- Example: *3x – 7 = 11 → x = 6* and *25 °C → 77 °F* verified accurate.  
- No hallucinations or off-topic outputs observed.  

**5. Evaluation Summary**

| Metric | Observation |
|:--|:--|
| Structural fidelity | Excellent – consistent JSON output |
| Content accuracy | High – correct and verifiable responses |
| Diversity | Strong – multiple subjects and skills |
| Pedagogical clarity | High – readable, age-appropriate phrasing |

**6. Overall Insight**  
The SIP method effectively expanded a small seed set into a broader, balanced dataset without manual authoring.  
It demonstrates the ability of large language models to *self-bootstrap* instructional data that can later be filtered, tagged by difficulty, or fine-tuned for adaptive learning systems.

**Next Steps**  
- Add quality-filtering heuristics (remove duplicates, flag too-short responses).  
- Annotate domain and difficulty metadata for each pair.  
- Iterate for more rounds to grow dataset size and complexity.


### Run Metadata
- Date: October 24th, 2025  
- Model: `gemini-2.5-flash`  
- Endpoint: `v1beta REST API`  
- Iterations: 3 rounds × 5 new examples per round  
- Total Output Examples: ≈ 20 (seeds + generated)  
- Output File: `outputs/self_instruct_dataset.json`  
- Temperature: default (~0.9)  
- Author: Prajna Penmetsa  