# Knowledge Distillation for Data Generation (MVP)

**Project:** Synthetic Data Creation: Survey and Synthesis  
**Method Group:** Language-Model and Cognitive Generation  
**Sub-method:** Knowledge Distillation for Data Generation  
**Author:** Prajna Penmetsa

**Goal:**  Generate a pseudo-labeled synthetic dataset using a **teacher–student framework**.  
- A large model (Gemini 2.5 Flash) acts as the teacher to annotate or label unlabeled examples, producing structured data pairs for downstream model training or evaluation.  
- This MVP demonstrates how knowledge distillation can serve as a data-generation method, creating instructional datasets without manual annotation.

In [1]:
from dotenv import load_dotenv
import os, json, requests, time

load_dotenv()
API_KEY = os.getenv("GEMINI_API_KEY")
if not API_KEY:
    raise ValueError("❌ GEMINI_API_KEY not found. Check your .env file.")
else:
    print("✅ GEMINI_API_KEY loaded successfully.")

MODEL = "gemini-2.5-flash"
GEMINI_URL = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL}:generateContent?key={API_KEY}"

os.makedirs("outputs/kd", exist_ok=True)

def call_gemini(prompt_text):
    """Send prompt to Gemini and return text output."""
    payload = {"contents": [{"parts": [{"text": prompt_text}]}]}
    r = requests.post(GEMINI_URL, json=payload)
    if r.ok:
        return r.json()["candidates"][0]["content"]["parts"][0]["text"]
    else:
        print("❌ Error:", r.status_code, r.text)
        return None

✅ GEMINI_API_KEY loaded successfully.


In [2]:
# Sample unlabeled inputs (could be any domain — e.g., math, science, grammar)
unlabeled_inputs = [
    "Convert 50°F to Celsius.",
    "Summarize the purpose of DNA in one line.",
    "Translate 'How are you?' to French.",
    "What is the formula for the area of a circle?",
    "Who wrote 'Romeo and Juliet'?",
    "What is the square root of 81?",
    "List two renewable energy sources.",
    "Define the term 'ecosystem'.",
    "Simplify the expression: 2(x + 3) + 4.",
    "What is the capital of Italy?"
]
print(f"Loaded {len(unlabeled_inputs)} unlabeled samples.")


Loaded 10 unlabeled samples.


In [5]:
pseudo_labeled = []

for i, sample in enumerate(unlabeled_inputs):
    prompt = f"You are a teacher model. Provide the correct answer or explanation for this input:\n\nInput: {sample}\n\nOutput:"
    output = call_gemini(prompt)
    if output:
        pseudo_labeled.append({"input": sample, "label": output.strip()})
    print(f"✅ Processed {i+1}/{len(unlabeled_inputs)}")

# Save pseudo-labeled data
with open("outputs/pseudo_labeled_dataset.json", "w", encoding="utf-8") as f:
    json.dump(pseudo_labeled, f, indent=2, ensure_ascii=False)

print(f"\n✅ Saved pseudo-labeled dataset with {len(pseudo_labeled)} entries.")

✅ Processed 1/10
✅ Processed 2/10
✅ Processed 3/10
✅ Processed 4/10
✅ Processed 5/10
✅ Processed 6/10
✅ Processed 7/10
✅ Processed 8/10
✅ Processed 9/10
✅ Processed 10/10

✅ Saved pseudo-labeled dataset with 10 entries.


### Quick Evaluation
Checks:
- Structural validity (`input`, `label` fields present)
- Completeness (no missing labels)
- Diversity of label types (definition, numeric, translation, factual)

In [4]:
valid = all("input" in ex and "label" in ex for ex in pseudo_labeled)
print("✅ Structure valid for all entries" if valid else "⚠️ Some entries missing fields")
print(f"Total examples: {len(pseudo_labeled)}")

# Example analysis
avg_label_len = sum(len(ex["label"]) for ex in pseudo_labeled) / len(pseudo_labeled)
print(f"Average label length: {avg_label_len:.1f} characters")

✅ Structure valid for all entries
Total examples: 10
Average label length: 417.4 characters


### Observations & Reflection (From Generated Output)

The knowledge-distillation pipeline using **Gemini 2.5 Flash** produced a high-quality pseudo-labeled dataset where each unlabeled input received a rich, context-appropriate teacher response.

**1. Structural & Format Quality**  
- All records conformed to the (`input`, `label`) schema with clean JSON formatting.  
- No parsing or encoding errors occurred.  
- Labels often included Markdown or LaTeX-style math, demonstrating structured reasoning and stepwise explanation.

**2. Content Accuracy & Fidelity**  
- Mathematical conversions and simplifications were correct and explicitly justified (e.g., *50 °F → 10 °C*, *2(x + 3) + 4 → 2x + 10*).  
- Factual items such as capitals, definitions, and biological concepts were precise (*Rome = Italy*, *DNA stores genetic instructions*).  
- Multi-paragraph responses remained coherent and logically consistent.

**3. Pedagogical Depth**  
- Many labels contained step-by-step reasoning, formulas, and explanatory scaffolding — ideal for tutoring or training smaller student models.  
- Scientific and mathematical topics were presented in didactic form, reflecting genuine “teacher model” behavior.

**4. Diversity & Domain Spread**  
- Covered diverse categories:  
  - **Mathematics:** algebra, simplification, conversions.  
  - **Science:** biology (DNA, ecosystem), physics formulas.  
  - **Language:** translation, grammar.  
  - **General Knowledge:** geography, literature.  
- Demonstrated the teacher model’s ability to generalize across subject types.

**5. Evaluation Summary**

| Metric | Observation |
|:--|:--|
| Structural fidelity | Excellent – consistent JSON with no malformed entries |
| Content accuracy | Very high – verified examples correct |
| Pedagogical clarity | Excellent – detailed stepwise explanations |
| Domain diversity | Strong – multi-subject coverage |
| Label richness | High – blend of concise and elaborated answers |

**6. Overall Insight**  
The experiment confirms that knowledge-distillation-based data generation can yield **high-fidelity, pedagogically rich pseudo-labels** suitable for student-model fine-tuning or educational middleware.  
The teacher model effectively acts as an automated annotator, generating both factual and reasoning-based supervision data.

**Next Steps**  
- Normalize label formatting (e.g., Markdown/LaTeX to plain text if required).  
- Quantitatively compare teacher output depth vs. prompt length.  
- Introduce a small student model (e.g., DistilBERT) to learn from these labels and measure downstream performance.

### Run Metadata
- Date: October 25th, 2025 
- Model: `gemini-2.5-flash`  
- Endpoint: `v1beta REST API`  
- Input Samples: 10  
- Teacher Prompts: Single-turn labeling per input  
- Output File: `outputs/pseudo_labeled_dataset.json`  
- Temperature: default (~0.9)  
- Author: Prajna Penmetsa  