# Prompt-Based Synthetic Data Generation (MVP)

**Project:** Synthetic Data Creation: Survey and Synthesis  
**Method Group:** Language-Model and Cognitive Generation  
**Sub-method:** Prompt-Based Generation  
**Author:** Prajna Penmetsa 

**Goal:**  
Use Gemini API to generate small, reproducible synthetic learning datasets (e.g., quiz questions, student responses, or explanations) to demonstrate prompt-based synthetic data creation for learning middleware.


In [18]:
from dotenv import load_dotenv
import os, json, requests, time

# Load .env file (should contain: GEMINI_API_KEY=<your_key>)
load_dotenv()

API_KEY = os.getenv("GEMINI_API_KEY")
if not API_KEY:
    raise ValueError("❌ GEMINI_API_KEY not found. Please check .env file.")
else:
    print("✅ GEMINI_API_KEY loaded successfully.")

# --- Model + Endpoint (latest available for your key) ---
MODEL = "gemini-2.5-flash"
GEMINI_URL = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL}:generateContent?key={API_KEY}"

# Output directory
os.makedirs("outputs", exist_ok=True)

✅ GEMINI_API_KEY loaded successfully.


In [19]:
payload = {"contents": [{"parts": [{"text": "Say hello in one sentence."}]}]}

t0 = time.time()
r = requests.post(GEMINI_URL, json=payload)
print("Status:", r.status_code, "| Time:", round(time.time() - t0, 2), "s")

if r.ok:
    data = r.json()
    text = data["candidates"][0]["content"]["parts"][0]["text"]
    print("✅ Connected successfully.\nResponse:", text)
else:
    print("❌ Error:", r.text)

Status: 200 | Time: 2.24 s
✅ Connected successfully.
Response: Hello there!


### Prompt Design

We will generate **10 math word problems** for middle school students, each with:

1. A question text  
2. Four options labeled A, B, C, and D  
3. The correct answer key (A/B/C/D)  

The model will output a **valid JSON list** for easy parsing and benchmarking.

In [20]:
prompt_text = """
You are an educational content creator.
Generate 10 math word problems for middle school students.
Each problem must include:
1. The question text.
2. Four answer options labeled A, B, C, and D.
3. The correct answer key (A/B/C/D).
Return only the JSON array. No explanations, markdown, or commentary.
Format exactly like this example:

[
  {
    "question": "A train travels 60 km in 1.5 hours. What is its speed?",
    "options": {"A": "30 km/h", "B": "40 km/h", "C": "50 km/h", "D": "60 km/h"},
    "answer": "C"
  }
]
Now generate 10 new, unique questions.
"""

payload = {"contents": [{"parts": [{"text": prompt_text}]}]}

t0 = time.time()
r = requests.post(GEMINI_URL, json=payload)
print("Status:", r.status_code, "| Time:", round(time.time() - t0, 2), "s")

if not r.ok:
    raise RuntimeError(f"❌ Generation failed: {r.text}")

generated = r.json()["candidates"][0]["content"]["parts"][0]["text"]
print(generated[:500])  # preview first 500 chars

Status: 200 | Time: 13.84 s
[
  {
    "question": "Sarah bought a new bicycle that was originally priced at $240. If it was on sale for 25% off, what was the discounted price of the bicycle?",
    "options": {"A": "$180", "B": "$60", "C": "$160", "D": "$200"},
    "answer": "A"
  },
  {
    "question": "A baker used 3/4 cup of flour for one recipe and 1/2 cup of flour for another recipe. How much flour did the baker use in total?",
    "options": {"A": "1 cup", "B": "1 1/4 cups", "C": "1 1/2 cups", "D": "3/8 cup"},
    "an


In [21]:
import re

# Save raw output
raw_path = "outputs/synthetic_questions_raw.txt"
with open(raw_path, "w") as f:
    f.write(generated)
print(f"✅ Saved raw output to {raw_path}")

# --- Robust JSON extraction ---
json_pattern = re.compile(r"\[.*\]", re.DOTALL)
match = json_pattern.search(generated)

if match:
    json_text = match.group(0)
    try:
        data = json.loads(json_text)
        print("✅ JSON parsed successfully. Sample:")
        for i, q in enumerate(data[:2]):
            print(f"{i+1}. {q['question']} → {q['answer']}")
        with open("outputs/synthetic_questions.json", "w") as f:
            json.dump(data, f, indent=2)
        print("✅ Saved clean JSON to outputs/synthetic_questions.json")
    except Exception as e:
        print("⚠️ Parsing failed after extraction:", e)
else:
    print("❌ No JSON-like block detected in model output.")

✅ Saved raw output to outputs/synthetic_questions_raw.txt
✅ JSON parsed successfully. Sample:
1. Sarah bought a new bicycle that was originally priced at $240. If it was on sale for 25% off, what was the discounted price of the bicycle? → A
2. A baker used 3/4 cup of flour for one recipe and 1/2 cup of flour for another recipe. How much flour did the baker use in total? → B
✅ Saved clean JSON to outputs/synthetic_questions.json


### Quick Evaluation Checklist
We’ll check for:
- Structural validity (question, options, answer present)
- Diversity of questions
- Coherence and logical correctness (manual inspection)

In [23]:
if "data" in locals():
    valid = all("question" in q and "options" in q and "answer" in q for q in data)
    print("✅ Structure valid for all entries" if valid else "⚠️ Some entries missing fields")
    print(f"Total generated questions: {len(data)}")

✅ Structure valid for all entries
Total generated questions: 10


### Observations & Analysis (From Generated Output)

The prompt-based generation using **Gemini 2.5 Flash** produced ten math word problems that were:

**1. Coherent and Logically Sound**  
- Every question had correct numerical reasoning (percentages, ratios, averages, etc.).  
- Answers aligned properly with each question’s logic.  
- Example: *“Sarah bought a bicycle for $240 at 25% off” → $180 was correct.*

**2. Diverse in Topic Coverage**  
- Covered a range of middle-school math skills:  
  - Arithmetic (addition, ratios, averages)  
  - Geometry (area, volume)  
  - Algebraic reasoning (variable solving)  
  - Everyday math (scales, discounts, temperature changes)

**3. Pedagogically Appropriate**  
- Language was clear and grade-level appropriate.  
- Each problem involved realistic, interpretable scenarios.  
- Balanced use of units (km, °F, $, cups).

**4. Structural Integrity**  
- All entries followed the requested JSON format without extra text.  
- Each item contained exactly 3 fields: `question`, `options`, `answer`.  
- Parsing succeeded automatically after regex cleanup.

**5. Evaluation Summary**  
| Metric | Observation |
|:--|:--|
| Fidelity | High – no factual or logical inconsistencies detected. |
| Diversity | Good – topics spanned different math domains. |
| Structure | Excellent – fully consistent JSON output. |
| Pedagogical value | Strong – readable and relevant for learning middleware. |

**6. Next Steps / Improvements**  
- Introduce *metadata fields* (e.g., topic, difficulty).  
- Extend to multi-subject domains (science, reading comprehension).  
- Add automated evaluation scripts to check numerical accuracy.  
- Explore controlled prompt variants to generate graded difficulty levels.

---

**Conclusion:**  
This run validates that prompt-based generation with Gemini 2.5 Flash is a reliable and high-quality method for producing structured educational datasets.  
It successfully meets fidelity, diversity, and format requirements for integration into synthetic learning middleware pipelines.


### Run Metadata
- Date: October 24th, 2025  
- Model: `gemini-2.5-flash`  
- Endpoint: `v1beta REST API`  
- Temperature: default (~0.9)  
- Output Files:  
  - `outputs/synthetic_questions_raw.txt`  
  - `outputs/synthetic_questions.json`  
- Author: Prajna Penmetsa  
