# MCQ Generator 

This notebook generates **high-quality multiple-choice questions (MCQs)** from PDF documents via **LangChain** using:
- **openai/gpt-oss-120b** for MCQ generation + scoring
- **Sentence Transformers** for fast semantic filtering
- **Async + batching** for speed
- **Early stopping** so generation halts once enough MCQs are collected


## Features
- Input: PDF document
- Scalable chunking of text
- MCQ generation (1 correct + 3–4 distractors)
- SentenceTransformer-based semantic filtering (Save API cost)
- Early stopping (don’t process all chunks unnecessarily)
- LLM-based scoring (batched for speed)
- Difficulty management (`easy`, `medium`, `hard`)
- Output: JSON file with all generated MCQs


### Import

In [1]:

import os
import json
import logging
from tqdm import tqdm
from dotenv import load_dotenv
from concurrent.futures import ThreadPoolExecutor
import asyncio

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_groq import ChatGroq

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate


  from .autonotebook import tqdm as notebook_tqdm


### Logging and API Key

In [2]:

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")

# Load env vars from .env in project root
load_dotenv()

# Ensure GOOGLE_API_KEY is visible to langchain-google-genai
#os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")

if not os.environ.get("GROQ_API_KEY"):
    logging.warning("GROQ_API_KEY is not set. Create a .env file with GOOGLE_API_KEY=your_key")
else:
    logging.info("GROQ_API_KEY loaded from .env")

2025-09-07 17:25:21,923 | INFO | GROQ_API_KEY loaded from .env


### Configuration

In [3]:

cfg = {
    "pdf_path": "notes.pdf",        # your PDF data
    "max_questions": 10,            # final cap after filtering and removing duplicate questions
    "chunk_size": 500,             
    "chunk_overlap": 50,         
    "model": "openai/gpt-oss-120b",
    "temperature": 0.6,
    "batch_size": 5,                # MCQs per scoring request (batched)
    "max_workers": 5,               # parallel Gemini calls for generation
    "skip_pages": 5,                # Number of pages to skip from the start (e.g., TOC)

    # Difficulty MCQ set (Bloom’s Taxonomy)
    "difficulty_distribution": 
    {
         "easy": 2,     
        "medium": 4,  
        "hard": 4    
    }
}


###  Load and Chunk PDF

In [4]:

def load_and_chunk_pdf(path, chunk_size, overlap, skip_pages=0):
    """
    Load a PDF and split it into overlapping text chunks, skipp unwanted pages(Table of Content) 
    only want MCQs to come from Chapter One onward.
    
    Args:
        path(str): Path to PDF file.
        chunk_size(int): Maximum characters per chunk.
        overlap(int): Number of characters to overlap between chunks.
        skip_pages (int): Number of initial pages to skip.


    Returns:
        List of chunked documents.
    """

    loader = PyPDFLoader(path)
    docs = loader.load()

    # Keep only pages starting from 6 (0-based index, so skip pages 0–4)
    # Only want pages starting from Chapter One onward.
    
    docs = [doc for doc in docs if doc.metadata["page"] >= skip_pages]

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap
    )
    chunks = splitter.split_documents(docs)
    logging.info(f"Loaded {len(chunks)} chunks from {path} (skipping pages {skip_pages} pages)")
    return chunks


### llm setup (Prompt Generation)

In [5]:

# Difficulty hints guide LLM to produce appropriate question types
DIFFICULTY_HINT = {
    "easy": "Recall and understanding; avoid tricky distractors.",
    "medium": "Application and analysis; include plausible but incorrect distractors.",
    "hard": "Evaluation and synthesis; nuanced distractors testing deeper understanding."
}

# Initialize Gemini LLM via LangChain wrapper
#llm = ChatGoogleGenerativeAI(model=cfg["model"], temperature=cfg["temperature"])
llm = ChatGroq(model=cfg["model"], temperature=cfg["temperature"])

# Prompt: instruct the model to return strict JSON only
mcq_prompt = PromptTemplate(
    input_variables=["context", "difficulty_hint"],
    template="""
You are an expert educator. Generate up to 3 multiple-choice questions from this text:

{context}

Difficulty instruction: {difficulty_hint}

Return STRICT JSON:
[
  {{
    "question": "string",
    "choices": ["string", "string", "string", "string"],
    "answer": "string",
    "explanation": "string"
  }}
]
Rules:
- One correct answer from the text
- 3–4 plausible distractors (no 'All/None of the above')
- The choices should start with A., B., C., and D.
- Answer must appear in choices
- Avoid True/False
- Return JSON only (no extra text)
"""
)

mcq_chain = LLMChain(llm=llm, prompt=mcq_prompt)

  mcq_chain = LLMChain(llm=llm, prompt=mcq_prompt)


### Validation & Deduplication Helpers

In [6]:

def validate_item(it):
    """
    Basic validation of an MCQ dict.

    Returns:
        bool: True if the item has required fields and is sane.
    """
    
    if not all(k in it for k in ["question", "choices", "answer", "explanation"]):
        return False
    if not isinstance(it["choices"], list) or len(it["choices"]) < 3:
        return False
    if it["answer"] not in it["choices"]:
        return False
    # basic text sanity
    if not it["question"].strip():
        return False
    return True

def dedupe_items(items):
    """
    Remove duplicate questions by normalized question text.

    Args:
        items (list): list of MCQ dicts

    Returns:
        list: deduplicated list preserving first occurrences
    """
    seen, out = set(), []
    for it in items:
        q = " ".join(it["question"].split()).strip().lower()
        if q not in seen:
            seen.add(q)
            out.append(it)
    return out


### Semantic Filter (SentenceTransformer Optimized)

In [7]:
# Initialize sentence Transformer
sent_embed = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_filter(items, context, threshold=0.25):
    """
    Keep MCQs where at least one choice is semantically related to the context.
    Uses sentence-transformer embeddings.

     Args:
        items (list): MCQ dicts
        context (str): chunk text
        threshold (float): cosine similarity threshold (0-1)

    Returns:
        list: filtered MCQs
        
    """
    if not items:
        return items

    context_vec = sent_embed.encode([context])[0]
    out = []
    for it in items:
        # evaluate all choices vs context; allow one sufficiently-related choice
        choice_vecs = sent_embed.encode(it["choices"])
        sims = cosine_similarity([context_vec], choice_vecs).flatten()
        if max(sims) >= threshold:
            out.append(it)
    return out


2025-09-07 17:25:22,206 | INFO | Use pytorch device_name: cpu
2025-09-07 17:25:22,210 | INFO | Load pretrained SentenceTransformer: all-MiniLM-L6-v2


### Parallel MCQ Generation (with Early Stopping)

In [8]:
async def generate_mcqs_async(chunks, difficulty, max_questions):
    """
    Generate MCQs in parallel across chunks using a thread pool.
    Stops early once max_questions is reached.
    """
    loop = asyncio.get_event_loop()
    collected = []

    def process_chunk(chunk):
        """
        Blocking processing for a single chunk 
        Returns a list of filtered MCQs for that chunk.
        """
        try:
            res = mcq_chain.run({
                "context": chunk.page_content,
                "difficulty_hint": DIFFICULTY_HINT[difficulty]
            })
            data = json.loads(res)
            if isinstance(data, dict):
                data = [data]
            data = [it for it in data if validate_item(it)]
            data =  semantic_filter(data, chunk.page_content)
            
            for it in data:
                it["difficulty"] = difficulty
            return data
        except Exception as e:
            logging.warning(f"MCQ generation failed: {e}")
            return []

    with ThreadPoolExecutor(max_workers=cfg["max_workers"]) as executor:
        # Process chunks in batches to allow early stopping
        for i in range(0, len(chunks), cfg["max_workers"]):
            batch = chunks[i:i + cfg["max_workers"]]
            tasks = [loop.run_in_executor(executor, process_chunk, c) for c in batch]
            results = await asyncio.gather(*tasks)
            for sublist in results:
                for q in sublist:
                    if len(collected) < max_questions:
                        collected.append(q)
                    else:
                        logging.info(f"Reached {max_questions} MCQs. Stopping early.")
                        return dedupe_items(collected)

    return dedupe_items(collected)


### Batch Scoring

In [9]:
# Scoring prompt: LLM rates multiple MCQs in a single call and returns JSON
scoring_prompt = PromptTemplate(
    input_variables=["questions"],
    template="""
You are grading multiple-choice questions for clarity, quality, and challenge.

For each MCQ, assign an integer score 1–5:
1 = poor (unclear, trivial, incorrect)
3 = acceptable (clear, somewhat useful, minor issues)
5 = excellent (clear, challenging, plausible distractors, grounded)

Return STRICT JSON (no commentary):
[
  {{"question": "...", "score": 1}},
  ...
]

MCQs to score (JSON list):
{questions}
"""
)
scoring_chain = LLMChain(llm=llm, prompt=scoring_prompt)


def batch_score(items, batch_size=5):
    """
    Score MCQs in batches to reduce number of LLM calls.

    Args:
        items (list): MCQ dicts
        batch_size (int): number of MCQs per scoring call

    Returns:
        list: MCQs with 'score' field added
    """
    """
    Score MCQs in batches to reduce API calls.
    Merges scores back to items by matching 'question' text.
    """
    if not items:
        return items

    scored = []
    for i in tqdm(range(0, len(items), batch_size), desc="Scoring batches"):
        batch = items[i:i+batch_size]
        try:
            res = scoring_chain.run({"questions": json.dumps(batch)})
            scores = json.loads(res)
            # Merge scores back by question text
            for it in batch:
                match = next((s for s in scores if s.get("question") == it["question"]), None)
                it["score"] = int(match.get("score", 1)) if match else 1
            scored.extend(batch)
        except Exception as e:
            logging.warning(f"Batch scoring failed, defaulting scores to 1: {e}")
            for it in batch:
                it["score"] = 1
            scored.extend(batch)
    return scored


### Save output

In [10]:

def save_mcqs(items, out_path):
    """
    Save MCQs to a JSON file.
    
    Args:
        items(list): List of MCQ dicts.
        out_path(str): File path to saved output in JSON.
    """
    with open(out_path, "w", encoding="utf-8") as f:
        json.dump(items, f, indent=2, ensure_ascii=False)
    logging.info(f"Saved {len(items)} MCQs to {out_path}")


### Run Pipeline

In [11]:
async def main():
    """
    Orchestrator: load chunks, generate MCQs with early stopping, score them in batches,
    sort by score, and save the output. Generate balanced MCQs across difficulties
    (based on Bloom’s Taxonomy).
    """
    chunks = load_and_chunk_pdf(
        cfg["pdf_path"], 
        cfg["chunk_size"], 
        cfg["chunk_overlap"],
        cfg["skip_pages"]
    )

    all_mcqs = []
    # Loop over requested difficulty distribution
    for diff, num_q in cfg["difficulty_distribution"].items():
        logging.info(f"Generating {num_q} {diff} questions...")
        subset = await generate_mcqs_async(chunks, diff, num_q)
        all_mcqs.extend(subset)

    # Score and sort combined questions
    all_mcqs = batch_score(all_mcqs, cfg["batch_size"])
    all_mcqs = sorted(all_mcqs, key=lambda x: x.get("score", 0), reverse=True)
    
    save_mcqs(all_mcqs, "generated_mcqs.json")
    logging.info(f"Generated {len(all_mcqs)} balanced MCQs across difficulties.")

    try:
        asyncio.get_event_loop()
    except RuntimeError:
     pass

await main()



2025-09-07 17:25:44,139 | INFO | Loaded 728 chunks from notes.pdf (skipping pages 5 pages)
2025-09-07 17:25:44,139 | INFO | Generating 2 easy questions...
  res = mcq_chain.run({
2025-09-07 17:25:48,408 | INFO | HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-09-07 17:25:48,413 | INFO | HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-09-07 17:25:48,415 | INFO | HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"

Batches: 100%|██████████| 1/1 [00:00<00:00, 11.37it/s]
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  8.49it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  8.70it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00, 16.18it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 17.79it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 17.49it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00, 17.56it/s]
Batches: 100%|██████████| 1/1 [0

In [12]:
def preview_mcqs(items, n=5):
    """
    Print a preview of the top N MCQs with their score and difficulty.

    Args:
        items (list): List of MCQ dicts
        n (int): Number of MCQs to preview
    """
    for i, it in enumerate(items[:n], 1):
        print(f"\nQ{i}: {it['question']} (Score: {it.get('score', 'N/A')}, Difficulty: {it.get('difficulty', 'N/A')})")
        for idx, choice in enumerate(it['choices'], 1):
            marker = "✅" if choice == it["answer"] else " "
            print(f"   {idx}. {choice} {marker}")
        print(f"Explanation: {it['explanation']}")

# Load JSON and preview sample questions
with open("generated_mcqs.json", "r", encoding="utf-8") as f:
    mcqs = json.load(f)

preview_mcqs(mcqs, n=5)



Q1: According to the text, what is the rectangular (Cartesian) form of the phasor at 45°? (Score: 4, Difficulty: easy)
   1. A. √2/2 + j √2/2 ✅
   2. B. 1 + j0  
   3. C. 1/2 + j √3/2  
   4. D. -√2/2 + j √2/2  
Explanation: The table lists the 45° phasor as ε₄ (1↑2 + j 1↑2), which corresponds to cos 45° = √2/2 and sin 45° = √2/2, giving √2/2 + j √2/2.

Q2: Given the values in Table 1.1, what is the exact value of \(\tan\frac{\pi}{6}\)? (Score: 4, Difficulty: medium)
   1. A. \(\sqrt{3}\)  
   2. B. \(\frac{1}{\sqrt{3}}\)  
   3. C. \(\frac{\sqrt{3}}{3}\) ✅
   4. D. \(\frac{1}{2}\)  
Explanation: From the table, \(\sin\frac{\pi}{6}=\frac{1}{2}\) and \(\cos\frac{\pi}{6}=\frac{\sqrt{3}}{2}\). Using \(\tan\theta=\frac{\sin\theta}{\cos\theta}\), \(\tan\frac{\pi}{6}=\frac{\frac{1}{2}}{\frac{\sqrt{3}}{2}}=\frac{1}{\sqrt{3}}=\frac{\sqrt{3}}{3}\).

Q3: Given a = 3 and b = 4, what are the magnitude r and angle ϖ according to the formulas r = √(a² + b²) and ϖ = tan⁻¹(b/a)? (Score: 4, Difficulty

### ⚡ Performance Notes: Async & Batching

This pipeline is optimized for speed using:

1. **Async MCQ Generation**  
   - Multiple chunks are processed in parallel using `ThreadPoolExecutor` + `asyncio`.  
   - Configurable via `cfg["max_workers"]`.  
   - Example: `max_workers=5` → 5 llm calls at once.  

2. **Batch Scoring**  
   - Instead of scoring one MCQ at a time, MCQs are grouped into batches.  
   - Reduces API calls significantly.  
   - Configurable via `cfg["batch_size"]`.  
   - Example: `batch_size=5` → 5 MCQs scored per Gemini call.  

3. **Sentence Transformer Filtering**  
   - Semantic filtering is done **locally** with `all-MiniLM-L6-v2`.  
   - This avoids extra API calls and speeds up processing.  

4. **Early stopping**
   - Halts as soon as `max_questions` are generated.  