📘 Step 1: Text Extraction and Structuring Pipeline
This notebook defines a pipeline that:

1. Extracts raw text from .txt, .docx, or .pdf files.

2. Cleans the raw text to remove formatting noise.

3. Identifies headers and segments the content into structured sections.

4. Outputs the result as a JSON file and previews the first few sections.

### Imports

In [2]:
import os
import re
import json
import fitz  # PyMuPDF
from docx import Document

### 📄 Text Extraction Functions for Different File Types

This section defines utility functions to extract raw text content from `.txt`, `.docx`, and `.pdf` files. These functions handle different formats and ensure the text is cleaned and returned as a unified string.

- **`extract_text_from_txt(path)`**  
  Opens a plain text file and reads its contents using UTF-8 encoding.

- **`extract_text_from_docx(path)`**  
  Uses `python-docx` to read paragraph texts from a Word document (.docx), removing empty lines.

- **`extract_text_from_pdf(path)`**  
  Uses `PyMuPDF` (imported as `fitz`) to extract text from each page of a PDF document. If reading fails, a warning is printed.

- **`load_text(file_path)`**  
  Detects the file type based on extension and calls the corresponding extraction function. Raises an error if the file format is unsupported.


These functions are essential for preprocessing text data from various file formats in the MCQ generation pipeline.


In [3]:
def extract_text_from_txt(path):
    with open(path, 'r', encoding='utf-8', errors='ignore') as file:
        return file.read()

def extract_text_from_docx(path):
    doc = Document(path)
    return '\n'.join(para.text.strip() for para in doc.paragraphs if para.text.strip())

def extract_text_from_pdf(path):
    text = ''
    try:
        with fitz.open(path) as doc:
            for page in doc:
                text += page.get_text()
    except Exception as e:
        print(f"Warning: Failed to read PDF '{path}': {e}")
    return text

def load_text(file_path):
    ext = os.path.splitext(file_path)[1].lower()
    loaders = {'.pdf': extract_text_from_pdf, '.docx': extract_text_from_docx, '.txt': extract_text_from_txt}
    if ext not in loaders:
        raise ValueError(f"Unsupported file format: {ext}")
    return loaders[ext](file_path)

### 🧹 Text Cleaning and Structuring Functions

This section provides utilities to clean raw extracted text and organize it into meaningful sections for downstream processing (e.g., question generation).

---

#### **`clean_raw_text(raw_text)`**
Performs basic cleaning on raw text:
- Removes page numbers and common bullet symbols.
- Normalizes spacing and newlines.
- Filters out OCR artifacts and control characters.
- Strips leading/trailing whitespace from each line.

---

#### **`is_probable_header(line)`**
Heuristic to detect section headers:
- Excludes list items (e.g., `1)`, `a)`).
- Detects lines ending in `:` or `-`.
- Detects mostly capitalized short lines or fully uppercase headings.

---

#### **`structure_text_to_sections(text)`**
Splits the text into sections based on probable headers and paragraph structure:
- Groups lines under detected headers.
- If no header is detected before content, assigns a default `"Document"` header.
- Ignores empty lines and formats content as paragraphs.

Returns a list of sections, each with:
```json
{
  "header": "Section Title",
  "content": "Associated paragraph text"
}


In [4]:
def clean_raw_text(raw_text):
    # Basic cleaning: remove page numbers, bullets, weird characters, multiple spaces, multiple newlines
    text = re.sub(r'\n\d+\n', '\n', raw_text)  # Remove isolated page numbers
    text = re.sub(r'[•●▪■\u2022\uf0b7]', '', text)  # Remove common bullets
    text = re.sub(r'[ \t]+', ' ', text)  # Replace multiple spaces/tabs with single space
    text = re.sub(r'\n{3,}', '\n\n', text)  # Limit newlines to max two

    # Remove obvious OCR garbage or corrupted words (customize as needed)
    text = re.sub(r'\bCo\s*i\s*ant\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'[\x00-\x08\x0B\x0C\x0E-\x1F]+', '', text)  # Remove control chars

    # Strip trailing/leading spaces on each line
    lines = [line.strip() for line in text.splitlines()]
    return "\n".join(lines)

def is_probable_header(line):
    line = line.strip()
    # Exclude numbered or lettered list items like '1) ...', 'a) ...', '2. ...'
    if re.match(r'^(\d+[\.\)]|[a-zA-Z][\.\)])\s', line):
        return False

    # Ends with colon or dash
    if re.search(r'[:\-]\s*$', line):
        return True

    # Mostly capitalized words, short line
    tokens = line.split()
    if 1 <= len(tokens) <= 15:
        upper_words = sum(1 for t in tokens if t and t[0].isupper())
        if upper_words / len(tokens) > 0.7:
            return True

    # Fully uppercase (ignore short lines)
    if len(line) > 3 and line.isupper():
        return True

    return False


def structure_text_to_sections(text):
    lines = text.splitlines()
    sections = []
    current_header = None
    current_content = []

    for idx, line in enumerate(lines):
        if not line.strip():
            # Empty line, consider as paragraph break - add current content if any
            if current_content:
                # Append paragraph (join with space)
                current_content.append('')  # Add paragraph break as empty string
            continue

        if is_probable_header(line):
            # Save previous section if exists
            if current_header or current_content:
                content_text = " ".join(p for p in current_content if p).strip()
                if current_header is None:
                    # No header before content? Use default header
                    current_header = "Document"
                sections.append({
                    "header": current_header,
                    "content": content_text
                })
                current_content = []
            current_header = line.rstrip(':-').strip()
        else:
            current_content.append(line.strip())

    # Add last section
    if current_header or current_content:
        content_text = " ".join(p for p in current_content if p).strip()
        if current_header is None:
            current_header = "Document"
        sections.append({
            "header": current_header,
            "content": content_text
        })

    # Remove empty content sections if any
    sections = [s for s in sections if s['content'].strip() != '']

    return sections

def save_json(data, path):
    with open(path, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

### ⚙️ Document Preprocessing Pipeline

#### **`preprcess(filepath)`**

This is the main orchestration function for processing input files and transforming them into structured section-based JSON.

**Steps performed:**
1. **Validation**: Checks if the file exists; raises an error if not.
2. **Extraction**: Loads text using `load_text()` depending on the file type (.txt, .docx, .pdf).
3. **Cleaning**: Cleans the raw text using `clean_raw_text()` to remove noise, bullets, OCR artifacts, and normalize spacing.
4. **Structuring**: Organizes the cleaned text into logical sections using `structure_text_to_sections()`.
5. **Saving**: Outputs the final structured data into `structured_output.json`.
6. **Preview**: Prints the first 3 sections of the structured result for verification.

**Output Example:**
```json
[
  {
    "header": "Introduction",
    "content": "This section provides an overview of the topic..."
  },
  ...
]


In [5]:
def preprcess(filepath):
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"File '{filepath}' does not exist.")

    print(f"Processing file: {filepath}")

    raw_text = load_text(filepath)
    cleaned_text = clean_raw_text(raw_text)
    sections = structure_text_to_sections(cleaned_text)

    if not sections:
        print("⚠️ Warning: No sections found after processing.")
    else:
        print(f"✅ Found {len(sections)} sections.")

    # Save to JSON
    save_json(sections, "structured_output.json")
    print("Structured JSON saved to 'structured_output.json'")

    # Print first 3 sections as preview
    print("\n--- Sample output (first 3 sections) ---\n")
    print(json.dumps(sections[:3], ensure_ascii=False, indent=2))


📘 **Step 2: Sentence-Level Content Enrichment and Relevance Scoring**

This notebook defines a pipeline that:

1. **Loads** structured JSON output from Step 1.

2. **Segments** each section’s content into individual sentences using spaCy.

3. **Cleans** each sentence by removing numbering, bullet points, and extra whitespace.

4. **Extracts**:
   - **Keywords** using KeyBERT (up to 5 per sentence).
   - **Named Entities** using a BERT-based NER model.

5. **Computes Relevance** of each sentence against the global document context using SentenceTransformer embeddings and cosine similarity.

6. **Combines** all results into a structured format with:
   - Sentence metadata
   - Extracted keywords and entities
   - Relevance scores
   - Answer candidate list

7. **Filters** sentences by a relevance score threshold (≥ 0.5).

8. **Saves** the processed output to `processed_stage2.json` for use in MCQ generation.


### Imports

In [6]:
import json
import re
import spacy
from keybert import KeyBERT
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util

  from .autonotebook import tqdm as notebook_tqdm


### 🤖 Load NLP Models (One-Time Setup)

This cell initializes and loads all required models used for Stage 2 of the MCQ generation pipeline: keyword extraction, NER, and sentence relevance scoring.

---

#### **Model Initializations:**

- **spaCy (`en_core_web_sm`)**  
  For general-purpose NLP tasks such as tokenization, POS tagging, and sentence segmentation.

- **KeyBERT (`KeyBERT()`)**  
  Used for unsupervised keyword/keyphrase extraction based on document embeddings.

- **SentenceTransformer (`all-MiniLM-L6-v2`)**  
  Efficient sentence embedding model used for measuring similarity and relevance between sentences and keywords.

- **Hugging Face NER Pipeline (`dslim/bert-base-NER`)**  
  Transformer-based Named Entity Recognition pipeline that identifies entities (e.g., person, organization, location) and aggregates overlapping spans.

These models are loaded once and reused throughout the pipeline to optimize performance and reduce initialization time.


In [7]:
# Load models once
nlp = spacy.load("en_core_web_sm")
kw_model = KeyBERT()
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
ner_pipeline = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    tokenizer="dslim/bert-base-NER",
    aggregation_strategy="simple"
)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


### ✂️ Sentence-Level Processing Functions

These functions operate on individual sentences to clean, segment, extract keywords/entities, and compute relevance scores. They form the core of Stage 2 in the MCQ generation pipeline.

---

#### **`clean_sentence(sentence)`**
- Removes leading bullets, numbers, or formatting characters.
- Replaces newline characters with spaces.
- Returns a clean, normalized sentence string.

---

#### **`segment_sentences(text)`**
- Uses `spaCy` to split the cleaned text into sentences.
- Filters out very short sentences (< 15 characters) to retain only meaningful content.

---

#### **`extract_keywords(sentence, top_n=5)`**
- Uses `KeyBERT` to extract up to `top_n` keywords or keyphrases (1-3 grams).
- Stop words are excluded using the `'english'` list.
- Returns a list of top-ranked keywords.

---

#### **`extract_entities(sentence)`**
- Uses the Hugging Face `dslim/bert-base-NER` pipeline to extract named entities from the sentence.
- Removes duplicate and whitespace-only entities.
- Returns a list of unique entity mentions.

---

#### **`compute_relevance(sentence, context_emb)`**
- Encodes the sentence using `SentenceTransformer` (`all-MiniLM-L6-v2`).
- Computes cosine similarity with a precomputed context embedding.
- Returns a rounded relevance score (e.g., `0.8437`).

---

> 🧠 These functions are essential for understanding sentence significance, extracting key concepts, and preparing data for high-quality question generation.


In [8]:
def clean_sentence(sentence):
    # Remove leading numbering or bullets in sentences and newlines inside
    sentence = re.sub(r'^[\d\)\.\-\s]+', '', sentence)
    sentence = sentence.replace('\n', ' ').strip()
    return sentence

def segment_sentences(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 15]

def extract_keywords(sentence, top_n=5):
    kws = kw_model.extract_keywords(sentence, keyphrase_ngram_range=(1, 3), stop_words='english', top_n=top_n)
    return [kw[0] for kw in kws]

def extract_entities(sentence):
    entities = ner_pipeline(sentence)
    # Aggregate unique entity texts, filter punctuation and spaces
    unique_entities = list({ent['word'].strip() for ent in entities if ent['word'].strip() and not ent['word'].isspace()})
    return unique_entities

def compute_relevance(sentence, context_emb):
    sent_emb = sentence_model.encode(sentence, convert_to_tensor=True)
    score = util.pytorch_cos_sim(sent_emb, context_emb).item()
    return round(score, 4)

### 🧠 Section-Level Processing & Global Context

These functions are responsible for analyzing each structured section from the document. They generate per-sentence metadata including relevance scores, keywords, and entities, which are crucial for targeted question generation.

---

#### **`process_section(section, global_context_emb)`**
Processes a single structured section (`{"header", "content"}`):
- Segments the section content into sentences.
- Cleans each sentence and extracts:
  - **Keywords** using `KeyBERT`
  - **Named Entities** using a transformer-based NER model
  - **Answer Candidates** from entities or fallback keywords
  - **Relevance Score** based on cosine similarity with a global context embedding
- Returns a list of enriched sentence records like:
```json
{
  "header": "Introduction",
  "sentence_num": 1,
  "sentence": "Machine learning is a subset of AI.",
  "clean_sentence": "Machine learning is a subset of AI.",
  "keywords": ["machine learning", "AI"],
  "entities": ["AI"],
  "answer_candidates": ["AI"],
  "relevance_score": 0.8731
}


In [9]:
def process_section(section, global_context_emb):
    """
    Process a single section {header, content}:
    - Segment content to sentences
    - For each sentence, extract keywords, entities, compute relevance vs global context
    """
    header = section.get('header', '').strip()
    content = section.get('content', '').strip()
    combined_text = f"{header}. {content}"  # Use header+content as local context if needed
    
    # Compute local context embedding for relevance comparison (optional: could use global context too)
    local_context_emb = sentence_model.encode(combined_text, convert_to_tensor=True)
    
    sentences = segment_sentences(content)
    results = []
    for i, sent in enumerate(sentences):
        clean_sent = clean_sentence(sent)
        keywords = extract_keywords(clean_sent)
        entities = extract_entities(clean_sent)
        answer_candidates = entities if entities else keywords
        relevance_score = compute_relevance(clean_sent, global_context_emb)
        
        results.append({
            "header": header,
            "sentence_num": i + 1,
            "sentence": sent,
            "clean_sentence": clean_sent,
            "keywords": keywords,
            "entities": entities,
            "answer_candidates": answer_candidates,
            "relevance_score": relevance_score
        })
    return results

def load_structured_json(path="structured_output.json"):
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data

def build_global_context(sections, max_chars=2048):
    # Combine headers + contents to build a global context string for relevance
    combined_text = " ".join(f"{sec.get('header','')} {sec.get('content','')}" for sec in sections)
    return combined_text[:max_chars]

### 📤 Stage 2: Content Extraction Pipeline (`extractions()`)

This function runs the full **Stage 2** processing flow of the MCQ generation pipeline. It enriches structured document sections by extracting semantic metadata per sentence.

---

#### **Steps Performed:**

1. **Load Structured Data**
   - Reads `structured_output.json` generated from Stage 1.

2. **Build Global Context**
   - Constructs a truncated global context string for semantic relevance comparison.
   - Embeds the global context using `SentenceTransformer`.

3. **Process Each Section**
   - For each section, it:
     - Segments content into sentences.
     - Cleans, extracts keywords and named entities.
     - Computes cosine similarity to global context (relevance).
     - Collects enriched metadata per sentence.

4. **Filter by Relevance**
   - Filters results to only keep sentences with a relevance score ≥ `0.5`.

5. **Save Results**
   - Writes enriched and filtered sentence metadata to `processed_stage2.json`.

---

#### **Output:**
- ✅ Raw processed sentences: `len(all_results)`
- 🔍 Filtered sentences by relevance ≥ `0.5`: `len(filtered_results)`
- 💾 Output file: `processed_stage2.json`

---

> 🧩 This output serves as the foundation for **Stage 3** – Multiple Choice Question (MCQ) generation.


In [10]:
def extractions():
    structured_data = load_structured_json("structured_output.json")
    global_context_text = build_global_context(structured_data)
    global_context_emb = sentence_model.encode(global_context_text, convert_to_tensor=True)

    all_results = []
    for section in structured_data:
        section_results = process_section(section, global_context_emb)
        all_results.extend(section_results)

    # Optionally filter by relevance threshold (e.g., 0.5)
    filtered_results = [r for r in all_results if r["relevance_score"] >= 0.5]

    # Save output for MCQ generation
    with open("processed_stage2.json", "w", encoding="utf-8") as f:
        json.dump(filtered_results, f, indent=2, ensure_ascii=False)

    print(f"Processed {len(all_results)} sentences, filtered to {len(filtered_results)} by relevance >= 0.5")
    print("Output saved to 'processed_stage2.json'")

📘 **Step 3: MCQ Generation from Enriched Sentences**

This notebook defines a pipeline that:

1. **Loads** processed sentence-level data from `processed_stage2.json`.

2. **Selects** key answer candidates (entities or keywords) from each sentence.

3. **Generates Questions** using the `valhalla/t5-small-qg-hl` model by highlighting the answer in the sentence.

4. **Handles Abbreviations**:
   - Detects acronyms related to answers in parentheses.
   - Replaces abbreviations in the generated question with the full answer for clarity.

5. **Cleans Questions** to:
   - Ensure minimum length and readability.
   - Prevent the answer from appearing directly in the question.

6. **Generates Distractors** using:
   - **Embedding-based similarity** with `SentenceTransformer`.
   - **WordNet** as a fallback for semantic distractors.

7. **Shuffles and Formats** questions with 1 correct answer + 3 distractors as options.

8. **Filters** out low-quality or incomplete MCQs.

9. **Saves** the final set of MCQs to `improved_mcqs.json` for use in educational applications.

✅ The script ensures each MCQ is contextually grounded, diverse, and avoids answer leakage.


### Imports

In [11]:
import json
import random
from typing import List
import torch
from transformers import pipeline
from nltk.corpus import wordnet
from sentence_transformers import SentenceTransformer, util
import nltk
import re

This cell prepares the environment and loads models required for generating Multiple Choice Questions (MCQs) from the processed content.

---

#### 📥 1. Download NLTK WordNet Resources
These are required for generating high-quality distractors using synonyms and semantic relationships.

---

#### 🤖 2. Load Question Generation Model
- **Model:** `valhalla/t5-small-qg-hl`  
- **Task:** `text2text-generation`  
- Used to generate **cloze-style questions** by highlighting the answer in the sentence context.

---

#### 🧠 3. Load Sentence Embedding Model for Distractor Similarity
- **Model:** `all-MiniLM-L6-v2` via `SentenceTransformer`  
- Used to **rank and filter distractor options** based on semantic similarity to the correct answer.

---

Each record includes:
- ✅ Clean sentence  
- 🧩 Answer candidates (entities or keywords)  
- 📊 Relevance score  
- 🏷️ Keywords and entities

---

In [12]:
# Ensure nltk data is downloaded
nltk.download("wordnet")
nltk.download("omw-1.4")

# Load the QG model with proper task
qg_pipeline = pipeline("text2text-generation", model="valhalla/t5-small-qg-hl")

# Load embedding model for distractors
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jasse\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\jasse\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Device set to use cpu


This section defines the core logic used in **Stage 3** of the pipeline: generating cloze-style multiple-choice questions (MCQs) using a pre-trained question generation model and creating distractors with lexical and semantic techniques.

---

#### `generate_question_qg(context, answer)`
- Uses `valhalla/t5-small-qg-hl` (T5-based model) to generate a question.
- Replaces the `answer` in the `context` with `<hl>` tags.
- Returns the generated question text or an empty string if:
  - The answer is not in context.
  - An error occurs during generation.

---

#### `get_wordnet_distractors(word)`
- Uses **WordNet** (NLTK) to retrieve lexical distractors:
  - Extracts alternative lemmas from synsets.
  - Cleans and filters out duplicates or exact matches.
  - Returns up to 5 distractor candidates.

---

#### `get_embedding_distractors(correct_answer, context, top_k=5)`
- Uses **sentence embeddings** for semantic distractor generation:
  - Extracts candidate words from context (filtered for length, punctuation, and duplication).
  - Computes cosine similarity between the correct answer and candidate words.
  - Sorts by similarity and returns the top `k` most relevant distractors.

---

> ✅ This hybrid method ensures that generated MCQs are not only contextually relevant but also pedagogically useful with diverse distractors.


In [13]:
def generate_question_qg(context: str, answer: str) -> str:
    """Generate a question using valhalla/t5-small-qg-hl with <hl> highlighting."""
    if answer not in context:
        return ""  # Avoid incorrect highlighting
    highlighted = context.replace(answer, f"<hl> {answer} <hl>")
    prompt = f"generate question: {highlighted}"
    try:
        result = qg_pipeline(prompt, max_length=64, do_sample=False)
        return result[0]["generated_text"].strip() if result else ""
    except Exception as e:
        print(f"QG error: {e}")
        return ""


def get_wordnet_distractors(word: str) -> List[str]:
    """Generate distractors using WordNet."""
    distractors = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            name = lemma.name().replace("_", " ").lower()
            if name != word.lower():
                distractors.add(name)
    return list(distractors)[:5]


def get_embedding_distractors(correct_answer: str, context: str, top_k=5) -> List[str]:
    """Find better distractors using sentence embeddings and word filtering."""
    words = list(set(context.lower().split()))
    words = [w.strip(".,()[]") for w in words if len(w) > 3 and w.lower() != correct_answer.lower()]
    words = [w for w in words if w.isalpha()]  # Remove punctuation, numbers, etc.

    if not words:
        return []

    try:
        correct_embedding = embedding_model.encode(correct_answer, convert_to_tensor=True)
        candidate_embeddings = embedding_model.encode(words, convert_to_tensor=True)
        similarities = util.pytorch_cos_sim(correct_embedding, candidate_embeddings)[0]
        sorted_indices = similarities.argsort(descending=True)  # Get most similar
        distractors = []
        for idx in sorted_indices:
            candidate = words[idx]
            if candidate.lower() != correct_answer.lower() and candidate not in distractors:
                distractors.append(candidate)
            if len(distractors) == top_k:
                break
        return distractors
    except Exception as e:
        print(f"Distractor generation error: {e}")
        return []


### 🧹 Question Cleaning and Abbreviation Handling

This section refines the generated questions to ensure they are **clean**, **contextually correct**, and **free of leakage** (i.e., directly including the answer in the question).

---

#### `clean_question(question, answer)`
- Ensures that the correct answer is **not explicitly present** in the question.
- If found, it replaces the answer with a blank (`"_____"`).
- Useful to prevent question-answer leakage in MCQs.

---

#### `replace_abbreviation_in_question(question, answer, abbreviation_candidates)`
- Detects and replaces abbreviations or acronyms in the question with the **full answer form**.
- Uses regex for **whole word replacement** to prevent partial matches.
- Helps improve clarity when abbreviations are present in questions.

---

#### `find_abbreviations(sentence, answer)`
- Heuristically detects abbreviations **following the full form** in parentheses.
  - Example: `"Random Access Memory (RAM)" → ['RAM']`
- Filters for **uppercase or short acronyms** (≤ 6 characters).
- Returns a list of valid abbreviation candidates to be replaced.

---

> ✨ This cleaning and replacement step ensures that generated questions are educationally effective and do not confuse learners by revealing answers prematurely or using unclear acronyms.


In [14]:
def clean_question(question: str, answer: str) -> str:
    """Ensure the answer is not directly embedded in the question."""
    q_lower = question.lower()
    a_lower = answer.lower()
    if a_lower in q_lower:
        return question.replace(answer, "_____").strip()
    return question.strip()


def replace_abbreviation_in_question(question: str, answer: str, abbreviation_candidates: List[str]) -> str:
    """
    Replace any detected abbreviation or acronym in the question with the full answer.

    abbreviation_candidates: list of abbreviations/acronyms extracted from the sentence that relate to the answer.
    """
    # To avoid partial replacement, do whole word match with regex
    for abbr in abbreviation_candidates:
        pattern = re.compile(r'\b' + re.escape(abbr) + r'\b', flags=re.IGNORECASE)
        if pattern.search(question):
            question = pattern.sub(answer, question)
    return question


def find_abbreviations(sentence: str, answer: str) -> List[str]:
    """
    Heuristic: find abbreviations in sentence inside parentheses next to answer.

    Example: "Network Interface Card (NIC)" → extract "NIC"
    """
    pattern = re.compile(rf"{re.escape(answer)}\s*\(([^)]+)\)", flags=re.IGNORECASE)
    match = pattern.search(sentence)
    if match:
        # Return all uppercase abbreviations or short acronyms split by comma/semicolon if any
        abbrs = [abbr.strip() for abbr in re.split(r'[;,]', match.group(1))]
        # Filter to plausible abbreviations (usually uppercase or short)
        return [a for a in abbrs if len(a) <= 6 and a.isupper()]
    return []


### 🎯 MCQ Generation: `generate()`

This function creates high-quality **multiple-choice questions (MCQs)** from structured and filtered educational content.

---

#### 🔧 Function: `generate(input_path, output_path, mcq_target)`
- **Inputs:**
  - `input_path`: Path to processed sentence-level JSON (default: `"processed_stage2.json"`).
  - `output_path`: Filepath to save generated MCQs.
  - `mcq_target`: Number of MCQs to generate (default: 7).

---

#### 🚀 Steps:
1. **Load Cleaned Sentences** from stage 2 output.
2. For each sentence:
   - Extract first answer candidate.
   - Generate question using `<hl>`-based T5 pipeline.
   - Replace abbreviations (e.g., “NIC”) with full answer.
   - Clean question to avoid leakage.
   - Generate **distractors**:
     - Prefer sentence embedding-based similarity.
     - Fall back to **WordNet** synonyms.
   - Shuffle options and format as MCQ.

3. **Save** results in JSON format with:
   - `question`, `options`, `answer`, `header`, `source_sentence`.

---

> ⚠️ Automatically skips low-quality entries (e.g., short questions, poor distractors, or leakage).
> Output: `"improved_mcqs.json"` (or custom path).

---


In [15]:
def generate(input_path="processed_stage2.json", output_path="final_mcqs.json", mcq_target=7):
    """Generate multiple-choice questions from processed data."""
    if not os.path.exists(input_path):
        raise FileNotFoundError(f"Input file not found: {input_path}")
    
    with open(input_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    output_mcqs = []
    skipped_entries = 0

    for entry in data:
        if len(output_mcqs) >= mcq_target:
            break

        sentence = entry.get("clean_sentence", "").strip()
        answer = entry.get("answer_candidates", [])[0] if entry.get("answer_candidates") else ""

        if not sentence or not answer:
            skipped_entries += 1
            continue

        question = generate_question_qg(sentence, answer)
        if not question:
            skipped_entries += 1
            continue

        # Detect abbreviations related to answer
        abbreviations = find_abbreviations(sentence, answer)

        # Replace abbreviation in question with full answer
        if abbreviations:
            question = replace_abbreviation_in_question(question, answer, abbreviations)

        question = clean_question(question, answer)

        if len(question.split()) < 3 or answer.lower() in question.lower():
            skipped_entries += 1
            continue

        # Generate distractors
        distractors = get_embedding_distractors(answer, sentence)
        if len(distractors) < 3:
            distractors += get_wordnet_distractors(answer)
        distractors = list(set(distractors))[:3]

        if len(distractors) < 3:
            skipped_entries += 1
            continue

        # Shuffle options
        options = distractors + [answer]
        random.shuffle(options)

        output_mcqs.append({
            "header": entry.get("header", "General"),
            "question": question,
            "options": options,
            "answer": answer,
            "source_sentence": sentence
        })

    # Save output
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(output_mcqs, f, indent=2, ensure_ascii=False)

    print(f"✅ MCQ generation complete. Generated {len(output_mcqs)} MCQs, skipped {skipped_entries} entries due to quality/filtering.")
    print(f"Output saved to: {output_path}")


# The main function 

In [16]:
if __name__ == "__main__": 
    preprcess("ex1.pdf")  # Change to your file path
    extractions()
    generate()

Processing file: ex1.pdf
✅ Found 9 sections.
Structured JSON saved to 'structured_output.json'

--- Sample output (first 3 sections) ---

[
  {
    "header": "Software Engineering Section 1",
    "content": "Difference between Software and Computer programs Software engineering is intended to support professional software development, rather than individual programming. It includes techniques that support program specification, design, and evolution, none of which are normally relevant for personal software development. Many people think that software is simply another word for computer programs. However, when we are talking about software engineering, software is not just the programs themselves but also all associated documentation and configuration data that is required to make these programs operate correctly. A professionally developed software system is often more than a single program. The system usually consists of a number of separate programs and configuration files that are 

Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Processed 76 sentences, filtered to 21 by relevance >= 0.5
Output saved to 'processed_stage2.json'


Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both

✅ MCQ generation complete. Generated 7 MCQs, skipped 11 entries due to quality/filtering.
Output saved to: final_mcqs.json
