# Introduction to the AI Prototype

## Overview

Welcome to the AI Research prototype notebook. This notebook is designed to demonstrate the **core AI infrastructure and workflows**, including:

- **Retrieval-Augmented Generation (RAG) Systems**: Efficiently retrieving and integrating educational content across multiple subjects and curricula.
- **Multimodal AI Systems**: Processing and combining both text and visual data for advanced educational tools.
- **Agentic AI Frameworks**: Coordinating multiple AI agents to automate and optimize educational planning and assessment workflows.
- **Foundation Model Integration**: Leveraging pre-trained language models for curriculum-aligned content generation and adaptive learning.

This project is aligned with MESO’s mission to create the **world’s most advanced educational planning tools**.

---

## Purpose of this Notebook

The goal of this notebook is to:

1. **Provide a structured framework** for developing AI models, data pipelines, and multimodal processing workflows.
2. **Demonstrate practical implementations** of RAG systems, embeddings, vector search, and multimodal AI.
3. **Enable reproducible AI experiments** aligned with the technical requirements outlined in the MESO AI Researcher role.
4. **Prepare for platform integration**, including API deployment, real-time content generation, and global scalability.




## Imports

# Importing Libraries and Utilities

## Overview

This cell loads all the essential Python libraries, utilities, and frameworks needed to **develop MESO’s AI infrastructure**, including RAG systems, multimodal AI pipelines, and data processing workflows. These imports are foundational for handling **text, embeddings, vector search, PDFs, images, and optional LLM integrations**.

By organizing all imports at the beginning, we ensure:

- Code readability and maintainability.
- Easy access to essential libraries for AI, NLP, and data processing.
- Flexibility to switch between different tools (e.g., PyMuPDF vs pdfminer) depending on availability.

---

---


In [None]:
# =============================
# Install Dependencies
# =============================
!pip install --upgrade pip
!pip install torch torchvision torchaudio
!pip install faiss-cpu
!pip install sentence-transformers
!pip install transformers
!pip install rank_bm25
!pip install scikit-learn
!pip install fastapi uvicorn
!pip install streamlit
!pip install pillow
!pip install opencv-python
!pip install timm
!pip install trafilatura
!pip install PyMuPDF pdfminer.six
!pip install pydantic

# Optional: for multimodal (BLIP)
!pip install git+https://github.com/salesforce/BLIP.git

Collecting pip
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.2
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m143.4 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-

In [None]:
import os, re, io, json, pickle, math, random, time, warnings
from pathlib import Path
from typing import List, Dict, Tuple, Optional, Any
from collections import Counter

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

# sentence-transformers and cross-encoder
from sentence_transformers import SentenceTransformer
from sentence_transformers import CrossEncoder

# FAISS
import faiss

# PDF libs (prefer PyMuPDF/fitz, fallback to pdfminer)
try:
    import fitz  # PyMuPDF
    PDF_LIB = 'pymupdf'
except Exception:
    from pdfminer.high_level import extract_text as pdfminer_extract_text
    PDF_LIB = 'pdfminer'

# Text utilities
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Image libs
from PIL import Image

# Optional OpenAI
try:
    import openai
except Exception:
    openai = None

# Silence warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Cell 2 — Configuration, Persistence Paths, and Lazy Model Initialization

## Overview

This cell sets up the **core configuration, folder structure, and model initialization routines**. It ensures that all files, embeddings, and indexes are stored in a consistent location and that **models are only loaded when needed** (lazy loading), which saves memory and speeds up notebook execution. It also checks for **multimodal support**, enabling image-text processing if the required libraries are installed.

---

1. BASE_DIR: Root directory of the project.
2. UPLOAD_DIR: Location for storing uploaded educational materials (PDFs, images, etc.).
3. OUTPUTS: Main directory to store results, embeddings, and indexes.
4. FAISS_DIR: Directory for storing FAISS vector indexes for semantic search.
5. CLIP_DIR: Directory for storing CLIP embeddings (used for multimodal image-text tasks).
6. CORPUS_PKL: Pickle file for storing metadata about the text corpus.

---
### Purpose: Load the models lazily.
Benefits:
- Saves RAM until model is required.
- Avoids long initial startup times.
- Allows switching models easily via environment variables.

* load_encoder(): Loads the SentenceTransformer model for generating embeddings.
* load_reranker(): Loads the CrossEncoder model for relevance scoring or reranking search results.

In [None]:
# CELL 2 — Config, persistence paths, lazy model init
# =====================
# Paths
BASE_DIR = Path('.').absolute()
UPLOAD_DIR = BASE_DIR / 'outputs' / 'uploads'
OUTPUTS = BASE_DIR / 'outputs'
FAISS_DIR = OUTPUTS / 'faiss'
CLIP_DIR = OUTPUTS / 'clip_index'
CORPUS_PKL = OUTPUTS / 'corpus_meta.pkl'

'''Ensures that all necessary directories are created automatically.
parents=True allows creation of parent directories if they don’t exist.
exist_ok=True avoids errors if the directories already exist.'''
for d in [UPLOAD_DIR, OUTPUTS, FAISS_DIR, CLIP_DIR]:
    d.mkdir(parents=True, exist_ok=True)

# Models
'''ENCODER_MODEL: Default sentence transformer model used to generate embeddings for semantic search.
RERANKER_MODEL: Cross-encoder model used to rerank candidate search results for better relevance.
EMBED_BATCH: Number of documents processed per batch when generating embeddings; balances speed and memory usage.'''

ENCODER_MODEL = os.getenv('ENCODER_MODEL', 'sentence-transformers/all-MiniLM-L6-v2')
RERANKER_MODEL = os.getenv('RERANKER_MODEL', 'cross-encoder/ms-marco-MiniLM-L-6-v2')
EMBED_BATCH = int(os.getenv('EMBED_BATCH', '32'))

# LLM settings (user should set environment variable OPENAI_API_KEY or edit below)
''''USE_LLM: Flag to indicate whether to use OpenAI GPT models.
OPENAI_API_KEY: Reads the API key from environment variables for secure authentication.
If a valid key is found, it initializes the OpenAI library for downstream content generation.'''
USE_LLM = False
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', None)
if OPENAI_API_KEY:
    openai.api_key = OPENAI_API_KEY

# Lazy-loaded model holders
'''_encoder and _reranker: Placeholders for the embedding and cross-encoder models.
These are initialized only when the load_encoder() or load_reranker() functions are called, saving memory until the models are needed.'''

_encoder: Optional[SentenceTransformer] = None
_reranker: Optional[CrossEncoder] = None


def load_encoder():
    global _encoder
    if _encoder is None:
        print('Loading encoder:', ENCODER_MODEL)
        _encoder = SentenceTransformer(ENCODER_MODEL)
    return _encoder


def load_reranker():
    global _reranker
    if _reranker is None:
        print('Loading reranker:', RERANKER_MODEL)
        _reranker = CrossEncoder(RERANKER_MODEL)
    return _reranker

# Multimodal flags (will be lazy-loaded if used)
'''Purpose: Checks if multimodal libraries are installed.
CLIP / BLIP: Used for vision-language tasks, such as matching images to text or generating visual content embeddings.
MULTIMODAL_AVAILABLE: Boolean flag indicating whether image-text processing is supported. Allows the notebook to skip multimodal operations if dependencies are missing.'''

MULTIMODAL_AVAILABLE = False
try:
    from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
    import torch
    MULTIMODAL_AVAILABLE = True
except Exception:
    MULTIMODAL_AVAILABLE = False


## Upload helpers (Colab / Jupyter / Streamlit)
---
## The functions below provide three upload interfaces. Each writes to outputs/uploads/ and returns paths.

In [None]:


# 1) Colab upload helper (if running in Colab)
def colab_upload_handler():
    try:
        from google.colab import files
    except Exception:
        print('Colab environment not detected.')
        return []
    uploaded = files.upload()
    paths=[]
    for name, blob in uploaded.items():
        p = UPLOAD_DIR / name
        with open(p, 'wb') as f:
            f.write(blob)
        paths.append(str(p))
    print('Uploaded:', paths)
    return paths

# 2) Jupyter ipywidgets (if running in Jupyter)
def jupyter_file_upload_widget():
    try:
        import ipywidgets as widgets
        from IPython.display import display
    except Exception:
        print('ipywidgets not available in this environment.')
        return None
    upload = widgets.FileUpload(accept='.pdf', multiple=True)
    display(upload)
    print('Use the widget to upload. After uploading, run: handle_widget_upload(upload)')
    return upload


def handle_widget_upload(widget):
    paths=[]
    for name, file_info in widget.value.items():
        p = UPLOAD_DIR / name
        with open(p, 'wb') as f:
            f.write(file_info['content'])
        paths.append(str(p))
    print('Saved uploaded files:', paths)
    return paths

# 3) Streamlit uploader (for demo)
# In streamlit app: uploaded_files = st.file_uploader('Upload PDFs', type=['pdf'], accept_multiple_files=True)
# Then call streamlit_handle_upload(uploaded_files)
def streamlit_handle_upload(uploaded_files):
    paths=[]
    for f in uploaded_files:
        # f is a UploadedFile-like object
        p = UPLOAD_DIR / f.name
        with open(p, 'wb') as out:
            out.write(f.getbuffer())
        paths.append(str(p))
    return paths

# Unified helper: call with a list of file-like or paths -> copy into outputs/uploads and return list

def ingest_files(filepaths: List[str]) -> List[str]:
    """Copy given file paths (e.g. user-selected) to UPLOAD_DIR and return new paths."""
    saved=[]
    for fp in filepaths:
        src = Path(fp)
        if not src.exists():
            print('[WARN] source does not exist:', fp)
            continue
        dst = UPLOAD_DIR / src.name
        with open(src,'rb') as r, open(dst,'wb') as w:
            w.write(r.read())
        saved.append(str(dst))
    print('Ingested files ->', saved)
    return saved

# Notify UX helper

def notify_upload_success(paths: List[str]):
    if not paths:
        print('No files uploaded.')
        return
    print('Successfully uploaded and stored the following files:')
    for p in paths:
        print(' -', p)
    print('\nNext: building corpus from these PDFs (call build_corpus_from_folder)')

# Robust PDF Extraction (Handles Images Too)

## Overview

This cell defines **robust functions for extracting text and images from PDFs**. It is designed to:

1. Handle **both text and images** in PDF pages.
2. Support **two different PDF libraries** (`PyMuPDF` and `pdfminer`) to maximize compatibility.
3. Allow **batch processing of PDFs in a folder**, generating structured output for downstream RAG and multimodal AI workflows.

These functions are essential for MESO’s AI system because educational materials often come in PDF format and may include diagrams, charts, or scanned pages.

---

## 1. Function: `extract_text_and_images_from_pdf`

```python
def extract_text_and_images_from_pdf(pdf_path: str, extract_images: bool=True) -> Dict[str, Any]:
    """Return {'pages': [{'page':n,'text':str,'images':[{'name','data','ext'}]}], 'meta': {...}}"""


In [None]:

# 1. Function: `extract_text_and_images_from_pdf`
'''Reads a single PDF file.
Extracts all text from each page.
Optionally extracts embedded images.
Returns a dictionary containing:
pages: List of pages, each with page number, text, and images.
meta: Metadata about the PDF (e.g., file path).'''

def extract_text_and_images_from_pdf(pdf_path: str, extract_images: bool=True) -> Dict[str, Any]:
    """Return {'pages': [{'page':n,'text':str,'images':[{'name','data','ext'}]}], 'meta': {...}}"""
    pdf_path = str(pdf_path)
    out = {'pages':[], 'meta':{'path':pdf_path}}
    if PDF_LIB == 'pymupdf':
        doc = fitz.open(pdf_path)
        for i in range(len(doc)):
            page = doc[i]
            text = page.get_text('text') or ''
            images = []
            if extract_images:
                for img_index, img in enumerate(page.get_images(full=True)):
                    xref = img[0]
                    base_image = doc.extract_image(xref)
                    img_bytes = base_image['image']
                    ext = base_image.get('ext','png')
                    name = f"{Path(pdf_path).stem}_p{i+1}_img{img_index}.{ext}"
                    images.append({'name':name, 'data':img_bytes, 'ext':ext})
            out['pages'].append({'page': i+1, 'text': text, 'images': images})
        doc.close()
    else:
        # pdfminer fallback: extract text only
        text = pdfminer_extract_text(pdf_path)
        # basic split into pages by form feed if present
        pages = text.split('\x0c')
        for i, ptxt in enumerate(pages):
            out['pages'].append({'page': i+1, 'text': ptxt, 'images': []})
    return out

# Robust wrapper for a folder of PDFs
# 2. Function: extract_from_folder
'''Processes all PDFs in a folder.
Returns two lists:
corpus_pages: All pages and extracted content.
pdf_metas: Metadata for each PDF (path, number of pages).'''

def extract_from_folder(folder: str, extract_images: bool=True) -> Tuple[List[Dict], List[Dict]]:
    folder = Path(folder)
    pdfs = sorted([p for p in folder.glob('*.pdf')])
    corpus_pages=[]
    pdf_metas=[]
    for pdf in pdfs:
        try:
            res = extract_text_and_images_from_pdf(str(pdf), extract_images=extract_images)
            corpus_pages.append({'pdf': str(pdf), 'pages': res['pages']})
            pdf_metas.append({'path': str(pdf), 'n_pages': len(res['pages'])})
        except Exception as e:
            print('[ERROR] Failed to process', pdf, e)
    return corpus_pages, pdf_metas


# Header/Footer Detection & Cleaning

## Overview

This cell defines **functions to automatically detect and remove repeating headers and footers** from PDFs.  

In educational PDFs, it is common for each page to contain:

- Document titles, chapter names, or school names at the **top** (headers)
- Page numbers, footnotes, or copyright notes at the **bottom** (footers)

These repeated elements can interfere with **text embeddings, semantic search, and RAG systems**, so cleaning them is essential.

The two main functions are:

1. `detect_repeating_headers_footers` — identifies lines that are repeated across pages and are likely headers or footers.
2. `remove_repeating_headers` — removes these detected lines from page text to produce cleaner, content-focused text.

---

In [None]:
# CELL 5 — Header/footer detection & cleaning
# =====================

def detect_repeating_headers_footers(pages: List[Dict], top_lines=4, bottom_lines=4, threshold=0.25) -> set:
    """Given pages [{'text':...}], return set of repeated line fragments likely headers/footers"""
    total = len(pages)
    c = Counter()
    for p in pages:
        lines = [ln.strip() for ln in p['text'].splitlines() if ln.strip()]
        top = lines[:top_lines]
        bottom = lines[-bottom_lines:]
        for ln in top+bottom:
            norm = re.sub(r'\d+','', ln.lower()).strip()
            if len(norm) > 3:
                c[norm]+=1
    candidates = {k for k,v in c.items() if v/total >= threshold}
    return candidates

'''Removes all previously detected repeating headers/footers from a given page’s text.
'''
def remove_repeating_headers(text: str, repeating: set) -> str:
    out = text
    for h in repeating:
        try:
            out = re.sub(re.escape(h), ' ', out, flags=re.I)
        except Exception:
            continue
    out = re.sub(r'\s+', ' ', out).strip()
    return out

# Semantic Chunking (Heading Detection + Sentence-Aware Overlap)

## Overview

This cell defines **functions to split large PDF/text content into semantically meaningful chunks** for embedding and retrieval.  

Key goals:

1. Detect **logical headings** or sections in the text.
2. Preserve **sentence boundaries** to maintain context.
3. Introduce **overlap between chunks** to avoid breaking semantic continuity.
4. Produce chunks of **manageable size** for vector embedding models (like SentenceTransformers or LLMs).

This is critical for MESO’s AI system because educational content often comes as **long PDFs**, and splitting them properly ensures:

- High-quality embeddings.
- Improved retrieval accuracy in RAG systems.
- Preservation of contextual information across chunk boundaries.

---

In [None]:
# CELL 6 — Semantic chunking (heading detection + sentence-aware overlap)
# =====================

def heuristic_heading_lines(text: str, top_n=6) -> List[str]:
    lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
    cand = []
    top = lines[:top_n]
    for ln in top:
        # heading heuristics: ALL CAPS or short line (<=8 words) or contains 'Chapter' 'Section'
        if ln.isupper() or len(ln.split()) <= 8 or re.search(r'chapter|section|unit|lesson', ln, flags=re.I):
            cand.append(ln)
    return cand


def chunk_by_section_and_sentences(text: str, chunk_words=500, overlap_words=100, min_words=40) -> List[str]:
    # Split text into sections by two newlines or headings
    # keep sentence boundaries
    sections = re.split(r'\n\s*\n', text)
    chunks=[]
    for sec in sections:
        sec = sec.strip()
        if not sec: continue
        sents = sent_tokenize(sec)
        if not sents: continue
        bucket=[]
        bucket_words=0
        for s in sents:
            w = len(s.split())
            if bucket_words + w <= chunk_words or not bucket:
                bucket.append(s); bucket_words += w
            else:
                chunk = ' '.join(bucket).strip()
                if len(chunk.split()) >= min_words:
                    chunks.append(chunk)
                # compute overlap in sentences to keep semantic continuity
                avg = max(1, bucket_words/len(bucket))
                overlap_sent = max(1, int(overlap_words/avg))
                bucket = bucket[-overlap_sent:].copy()
                bucket_words = sum(len(ss.split()) for ss in bucket)
                bucket.append(s); bucket_words += w
        if bucket:
            chunk = ' '.join(bucket).strip()
            if len(chunk.split()) >= min_words:
                chunks.append(chunk)
    return chunks

# Build Corpus from Uploaded PDFs and Persist

## Overview

This cell defines a **function to process uploaded PDFs into a clean, chunked, and structured corpus**, ready for embeddings and retrieval.  

Key responsibilities of this cell:

1. **Load uploaded PDFs** from a folder.
2. **Extract text** (and optionally images) from PDFs.
3. **Detect and remove repeating headers/footers** for clean content.
4. **Chunk text semantically** using sentence-aware overlap.
5. **Fallback sliding window** for unchunked text to avoid losing data.
6. **Generate metadata** for each chunk for tracking.
7. **Persist the corpus** to disk for future use.

This step **consolidates PDF preprocessing, cleaning, and chunking** into a reusable corpus for MESO’s AI RAG system.

---


In [None]:
# CELL 7 — Build corpus from uploaded PDFs and persist
# =====================

def build_corpus_from_uploads(upload_folder: str, persist=True) -> Tuple[List[str], List[Dict]]:
    """Extracts, cleans headers, chunks, and returns corpus list and meta list. Persists to CORPUS_PKL by default."""
    upload_folder = Path(upload_folder)
    # extract pages + images
    pdfs = sorted([p for p in upload_folder.glob('*.pdf')])
    all_pages=[]
    for pdf in pdfs:
        try:
            doc = fitz.open(str(pdf)) if PDF_LIB=='pymupdf' else None
        except Exception:
            doc = None
        # use extract_from_folder helper for robustness
        pages_struct, pm = extract_from_folder(str(upload_folder), extract_images=False)
        # pages_struct contains entries for all pdfs in folder. We'll aggregate differently below
        break
    # Simpler: individually process each pdf to keep mapping
    corpus=[]
    meta=[]
    for pdf in pdfs:
        try:
            res = extract_text_and_images_from_pdf(str(pdf), extract_images=False)
        except Exception as e:
            print('[ERROR]', e)
            continue
        pages = res['pages']
        # detect headers across pages for this pdf
        repeating = detect_repeating_headers_footers(pages, top_lines=4, bottom_lines=4, threshold=0.25)
        for p in pages:
            txt = p['text']
            if not txt or not txt.strip():
                continue
            txt = remove_repeating_headers(txt, repeating)
            chunks = chunk_by_section_and_sentences(txt, chunk_words=500, overlap_words=100, min_words=40)
            if not chunks:
                # fallback sliding window
                words = txt.split()
                if len(words) > 60:
                    for i in range(0, len(words), 300):
                        sub = ' '.join(words[i:i+500])
                        corpus.append(sub)
                        meta.append({'doc': str(pdf.name), 'page': p['page'], 'chunk_id': len(corpus)-1})
                else:
                    corpus.append(txt)
                    meta.append({'doc': str(pdf.name), 'page': p['page'], 'chunk_id': len(corpus)-1})
                continue
            for cid, ch in enumerate(chunks):
                corpus.append(ch)
                mm = {'doc': str(pdf.name), 'page': p['page'], 'chunk_id': cid, 'n_words': len(ch.split()), 'first_sentence': ch.split('. ')[0][:200]}
                meta.append(mm)
    print('Built corpus chunks:', len(corpus))
    if persist:
        with open(CORPUS_PKL, 'wb') as f:
            pickle.dump({'corpus':corpus,'meta':meta}, f)
        print('Persisted corpus ->', CORPUS_PKL)
    return corpus, meta

# Cell 8 — Build Embeddings + FAISS Index (Batched)

## Overview

This cell defines functions to **convert the processed corpus into embeddings and build a FAISS index**, enabling **fast semantic search**.  

Key responsibilities:

1. Encode each text chunk in the corpus into **vector embeddings**.
2. Build a **FAISS index** for efficient similarity search.
3. Persist embeddings, index, and metadata to disk.
4. Provide a loader function to **reload the FAISS index and metadata**.

This is a critical step for MESO’s AI retrieval system, allowing **rapid, accurate retrieval of educational content** from large document corpora.

---


In [None]:
# CELL 8 — Build embeddings + FAISS index (batched)
# =====================

def build_faiss_index(corpus: List[str], meta: List[Dict], encoder_model=None, batch_size=EMBED_BATCH, persist_dir: Path=FAISS_DIR):
    if encoder_model is None:
        encoder_model = load_encoder()
    all_emb=[]
    for i in tqdm(range(0, len(corpus), batch_size), desc='Embedding corpus'):
        batch = corpus[i:i+batch_size]
        emb = encoder_model.encode(batch, convert_to_numpy=True, normalize_embeddings=True, show_progress_bar=False)
        all_emb.append(emb)
    emb = np.vstack(all_emb).astype('float32')
    dim = emb.shape[1]
    index = faiss.IndexIDMap(faiss.IndexFlatIP(dim))
    ids = np.arange(len(corpus)).astype('int64')
    index.add_with_ids(emb, ids)
    # persist
    faiss.write_index(index, str(persist_dir / 'faiss.index'))
    np.save(str(persist_dir / 'embeddings.npy'), emb)
    with open(str(persist_dir / 'meta.pkl'), 'wb') as f:
        pickle.dump(meta, f)
    print('Saved FAISS index to', persist_dir)
    return index, emb

# Loader

def load_index_and_meta(persist_dir: Path=FAISS_DIR):
    idx_path = persist_dir / 'faiss.index'
    if not idx_path.exists():
        raise FileNotFoundError('No FAISS index found. Run build_faiss_index first.')
    index = faiss.read_index(str(idx_path))
    emb = np.load(str(persist_dir / 'embeddings.npy'))
    with open(str(persist_dir / 'meta.pkl'), 'rb') as f:
        meta = pickle.load(f)
    print('Loaded FAISS index and meta. n_chunks=', len(meta))
    return index, emb, meta


# Search + Rerank + Filters + Confidence

## Overview

This cell defines functions to **search the FAISS index, rerank retrieved chunks using a cross-encoder, apply optional filters, and compute confidence scores**.  

It is a **core retrieval step** in the MESO AI pipeline, allowing semantic search over educational content with **high precision and contextual relevance**.

The two main functions are:

1. `search_and_rerank` — retrieve and rerank top chunks from the corpus.
2. `safe_confidence` — compute a normalized confidence score for retrieved results.

---

In [None]:
# CELL 9 — Search + rerank + filters + confidence
# =====================

def search_and_rerank(index, query: str, corpus: List[str], meta: List[Dict], top_k=5, extra_k=12, filters: Optional[Dict]=None):
    encoder = load_encoder()
    reranker = load_reranker()
    qv = encoder.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    D, I = index.search(qv, extra_k)
    cand = []
    for score, idx in zip(D[0], I[0]):
        if int(idx) < 0: continue
        m = meta[int(idx)]
        if filters:
            skip=False
            for k,v in filters.items():
                mv = m.get(k)
                if mv is None or str(mv).lower() != str(v).lower():
                    skip=True; break
            if skip: continue
        cand.append((int(idx), float(score)))
    if not cand:
        return [], [], []
    chunks = [corpus[i] for i,_ in cand]
    pairs = [(query, c) for c in chunks]
    rerank_scores = reranker.predict(pairs)
    ranked = sorted(zip(cand, chunks, rerank_scores), key=lambda x: x[2], reverse=True)[:top_k]
    final_chunks = [x[1] for x in ranked]
    final_meta = [meta[x[0][0]] for x in ranked]
    final_scores = [float(x[2]) for x in ranked]
    return final_chunks, final_meta, final_scores


def safe_confidence(scores: List[float]) -> float:
    if not scores: return 0.0
    return float(np.clip(np.mean(scores), 0.0, 1.0))

# Guardrails & Validator Agent

## Overview

This cell introduces a **validator agent**, which acts as a **content guardrail** in the MESO AI pipeline.  

The purpose of this function is to **filter retrieved or generated text chunks**, ensuring they:

1. Meet **minimum content length**.
2. Avoid **banned or harmful terms**.
3. Optionally respect **readability constraints** (e.g., Flesch-Kincaid grade level).

This is crucial for **educational applications**, where outputs must be safe, clean, and appropriate for learners.

---

In [None]:
# CELL 10 — Guardrails & validator agent
# =====================

def validator_agent(chunks: List[str], metas: List[Dict], min_words=20, banned_terms: Optional[set]=None, max_readability_grade: Optional[float]=None) -> Tuple[List[str], List[Dict]]:
    if banned_terms is None:
        banned_terms = {'suicide','bomb','explosive','terror','hate'}
    filtered_chunks=[]
    filtered_meta=[]
    for ch,m in zip(chunks, metas):
        if len(ch.split()) < min_words:
            continue
        low = ch.lower()
        if any(b in low for b in banned_terms):
            continue
        filtered_chunks.append(ch)
        filtered_meta.append(m)
    return filtered_chunks, filtered_meta


# MCQ Generation (Deterministic + LLM-based)

## Overview

This cell implements the **Multiple Choice Question (MCQ) generation pipeline** for educational content in MESO.  
It combines:

1. **Deterministic MCQ generation** from extracted text chunks.
2. **LLM-based MCQ generation** (e.g., OpenAI ChatCompletion or Google Gemini) for more natural, diverse, and pedagogically rich questions.
3. A **wrapper function** to seamlessly choose between deterministic or LLM-based generation.

The goal is to automatically generate **contextually accurate MCQs** suitable for teachers and students.

---

In [None]:
# CELL 11 — MCQ generation (deterministic fallback + LLM wrapper)
# =====================
from sklearn.feature_extraction.text import TfidfVectorizer


def deterministic_mcqs_from_chunks(chunks: List[str], n_q=5):
    text = ' '.join(chunks)
    vect = TfidfVectorizer(stop_words='english', max_features=2000)
    X = vect.fit_transform([text])
    terms = list(vect.get_feature_names_out())
    candidates = [t for t in terms if len(t)>4]
    sents = re.split(r'(?<=[.?!])\s+', text)
    random.shuffle(candidates)
    questions=[]
    used_terms=set()
    for term in candidates:
        if len(questions)>=n_q: break
        if term in used_terms: continue
        found = [s for s in sents if re.search(r'\b'+re.escape(term)+r'\b', s, flags=re.I)]
        if not found: continue
        sent = found[0]
        blank = re.sub(r'(?i)\b'+re.escape(term)+r'\b', '_____', sent, count=1)
        # distractors: pick other candidate terms or generate via simple perturbation
        distractors = [c for c in candidates if c!=term]
        if len(distractors) >= 3:
            choices = random.sample(distractors, 3) + [term]
        else:
            # simple morphological variants
            choices = [term+'s', term.upper(), term+'_alt'][:3] + [term]
        random.shuffle(choices)
        questions.append({'question': blank, 'options': choices, 'answer': term, 'rationale': 'Taken from source sentence.', 'difficulty': 'medium'})
        used_terms.add(term)
    return questions

# LLM-based generator (OpenAI ChatCompletion example - user must set OPENAI_API_KEY)
LLM_PROMPT_TEMPLATE = '''You are an educational content generator. Based only on the CONTEXT below, generate {n_q} multiple-choice questions (A-D).
For each question include JSON keys: question_text, options (A-D), correct_option (A/B/C/D), rationale (one brief sentence), difficulty (easy/medium/hard).
Context:
{context}
Respond ONLY with a JSON array.'''


# LLM-based MCQ generator
import requests

def llm_generate_mcqs_gemini(chunks: List[str], n_q=5, api_key: Optional[str]=None, model_name='gemini-1.5'):
    """
    Generate MCQs using Gemini LLM instead of OpenAI.
    """
    # Get API key from argument or environment variable
    key = api_key or os.getenv("GEMINI_API_KEY")
    if not key:
        print("[WARN] No Gemini API key found. Falling back to deterministic MCQs.")
        return deterministic_mcqs_from_chunks(chunks, n_q=n_q)

    prompt = LLM_PROMPT_TEMPLATE.format(n_q=n_q, context='\n\n'.join(chunks[:4]))

    headers = {
        "Authorization": f"Bearer {key}",
        "Content-Type": "application/json"
    }

    data = {
        "model": model_name,
        "prompt": prompt,
        "temperature": 0.0,
        "max_output_tokens": 800
    }

    try:
        resp = requests.post("https://api.generativeai.google/v1beta2/models/{model_name}:generateText".format(model_name=model_name),
                             headers=headers, json=data)
        resp.raise_for_status()
        txt = resp.json()['candidates'][0]['content']
        return json.loads(txt)
    except Exception as e:
        print(f"[WARN] Gemini LLM generation failed: {e}. Using deterministic fallback.")
        return deterministic_mcqs_from_chunks(chunks, n_q=n_q)



def generate_mcqs(chunks: List[str], n_q=5, use_llm=False, api_key: Optional[str]=None):
    if not chunks: return []
    if use_llm:
        return llm_generate_mcqs(chunks, n_q=n_q, api_key=api_key)
    else:
        return deterministic_mcqs_from_chunks(chunks, n_q=n_q)

# Multimodal: CLIP Indexing + BLIP Captioning

## Overview

This cell adds **multimodal support** to the MESO pipeline, enabling the system to handle **images** alongside text.  
It combines:

1. **CLIP embeddings** for image semantic indexing and similarity search.
2. **BLIP captions** to generate textual descriptions of images, which can then be searched or linked with text chunks.
3. **FAISS indexing** to store and retrieve image embeddings efficiently.

> Multimodal functionality is **optional** and only works if the required libraries (`transformers`, `torch`, `PIL`) are installed.

---

In [None]:
from pathlib import Path
from PIL import Image, UnidentifiedImageError
import numpy as np
import pickle
import faiss
from tqdm import tqdm

# Make sure MULTIMODAL_AVAILABLE is True before running this
def multimodal_index_images(image_folder: str, persist_dir: Path):
    if not MULTIMODAL_AVAILABLE:
        print('Multimodal libs not available. Skipping image indexing.')
        return None

    # Load models
    clip_model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
    clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')
    blip_model = BlipForConditionalGeneration.from_pretrained('Salesforce/blip-image-captioning-base')
    blip_processor = BlipProcessor.from_pretrained('Salesforce/blip-image-captioning-base')

    # Only pick valid image file types
    valid_exts = ['.png', '.jpg', '.jpeg', '.bmp', '.tiff']
    img_paths = sorted([str(p) for p in Path(image_folder).glob('*') if p.suffix.lower() in valid_exts])

    if not img_paths:
        print('No valid images found in', image_folder)
        return None

    emb_list = []
    meta = []

    for p in tqdm(img_paths, desc='Indexing images'):
        try:
            im = Image.open(p).convert('RGB')
        except UnidentifiedImageError:
            print(f"⚠️ Skipping non-image file: {p}")
            continue
        except Exception as e:
            print(f"⚠️ Skipping file {p} due to error: {e}")
            continue

        # CLIP embedding
        inputs = clip_processor(images=im, return_tensors='pt')
        with torch.no_grad():
            out = clip_model.get_image_features(**inputs)
            vec = out.cpu().numpy().reshape(-1)
            vec = vec / np.linalg.norm(vec)  # normalize
        emb_list.append(vec.astype('float32'))

        # BLIP captioning
        blip_in = blip_processor(images=im, return_tensors='pt').to(blip_model.device)
        with torch.no_grad():
            ids = blip_model.generate(**blip_in, max_new_tokens=40)
            caption = blip_processor.decode(ids[0], skip_special_tokens=True)
        meta.append({'path': p, 'caption': caption})

    # Save embeddings with FAISS
    if emb_list:
        emb = np.vstack(emb_list)
        dim = emb.shape[1]
        img_index = faiss.IndexIDMap(faiss.IndexFlatIP(dim))
        ids = np.arange(len(emb)).astype('int64')
        img_index.add_with_ids(emb, ids)
        faiss.write_index(img_index, str(persist_dir / 'img.index'))
        np.save(str(persist_dir / 'img_emb.npy'), emb)
        with open(str(persist_dir / 'img_meta.pkl'), 'wb') as f:
            pickle.dump(meta, f)
        print(f"✅ Saved image index with {len(emb_list)} images to {persist_dir}")
        return True
    else:
        print("⚠️ No valid images were indexed.")
        return None


# Orchestration: Retriever → Validator → Quiz Generator → Finalizer

## Overview

This cell represents the **pipeline orchestration layer** of the MESO system.  
It coordinates all previous components — text/image retrieval, validation, MCQ generation, and packaging — into a **single workflow** that transforms a user query into a ready-to-use quiz payload.

The main functions are:

1. `finalizer_agent`: Combines MCQs, metadata, and optionally image data into a structured output.
2. `run_pipeline`: End-to-end orchestrator that:
   - Searches and reranks relevant chunks
   - Applies validator guardrails
   - Generates MCQs (deterministic or LLM-based)
   - Returns a final payload.

---


In [None]:
# CELL 13 — Orchestration: Retriever -> Validator -> QuizGen -> Finalizer
# =====================

def finalizer_agent(mcqs: List[Dict], metas: List[Dict], images: Optional[List[Dict]]=None) -> Dict:
    payload = {'quiz': mcqs, 'sources': metas}
    if images:
        payload['images'] = images
    payload['generated_at'] = time.strftime('%Y-%m-%d %H:%M:%S')
    return payload


def run_pipeline(query: str, index=None, corpus=None, meta=None, top_k=5, use_llm=False, api_key: Optional[str]=None, filters: Optional[Dict]=None):
    # Load index & corpus if not provided
    if index is None or corpus is None or meta is None:
        index, emb, meta = load_index_and_meta()
        with open(CORPUS_PKL, 'rb') as f:
            d = pickle.load(f)
            corpus = d['corpus']
            meta = d['meta']

    # Search and rerank
    chunks, metas, scores = search_and_rerank(index, query, corpus, meta, top_k=top_k, extra_k=top_k*3, filters=filters)

    if not chunks:
        print('No chunks retrieved.')
        return {'error':'no_retrieval'}

    # Show raw reranker scores
    print("Top-k raw reranker scores:", scores)

    # Convert scores to 0-1 confidence using softmax
    exp_scores = np.exp(scores - np.max(scores))
    conf = float(np.max(exp_scores / exp_scores.sum()))
    print(f"Retrieved {len(chunks)} chunks. Confidence: {conf:.3f}")

    # Validate chunks
    v_chunks, v_metas = validator_agent(chunks, metas)
    if not v_chunks:
        print('All chunks filtered by validator.')
        return {'error':'all_filtered'}

    # Generate MCQs: deterministic vs LLM
    approach1 = deterministic_mcqs_from_chunks(v_chunks, n_q=5)
    approach2 = None
    if use_llm:
        try:
            approach2 = llm_generate_mcqs(v_chunks, n_q=5, api_key=api_key)
        except Exception as e:
            print('[WARN] LLM failed, falling back:', e)
            approach2 = deterministic_mcqs_from_chunks(v_chunks, n_q=5)
    else:
        approach2 = deterministic_mcqs_from_chunks(v_chunks, n_q=5)

    payload = finalizer_agent({'approach1': approach1, 'approach2': approach2}, v_metas)
    return payload


# Cell 14 — FastAPI Endpoint and Streamlit Demo Skeleton

## Overview

This cell sets up a **minimal web interface** and API endpoint** for the MESO system.  
It enables external users (or a front-end) to **query the pipeline** and receive MCQs in real time.  
Additionally, it provides a **Streamlit demo skeleton** for interactive testing.

The main components are:

1. **FastAPI endpoint** (`/generate`): Accepts a query, retrieves relevant chunks, validates, generates MCQs, and returns the payload.
2. **Streamlit demo** (`streamlit_app.py`): Simple front-end that allows a user to input queries and view the results.

---

In [None]:
# CELL 14 — FastAPI endpoint and Streamlit demo skeleton
# =====================
# FastAPI:
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class QueryIn(BaseModel):
    query: str
    top_k: int = 5
    use_llm: bool = False

@app.post('/generate')
def generate(q: QueryIn):
    idx, emb, meta = load_index_and_meta()
    with open(CORPUS_PKL,'rb') as f:
        d = pickle.load(f)
        corpus = d['corpus']
        meta = d['meta']
    res = run_pipeline(q.query, index=idx, corpus=corpus, meta=meta, top_k=q.top_k, use_llm=q.use_llm, api_key=os.getenv('OPENAI_API_KEY'))
    return res

# Streamlit app (save as streamlit_app.py)
streamlit_code = r"""
import streamlit as st
import requests
st.title('EduRAG - Demo')
u = st.text_input('Query', 'Photosynthesis')
use_llm = st.checkbox('Use LLM for MCQs (requires API key on server)', False)
if st.button('Generate'):
    res = requests.post('http://localhost:8000/generate', json={'query':u, 'top_k':5, 'use_llm': use_llm})
    st.json(res.json())
"""
with open('streamlit_app.py','w') as f:
    f.write(streamlit_code)
print('Wrote streamlit_app.py — run with: uvicorn this_script:app --reload  and streamlit run streamlit_app.py')


Wrote streamlit_app.py — run with: uvicorn this_script:app --reload  and streamlit run streamlit_app.py


# Utilities: Compare Approach 1 vs 2 and Save Report

## Overview

This cell provides a **utility function** to generate a human-readable **comparison report** between two MCQ generation approaches:

1. **Approach 1** — Deterministic MCQs (TF-IDF / heuristic-based).
2. **Approach 2** — LLM-based MCQs (e.g., OpenAI or Gemini).

It saves the comparison as a **text file**, which can be shared, analyzed, or archived for educational evaluation.

---


In [None]:
# CELL 15 — Utilities: compare Approach 1 vs 2 and save report
# =====================
from pathlib import Path

def compare_and_save_report(query: str, res: dict, filename: str = "mcq_report.txt"):
    """
    Generate a text report comparing Approach1 (deterministic) and Approach2 (LLM),
    and save to a file.
    """
    report_lines = []
    report_lines.append(f"MCQ Generation Report for Query: '{query}'\n")
    report_lines.append("="*60 + "\n")

    # Approach 1 Section
    report_lines.append("APPROACH 1: Deterministic MCQs\n")
    report_lines.append("-"*60 + "\n")
    approach1 = res.get('approach1', [])
    if approach1:
        for i, q in enumerate(approach1, 1):
            report_lines.append(f"{i}. {q}\n")
    else:
        report_lines.append("No MCQs generated.\n")

    report_lines.append("\n")

    # Approach 2 Section
    report_lines.append("APPROACH 2: LLM-based MCQs\n")
    report_lines.append("-"*60 + "\n")
    approach2 = res.get('approach2', [])
    if approach2:
        for i, q in enumerate(approach2, 1):
            report_lines.append(f"{i}. {q}\n")
    else:
        report_lines.append("No MCQs generated.\n")

    # Save to file
    report_path = Path(filename)
    with open(report_path, 'w', encoding='utf-8') as f:
        f.writelines(report_lines)

    print(f"Report saved to {report_path.resolve()}")
    return report_path

# Example usage
# report_file = compare_and_save_report('Explain Economics', res)


# Colab-only PDF Upload Helper

## Overview

This cell provides a **user-friendly helper** for Google Colab that allows users to **upload PDFs interactively**. It automatically saves the files to a designated folder (`outputs/uploads`) and provides a visual summary of the uploaded files. This helper is designed to streamline the **initial data ingestion step** in the workflow.

---


In [None]:
# =====================
# CELL 3 — Colab-only PDF Upload Helper
# =====================
from google.colab import files
from IPython.display import display, HTML
from pathlib import Path
from tqdm.notebook import tqdm

UPLOAD_DIR = Path('outputs/uploads')
UPLOAD_DIR.mkdir(parents=True, exist_ok=True)

def colab_upload_handler_interactive():
    print("📂 Click 'Choose Files' to select PDFs from your computer...")
    uploaded = files.upload()
    if not uploaded:
        print("⚠️ No files uploaded.")
        return []

    saved_paths = []
    print("\n💾 Saving files to", UPLOAD_DIR, "...")
    for name, blob in tqdm(uploaded.items(), desc="Uploading PDFs"):
        dest = UPLOAD_DIR / name
        with open(dest, 'wb') as f:
            f.write(blob)
        saved_paths.append(str(dest))

    # Interactive HTML summary
    html_list = "<ul>"
    for p in saved_paths:
        html_list += f"<li>{p}</li>"
    html_list += "</ul>"
    display(HTML(f"<b>✅ Successfully uploaded {len(saved_paths)} file(s):</b> {html_list}"))

    print("\nNext steps:")
    print("1️⃣ Build corpus: corpus, meta = build_corpus_from_uploads('outputs/uploads')")
    print("2️⃣ Build FAISS index: index, emb = build_faiss_index(corpus, meta)")
    print("3️⃣ Run queries: run_pipeline('Your query here', index=index, corpus=corpus, meta=meta, top_k=5)")

    return saved_paths

# Usage:
# uploaded_files = colab_upload_handler_interactive()


# Execution Flow

In [None]:
# Step 1: Upload PDFs to process
# Example (when everything is ready):
uploaded = colab_upload_handler_interactive() # Returns list of saved PDF paths


📂 Click 'Choose Files' to select PDFs from your computer...


Saving kest101.pdf to kest101 (1).pdf

💾 Saving files to outputs/uploads ...


Uploading PDFs:   0%|          | 0/1 [00:00<?, ?it/s]


Next steps:
1️⃣ Build corpus: corpus, meta = build_corpus_from_uploads('outputs/uploads')
2️⃣ Build FAISS index: index, emb = build_faiss_index(corpus, meta)
3️⃣ Run queries: run_pipeline('Your query here', index=index, corpus=corpus, meta=meta, top_k=5)


In [None]:
!pip install --upgrade nltk




In [None]:
import nltk

# Standard tokenizer (already available)
nltk.download('punkt')

# Tabular/paragraph-aware tokenizer (needed for PDF chunking)
nltk.download('punkt_tab')

# Verify availability
try:
    nltk.data.find('tokenizers/punkt_tab/english')
    print("✅ punkt_tab is now available")
except LookupError:
    print("❌ punkt_tab still not found. Try upgrading NLTK.")


✅ punkt_tab is now available


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
# Step 2: Extract text, clean headers, chunk text into semantic chunks
# This generates a corpus list and metadata for each chunk
chunks, metas, scores = search_and_rerank(index, 'Understanding Economics', corpus, meta)
print("Chunks retrieved:", len(chunks))
for i, c in enumerate(chunks[:3]):
    print(f"Chunk {i+1}:", c[:200], "...\n")


Chunks retrieved: 5
Chunk 1: told this subject is mainly around what Alfred Marshall (one of the founders of modern economics) called “the study of man in the ordinary business of life”. Let us understand what that means. When yo ...

Chunk 2: 5 be possible without data on various factors underlying an economic problem? And, that, in such a situation, no policies can be formulated to solve it. If yes, then you have, to a large extent, under ...

Chunk 3: 4 may want to know how many are illiterate, who will not get jobs, requiring education, how many are highly educated and will have the best job opportunities and so on. In other words, you may want to ...



In [None]:
# Step 3: Convert corpus to embeddings and build FAISS index for semantic search
index, emb = build_faiss_index(corpus, meta, encoder_model=None, batch_size=EMBED_BATCH, persist_dir=FAISS_DIR)

# Verify embedding dimension
print("Embedding matrix shape:", emb.shape)


Embedding corpus:   0%|          | 0/1 [00:00<?, ?it/s]

Saved FAISS index to /content/outputs/faiss
Embedding matrix shape: (8, 384)


In [None]:
# Step 3: After validator
v_chunks, v_metas = validator_agent(chunks, metas)
print("Chunks after guardrails:", len(v_chunks))
for i, c in enumerate(v_chunks[:3]):
    print(f"Validated Chunk {i+1}:", c[:200], "...\n")

Chunks after guardrails: 4
Validated Chunk 1: told this subject is mainly around what Alfred Marshall (one of the founders of modern economics) called “the study of man in the ordinary business of life”. Let us understand what that means. When yo ...

Validated Chunk 2: 5 be possible without data on various factors underlying an economic problem? And, that, in such a situation, no policies can be formulated to solve it. If yes, then you have, to a large extent, under ...

Validated Chunk 3: 4 may want to know how many are illiterate, who will not get jobs, requiring education, how many are highly educated and will have the best job opportunities and so on. In other words, you may want to ...



In [None]:
# Step 4: Run a query to retrieve relevant chunks
query = "Understanding Economics"

# Run pipeline: retrieves chunks, validates, generates MCQs
payload = run_pipeline(query=query, index=index, corpus=corpus, meta=meta, top_k=5, use_llm=False)

# Check what we got
print("Keys in payload:", payload.keys())
print("Sample MCQ from deterministic approach:")
print(payload['quiz']['approach1'][0])


Top-k raw reranker scores: [1.859816551208496, -2.0850226879119873, -2.245616912841797, -2.7816452980041504, -3.3252665996551514]
Retrieved 5 chunks. Confidence: 0.951
Keys in payload: dict_keys(['quiz', 'sources', 'generated_at'])
Sample MCQ from deterministic approach:
{'question': 'STATISTICS IN ECONOMICS In the previous section you were told about _____ special studies that concern the basic problems facing a country.', 'options': ['lines', 'finance', 'certain', 'formulation'], 'answer': 'certain', 'rationale': 'Taken from source sentence.', 'difficulty': 'medium'}


In [None]:
# Step 5: If you have an LLM API key, enable LLM MCQ generation
# Make sure you have OPENAI_API_KEY or GEMINI_API_KEY set in environment
payload_llm = run_pipeline(query=query, index=index, corpus=corpus, meta=meta, top_k=5, use_llm=True, api_key=os.getenv('OPENAI_API_KEY'))

print("Sample MCQ from LLM approach:")
print(payload_llm['quiz']['approach2'][0])


Top-k raw reranker scores: [1.859816551208496, -2.0850226879119873, -2.245616912841797, -2.7816452980041504, -3.3252665996551514]
Retrieved 5 chunks. Confidence: 0.951
[WARN] LLM failed, falling back: name 'llm_generate_mcqs' is not defined
Sample MCQ from LLM approach:
{'question': 'The chief characteristic of such information is that they describe _____ of a single person or a group of persons that is important to record as accurately as possible even though they cannot be measured in quantitative terms.', 'options': ['enormously', 'concerned', 'resources', 'attributes'], 'answer': 'attributes', 'rationale': 'Taken from source sentence.', 'difficulty': 'medium'}


In [None]:
# Step 6: You can test the validator on retrieved chunks
chunks, metas, scores = search_and_rerank(index, query, corpus, meta, top_k=5)
v_chunks, v_metas = validator_agent(chunks, metas)
print(f"Chunks before validator: {len(chunks)}, after validator: {len(v_chunks)}")


Chunks before validator: 5, after validator: 4


In [None]:
# Step 7: Generate MCQs from validated chunks (deterministic approach)
mcqs = deterministic_mcqs_from_chunks(v_chunks, n_q=5)
for i, q in enumerate(mcqs, 1):
    print(f"Q{i}: {q['question']}")
    print(f"Options: {q['options']}, Answer: {q['answer']}\n")


Q1: Statistics also helps in condensing mass data into a few numerical measures (such as mean, variance etc., about which you will _____ later).
Options: ['learn', 'overall', 'courses', 'marshall'], Answer: learn

Q2: At this stage you are probably _____ to know more about Statistics.
Options: ['variance', 'knowing', 'ready', 'distinguishes'], Answer: ready

Q3: Would you now agree with the following _____ of economics that many economists use?
Options: ['goods', 'information', 'increased', 'definition'], Answer: definition

Q4: An economist may be interested in finding out what _____ to the demand for a commodity when its price increases or decreases?
Options: ['price', 'physics', 'happens', 'statement'], Answer: happens

Q5: The data, then, are summarised by calculating various numerical indices, such as mean, _____, standard deviation, etc., that represent the broad characteristics of the collected set of information.
Options: ['variance', 'includes', 'distribute', 'business'], Answ

In [None]:
# Step 8: If you want to index images alongside PDFs (optional)
image_indexed = multimodal_index_images(image_folder='outputs/uploads', persist_dir=CLIP_DIR)
print("Multimodal image indexing done:", image_indexed)


No valid images found in outputs/uploads
Multimodal image indexing done: None


In [None]:
# Step 9: Compare deterministic vs LLM MCQs for the same query
# Prepare payload in the format expected by the report function
report_payload = {
    "approach1": mcqs,          # deterministic MCQs
    "approach2": payload_llm     # LLM MCQs (may be empty)
}

# Check if LLM MCQs are empty
llm_empty = not payload_llm
if llm_empty:
    print("⚠️ Warning: LLM MCQs are empty. Using deterministic MCQs only.")

# Function to create nicely formatted MCQ worksheet
def create_mcq_worksheet(query, deterministic_mcqs, llm_mcqs, filename="mcq_worksheet.txt"):
    with open(filename, "w", encoding="utf-8") as f:
        f.write(f"MCQ Generation Worksheet for Query: '{query}'\n")
        f.write("="*80 + "\n\n")

        # Deterministic MCQs
        f.write("APPROACH 1: Deterministic MCQs\n")
        f.write("-"*80 + "\n")
        for i, q in enumerate(deterministic_mcqs, 1):
            f.write(f"{i}. {q.get('question','N/A')}\n")
            for j, opt in enumerate(q.get('options', []), 1):
                f.write(f"   {chr(64+j)}. {opt}\n")
            f.write(f"Answer: {q.get('answer','N/A')}\n")
            f.write(f"Rationale: {q.get('rationale', 'N/A')}\n\n")

        # LLM MCQs
        f.write("APPROACH 2: LLM-based MCQs\n")
        f.write("-"*80 + "\n")

        # Check if LLM MCQs are valid list of dicts
        if not llm_mcqs or not isinstance(llm_mcqs, list) or not all(isinstance(q, dict) for q in llm_mcqs):
            f.write("⚠️ No valid LLM-based MCQs were generated for this query.\n")
        else:
            for i, q in enumerate(llm_mcqs, 1):
                f.write(f"{i}. {q.get('question','N/A')}\n")
                for j, opt in enumerate(q.get('options', []), 1):
                    f.write(f"   {chr(64+j)}. {opt}\n")
                f.write(f"Answer: {q.get('answer','N/A')}\n")
                f.write(f"Rationale: {q.get('rationale', 'N/A')}\n\n")

    return filename

# Save worksheet
worksheet_file = create_mcq_worksheet(query=query,
                                      deterministic_mcqs=mcqs,
                                      llm_mcqs=payload_llm,
                                      filename="mcq_worksheet3.txt")

print("Worksheet saved at:", worksheet_file)


Worksheet saved at: mcq_worksheet3.txt


In [None]:
# Step 10: Load persisted FAISS index and embeddings to confirm
index_loaded, emb_loaded, meta_loaded = load_index_and_meta(persist_dir=FAISS_DIR)
print("Number of chunks in loaded index:", len(meta_loaded))


Loaded FAISS index and meta. n_chunks= 8
Number of chunks in loaded index: 8


In [None]:
# Step 11: If you want to run the interactive demo
# FastAPI endpoint is already created in your code
# Run this in terminal:
# uvicorn this_script:app --reload
# Then run streamlit:
# streamlit run streamlit_app.py
