<a href="https://colab.research.google.com/github/pushpendra-saini-pks/MCQ_Generation-using-RAG/blob/main/rag_mcq_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# RAG-based MCQ Generation from a Python Language PDF (Notebook)

**What this notebook does (overview):**
1. Downloads a free Python book PDF (links provided below).  
2. Extracts text from the PDF and splits it into searchable chunks.  
3. Builds embeddings (instructions + placeholder code using `sentence-transformers`).  
4. Creates a RAG retrieval pipeline (using FAISS locally) and a prompt to a text generator to produce MCQs.  
5. Includes a small **toy demo** which runs without external APIs to show sample MCQ outputs.

**Free PDF recommendations used in this notebook (you can change to any PDF):**
- *Think Python* (Allen B. Downey) — official PDF (Creative Commons / free edition). Recommended URL (open): https://greenteapress.com/thinkpython2/thinkpython2.pdf.
- *A Byte of Python* (Swaroop C H) — free PDF: https://www.ibiblio.org/swaroopch/byteofpython.pdf.

> Notes: This notebook contains both full pipeline code and a minimal toy demo that **runs without internet or API keys** (so you can see example outputs immediately). For full-scale execution (embeddings + LLM), install required libraries and provide API keys as indicated in the cells.



## Installation
The full pipeline uses these Python packages. Install them before running the corresponding cells :
```bash
pip install pysimplegui PyPDF2 pdfplumber sentence-transformers faiss-cpu transformers openai
```




In [14]:
# Install required libraries
!pip install -q pdfplumber sentence-transformers faiss-cpu google-generativeai


In [15]:

# 1) Download a PDF (optional) - uncomment and change the URL if you want to fetch directly from the web.
# Example (uncomment to run):
# import requests
# url = "https://greenteapress.com/thinkpython2/thinkpython2.pdf"
# r = requests.get(url)
# open("thinkpython2.pdf", "wb").write(r.content)
#
# Alternatively, place a PDF named 'python_book.pdf' next to this notebook.
print('Skip downloading in this demo. Place your PDF as "python_book.pdf" or uncomment the download code.')


Skip downloading in this demo. Place your PDF as "python_book.pdf" or uncomment the download code.


## LLM Configuration

In [16]:
# Configure Google Generative AI (Gemini)
import google.generativeai as genai
from google.colab import userdata

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')   # Load your stored key

genai.configure(api_key=GOOGLE_API_KEY)

GEN_MODEL = genai.GenerativeModel("gemini-flash-latest")
print("Configured Gemini model:", GEN_MODEL)



Configured Gemini model: genai.GenerativeModel(
    model_name='models/gemini-flash-latest',
    generation_config={},
    safety_settings={},
    tools=None,
    system_instruction=None,
    cached_content=None
)


## Document loading

In [17]:
# 1) Extract text from thinkpython2.pdf (uploaded)
import pdfplumber, os
pdf_path = 'thinkpython2.pdf'
if not os.path.exists(pdf_path):
    raise FileNotFoundError(f"PDF not found at {pdf_path}. Please upload it to the notebook directory.")
def extract_text_from_pdf(path, max_pages=None):
    text_chunks = []
    with pdfplumber.open(path) as pdf:
        pages = pdf.pages if max_pages is None else pdf.pages[:max_pages]
        for p in pages:
            page_text = p.extract_text() or ''
            text_chunks.append(page_text)
    return '\n\n'.join(text_chunks)

raw_text = extract_text_from_pdf(pdf_path, max_pages=None)
print('Extracted characters:', len(raw_text))
# show a short preview
print(raw_text[:1000].replace('\n',' '))


Extracted characters: 402460
Think Python How to Think Like a Computer Scientist 2ndEdition,Version2.4.0    Think Python How to Think Like a Computer Scientist 2ndEdition,Version2.4.0 Allen Downey Green Tea Press Needham,Massachusetts  Copyright©2015AllenDowney. GreenTeaPress 9WashburnAve NeedhamMA02492 Permission is granted to copy, distribute, and/or modify this document under the terms of the CreativeCommonsAttribution-NonCommercial3.0UnportedLicense,whichisavailableathttp: //creativecommons.org/licenses/by-nc/3.0/. TheoriginalformofthisbookisLATEXsourcecode.CompilingthisLATEXsourcehastheeffectofgen- eratingadevice-independentrepresentationofatextbook,whichcanbeconvertedtootherformats andprinted. TheLATEXsourceforthisbookisavailablefromhttp://www.thinkpython.com  Preface The strange history of this book InJanuary1999IwaspreparingtoteachanintroductoryprogrammingclassinJava. Ihad taughtitthreetimesandIwasgettingfrustrated. Thefailurerateintheclasswastoohigh and,evenforstudentswhosucce

## chunking

In [18]:

# 2) Chunk the text into overlapping windows (simple word-based chunking)
def chunk_text(text, chunk_size=300, overlap=50):
    tokens = text.split()
    chunks = []
    i = 0
    while i < len(tokens):
        chunk = ' '.join(tokens[i:i+chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

chunks = chunk_text(raw_text, chunk_size=300, overlap=50)
print('Number of chunks:', len(chunks))
print('Sample chunk (first 500 chars):\n', chunks[0][:500])


Number of chunks: 157
Sample chunk (first 500 chars):
 Think Python How to Think Like a Computer Scientist 2ndEdition,Version2.4.0 Think Python How to Think Like a Computer Scientist 2ndEdition,Version2.4.0 Allen Downey Green Tea Press Needham,Massachusetts Copyright©2015AllenDowney. GreenTeaPress 9WashburnAve NeedhamMA02492 Permission is granted to copy, distribute, and/or modify this document under the terms of the CreativeCommonsAttribution-NonCommercial3.0UnportedLicense,whichisavailableathttp: //creativecommons.org/licenses/by-nc/3.0/. Theorigi



##  Embedding & Indexing (instructions)

Below is example code to create embeddings using `sentence-transformers` and index them with FAISS.


In [19]:
# 3) Create embeddings locally using sentence-transformers and build FAISS index
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

embed_model = SentenceTransformer('all-MiniLM-L6-v2')  # small & fast
# encode all chunks
embeddings = embed_model.encode(chunks, show_progress_bar=True, convert_to_numpy=True)
print('Embeddings shape:', embeddings.shape)

# Build FAISS index (L2)
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings.astype('float32'))
print('FAISS index built with', index.ntotal, 'vectors')



Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Embeddings shape: (157, 384)
FAISS index built with 157 vectors



##  RAG retrieval + MCQ generation (instructions + example prompt)

After building an index, the typical flow is:
1. For a prompt or chapter, retrieve top-k chunks from the index.  
2. Construct a generation prompt that includes those chunks as context and asks the LLM to produce MCQs.  
3. Send the prompt to a text-generation model (Gemini,OpenAI, local transformer, etc.).

**Example prompt template (to send to an LLM):**

```
Context: <retrieved text chunks>

Task: Generate 4 multiple-choice questions (each with 4 options) based only on the context.
Mark the correct answer with (Correct).
Keep each question short and focused on facts from the context.
```




In [20]:
# 4) Retrieval helper: embed a query with the same model and return top-k chunks
def retrieve(query, top_k=4):
    q_emb = embed_model.encode([query], convert_to_numpy=True)
    D, I = index.search(q_emb.astype('float32'), top_k)
    results = []
    for idx in I[0]:
        results.append({'chunk': chunks[idx], 'chunk_id': int(idx)})
    return results

# quick test
print('Top chunk(s) for "What is Python?":')
for r in retrieve('What is Python?', top_k=3):
    print('--- chunk id', r['chunk_id'], 'len', len(r['chunk']))
    print(r['chunk'][:400].replace('\n',' '))
    print()


Top chunk(s) for "What is Python?":
--- chunk id 0 len 4522
Think Python How to Think Like a Computer Scientist 2ndEdition,Version2.4.0 Think Python How to Think Like a Computer Scientist 2ndEdition,Version2.4.0 Allen Downey Green Tea Press Needham,Massachusetts Copyright©2015AllenDowney. GreenTeaPress 9WashburnAve NeedhamMA02492 Permission is granted to copy, distribute, and/or modify this document under the terms of the CreativeCommonsAttribution-NonComm

--- chunk id 1 len 7413
youwantto. • ForChapter4.1Iswitchedfrommyownturtlegraphicspackage,calledSwampy,toa morestandardPythonmodule,turtle,whichiseasiertoinstallandmorepowerful. • I added a new chapter called “The Goodies”, which introduces some additional Pythonfeaturesthatarenotstrictlynecessary,butsometimeshandy. Ihopeyouenjoyworkingwiththisbook,andthatithelpsyoulearntoprogramandthink likeacomputerscientist,atleastali

--- chunk id 122 len 3444
two birth dates and computes their DoubleDay. 4. For a little more challenge, write th

### Generate MCQ using Gemini

In [21]:
# 5) Generation: use Gemini to create MCQs from retrieved context
def generate_mcqs_with_gemini(context, num_questions=4):
    prompt = f"""You are a helpful MCQ generator.
Based ONLY on the context below, generate {num_questions} multiple-choice questions.
Each question should have 4 options (A-D). Mark the correct option by appending ' (Correct)' after it.
Be factual and keep each question concise.

Context:
{context}
"""
    # Use the generative model to create text output
    response = GEN_MODEL.generate_content(prompt)
    # The response object usually has .text; if not, print the response to inspect available fields.
    try:
        return response.text
    except Exception:
        # fallback: return full response repr for debugging
        return repr(response)

# Example end-to-end: retrieve, build context, generate
query = 'Variables in Python and assignment statements'
retrieved = retrieve(query, top_k=4)
context = '\n\n'.join([r['chunk'] for r in retrieved])
print('Context length (chars):', len(context))

mcq_text = generate_mcqs_with_gemini(context, num_questions=4)
print('\n=== Generated MCQs ===\n')
print(mcq_text)


Context length (chars): 13106

=== Generated MCQs ===

**Question 1:**
In Python, why is the assignment statement (`a = b`) fundamentally different from a mathematical proposition of equality?
A. Mathematical equality must be explicitly initialized.
B. A proposition of equality is always symmetric, but assignment is not. (Correct)
C. Assignment means the claim that `a` and `b` are temporarily equal.
D. Mathematical equality changes over time, while assignment is static.

**Question 2:**
What is the conventional indentation used for the body of a function definition in Python?
A. Two spaces
B. Three spaces
C. Four spaces (Correct)
D. A single tab

**Question 3:**
If an expression other than a variable name is placed on the left side of the assignment operator (e.g., `hours * 60 = minutes`), what type of error results?
A. `ValueError`
B. `NameError`
C. `SyntaxError` (Correct)
D. `TypeError`

**Question 4:**
If a programmer wishes to reassign a global variable inside a function, what spec


## 7) Scaling up


- Use a small, fast embedding model (e.g., `all-MiniLM-L6-v2`) for retrieval; keep chunk size ≈ 200-500 tokens.  
- Use FAISS for local dense retrieval or Pinecone/Weaviate for managed vector DBs.  
- For generation, you can use OpenAI, Anthropic, or a local Llama/LLM. Restrict generation length and use temperature 0.2–0.7.  
- Always include clear instructions in the prompt to keep MCQs factual and avoid hallucinations (include source chunks and ask model to only use them).  
- Validate generated MCQs by running an automated heuristic or a small human-review pass.

---
