# PDF & Lecture Summarizer + Question Bank Generator

## Setup

In [None]:
!pip install transformers sentencepiece torch



## Summarization Prototype

In [None]:
from transformers import pipeline

In [None]:
# summarization model
summarizer = pipeline(
    "summarization",
    model="sshleifer/distilbart-cnn-12-6",
    device=-1   # CPU only
)


# Example text (to bereplaced with your PDF text later)
text = """
Natural language processing (NLP) is a subfield of
linguistics, computer science, and artificial intelligence
concerned with the interactions between computers
and human language, how to program computers to
process and analyze large amounts of natural language
data. By “natural language” we mean a language that is used
for everyday communication by humans, such as
Arabic, English, Spanish….etc. NLP is not to be confused with the abbreviation that
stands for Neuro-Linguistic Programming
 which is a psychological approach that involves
analyzing strategies used by successful individuals and
applying them to reach a personal goal
"""

# Summarize
summary = summarizer(text, max_length=60, min_length=20, do_sample=False)
print("Summary:", summary[0]['summary_text'])


Device set to use cpu


Summary:  Natural language processing (NLP) is a subfield of computer science, computer science and artificial intelligence . By “natural language” we mean a language that is used for everyday communication by humans . NLP is not to be confused with the abbreviation Neuro-Linguistic


## Question Generation Prototype

In [None]:
from transformers import pipeline

In [None]:
# question-generation model
qg = pipeline("text2text-generation", model="valhalla/t5-base-qg-hl", device=-1)


# Example text
context = """
NLP enables computers to understand and generate human language.
One key technique is word embeddings, which represent words as vectors.
Applications include chatbots, translation, and sentiment analysis.
"""

# Generate questions
questions = qg("generate questions: " + context, max_length=64, num_return_sequences=3)
for i, q in enumerate(questions):
    print(f"Q{i+1}:", q["generated_text"])


Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Q1: What does NLP allow computers to understand and generate human language?
Q2: What does NLP enable computers to understand and generate human language?
Q3: What does NLP allow computers to understand?


## Extracting Text from Lecture PDFs

Used PyPDF2 to extract text.

In [None]:
!pip install PyPDF2
import PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
!pip install pytesseract pillow pdf2image
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
import PyPDF2

# Extract text with fallback OCR
def extract_text_with_ocr(pdf_path):
    text = ""
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for page in reader.pages:
            page_text = page.extract_text()
            if page_text and page_text.strip():
                text += page_text + "\n"

    # If text is empty -> run OCR
    if not text.strip():
        print("⚠️ No text found, using OCR...")
        pages = convert_from_path(pdf_path)   # convert PDF to images
        for page in pages:
            ocr_text = pytesseract.image_to_string(page)
            text += ocr_text + "\n"

    return text

# Usage
from google.colab import files
uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]

pdf_text = extract_text_with_ocr(pdf_path)
print(pdf_text[:1000])  # preview




Saving Intro for NLP_(NTI Lec 1).pdf to Intro for NLP_(NTI Lec 1) (6).pdf
01

Course Outline
1.Introduction to NLP course and Basic Concepts
2.NLP Basic Concepts
3.NLP Basic Concepts
4.Simple Processing
02•Tokenization •Sentence
•Segmentation •POS Tagging
•Stemming •Lemmatization
•Named Entity 
Recognition•Stop Words
•Matchers •Text Visualization
•Syntax Structure
•Bag of Words •Text Vectors
•TF-IDF5.Simple Processing
6.Advanced Processing
7.Modeling & Text Generation
8.Modern NLP Architectures•Word Embedding •Word2Vec
•Text Similarity •Distance Similarity
• Text Classification •Text Clustering
•LDA • N-Grams
•Text Generation
•Attention Mechanism •Transformer
9.Large Language Models
• LLMs (BERT, GPT) •Fine -tuning LLMs
03•Introduction to Natural Language Processing
•  What is Natural Language Processing (NLP)?
• Natural Language Understanding(NLU) and Natural Language 
Generation(NLG) 
•Applications of Natural Language Processing(NLP) 
•Applications of Natural Language Understanding(N

## Chunking Before Summarizing

Chunked Long Texts :Broke down big lectures into smaller pieces so the models don’t crash.

In [None]:
def chunk_text(text, max_tokens=500):
    words = text.split()
    for i in range(0, len(words), max_tokens):
        yield " ".join(words[i:i+max_tokens])


Used Hugging Face distilbart to generate lecture summaries:Combined summaries from chunks into one final summary.

In [None]:
chunks = list(chunk_text(pdf_text, max_tokens=500))

summaries = []
for chunk in chunks:
    summary = summarizer(chunk, max_length=120, min_length=40, do_sample=False)
    summaries.append(summary[0]['summary_text'])

final_summary = " ".join(summaries)
print("Final Summary:\n", final_summary)


Your max_length is set to 120, but your input_length is only 100. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


Final Summary:
  Course Outline: Introduction to NLP course and Basic Concepts . Introduction to Natural Language Processing . What is AI? What is Intelligence? How does natural language processing work? NLP is a difficult task because it involves a lot of unstructured data .  NLP is a field of Artificial Intelligence that enables computers to understand, interpret, and generate human language . NLP powers applications such as chatbots, translation services, sentiment analysis, and voice assistants like Siri and Alexa .  Low-resource language: Different businesses and industries often use very different language . NLP is word level analysis including: word segmentation, part-of-speech tagging (POS) oNamed Entity Recognition (NER) oStop Words Removal oStemming oLemmatization .  Semantic analysis would help computer learn about less literal meanings that go beyond the standard lexicon . Sentiment analysis is a way of measuring tone and intent in social media comments or reviews . It is o

## Summarize the Lecture

Generated a Question Bank:Used Hugging Face t5-qg model: Created multiple questions from the lecture content.

In [None]:
!pip install transformers sentencepiece torch PyPDF2
from transformers import pipeline
import PyPDF2

# ----------------
# 1. Load models
# ----------------
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=-1)  # CPU mode for safety
qg = pipeline("text2text-generation", model="valhalla/t5-base-qg-hl", device=-1)

# ----------------
# 2. PDF Extract
# ----------------
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

from google.colab import files
uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]
pdf_text = extract_text_from_pdf(pdf_path)

# ----------------
# 3. Chunking
# ----------------
def chunk_text(text, max_tokens=500):
    words = text.split()
    for i in range(0, len(words), max_tokens):
        yield " ".join(words[i:i+max_tokens])

chunks = list(chunk_text(pdf_text))

# ----------------
# 4. Summarization
# ----------------
summaries = []
for chunk in chunks:
    try:
        summary = summarizer(chunk, max_length=120, min_length=40, do_sample=False)
        summaries.append(summary[0]['summary_text'])
    except:
        continue

final_summary = " ".join(summaries)
print("📌 Final Summary:\n", final_summary)

# ----------------
# 5. Question Generation
# ----------------
questions = []
for chunk in chunks[:3]:   # limit to first 3 chunks for speed
    try:
        qset = qg("generate questions: " + chunk, max_length=64, num_return_sequences=3)
        questions.extend([q["generated_text"] for q in qset])
    except:
        continue

print("\n📌 Generated Questions:")
for i, q in enumerate(questions):
    print(f"Q{i+1}:", q)




Device set to use cpu
Device set to use cpu


Saving Intro for NLP_(NTI Lec 1).pdf to Intro for NLP_(NTI Lec 1) (7).pdf


Your max_length is set to 120, but your input_length is only 100. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Token indices sequence length is longer than the specified maximum sequence length for this model (940 > 512). Running this sequence through the model will result in indexing errors


📌 Final Summary:
  Course Outline: Introduction to NLP course and Basic Concepts . Introduction to Natural Language Processing . What is AI? What is Intelligence? How does natural language processing work? NLP is a difficult task because it involves a lot of unstructured data .  NLP is a field of Artificial Intelligence that enables computers to understand, interpret, and generate human language . NLP powers applications such as chatbots, translation services, sentiment analysis, and voice assistants like Siri and Alexa .  Low-resource language: Different businesses and industries often use very different language . NLP is word level analysis including: word segmentation, part-of-speech tagging (POS) oNamed Entity Recognition (NER) oStop Words Removal oStemming oLemmatization .  Semantic analysis would help computer learn about less literal meanings that go beyond the standard lexicon . Sentiment analysis is a way of measuring tone and intent in social media comments or reviews . It is

Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=64) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



📌 Generated Questions:
Q1: What is the purpose of the NLP curriculum meeting?
Q2: What is the purpose of the curriculum meeting?
Q3: What is the name of the NLP curriculum meeting?
Q4: What is the opposite direction of NLG?
Q5: What is the opposite of NLG?
Q6: What is the opposite direction of Natural Language Generation?
Q7: What is the definition of Ambiguity Low-resource language?
Q8: What is a low-resource language?
Q9: What is the definition of Ambiguity Low-resource?


In [None]:
# === 1) deps
!pip install yake
import re, random, pandas as pd, yake

# === 2) helper: sentence splitter (no extra models needed)
def split_sentences(text: str):
    return [s.strip() for s in re.split(r'(?<=[.?!])\s+', text.strip()) if s.strip()]

# === 3) extract keywords from the summary (answers + distractor pool)
kw_extractor = yake.KeywordExtractor(n=1, top=40)  # unigrams keep it simple/clean
keywords = [kw for kw, _ in kw_extractor.extract_keywords(final_summary)]
# keep only “real” terms
keywords = [k for k in keywords if len(k) > 2 and re.search(r'[A-Za-z]', k)]

# === 4) choose an answer present in a sentence
def pick_answer(sent: str, kws):
    for k in kws:
        if re.search(rf'\b{re.escape(k)}\b', sent, flags=re.I):
            return k
    return None

# === 5) generators for FIB / T-F / MCQ from one sentence + its answer
def make_fib(sent: str, answer: str):
    fib_q = re.sub(rf'\b{re.escape(answer)}\b', '____', sent, flags=re.I)
    return {'type':'FIB', 'question': fib_q, 'answer': answer}

def make_tf(sent: str, answer: str, kws):
    flip = random.random() < 0.5
    if flip and len(kws) > 1:
        # replace answer with a random other keyword to make it false
        distractor = random.choice([k for k in kws if k.lower() != answer.lower()])
        stmt = re.sub(rf'\b{re.escape(answer)}\b', distractor, sent, flags=re.I)
        return {'type':'T/F', 'question': f"True or False: {stmt}", 'answer': 'False'}
    else:
        return {'type':'T/F', 'question': f"True or False: {sent}", 'answer': 'True'}

def make_mcq(sent: str, answer: str, kws):
    pool = [k for k in kws if k.lower() != answer.lower()]
    # safe sample ≤3 items
    k = min(3, len(pool))
    distractors = random.sample(pool, k) if k > 0 else []
    options = distractors + [answer]
    random.shuffle(options)
    stem = re.sub(rf'\b{re.escape(answer)}\b', '_____', sent, flags=re.I)
    # ask to fill the blank via options
    q = f"{stem}\nChoose the best option to fill the blank."
    return {'type':'MCQ', 'question': q, 'options': options, 'answer': answer}

# === 6) build the bank from the SUMMARY
def build_qbank_from_summary(summary_text: str, kws, max_per_type=10):
    sents = split_sentences(summary_text)
    tf_items, fib_items, mcq_items = [], [], []
    for s in sents:
        ans = pick_answer(s, kws)
        if not ans:
            continue
        fib_items.append(make_fib(s, ans))
        tf_items.append(make_tf(s, ans, kws))
        mcq_items.append(make_mcq(s, ans, kws))
        if len(fib_items) >= max_per_type and len(tf_items) >= max_per_type and len(mcq_items) >= max_per_type:
            break
    return tf_items, fib_items, mcq_items

tf_items, fib_items, mcq_items = build_qbank_from_summary(final_summary, keywords, max_per_type=8)

# === 7) flatten to a single CSV (type, question, options, answer)
rows = []
for it in tf_items:
    rows.append({'type': 'T/F', 'question': it['question'], 'option_a': 'True', 'option_b': 'False',
                 'option_c': '', 'option_d': '', 'answer': it['answer']})
for it in fib_items:
    rows.append({'type': 'FIB', 'question': it['question'], 'option_a': '', 'option_b': '',
                 'option_c': '', 'option_d': '', 'answer': it['answer']})
for it in mcq_items:
    opts = (it['options'] + ['','','',''])[:4]
    rows.append({'type': 'MCQ', 'question': it['question'],
                 'option_a': opts[0], 'option_b': opts[1], 'option_c': opts[2], 'option_d': opts[3],
                 'answer': it['answer']})

df = pd.DataFrame(rows)
df.to_csv('question_bank_enhanced.csv', index=False)
print("✅ Saved question_bank_enhanced.csv with", len(df), "items")


✅ Saved question_bank_enhanced.csv with 24 items


## RAG prototype code

In [None]:
!pip install faiss-cpu sentence-transformers

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [None]:
from sentence_transformers import SentenceTransformer

# load embedding model on CPU
embedder = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

# create embeddings for chunks
chunk_texts = chunks  # from your earlier PDF splitting
embeddings = embedder.encode(chunks, convert_to_tensor=False)

import faiss, numpy as np
emb_matrix = np.array(embeddings)
dimension = emb_matrix.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(emb_matrix)

# === 3) build FAISS index
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)

print("✅ FAISS index built with", index.ntotal, "chunks")

# === 4) helper: ask a question
from transformers import pipeline
qa_model = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad",
    device=-1   # force CPU
)


def ask_question(question, top_k=3):
    q_emb = embedder.encode([question], convert_to_tensor=False)
    D, I = index.search(np.array(q_emb), k=top_k)
    retrieved = " ".join([chunks[idx] for idx in I[0]])

    ans = qa_model(question=question, context=retrieved)
    return ans["answer"], retrieved



# === 5) try it
q = "What is NLP and why is it difficult?"
ans, ctx = ask_question(q)
print("Q:", q)
print("Answer:", ans)
print("\nRetrieved Context:", ctx)


✅ FAISS index built with 5 chunks


Fetching 0 files: 0it [00:00, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 0 files: 0it [00:00, ?it/s]

Device set to use cpu


Q: What is NLP and why is it difficult?
Answer: it involves a lot of unstructured data

Retrieved Context: 01 Course Outline 1.Introduction to NLP course and Basic Concepts 2.NLP Basic Concepts 3.NLP Basic Concepts 4.Simple Processing 02•Tokenization •Sentence •Segmentation •POS Tagging •Stemming •Lemmatization •Named Entity Recognition•Stop Words •Matchers •Text Visualization •Syntax Structure •Bag of Words •Text Vectors •TF-IDF5.Simple Processing 6.Advanced Processing 7.Modeling & Text Generation 8.Modern NLP Architectures•Word Embedding •Word2Vec •Text Similarity •Distance Similarity • Text Classification •Text Clustering •LDA • N-Grams •Text Generation •Attention Mechanism •Transformer 9.Large Language Models • LLMs (BERT, GPT) •Fine -tuning LLMs 03•Introduction to Natural Language Processing • What is Natural Language Processing (NLP)? • Natural Language Understanding(NLU) and Natural Language Generation(NLG) •Applications of Natural Language Processing(NLP) •Applications of Natur

## multi-document mode

In [None]:
# --- Multi-document upload + processing ---
from google.colab import files
uploaded = files.upload()  # select multiple PDFs at once in the dialog

docs = []
for fname in uploaded.keys():
    text = extract_text_with_ocr(fname)             # your OCR-aware function
    if not text.strip():
        continue

    # per-doc chunking (keep chunks small for QA models)
    doc_chunks = list(chunk_text(text, max_tokens=300))
    # per-doc summary
    doc_summaries = [summarizer(c, max_length=120, min_length=40, do_sample=False)[0]['summary_text']
                     for c in doc_chunks]
    doc_final_summary = " ".join(doc_summaries)

    docs.append({
        "name": fname,
        "text": text,
        "chunks": doc_chunks,
        "summary": doc_final_summary
    })

print(f"✅ processed {len(docs)} documents")

# --- Global (cross-docs) summary + question bank ---
overall_summary = " ".join(d["summary"] for d in docs)
print("📌 overall summary (preview):", overall_summary[:800], "...")

# Rebuild the FIB/T-F/MCQ from the OVERALL summary (reuse your YAKE block)
kw_extractor = yake.KeywordExtractor(n=1, top=50)
keywords = [kw for kw,_ in kw_extractor.extract_keywords(overall_summary)]
keywords = [k for k in keywords if len(k) > 2]

tf_items, fib_items, mcq_items = build_qbank_from_summary(overall_summary, keywords, max_per_type=10)

import pandas as pd
rows = []
for it in tf_items:
    rows.append({'doc':'ALL','type':'T/F','question':it['question'],'option_a':'True','option_b':'False',
                 'option_c':'','option_d':'','answer':it['answer']})
for it in fib_items:
    rows.append({'doc':'ALL','type':'FIB','question':it['question'],'option_a':'','option_b':'',
                 'option_c':'','option_d':'','answer':it['answer']})
for it in mcq_items:
    opts = (it['options'] + ['','','',''])[:4]
    rows.append({'doc':'ALL','type':'MCQ','question':it['question'],'option_a':opts[0],'option_b':opts[1],
                 'option_c':opts[2],'option_d':opts[3],'answer':it['answer']})
pd.DataFrame(rows).to_csv("question_bank_multi.csv", index=False)
print("✅ saved question_bank_multi.csv")

# --- Build a single FAISS index over ALL documents (keep doc ids) ---
from sentence_transformers import SentenceTransformer
import numpy as np, faiss

embedder = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

all_chunks, chunk_docids = [], []
for i, d in enumerate(docs):
    all_chunks.extend(d["chunks"])
    chunk_docids.extend([i]*len(d["chunks"]))

emb = embedder.encode(all_chunks, convert_to_numpy=True)
index = faiss.IndexFlatL2(emb.shape[1])
index.add(emb)
print("✅ FAISS index built over", len(all_chunks), "chunks from", len(docs), "docs")

# --- Multi-doc Q&A helper (returns answer + which docs were used) ---
def ask_question_multi(question, top_k=4):
    q_emb = embedder.encode([question], convert_to_numpy=True)
    D, I = index.search(q_emb, top_k)
    ctx_chunks = [all_chunks[i] for i in I[0]]
    used_doc_ids = sorted(set(chunk_docids[i] for i in I[0]))

    # keep context short enough for QA
    context = " ".join(ctx_chunks)[:2000]
    ans = qa_model(question=question, context=context)
    used_docs = [docs[j]["name"] for j in used_doc_ids]
    return ans["answer"], used_docs, ctx_chunks

# try it
ans, used_docs, ctx = ask_question_multi("What is Word2Vec?", top_k=4)
print("Answer:", ans)
print("From docs:", used_docs)


Saving LLM & LangChain(NTI Lec 11).pdf to LLM & LangChain(NTI Lec 11).pdf
Saving ocr&scraping(NTI Lec 12).pdf to ocr&scraping(NTI Lec 12).pdf


Your max_length is set to 120, but your input_length is only 14. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=7)


✅ processed 2 documents
📌 overall summary (preview):  Large Language Models (LLMs) are advanced AI systems trained on massive text datasets . They understand, generate, translate, and summarize human language . Examples: GPT (by OpenAI), PaLM (by Google), LLaMA (by Meta ).  LangChain is an open -source framework for building applications powered by LLMs . It helps developers connect language models with external tools, data sources, and user interfaces . LangChain helps coordinate multiple LLM components into a single workflow .  LangChain uses LLMChains to chain multiple tasks (e.g., search → summarize → answer) to multiple tasks . Maintains memory/state across user interactions using: ConversationBufferMemory Vector stores + retrievers . Enables context -aware chatbots and assistants .  LangChain provides memory modules to store, retrieve, ...
✅ saved question_bank_multi.csv
✅ FAISS index built over 11 chunks from 2 docs
Answer: Captured word relationships using vectors
From docs: ['

## Save Summary as PDF

In [None]:
!pip install reportlab

Collecting reportlab
  Downloading reportlab-4.4.3-py3-none-any.whl.metadata (1.7 kB)
Downloading reportlab-4.4.3-py3-none-any.whl (2.0 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/2.0 MB[0m [31m9.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/2.0 MB[0m [31m35.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: reportlab
Successfully installed reportlab-4.4.3


In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

In [None]:
import os, time, pandas as pd
from google.colab import files

In [None]:
def save_text_as_pdf(text, filename):
    c = canvas.Canvas(filename, pagesize=letter)
    w, h = letter
    t = c.beginText(40, h - 40)
    t.setFont("Helvetica", 12)
    # simple wrapping
    for para in text.split("\n"):
        for line in [para[i:i+95] for i in range(0, len(para), 95)]:
            t.textLine(line)
        t.textLine("")  # blank line between paragraphs
    c.drawText(t); c.save()

ts = time.strftime("%Y%m%d_%H%M")

In [None]:
# 2) choose the newest summary
summary_text = None
summary_name = None
if 'overall_summary' in globals() and overall_summary.strip():
    summary_text = overall_summary
    summary_name = f"summary_overall_{ts}.pdf"
elif 'final_summary' in globals() and final_summary.strip():
    summary_text = final_summary
    summary_name = f"summary_single_{ts}.pdf"

if summary_text:
    save_text_as_pdf(summary_text, summary_name)
    print(f"✅ Saved summary -> {summary_name}")
    files.download(summary_name)


✅ Saved summary -> summary_overall_20250918_1105.pdf


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Save Questions as CSV

In [None]:
# 3) choose the newest question bank
# Priority: (a) structured bank DataFrame df, (b) multi-doc CSV already created,
# (c) fallback to open-ended `questions` list.
if 'df' in globals() and isinstance(df, pd.DataFrame) and len(df):
    qfile = f"question_bank_enhanced_{ts}.csv"
    df.to_csv(qfile, index=False)
    print(f"✅ Saved enhanced question bank -> {qfile}")
    display(pd.read_csv(qfile).head(5))
    files.download(qfile)
elif os.path.exists("question_bank_multi.csv"):
    print("ℹ️ Using existing multi-doc bank: question_bank_multi.csv")
    display(pd.read_csv("question_bank_multi.csv").head(5))
    files.download("question_bank_multi.csv")
elif 'questions' in globals() and len(questions):
    qfile = f"question_bank_open_{ts}.csv"
    pd.DataFrame({'question': questions}).to_csv(qfile, index=False)
    print(f"✅ Saved open-ended question list -> {qfile}")
    display(pd.read_csv(qfile).head(5))
    files.download(qfile)
else:
    print("⚠️ No question bank object found to save.")


✅ Saved enhanced question bank -> question_bank_enhanced_20250918_1105.csv


Unnamed: 0,type,question,option_a,option_b,option_c,option_d,answer
0,T/F,True or False: Course Outline: Parsing to NLP ...,True,False,,,False
1,T/F,True or False: Introduction to Natural Languag...,True,False,,,True
2,T/F,True or False: What is Natural?,True,False,,,False
3,T/F,True or False: How does natural Lemmatization ...,True,False,,,False
4,T/F,True or False: analysis is a difficult task be...,True,False,,,False


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>