<a href="https://colab.research.google.com/github/madhupathy/PGP-GABA-UTA-JUL25/blob/main/Medical_Assistant_RAG_FullCode_Starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Medical Assistant (RAG) — Full‑Code Starter

**Goal:** Answer clinical questions using a Retrieval‑Augmented Generation (RAG) pipeline over the Merck Manual PDF. This notebook covers installs, GPU check, data ingestion, embeddings + FAISS, prompt‑sweeps, RAG QA, and evaluation.

**Submission:** Export to HTML at the end.

> ⚠️ Educational use only. Not a substitute for professional medical advice.


## 0) Runtime & Setup (Colab)
- *Runtime → Change runtime type → GPU (T4 recommended)*
- If CUDA errors occur, re‑select GPU and **Restart runtime**.

In [43]:
#@title GPU check
!nvidia-smi || echo "No GPU visible — enable GPU in Runtime > Change runtime type"


/bin/bash: line 1: nvidia-smi: command not found
No GPU visible — enable GPU in Runtime > Change runtime type


## 1) Installs

In [44]:
#@title Install libraries (may take a few minutes)
!pip -q install --upgrade pip
!pip -q install faiss-cpu sentence-transformers pypdf langchain langchain-community langchain-text-splitters
!pip -q install llama-cpp-python==0.2.90  # works well on Colab T4; adjust if needed
# Optional: reranker
!pip -q install FlagEmbedding                    # BGE reranker


## 2) Configuration

In [45]:
#@title Configuration
from pathlib import Path
PDF_PATH = Path('/content/medical_diagnosis_manual.pdf')
EMBED_MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
FAISS_DIR = Path('/content/faiss_index')
FAISS_DIR.mkdir(exist_ok=True, parents=True)

# LLM config (llama-cpp); alternatively, plug in OpenAI etc.
LLAMA_MODEL_PATH = '/content/llama-3.2-3b-instruct.Q4_K_M.gguf'
TEMPERATURE = 0.2
MAX_TOKENS = 1024
TOP_P = 0.95
TOP_K = 40

!pip -q install -U huggingface_hub
from huggingface_hub import hf_hub_download

MODEL_REPO = "bartowski/Llama-3.2-3B-Instruct-GGUF"
FILENAME   = "Llama-3.2-3B-Instruct-Q4_K_M.gguf"

local_path = hf_hub_download(repo_id=MODEL_REPO, filename=FILENAME, local_dir="/content", local_dir_use_symlinks=False)
print("Saved to:", local_path)


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Saved to: /content/Llama-3.2-3B-Instruct-Q4_K_M.gguf


## 3) Load & Chunk PDF

In [46]:
#@title Load PDF and chunk
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

assert PDF_PATH.exists(), f"Upload your PDF to {PDF_PATH}"
loader = PyPDFLoader(str(PDF_PATH))
docs = loader.load()
print(f"Pages loaded: {len(docs)}")

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=150)
chunks = splitter.split_documents(docs)
print(f"Chunks: {len(chunks)} (size≈1200, overlap=150)")


Pages loaded: 4114
Chunks: 14561 (size≈1200, overlap=150)


## 4) Embeddings & FAISS

In [47]:
#@title Build / Load FAISS index
from sentence_transformers import SentenceTransformer
import faiss, numpy as np

embedder = SentenceTransformer(EMBED_MODEL_NAME)
embs = embedder.encode([c.page_content for c in chunks], show_progress_bar=True, convert_to_numpy=True, normalize_embeddings=True)
index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs)
print('FAISS index size:', index.ntotal)

# Store minimal store (texts + metadata)
texts = [c.page_content for c in chunks]
metas = [c.metadata for c in chunks]


Batches:   0%|          | 0/456 [00:00<?, ?it/s]

FAISS index size: 14561


## 5) Retriever

In [48]:
#@title Define retriever (similarity search)
import numpy as np

def retrieve(query, k=6):
    q = embedder.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    D, I = index.search(q, k)
    ctxs = [(texts[i], metas[i]) for i in I[0]]
    return ctxs

# quick test
print(len(retrieve("sepsis hour-1 bundle", k=3)))


3


## 6) LLM (llama-cpp) — local

In [49]:
#@title Load llama-cpp (provide GGUF path above)
from llama_cpp import Llama
llm = Llama(model_path=LLAMA_MODEL_PATH, n_gpu_layers=50, n_ctx=4096, n_threads=4, verbose=False)



## 7) Prompt Templates

In [50]:
#@title Prompts (baseline, engineered, RAG)
BASELINE_SYS = """You are a careful clinical assistant. If unsure, say so and ask for clarification. Do not fabricate.
Educational use only — not medical advice.
"""

ENGINEERED_SYS = """You are an ICU attending. Answer concisely in bullets and include a short ‘Rationale’. If evidence is insufficient, state that explicitly.
Return JSON with fields: answer, rationale, red_flags.
"""

RAG_SYS = """You are a clinical assistant. Answer ONLY using the provided CONTEXT.
- If the context is insufficient, say "Insufficient context" and request what is missing.
- Cite chunk IDs like [C1], [C2] from the given CONTEXT.
Educational use only — not medical advice.
"""

USER_QS = [
    "What is the protocol for managing sepsis in a critical care unit?",
    "What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "What are treatments and possible causes for sudden patchy hair loss (localized bald spots on scalp)?",
    "What treatments are recommended for a person with brain tissue injury (TBI) causing temporary or permanent impairment?"
]


## 8) Generation utilities

In [51]:
#@title Generation helpers
from textwrap import shorten

def chat(llm, system, user, temperature=0.2, top_p=0.95, max_tokens=1024, top_k=40):
    prompt = f"<s>[INST]<<SYS>>\n{system}\n<</SYS>>\n{user}[/INST]"
    out = llm(prompt, temperature=temperature, top_p=top_p, max_tokens=max_tokens, top_k=top_k)
    return out["choices"][0]["text"].strip()

# RAG
def answer_with_rag(question, k=6, temperature=0.2):
    ctxs = retrieve(question, k=k)
    ctx_text = "\n\n".join([f"[C{i+1}] {shorten(t[:1200], width=1200)}" for i,(t,_) in enumerate(ctxs)])
    user = f"CONTEXT:\n{ctx_text}\n\nQUESTION: {question}\n\nAnswer with citations [C#]."
    return chat(llm, RAG_SYS, user, temperature=temperature)


## 9) Baseline & Prompt‑Engineered QA (5+ variants)

In [52]:
#@title Fast + Optimized Baselines & Prompt Sweeps (with final pass)
import itertools, time

# ---------- TUNABLES ----------
# Fast "sweep" on just 1 question to pick params quickly
QUESTIONS_FOR_SWEEP = [USER_QS[0]]   # only sepsis; add USER_QS[1] if you want 2
# Keep combos small for speed (3 combos is enough to pick a winner)
PARAM_GRID = {
    'temperature': [0.0, 0.2],
    'top_p': [0.9, 1.0],
    'top_k': [20, 40]
}
COMBOS = list(itertools.product(PARAM_GRID['temperature'], PARAM_GRID['top_p'], PARAM_GRID['top_k']))[:3]

# Shorter output during sweep; long output only in final pass
SWEEP_MAX_TOKENS = 200
FINAL_MAX_TOKENS = 700   # reduce from 1024 to speed up; increase if you truly need longer answers

# Add stop tokens to end early when the answer is complete
STOP_TOKENS = ["</s>", "###", "\n\n\n"]

# ---------- HELPER ----------
def chat_fast(llm, system, user, temperature=0.2, top_p=0.95, top_k=40, max_tokens=256):
    prompt = f"<s>[INST]<<SYS>>\n{system}\n<</SYS>>\n{user}[/INST]"
    out = llm(
        prompt,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        max_tokens=max_tokens,
        stop=STOP_TOKENS
    )
    return out["choices"][0]["text"].strip()

runs = []

# ---------- 1) FAST SWEEP (small outputs, 1 question) ----------
print("=== FAST SWEEP START ===")
best_combo = None
best_score = -1.0

for q in QUESTIONS_FOR_SWEEP:
    # Baseline (short)
    t0 = time.time()
    out = chat_fast(llm, BASELINE_SYS, q, temperature=0.2, max_tokens=SWEEP_MAX_TOKENS)
    dt = time.time() - t0
    print(f"[baseline] '{q[:28]}...' took {dt:.1f}s")
    runs.append({"mode":"baseline_sweep","q":q,"params":{"temperature":0.2,"max_tokens":SWEEP_MAX_TOKENS},"answer":out})

    # Prompt-engineered short sweeps
    for (t,tp,tk) in COMBOS:
        t0 = time.time()
        out = chat_fast(llm, ENGINEERED_SYS, q, temperature=t, top_p=tp, top_k=tk, max_tokens=SWEEP_MAX_TOKENS)
        dt = time.time() - t0
        print(f"[engineered] t={t}, tp={tp}, tk={tk} took {dt:.1f}s")

        # Simple heuristic score: longer (but capped) and contains bullets or rationale keywords
        # (You’ll replace with your real evaluation later)
        score = (0.5 if ("•" in out or "-" in out) else 0.0) + min(len(out), SWEEP_MAX_TOKENS)/SWEEP_MAX_TOKENS*0.5
        if score > best_score:
            best_score = score
            best_combo = {"temperature": t, "top_p": tp, "top_k": tk}
        runs.append({"mode":"engineered_sweep","q":q,"params":{"temperature":t,"top_p":tp,"top_k":tk,"max_tokens":SWEEP_MAX_TOKENS},"answer":out})

print("Best combo from sweep:", best_combo, "score:", round(best_score,3))
print("=== FAST SWEEP END ===\n")

# ---------- 2) FINAL PASS (all 4 questions, chosen params, longer but capped outputs) ----------
print("=== FINAL PASS START ===")
final_params = best_combo if best_combo else {"temperature":0.2, "top_p":0.95, "top_k":40}

# Baseline (once for all questions)
for q in USER_QS:
    t0 = time.time()
    out = chat_fast(llm, BASELINE_SYS, q, temperature=0.2, top_p=0.95, top_k=40, max_tokens=FINAL_MAX_TOKENS)
    print(f"[final-baseline] '{q[:28]}...' took {time.time()-t0:.1f}s")
    runs.append({"mode":"baseline_final","q":q,"params":{"temperature":0.2,"top_p":0.95,"top_k":40,"max_tokens":FINAL_MAX_TOKENS},"answer":out})

# Engineered (best combo)
for q in USER_QS:
    t0 = time.time()
    out = chat_fast(llm, ENGINEERED_SYS, q, **final_params, max_tokens=FINAL_MAX_TOKENS)
    print(f"[final-engineered] '{q[:28]}...' took {time.time()-t0:.1f}s")
    runs.append({"mode":"engineered_final","q":q,"params":{**final_params,"max_tokens":FINAL_MAX_TOKENS},"answer":out})

print("=== FINAL PASS END ===")

len(runs)


=== FAST SWEEP START ===
[baseline] 'What is the protocol for man...' took 41.3s
[engineered] t=0.0, tp=0.9, tk=20 took 39.5s
[engineered] t=0.0, tp=0.9, tk=40 took 32.5s
[engineered] t=0.0, tp=1.0, tk=20 took 33.1s
Best combo from sweep: {'temperature': 0.0, 'top_p': 0.9, 'top_k': 20} score: 1.0
=== FAST SWEEP END ===

=== FINAL PASS START ===
[final-baseline] 'What is the protocol for man...' took 98.6s
[final-baseline] 'What are the common symptoms...' took 64.4s
[final-baseline] 'What are treatments and poss...' took 105.1s
[final-baseline] 'What treatments are recommen...' took 65.3s
[final-engineered] 'What is the protocol for man...' took 124.4s
[final-engineered] 'What are the common symptoms...' took 80.1s
[final-engineered] 'What are treatments and poss...' took 120.8s
[final-engineered] 'What treatments are recommen...' took 81.2s
=== FINAL PASS END ===


12

## 10) RAG QA

In [None]:
#@title Run RAG answers
for q in USER_QS:
    ans = answer_with_rag(q, k=6, temperature=0.2)
    runs.append({"mode":"rag","q":q,"params":{"k":6,"temperature":0.2},"answer":ans})

len(runs)


## 11) Evaluation — Groundedness, Relevance, Context Precision

In [None]:
#@title Simple automatic rubric (embedding similarity)
import numpy as np

def cosine(a,b):
    return float(np.dot(a,b) / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-8))

# Groundedness: max similarity between answer and its retrieved context chunks (proxy)
# Relevance: similarity between answer and question
# Context precision: average similarity of retrieved chunks to the question (for RAG)

def embed_text(t: str):
    return embedder.encode([t], convert_to_numpy=True, normalize_embeddings=True)[0]

question_vecs = {q: embed_text(q) for q in USER_QS}

evals = []
for r in runs:
    q = r['q']
    ans = r['answer']
    ans_vec = embed_text(ans)
    rel = cosine(ans_vec, question_vecs[q])
    grounded = None
    ctx_prec = None
    if r['mode'] == 'rag':
        ctxs = retrieve(q, k=6)
        ctx_vecs = [embed_text(t) for t,_ in ctxs]
        grounded = max(cosine(ans_vec, v) for v in ctx_vecs)
        ctx_prec = float(np.mean([cosine(question_vecs[q], v) for v in ctx_vecs]))
    evals.append({**r, 'metrics': {'relevance': rel, 'groundedness': grounded, 'context_precision': ctx_prec}})

len(evals)


## 12) Display & Export

In [None]:
#@title Tabulate results and export to JSON
import pandas as pd

df = pd.DataFrame([
    {
        'mode': e['mode'],
        'question': e['q'],
        'params': str(e['params']),
        'relevance': round(e['metrics']['relevance'],3),
        'groundedness': None if e['metrics']['groundedness'] is None else round(e['metrics']['groundedness'],3),
        'context_precision': None if e['metrics']['context_precision'] is None else round(e['metrics']['context_precision'],3),
        'answer': e['answer'][:400]  # preview
    } for e in evals
])

# Just show the DataFrame in Colab
from IPython.display import display
display(df)

# Save to JSON if you want to keep results
df.to_json('/content/qa_eval_results.json', orient='records', indent=2)
print('Saved /content/qa_eval_results.json')


### HTML Export (for submission)

In [None]:
#@title Save this notebook as HTML
!jupyter nbconvert --to html /content/Medical_Assistant_RAG_FullCode_Starter.ipynb --output /content/Medical_Assistant_RAG_Submission.html || echo 'Open: File > Download > .ipynb then convert locally if nbconvert is unavailable.'


## Notes & Citations
- Merck Manual (Professional Edition): Sepsis & Septic Shock; Appendicitis; Alopecia Areata; Traumatic Brain Injury.
- Surviving Sepsis Campaign (SCCM) Adult Guidelines (Hour‑1 Bundle), 2021.
