<a href="https://colab.research.google.com/github/m-adeleke1/PyTorch_Projects/blob/main/PyTorch_Radiology_Error_Report_Checker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Radiology Report Error Checker — What This Project Does (and Why)<br>

This project is a beginner-friendly, end-to-end demo that helps you practice PyTorch while building something useful: a tool that flags simple issues in radiology text and drafts clearer, corrected sentences. It’s not for clinical use—think of it as a learning lab where you can see how data, rules, a small neural network, and large language models (LLMs) fit together.

At a high level, the pipeline is:

1)Load reports<br>
2) Split text into sentences<br>
3) Apply simple rules to label “anomalies”<br>
4) Train a small PyTorch model to predict anomalies<br>
5) For flagged sentences, ask an LLM to rewrite in clinical style and explain it ELI5<br>
6) Try it in a tiny web app (Gradio).<br><br>

###The Data We Use

We try to load real radiology reports from the IU X-Ray / OpenI collection (free, public). The code downloads an archive of XML files and extracts two common sections per report:

Findings (descriptive details)

Impression (bottom-line summary)

If downloading isn’t possible in your environment, the notebook falls back to a tiny toy dataset so everything still runs. Either way, you’ll end up with many sentences like “The heart size is normal.” or “No pleural effusion. There is pleural effusion on the left.”<br><br>

###What Counts as an “Anomaly” Here?

Because we don’t have gold labels (human-marked errors), we create weak labels using simple rules. These rules aren’t perfect, but they’re great for learning, and they point your model in the right direction.

We label a sentence as anomalous if it matches any of these patterns:<br><br>

Missing measurement<br><br>
Examples:

“The lesion measures . cm in diameter.” (a dot before “cm” with no number)

“measures” not followed by a number shortly after

units like cm/mm without a preceding number<br><br>

Duplicated phrase<br><br>
Examples:

Exact repetition like “Lungs are clear clear.”<br><br>

Repeated 3-word chunks (“n-grams”) inside the same sentence<br><br>

Simple contradiction<br><br>
Examples:

A negation plus a later positive mention of the same thing:
“No pleural effusion … effusion present.”
The rule uses a tiny list of common chest findings (e.g., effusion, pneumonia, nodule) and looks for “no/without/absent … TERM” followed by another un-negated mention of TERM.

Each sentence gets a binary label: 1 for “anomalous” if any rule fires, 0 otherwise. These labels drive training.<br><br>

Why Train a Model If We Already Have Rules?

Two reasons:

Noise smoothing: Rules are crude. The model can learn patterns that correlate with “problematic style” beyond the exact rule triggers, improving recall in practice.

Generalization: Later you can add more sophisticated inputs (e.g., domain terms, section info) or switch to a transformer. The basic training loop stays the same.

In [1]:
# @title ⬇️ Install libraries
!pip -q install -U datasets transformers gradio openai anthropic scikit-learn tqdm

import os, re, random, json, math
import numpy as np
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm
from datasets import load_dataset
from sklearn.metrics import classification_report

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m951.0/951.0 kB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.0/308.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m95.3 MB/s[0m eta [36m0:00:00[0m
[?25h

device(type='cuda')

In [2]:
# @title 🔑 API keys (read from Colab “User secrets” or env) – optional
try:
    from google.colab import userdata  # available in Colab
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY') or os.environ.get('OPENAI_API_KEY','')
    ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY') or os.environ.get('ANTHROPIC_API_KEY','')
except Exception:
    OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY','')
    ANTHROPIC_API_KEY = os.environ.get('ANTHROPIC_API_KEY','')

if OPENAI_API_KEY: os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
if ANTHROPIC_API_KEY: os.environ['ANTHROPIC_API_KEY'] = ANTHROPIC_API_KEY

print("OpenAI set:", bool(os.environ.get("OPENAI_API_KEY")))
print("Anthropic set:", bool(os.environ.get("ANTHROPIC_API_KEY")))

OpenAI set: True
Anthropic set: True


##Load IU X-Ray reports directly from OpenI (FINDINGS/IMPRESSION)

This block downloads a public archive of chest X-ray reports from OpenI and unpacks it on disk (so you don’t re-download every run). It walks through the extracted folders, finds all the XML files, and for each file it pulls out the AbstractText sections labeled FINDINGS and IMPRESSION. Those two sections are the most useful: findings = what’s observed, impression = the summary. It returns a list of small dictionaries like {"findings": "...", "impression": "..."}—one per report. If the download fails (e.g., firewall), it falls back to a tiny, hard-coded toy set so the rest of the notebook keeps working.

In [5]:
# @title 📥 Load IU X-Ray reports directly from OpenI (FINDINGS/IMPRESSION)
import os, tarfile, tempfile, glob, urllib.request, xml.etree.ElementTree as ET, json, random

OPENI_REPORTS_URL = "https://openi.nlm.nih.gov/imgs/collections/NLMCXR_reports.tgz"  # IU chest X-ray reports (XML)
CACHE_DIR = "/content/openi_iu_reports"   # cache so we don't re-download every run
os.makedirs(CACHE_DIR, exist_ok=True)

def _download_reports_tgz(dest_dir=CACHE_DIR, url=OPENI_REPORTS_URL, fname="NLMCXR_reports.tgz"):
    tgz_path = os.path.join(dest_dir, fname)
    if not os.path.exists(tgz_path):
        print(f"Downloading {url} → {tgz_path} ...")
        urllib.request.urlretrieve(url, tgz_path)
    else:
        print(f"Found cached tgz at {tgz_path}")
    return tgz_path

def _extract_if_needed(tgz_path, extract_dir=CACHE_DIR):
    # Extract only once
    marker = os.path.join(extract_dir, ".extracted")
    if os.path.exists(marker):
        print("Archive already extracted.")
        return extract_dir
    print(f"Extracting {tgz_path} ...")
    with tarfile.open(tgz_path, "r:gz") as tar:
        # Safe extraction
        def is_within_directory(directory, target):
            import os
            abs_directory = os.path.abspath(directory)
            abs_target = os.path.abspath(target)
            prefix = os.path.commonprefix([abs_directory, abs_target])
            return prefix == abs_directory
        def safe_extract(tar, path=".", members=None, *, numeric_owner=False):
            for member in tar.getmembers():
                member_path = os.path.join(path, member.name)
                if not is_within_directory(path, member_path):
                    raise Exception("Attempted Path Traversal in Tar File")
            tar.extractall(path, members, numeric_owner=numeric_owner)
        safe_extract(tar, path=extract_dir)
    open(marker, "w").close()
    return extract_dir

def _iter_reports_xml(root_dir=CACHE_DIR, limit=None):
    # Many distributions extract to a folder named NLMCXR_reports/... with XMLs under ecgen-radiology/
    xml_paths = glob.glob(os.path.join(root_dir, "**", "*.xml"), recursive=True)
    if not xml_paths:
        # Try alternative: sometimes it extracts directly
        xml_paths = glob.glob(os.path.join(root_dir, "*.xml"))
    if limit:
        xml_paths = xml_paths[:limit]
    return xml_paths

def _parse_report_xml(xml_path):
    """Return dict {'findings': str, 'impression': str} from a single IU report XML."""
    try:
        tree = ET.parse(xml_path)
        root = tree.getroot()
        # IU OpenI reports typically use Abstract/AbstractText with Label attributes.
        findings_chunks, impression_chunks = [], []
        for abs_node in root.findall(".//AbstractText"):
            label = (abs_node.attrib.get("Label") or abs_node.attrib.get("label") or "").upper()
            text = "".join(abs_node.itertext()).strip()
            if not text:
                continue
            if label == "FINDINGS":
                findings_chunks.append(text)
            elif label == "IMPRESSION":
                impression_chunks.append(text)
        return {
            "findings": " ".join(findings_chunks).strip(),
            "impression": " ".join(impression_chunks).strip()
        }
    except Exception as e:
        # Bad XML or unexpected format
        return {"findings": "", "impression": ""}

def load_openi_iu_reports(max_reports=None):
    """Download → extract → parse IU X-Ray reports (OpenI).
       Returns a list of dicts with keys: findings, impression."""
    try:
        tgz = _download_reports_tgz()
        _extract_if_needed(tgz)
        rows = []
        for p in _iter_reports_xml(limit=max_reports):
            rec = _parse_report_xml(p)
            if rec["findings"] or rec["impression"]:
                rows.append(rec)
        if rows:
            return rows
    except Exception as e:
        print("⚠️ OpenI download/parse failed:", e)
    return None

# ---- Try OpenI first; then fallback to toy corpus so notebook keeps running ----
rows = load_openi_iu_reports(max_reports=4000)  # adjust to taste; set to None for all (~3,955 reports)
if not rows:
    print("⚠️ Using a tiny toy corpus to keep the notebook runnable.")
    rows = [
        {"findings": "The heart size is normal. The left costophrenic angle is sharp.",
         "impression": "No pneumothorax identified."},
        {"findings": "There is a 2 cm nodular opacity in the right upper lobe. The lesion measures . cm in diameter.",
         "impression": "Findings concerning for neoplasm."},
        {"findings": "No pleural effusion. There is pleural effusion on the left.",
         "impression": "Possible atelectasis."},
        {"findings": "Lungs are clear clear. Cardiomediastinal silhouette within normal limits.",
         "impression": "No acute cardiopulmonary disease."},
    ]

print(f"Loaded {len(rows)} reports.")
print("Sample:", rows[min(0, len(rows)-1)])

Downloading https://openi.nlm.nih.gov/imgs/collections/NLMCXR_reports.tgz → /content/openi_iu_reports/NLMCXR_reports.tgz ...
Extracting /content/openi_iu_reports/NLMCXR_reports.tgz ...


  tar.extractall(path, members, numeric_owner=numeric_owner)


Loaded 3927 reports.
Sample: {'findings': 'The heart and lungs have XXXX XXXX in the interval. Both lungs are clear and expanded. Heart and mediastinum normal.', 'impression': 'No active disease.'}


##Heuristic labelers

We don’t have human-made labels, so we make simple rules (“heuristics”) to mark sentences that look suspicious. Three checks: missing measurement (units like “cm/mm” but no number nearby, or “measures …” with no number), duplicated phrase (exact repeats like “clear clear” or repeated 3-word chunks), and simple contradiction (e.g., “no effusion … effusion”). Each function returns True/False; a wrapper combines them and sets a binary label: 1 = anomaly if any rule fires, else 0. These labels are noisy but good enough to train a first model and to practice PyTorch.

In [6]:
# @title 🧪 Heuristic labelers
SPLIT_RE = re.compile(r'(?<=[\.\?!])\s+')

def split_sentences(text):
    text = (text or "").strip()
    if not text: return []
    sents = [s.strip() for s in SPLIT_RE.split(text) if s.strip()]
    return sents

FINDING_TERMS = [
    "pneumothorax","effusion","edema","pneumonia","consolidation",
    "fracture","atelectasis","nodule","opacity","embolus","infiltrate"
]

NEG_WORDS = r"(?:no|without|absent|free of)"

def has_missing_measurement(s):
    s_l = s.lower()
    if re.search(r'(?<!\d)\s(cm|mm)\b', s_l):  # ' cm' with no preceding digit
        return True
    if re.search(r'\bmeasures\b(?!\s*\d)', s_l):
        # If "measures" not followed by a number within a few tokens
        after = re.split(r'\bmeasures\b', s_l, 1)[-1]
        if not re.search(r'\d', after[:30]):  # crude window
            return True
    # Bare decimal like ". cm"
    if re.search(r'\b\.\s*(cm|mm)\b', s_l):
        return True
    return False

def has_duplicate_phrase(s):
    toks = re.findall(r"[a-z0-9]+", s.lower())
    if len(toks) < 6:
        # handle obvious repetition like "clear clear"
        return any(toks[i]==toks[i+1] for i in range(len(toks)-1))
    # check repeating 3-grams
    grams = [" ".join(toks[i:i+3]) for i in range(len(toks)-2)]
    seen = set()
    for g in grams:
        if g in seen: return True
        seen.add(g)
    return False

def has_simple_contradiction(s):
    s_l = s.lower()
    for term in FINDING_TERMS:
        neg_pat = rf"{NEG_WORDS}\s+{term}\b"
        pos_pat = rf"\b{term}\b"
        if re.search(neg_pat, s_l) and re.search(pos_pat, s_l):
            # ensure there is a positive mention not covered by the negation
            # crude: remove the first negated occurrence and look again
            rem = re.sub(neg_pat, "", s_l, count=1)
            if re.search(pos_pat, rem):
                return True
    return False

def label_sentence(s):
    mm = has_missing_measurement(s)
    dp = has_duplicate_phrase(s)
    ct = has_simple_contradiction(s)
    any_flag = mm or dp or ct
    label = 1 if any_flag else 0
    return label, {"missing_measurement": mm, "duplicate_phrase": dp, "contradiction": ct}

##Build sentence-level dataset with weak labels

Reports are long, so we split them into sentences (roughly by punctuation). For every sentence we run the heuristic labelers and store: the text, the final label (0/1), and which rule(s) matched. Because real data often has more normal sentences than anomalous ones, we do a light balancing (downsample some negatives) so the model sees a healthier mix during training. The result is a shuffled list you can split into train/validation.

In [7]:
# @title 📦 Build sentence-level dataset with weak labels
all_rows = []
for r in rows:
    for field in ("findings","impression"):
        sents = split_sentences(r.get(field,""))
        for s in sents:
            y, details = label_sentence(s)
            all_rows.append({"text": s, "label": y, **details})

# Balance a bit (downsample negatives to 1:1)
pos = [r for r in all_rows if r["label"]==1]
neg = [r for r in all_rows if r["label"]==0]
k = min(len(pos), len(neg)) if len(pos)>0 else min(400, len(neg))
random.shuffle(pos); random.shuffle(neg)
data = (pos[:k] + neg[:k]) if k>0 else (pos + neg)
random.shuffle(data)

len(all_rows), len(data), sum(d['label'] for d in data)

(24426, 188, 94)

##Vocab + encoding

Neural nets work with numbers, not raw text. Here we tokenize each sentence into basic word-like pieces with a regex, count token frequencies, and build a vocabulary that maps tokens to integer IDs. We reserve <pad> (for padding shorter sentences) and <unk> (for out-of-vocab tokens). The encode() function turns new sentences into lists of IDs (capped at MAX_LEN). This is the classic “bag of tokens → IDs” step for simple sequence models.

In [8]:
# @title 🔡 Vocab + encoding
MAX_VOCAB = 20000
MIN_FREQ  = 2
MAX_LEN   = 128

def tok(text):
    text = (text or "").lower()
    return re.findall(r"[a-z0-9]+(?:'[a-z0-9]+)?", text)

from collections import Counter
ctr = Counter()
for r in data:
    ctr.update(tok(r["text"]))
itos = ["<pad>","<unk>"] + [w for w,c in ctr.most_common(MAX_VOCAB) if c>=MIN_FREQ]
stoi = {w:i for i,w in enumerate(itos)}
PAD_ID, UNK_ID = stoi["<pad>"], stoi["<unk>"]

def encode(text):
    ids = [stoi.get(t, UNK_ID) for t in tok(text)][:MAX_LEN]
    return ids if ids else [UNK_ID]

##Dataset / DataLoader

PyTorch expects two things: a Dataset that knows how to fetch one sample, and a DataLoader that batches samples efficiently. Our Dataset returns (ids_tensor, label_tensor) for a given index. The custom collate function pads sequences in the batch to the same length and also returns a lengths tensor so the model knows where padding starts. Finally, we create train and valid loaders with reasonable batch sizes and shuffling for the training set.

In [9]:
# @title 🧰 Dataset / DataLoader
class SentDS(Dataset):
    def __init__(self, rows): self.rows=rows
    def __len__(self): return len(self.rows)
    def __getitem__(self, idx):
        r = self.rows[idx]
        return torch.tensor(encode(r["text"]), dtype=torch.long), torch.tensor(r["label"], dtype=torch.long)

def collate(batch):
    xs, ys = zip(*batch)
    lens = torch.tensor([len(x) for x in xs], dtype=torch.long)
    maxlen = int(lens.max().item())
    pad = torch.full((len(xs), maxlen), PAD_ID, dtype=torch.long)
    for i,x in enumerate(xs): pad[i,:len(x)] = x
    return pad, torch.stack(ys), lens

# Split
n = len(data)
i = int(0.85*n)
train_rows, valid_rows = data[:i], data[i:]
train_loader = DataLoader(SentDS(train_rows), batch_size=32, shuffle=True, collate_fn=collate)
valid_loader = DataLoader(SentDS(valid_rows), batch_size=64, shuffle=False, collate_fn=collate)
len(train_rows), len(valid_rows)

(159, 29)

##GRU classifier

This is a small, readable sequence model. It looks up embeddings for tokens (nn.Embedding), runs a bidirectional GRU over the sequence (reads left-to-right and right-to-left), and takes the final hidden states from both directions, concatenated, as a compact sentence summary. A little dropout adds regularization. A final linear layer converts that summary to two numbers (logits) for the classes “OK” and “ANOMALY.” Think of it as “embed → read → summarize → decide.”

In [10]:
# @title 🧠 GRU classifier
class GRUClassifier(nn.Module):
    def __init__(self, vocab_size, emb=128, hid=128, pad_idx=0, dr=0.2):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb, padding_idx=pad_idx)
        self.gru = nn.GRU(emb, hid, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(dr)
        self.fc = nn.Linear(hid*2, 2)  # binary
    def forward(self, x, lengths):
        e = self.emb(x)
        packed = nn.utils.rnn.pack_padded_sequence(e, lengths.cpu(), batch_first=True, enforce_sorted=False)
        _, h = self.gru(packed)  # h: [2, B, H]
        h = torch.cat([h[0], h[1]], dim=1)  # [B, 2H]
        h = self.dropout(h)
        return self.fc(h)

model = GRUClassifier(len(itos), pad_idx=PAD_ID).to(device)
sum(p.numel() for p in model.parameters())/1e6

0.219266

##Train

The training loop repeats for a few epochs. For each batch we: move tensors to the device (CPU/GPU), compute logits, compute cross-entropy loss, backpropagate (loss.backward()), clip gradients to avoid explosions, and step the optimizer (AdamW). After each epoch we run a validation pass to get accuracy and keep the best model (early stopping if it stops improving). This is the standard PyTorch training pattern you’ll reuse in many projects.

In [11]:
# @title 🚂 Train
criterion = nn.CrossEntropyLoss()
opt = torch.optim.AdamW(model.parameters(), lr=2e-3)
best, patience, wait = 0.0, 3, 0

def evaluate(loader):
    model.eval()
    tot, correct, y_true, y_pred = 0,0,[],[]
    with torch.no_grad():
        for X,y,L in loader:
            X,y,L = X.to(device), y.to(device), L.to(device)
            logits = model(X,L)
            preds = logits.argmax(1)
            correct += (preds==y).sum().item()
            tot += y.size(0)
            y_true += y.cpu().tolist(); y_pred += preds.cpu().tolist()
    return correct/tot, y_true, y_pred

EPOCHS=6
for ep in range(1, EPOCHS+1):
    model.train()
    pbar = tqdm(train_loader, desc=f"Epoch {ep}/{EPOCHS}")
    for X,y,L in pbar:
        X,y,L = X.to(device), y.to(device), L.to(device)
        opt.zero_grad()
        loss = criterion(model(X,L), y)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()
        pbar.set_postfix(loss=float(loss.item()))
    acc, yt, yp = evaluate(valid_loader)
    print(f"valid acc: {acc:.3f}")
    if acc>best:
        best, wait = acc, 0
        torch.save(model.state_dict(), "gru_best.pt")
    else:
        wait += 1
        if wait>=patience:
            print("Early stop."); break

print("Best valid acc:", best)
print(classification_report(yt, yp, digits=3, target_names=["OK","ANOMALY"]))

Epoch 1/6:   0%|          | 0/5 [00:00<?, ?it/s]

valid acc: 0.862


Epoch 2/6:   0%|          | 0/5 [00:00<?, ?it/s]

valid acc: 0.862


Epoch 3/6:   0%|          | 0/5 [00:00<?, ?it/s]

valid acc: 0.931


Epoch 4/6:   0%|          | 0/5 [00:00<?, ?it/s]

valid acc: 0.931


Epoch 5/6:   0%|          | 0/5 [00:00<?, ?it/s]

valid acc: 0.931


Epoch 6/6:   0%|          | 0/5 [00:00<?, ?it/s]

valid acc: 0.931
Early stop.
Best valid acc: 0.9310344827586207
              precision    recall  f1-score   support

          OK      0.857     1.000     0.923        12
     ANOMALY      1.000     0.882     0.938        17

    accuracy                          0.931        29
   macro avg      0.929     0.941     0.930        29
weighted avg      0.941     0.931     0.932        29



##Classify a sentence

A small helper to use the trained model on one sentence. It runs the same tokenization/encoding as training, builds a tiny 1-item batch, and calls the model in eval() mode (no gradients). It applies softmax to turn logits into probabilities and returns both the predicted class and the confidence. This is what you’ll call from a UI or API.

In [12]:
# @title 🔎 Classify a sentence
import torch.nn.functional as F

def encode_batch_text(texts):
    xs = [torch.tensor(encode(t), dtype=torch.long) for t in texts]
    lens = torch.tensor([len(x) for x in xs], dtype=torch.long)
    maxlen = int(lens.max().item())
    pad = torch.full((len(xs), maxlen), PAD_ID, dtype=torch.long)
    for i,x in enumerate(xs): pad[i,:len(x)] = x
    return pad, lens

def predict_sentence(text):
    model.eval()
    with torch.no_grad():
        X,L = encode_batch_text([text])
        X,L = X.to(device), L.to(device)
        logits = model(X,L)[0]
        probs = F.softmax(logits, dim=0).cpu().numpy()
    return {"label": int(np.argmax(probs)), "probs": {"OK": float(probs[0]), "ANOMALY": float(probs[1])}}

##OpenAI & Claude helpers (optional)

These functions are post-processing tools driven by LLMs. If a sentence looks anomalous, we ask OpenAI to rewrite it in clean clinical style without changing facts, and ask Claude to explain it in simple language (ELI5). They’re optional: if the keys aren’t set, the function just returns a placeholder. Prompts are short and strict to reduce hallucinations (e.g., “don’t invent findings”).

In [13]:
# @title 🤝 OpenAI & Claude helpers (optional)
def rewrite_with_openai(sentence):
    if not os.environ.get("OPENAI_API_KEY"):
        return "(OpenAI key not set)"
    from openai import OpenAI
    client = OpenAI()
    prompt = (
        "You are a radiology copy editor. Rewrite the sentence in precise, conventional clinical style, "
        "keeping the same factual content. Do not invent findings.\n"
        f"Sentence: {sentence}\n"
        "Return only the rewritten sentence."
    )
    rsp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role":"user","content":prompt}],
        temperature=0.2,
    )
    return rsp.choices[0].message.content.strip()

def eli5_with_claude(sentence):
    if not os.environ.get("ANTHROPIC_API_KEY"):
        return "(Anthropic key not set)"
    from anthropic import Anthropic
    cl = Anthropic()
    prompt = (
        "Explain like I'm five what this radiology sentence is trying to say, "
        "without adding medical claims that aren't there. Keep it 2 sentences max.\n"
        f"Sentence: {sentence}"
    )
    rsp = cl.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        temperature=0.2,
        messages=[{"role":"user","content":prompt}]
    )
    return rsp.content[0].text.strip()

##End-to-end checker

One function to glue everything together for a single sentence. It runs the heuristics and the model; if either says “anomaly,” it calls the LLM helpers to produce a suggested rewrite and a gentle explanation. The output is a tidy JSON-like dict you can print, log, or feed to a front-end. This is also handy for unit tests or quick experiments.

In [14]:
# @title 🧩 End-to-end checker
def check_sentence(sentence):
    label, details = label_sentence(sentence)
    clf = predict_sentence(sentence)
    out = {
        "heuristics": details,
        "model": clf,
    }
    if clf["label"]==1 or any(details.values()):
        out["openai_rewrite"] = rewrite_with_openai(sentence)
        out["claude_eli5"] = eli5_with_claude(sentence)
    return out

# quick smoke tests
tests = [
    "There is pleural effusion on the left. No pleural effusion is seen.",
    "Lesion measures . cm in diameter.",
    "Lungs are clear clear.",
    "The heart size is normal."
]
for t in tests:
    print(t, "->", check_sentence(t))

There is pleural effusion on the left. No pleural effusion is seen. -> {'heuristics': {'missing_measurement': False, 'duplicate_phrase': False, 'contradiction': False}, 'model': {'label': 0, 'probs': {'OK': 0.7606035470962524, 'ANOMALY': 0.23939652740955353}}}
Lesion measures . cm in diameter. -> {'heuristics': {'missing_measurement': True, 'duplicate_phrase': False, 'contradiction': False}, 'model': {'label': 0, 'probs': {'OK': 0.9987999200820923, 'ANOMALY': 0.0012000707210972905}}, 'openai_rewrite': 'The lesion measures X cm in diameter.', 'claude_eli5': "This sentence is describing the size of a spot or area that the doctor has found. It's telling us how wide that spot is, in centimeters."}
Lungs are clear clear. -> {'heuristics': {'missing_measurement': False, 'duplicate_phrase': True, 'contradiction': False}, 'model': {'label': 0, 'probs': {'OK': 0.9988040924072266, 'ANOMALY': 0.0011959095718339086}}, 'openai_rewrite': 'Lungs are clear.', 'claude_eli5': "The sentence is saying tha

##Gradio demo (paste a sentence)

A minimal web UI so you can try sentences without writing extra code. You paste text into a box; the app calls the end-to-end checker and shows the structured JSON result (rules fired, model prediction, optional rewrite, ELI5). Gradio handles the server and interface, so you can focus on the ML logic. It’s perfect for quick demos and for sanity-checking model changes in real time.

In [15]:
# @title 🖥️ Gradio demo (paste a sentence)
import gradio as gr

def gradio_check(s):
    res = check_sentence(s)
    return json.dumps(res, indent=2)

demo = gr.Interface(
    fn=gradio_check,
    inputs=gr.Textbox(label="Radiology sentence", lines=3, placeholder="Paste a single sentence from a report..."),
    outputs=gr.Code(label="Result (JSON)"),
    title="Radiology Error Checker (demo)",
    description="Educational demo. Flags simple anomalies and drafts a rewrite/ELI5."
)

demo.launch(share=False)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.
* To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



In [None]:
# @title 💾 Save vocab + labels + weights
art = {
    "itos": itos,
    "stoi": stoi,
    "pad_id": PAD_ID,
    "unk_id": UNK_ID,
}
with open("vocab.json","w") as f: json.dump(art, f)
torch.save(model.state_dict(), "gru_best.pt")
print("Saved vocab.json and gru_best.pt")