Detect medical jargon while you read any article and get simple explanations—right on the page.
- Targets: Medical-domain terms (e.g., diseases, procedures, signs/symptoms, medications)
- Extension: Highlights jargon in webpages and shows tooltips/popovers with plain-language explanations
- Backend (optional): FastAPI endpoint for inference & definitions; or fully offline via exported model weights
- On-page scanning: Extracts article text from the DOM and tokenizes/splits into phrases.
- Jargon detection: SVM classifier trained on labeled biomedical spans.
- Explain in plain English: Uses rule-based templates + public-source glossaries (e.g., MedlinePlus/NHS/NCIt—configure attribution).
- Two deployment modes:
- API mode: Send text to a FastAPI server for prediction + definitions.
- Privacy-first: No data leaves the page in offline mode. Opt-in telemetry only.
.
├─ extension/ # Chrome/Edge extension (React/Plasmo or vanilla)
│ ├─ src/
│ │ ├─ content.tsx # Scans page, highlights terms, popovers
│ │ ├─ ui/ # Tooltip/Popup components
│ │ ├─ ml/
│ │ │ ├─ model.json # Exported SVM weights (coef, intercept)
│ │ │ ├─ vocab.json # TF-IDF vocabulary
│ │ │ └─ tfidf_stats.json # idf, norms, etc.
│ │ ├─ utils/text.ts # DOM extraction, sentence/phrase splitting
│ │ └─ utils/infer.ts # TF-IDF + SVM inference (JS)
│ └─ manifest.json
│
├─ backend/
│ ├─ app.py # FastAPI (predict/define endpoints)
│ ├─ artifacts/
│ │ ├─ vectorizer.pkl
│ │ ├─ svm.pkl
│ │ └─ label_map.json
│ └─ requirements.txt
│
├─ model/
│ ├─ train.ipynb # E2E training notebook
│ ├─ train.py # Scripted training
│ ├─ export_js.py # Export sklearn artifacts → JSON for extension
│ └─ data/ # Your datasets (spans/tokens/labels)
│
└─ README.md
# create env
python -m venv .venv && source .venv/bin/activate # (Windows: .venv\Scripts\activate)
pip install -r backend/requirements.txt # includes scikit-learn, fastapi, uvicorn, pandas, joblib
# train SVM (uses TF-IDF + LinearSVC)
python model/train.py --train data/train.csv --dev data/dev.csv --out backend/artifactsWhat train.py does
- cleans text, lowercases, strips punctuation/stopwords
- builds
TfidfVectorizer(ngram_range=(1,3), min_df=2, sublinear_tf=True) - fits LinearSVC (good margin classifier for sparse features)
- saves:
vectorizer.pkl,svm.pkl,label_map.json
python model/export_js.py --vec backend/artifacts/vectorizer.pkl --svm backend/artifacts/svm.pkl --labels backend/artifacts/label_map.json --out extension/src/mlThis writes:
vocab.json(term → index)tfidf_stats.json(idf, norms, options)model.json(SVM coef, intercept, classes)
The extension runs TF-IDF and the linear decision function in JS for zero-server inference.
If you use Plasmo:
cd extension
npm i
npm run dev # builds and watches
# Load the generated build in Chrome: chrome://extensions (Developer mode → Load unpacked)Edge: same as Chrome (edge://extensions).
-
DOM Extraction
content.tsxfinds main article nodes (heuristics:<article>, large<div>blocks).- Merges contiguous text nodes, preserves paragraph breaks.
-
Tokenization & Candidate Phrases
- Lowercase, strip punctuation.
- Generate unigrams/bigrams/trigrams (windowed).
- Optional filters: stopwords (
and, or, of, was, with…) are dropped unless part of a learned phrase.
-
Vectorization (TF-IDF)
- Uses exported
vocab.json+tfidf_stats.json. - Builds a sparse vector in JS mirroring scikit-learn’s preprocessing.
- Uses exported
-
Classification (Linear SVM)
- Applies
decision = X · coef.T + intercept. - One-vs-rest or direct LinearSVC classes → select jargon label or
O(non-jargon). - Thresholding: optional margin threshold to trade precision/recall.
- Applies
-
Explanation
- If API mode is enabled, the extension calls
/define?term=…to fetch a short definition from configured sources. - Otherwise, local rules + bundled mini-glossary (JSON) generate simplified definitions.
- If API mode is enabled, the extension calls
-
UI
- Highlight spans with a subtle underline.
- Hover/click → tooltip with “What it means” (plain language), “Also called”, and “Learn more” (attribution link).
Use any medically-labeled spans dataset you have permission to use. Recommended label focus:
BIOLOGICAL_STRUCTURE,DIAGNOSTIC_PROCEDURE,DISEASE_DISORDER,MEDICATION,SIGN_SYMPTOM,THERAPEUTIC_PROCEDURE
In train.py, map your dataset’s labels to the above set via label_map.json. Everything else can map to O (non-jargon).
python model/train.py --eval_only --train data/train.csv --dev data/dev.csv --out backend/artifactsReport:
- Precision / Recall / F1 per label
- Macro-F1 and Micro-F1
- Confusion matrix (optional)
- Threshold sweep (if you apply probability calibration via
LinearSVC+ Platt/CalibratedClassifierCV)
FastAPI server for inference + definitions.
uvicorn backend.app:app --reload --port 8000Endpoints
POST /predict→{ text: str }⇒[{ span, label, score }]GET /define?term=dyspnea→{ term, definition, source_url, attribution }
Extension config
extension/src/config.ts:export const INFERENCE_MODE: "offline" | "api" = "offline" export const API_BASE = "http://localhost:8000"
- Linear SVMs work extremely well with high-dimensional sparse TF-IDF features.
- Fast to train, tiny to ship (just coef/intercept + vocab), easy to run on-device.
- Great baseline for hackathons; can upgrade later to a small transformer if needed.
- Thresholds:
extension/src/ml/thresholds.ts(per-label or global margin cut-off) - Stopwords:
extension/src/ml/stopwords.json(e.g.,and, or, of, was, with,etc.) - Glossary:
extension/src/data/glossary.jsonfor offline definitions - Attribution: Add source notes for external definitions (e.g., MedlinePlus/NHS)
# Python
black model backend
pytest -q
# Extension
npm run dev
npm run buildPackaging for store:
- Chrome Web Store: upload zipped
/build(ensure correctmanifest.json). - Edge Add-ons: similar submission; test in Edge beforehand.
- Add probabilistic calibration + confidence bins in tooltips
- Contextual grouping of multi-word spans (merge overlapping hits)
- Caching definitions + inline citations
- On-device tiny transformer (Distil/ELI5-style simplifier) as optional module
- i18n (English ↔ Vietnamese)
PRs welcome! Please:
- Open an issue describing the change.
- Include evaluation diffs (metrics) if you modify the model.
- Keep extension bundle size minimal.