MedJar – Medical Jargon Detector & Simplifier (WebExtension + ML)

Detect medical jargon while you read any article and get simple explanations—right on the page.

Targets: Medical-domain terms (e.g., diseases, procedures, signs/symptoms, medications)
Extension: Highlights jargon in webpages and shows tooltips/popovers with plain-language explanations
Backend (optional): FastAPI endpoint for inference & definitions; or fully offline via exported model weights

✨ Features

On-page scanning: Extracts article text from the DOM and tokenizes/splits into phrases.
Jargon detection: SVM classifier trained on labeled biomedical spans.
Explain in plain English: Uses rule-based templates + public-source glossaries (e.g., MedlinePlus/NHS/NCIt—configure attribution).
Two deployment modes:
- API mode: Send text to a FastAPI server for prediction + definitions.
Privacy-first: No data leaves the page in offline mode. Opt-in telemetry only.

🧱 Repo Structure

.
├─ extension/                  # Chrome/Edge extension (React/Plasmo or vanilla)
│  ├─ src/
│  │  ├─ content.tsx          # Scans page, highlights terms, popovers
│  │  ├─ ui/                  # Tooltip/Popup components
│  │  ├─ ml/
│  │  │  ├─ model.json        # Exported SVM weights (coef, intercept)
│  │  │  ├─ vocab.json        # TF-IDF vocabulary
│  │  │  └─ tfidf_stats.json  # idf, norms, etc.
│  │  ├─ utils/text.ts        # DOM extraction, sentence/phrase splitting
│  │  └─ utils/infer.ts       # TF-IDF + SVM inference (JS)
│  └─ manifest.json
│
├─ backend/
│  ├─ app.py                  # FastAPI (predict/define endpoints)
│  ├─ artifacts/
│  │  ├─ vectorizer.pkl
│  │  ├─ svm.pkl
│  │  └─ label_map.json
│  └─ requirements.txt
│
├─ model/
│  ├─ train.ipynb             # E2E training notebook
│  ├─ train.py                # Scripted training
│  ├─ export_js.py            # Export sklearn artifacts → JSON for extension
│  └─ data/                   # Your datasets (spans/tokens/labels)
│
└─ README.md

📦 Quick Start

1) Set up Python & train the model

# create env
python -m venv .venv && source .venv/bin/activate  # (Windows: .venv\Scripts\activate)
pip install -r backend/requirements.txt  # includes scikit-learn, fastapi, uvicorn, pandas, joblib

# train SVM (uses TF-IDF + LinearSVC)
python model/train.py   --train data/train.csv   --dev data/dev.csv   --out backend/artifacts

What train.py does

cleans text, lowercases, strips punctuation/stopwords
builds TfidfVectorizer(ngram_range=(1,3), min_df=2, sublinear_tf=True)
fits LinearSVC (good margin classifier for sparse features)
saves: vectorizer.pkl, svm.pkl, label_map.json

2) Export artifacts for the WebExtension (offline mode)

python model/export_js.py   --vec backend/artifacts/vectorizer.pkl   --svm backend/artifacts/svm.pkl   --labels backend/artifacts/label_map.json   --out extension/src/ml

This writes:

vocab.json (term → index)
tfidf_stats.json (idf, norms, options)
model.json (SVM coef, intercept, classes)

The extension runs TF-IDF and the linear decision function in JS for zero-server inference.

3) Run the extension in dev

If you use Plasmo:

cd extension
npm i
npm run dev          # builds and watches
# Load the generated build in Chrome: chrome://extensions (Developer mode → Load unpacked)

Edge: same as Chrome (edge://extensions).

🔍 How It Works

DOM Extraction
- content.tsx finds main article nodes (heuristics: <article>, large <div> blocks).
- Merges contiguous text nodes, preserves paragraph breaks.
Tokenization & Candidate Phrases
- Lowercase, strip punctuation.
- Generate unigrams/bigrams/trigrams (windowed).
- Optional filters: stopwords (and, or, of, was, with…) are dropped unless part of a learned phrase.
Vectorization (TF-IDF)
- Uses exported vocab.json + tfidf_stats.json.
- Builds a sparse vector in JS mirroring scikit-learn’s preprocessing.
Classification (Linear SVM)
- Applies decision = X · coef.T + intercept.
- One-vs-rest or direct LinearSVC classes → select jargon label or O (non-jargon).
- Thresholding: optional margin threshold to trade precision/recall.
Explanation
- If API mode is enabled, the extension calls /define?term=… to fetch a short definition from configured sources.
- Otherwise, local rules + bundled mini-glossary (JSON) generate simplified definitions.
UI
- Highlight spans with a subtle underline.
- Hover/click → tooltip with “What it means” (plain language), “Also called”, and “Learn more” (attribution link).

🧪 Datasets & Labels

Use any medically-labeled spans dataset you have permission to use. Recommended label focus:

BIOLOGICAL_STRUCTURE, DIAGNOSTIC_PROCEDURE, DISEASE_DISORDER,
MEDICATION, SIGN_SYMPTOM, THERAPEUTIC_PROCEDURE

In train.py, map your dataset’s labels to the above set via label_map.json. Everything else can map to O (non-jargon).

📈 Evaluation (suggested)

python model/train.py --eval_only   --train data/train.csv --dev data/dev.csv   --out backend/artifacts

Report:

Precision / Recall / F1 per label
Macro-F1 and Micro-F1
Confusion matrix (optional)
Threshold sweep (if you apply probability calibration via LinearSVC + Platt/CalibratedClassifierCV)

🧩 API Mode (optional)

FastAPI server for inference + definitions.

uvicorn backend.app:app --reload --port 8000

Endpoints

POST /predict → { text: str } ⇒ [{ span, label, score }]
GET /define?term=dyspnea → { term, definition, source_url, attribution }

Extension config

extension/src/config.ts:

export const INFERENCE_MODE: "offline" | "api" = "offline"
export const API_BASE = "http://localhost:8000"

🧠 Why SVM?

Linear SVMs work extremely well with high-dimensional sparse TF-IDF features.
Fast to train, tiny to ship (just coef/intercept + vocab), easy to run on-device.
Great baseline for hackathons; can upgrade later to a small transformer if needed.

⚙️ Configuration

Thresholds: extension/src/ml/thresholds.ts (per-label or global margin cut-off)
Stopwords: extension/src/ml/stopwords.json (e.g., and, or, of, was, with,etc.)
Glossary: extension/src/data/glossary.json for offline definitions
Attribution: Add source notes for external definitions (e.g., MedlinePlus/NHS)

🛠️ Development Scripts

# Python
black model backend
pytest -q

# Extension
npm run dev
npm run build

Packaging for store:

Chrome Web Store: upload zipped /build (ensure correct manifest.json).
Edge Add-ons: similar submission; test in Edge beforehand.

🚧 Roadmap

Add probabilistic calibration + confidence bins in tooltips
Contextual grouping of multi-word spans (merge overlapping hits)
Caching definitions + inline citations
On-device tiny transformer (Distil/ELI5-style simplifier) as optional module
i18n (English ↔ Vietnamese)

🤝 Contributing

PRs welcome! Please:

Open an issue describing the change.
Include evaluation diffs (metrics) if you modify the model.
Keep extension bundle size minimal.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
backend		backend
frontend		frontend
model		model
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedJar – Medical Jargon Detector & Simplifier (WebExtension + ML)

✨ Features

🧱 Repo Structure

📦 Quick Start

1) Set up Python & train the model

2) Export artifacts for the WebExtension (offline mode)

3) Run the extension in dev

🔍 How It Works

🧪 Datasets & Labels

📈 Evaluation (suggested)

🧩 API Mode (optional)

🧠 Why SVM?

⚙️ Configuration

🛠️ Development Scripts

🚧 Roadmap

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MedJar – Medical Jargon Detector & Simplifier (WebExtension + ML)

✨ Features

🧱 Repo Structure

📦 Quick Start

1) Set up Python & train the model

2) Export artifacts for the WebExtension (offline mode)

3) Run the extension in dev

🔍 How It Works

🧪 Datasets & Labels

📈 Evaluation (suggested)

🧩 API Mode (optional)

🧠 Why SVM?

⚙️ Configuration

🛠️ Development Scripts

🚧 Roadmap

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages