🏥 TriFetch AI: RLHF Control Room

An Online RLHF (Reinforcement Learning from Human Feedback) Workbench for medical AI. This tool simulates the process of a medical expert ranking model outputs and calculates the optimization updates (DPO Loss and GRPO Advantages) required to steer the model.

Note:

With small models, the probability of guessing correctly is low. I implemented rejection sampling for honest attempts, with a fallback that conditions on the correct answer to ensure the pipeline completes. In production with a larger model, rejection sampling would succeed more often.

🎥 Demo

trifetch_demo.mp4

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run the App

streamlit run app.py

3. Use the Workbench

Select Model — Choose a model from the sidebar
Select Case — Pick a patient case (1-5)
Generate Traces — Click to generate 3 reasoning traces
Rank Traces — Act as the doctor: rank Best, Middle, Worst
Compute Loss — Click to calculate DPO Loss and GRPO Advantages

Use Clear All in the sidebar to reset.

Configuration

All settings are in config.yaml:

# Change default model (one line switch)
default: "qwen-0.5b"

# DPO hyperparameter
dpo:
  beta: 0.1

# Available models
models:
  smollm-135m:
    name: "HuggingFaceTB/SmolLM-135M-Instruct"
    description: "SmolLM 135M (Fast)"
  qwen-0.5b:
    name: "Qwen/Qwen2-0.5B-Instruct"
    description: "Qwen2 0.5B (Best)"

Adding a New Model

Find the model on HuggingFace
Add it to config.yaml:

models:
  my-new-model:
    name: "organization/model-name"
    description: "My New Model"

Set it as default: default: "my-new-model"

Project Structure

├── app.py           # Streamlit UI
├── sampler.py       # Model interface & trace generation
├── optimizer.py     # DPO & GRPO calculations
├── config.yaml      # Model & hyperparameter settings
├── sample1-5.json   # Patient cases
├── requirements.txt # Dependencies
└── README.md

Optimization Algorithms

DPO (Direct Preference Optimization)

Calculates how much to adjust the model based on preferred vs rejected traces:

Loss = -log(sigmoid(β * (policy_margin - reference_margin)))

GRPO (Group Relative Policy Optimization)

Normalizes rewards across the group of traces:

Advantage = (reward - mean) / std

Requirements

Python 3.8+
About 1GB disk space (for model weights)
Works on CPU, MPS (Mac), or CUDA (GPU)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.yaml		config.yaml
demo.mp4		demo.mp4
optimizer.py		optimizer.py
requirements.txt		requirements.txt
sample1.json		sample1.json
sample2.json		sample2.json
sample3.json		sample3.json
sample4.json		sample4.json
sample5.json		sample5.json
sampler.py		sampler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏥 TriFetch AI: RLHF Control Room

🎥 Demo

Quick Start

1. Install Dependencies

2. Run the App

3. Use the Workbench

Configuration

Adding a New Model

Project Structure

Optimization Algorithms

DPO (Direct Preference Optimization)

GRPO (Group Relative Policy Optimization)

Requirements

About

Uh oh!

Releases

Packages

Languages

hongtaoh/TriFetch_Code

Folders and files

Latest commit

History

Repository files navigation

🏥 TriFetch AI: RLHF Control Room

🎥 Demo

Quick Start

1. Install Dependencies

2. Run the App

3. Use the Workbench

Configuration

Adding a New Model

Project Structure

Optimization Algorithms

DPO (Direct Preference Optimization)

GRPO (Group Relative Policy Optimization)

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages