Search-and-rescue humanoid for the Golden Hour. TreeHacks 2026 · adamrobotics.xyz
ADAM is a humanoid robot first-responder for post-disaster triage: it locates survivors (audio + vision), assesses injuries with LLM + VLM, and streams evidence-based reports to a remote operator dashboard—so responders can act without entering dangerous terrain.
Landing page: ADAM Robotics
- System Overview
- 1.1 Key Features
- 1.2 Screenshots
- Architecture
- Pipelines
- Setup & Installation
- 4.1 Prerequisites
- 4.2 Installation Steps
- Running the System
- How It All Works
- Development Guide
- Limitations
- Credits
This project implements an autonomous humanoid robot for search and rescue operations. The robot can:
- Search & Navigate: Autonomously search for victims, call out, listen for responses, navigate to sound sources
- Medical Triage: Perform visual injury detection using VLM (Vision Language Models), conduct dialogue-based triage assessment
- Documentation: Generate comprehensive medical reports with evidence provenance
- Operator Interface: Stream telemetry, video, and communications to a web-based command center
- Autonomous Decision Making: LLM-driven policy for high-level decisions, reflex controller for safety
- Zero ROS2 Dependencies on Laptop: Python-orchestrator-first architecture
- Multimodal AI: GPT-4o-mini for decisions, YOLO-World for open-vocabulary detection, MediaPipe for pose estimation
- Evidence-Based Reports: Every claim in medical reports is traceable to sensor data with confidence scores
- Real-Time Command Center: React webapp with live video feed, comms log, floor plan, operator messaging
- Keyframe System: Event-triggered frame capture (not continuous recording) for efficient storage and transmission
| Landing page | Operator dashboard |
|---|---|
![]() |
![]() |
| YOLO-World debris detection | Generated medical report |
|---|---|
![]() |
![]() |
End-to-end flow (building intuition):
┌──────────────────────────────────────────────────────────────────────────────┐
│ OPERATOR CONSOLE (React + Vite) │
│ • Live camera feed │
│ • Communications log (victim/robot/operator messages) │
│ • Floor plan with robot position │
│ • Send messages → robot speaks them │
└────────────────────────────┬─────────────────────────────────────────────────┘
│ HTTP Polling (/latest, /snapshot/latest)
│ POST /operator-message
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ COMMAND CENTER (FastAPI Server) │
│ • Receives telemetry from robot (phase, detections, map position) │
│ • Stores keyframe snapshots (JPEG) │
│ • Buffers operator messages for robot to poll │
│ • Serves /latest, /snapshot/latest, /operator-messages endpoints │
└────────────────────────────┬─────────────────────────────────────────────────┘
│ HTTP (POST events/snapshots, GET operator msgs)
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR (Python Agent on Laptop) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ PERCEPTION LAYER │ │
│ │ • YOLO person detection │ │
│ │ • YOLO-World open-vocab rubble detection │ │
│ │ • MediaPipe pose estimation │ │
│ │ • Audio scanner (direction finding) │ │
│ │ • Medical VLM (injury classification via GPT-4-Vision) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ DECISION MAKING (Two-Layer Brain) │ │
│ │ • Reflex Controller (~10-20 Hz): Safety overrides │ │
│ │ • LLM Policy (~1 Hz): GPT-4o-mini for high-level decisions │ │
│ │ - Input: World state (phase, detections, transcript, etc) │ │
│ │ - Output: Action, say/ask text, params, confidence │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ MEDICAL PIPELINE │ │
│ │ • TriagePipeline: coordinates all medical subsystems │ │
│ │ • MedicalAssessor: VLM-based injury detection │ │
│ │ • EvidenceCollector: rolling buffer of frames │ │
│ │ • QuestionPlanner: dynamic triage questions │ │
│ │ • ReportBuilder: structured medical reports (Markdown + PDF) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ COMMS & TELEMETRY │ │
│ │ • EventManager: keyframe capture on events │ │
│ │ • CommandCenterClient: POST events/snapshots │ │
│ │ • Operator message polling loop │ │
│ └───────────────────────────────────────────────────────────────┘ │
└────────────────────────────┬─────────────────────────────────────────────────┘
│ HTTP to Robot Bridge (:9090)
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ K1 ROBOT (Booster Humanoid) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ ROBOT BRIDGE SERVER (FastAPI on Robot) │ │
│ │ • Camera: ROS2 subscriber → /StereoNetNode/rectified_image │ │
│ │ • Mic: ALSA (arecord) → 16kHz mono WAV │ │
│ │ • Speaker: TTS via espeak | paplay │ │
│ │ • Motion: Booster SDK (B1LocoClient) - walk, turn, head pan │ │
│ │ • Endpoints: /frame, /speak, /record, /velocity, /stop │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ • Robot IP: 192.168.10.102 (USB-C Ethernet) │
│ • Laptop IP: 192.168.10.1 │
└──────────────────────────────────────────────────────────────────────────────┘
treehacks2026/
├── himpublic-py/ # Main Python orchestrator (NO ROS2)
│ ├── src/himpublic/
│ │ ├── main.py # Entrypoint (CLI, signal handling)
│ │ ├── orchestrator/ # Agent, state machine, phases, policy
│ │ │ ├── agent.py # Always-on agent with async tasks
│ │ │ ├── phases.py # Phase enum and definitions
│ │ │ ├── policy.py # ReflexController, LLMPolicy, Decision
│ │ │ ├── llm_adapter.py # GPT-4o-mini integration
│ │ │ ├── dialogue_manager.py # Triage dialogue state machine
│ │ │ ├── events.py # EventManager for keyframes
│ │ │ └── search_phase.py # Audio-guided search implementation
│ │ ├── medical/ # Medical triage CV + dialogue
│ │ │ ├── triage_pipeline.py # Coordinator for all medical subsystems
│ │ │ ├── medical_assessor.py # VLM injury detection (GPT-4-Vision)
│ │ │ ├── evidence_collector.py # Rolling frame buffer
│ │ │ ├── question_planner.py # Dynamic triage questions
│ │ │ ├── report_builder.py # Structured medical reports
│ │ │ └── schemas.py # Finding, TriageReport data models
│ │ ├── perception/ # Computer vision
│ │ │ ├── person_detector.py # YOLOv8 (COCO-80)
│ │ │ ├── openvocab.py # YOLO-World (open-vocabulary rubble)
│ │ │ ├── audio_scanner.py # Audio direction finding
│ │ │ ├── frame_store.py # LatestFrameStore + RingBuffer
│ │ │ └── types.py # Observation dataclass
│ │ ├── io/ # Hardware abstraction
│ │ │ ├── robot_interface.py # Protocol definition
│ │ │ ├── mock_robot.py # Local dev mock
│ │ │ ├── robot_client.py # HTTP client for bridge
│ │ │ ├── video_source.py # Webcam/File/Robot sources
│ │ │ └── audio_io.py # Local/Robot audio I/O
│ │ ├── comms/ # Command center communication
│ │ │ ├── command_center_server.py # FastAPI server
│ │ │ └── command_center_client.py # HTTP client
│ │ ├── reporting/ # Evidence-based report generation
│ │ │ ├── render_medical_report.py
│ │ │ ├── report_builder.py
│ │ │ ├── document_builder.py
│ │ │ └── artifact_store.py
│ │ ├── planner/ # LLM planner-executor (optional)
│ │ │ ├── llm_planner.py
│ │ │ ├── executor.py
│ │ │ └── schema.py
│ │ ├── pipeline/ # Strict sequential mission pipeline
│ │ │ ├── engine.py
│ │ │ ├── phases.py
│ │ │ └── cli.py
│ │ └── utils/
│ ├── scripts/
│ │ ├── run_command_center.py
│ │ ├── smoke_test_robot.py
│ │ └── run_search_demo.py
│ ├── code/ # Demo scripts and experiments
│ │ ├── final_demo.py
│ │ ├── hardcoded_demo.py
│ │ └── walk_demo_*.py
│ └── docs/ # Architecture docs
│ ├── ARCHITECTURE.md
│ ├── PIPELINE.md
│ ├── ROBOT_INTEGRATION.md
│ └── MEDICAL_REPORT_SPEC.md
├── webapp/ # React operator console
│ ├── src/
│ │ ├── App.tsx # Main dashboard
│ │ ├── components/ # Chat, FloorPlan, RobotStatus, etc.
│ │ └── api/client.ts # Command center API client
│ ├── package.json
│ └── vite.config.ts
├── booster_robotics_sdk/ # Official Booster SDK (reference)
├── no-robot-sim/ # Wizard-of-Oz demo (keyboard-driven)
└── docs/
├── STARTUP_GUIDE.md # What to run, in what order
└── ARCHITECTURE.md # Full-stack architecture
Goal: High-level autonomous decision making for navigation, dialogue, and phase transitions.
How It Works:
-
World State Construction: Every tick (~1 Hz), the orchestrator builds a compact JSON snapshot:
- Current phase (search, approach, triage, etc.)
- Observations: detected persons, rubble, audio cues
- Conversation history (last 10 exchanges)
- Robot constraints (safety limits, available tools)
- Triage state (questions asked, answers collected)
-
LLM Call (GPT-4o-mini):
prompt = f""" You are controlling a search and rescue robot. Current state: {json.dumps(world_state, indent=2)} Available actions: {allowed_actions_for_phase} Decide next action. Output JSON: {{ "action": "rotate_left|wait|ask|say|...", "say": "optional text to speak", "ask": "optional question to ask victim", "params": {{}}, "rationale": "why you chose this", "confidence": 0.0-1.0 }} """ response = openai.ChatCompletion.create( model="gpt-4o-mini", messages=[{"role": "system", "content": prompt}], temperature=0.7, ) decision = parse_llm_json(response)
-
Validation & Execution:
- Check action is allowed for current phase (e.g.,
push_obstacleonly in DEBRIS_CLEAR) - Clamp params (rotation degrees, wait duration) to safety bounds
- Dispatch action (e.g.,
robot.set_velocity(),audio.speak()) - Log decision to
logs/llm_decisions.jsonl
- Check action is allowed for current phase (e.g.,
-
Reflex Override: A separate controller (~10-20 Hz) can override the LLM:
- Stop if obstacle too close (< 0.5m)
- Rotate to center person if offset > 0.3
Located In: himpublic-py/src/himpublic/orchestrator/policy.py, llm_adapter.py
Goal: Detect injuries from camera frames using multimodal vision-language models.
Components:
- Detects rubble/obstacles using free-form text prompts
- Why: Standard YOLOv8 only knows 80 COCO classes; "cardboard box", "rubble", "debris" are not included
- Prompts:
["cardboard box", "amazon box", "package", "debris", "rubble", ...] - Model:
yolov8s-worldv2(built intoultralyticspackage) - Located In:
himpublic-py/src/himpublic/perception/openvocab.py
- Analyzes frames for visible injuries (bleeding, burns, fractures, etc.)
- Input: BGR frame + optional pose keypoints (MediaPipe)
- Prompt:
Analyze this image for visible injuries. Look for: - Bleeding (arterial/venous) - Burns - Fractures or deformities - Contusions - Lacerations If pose keypoints provided, cross-reference body regions. Output JSON: [ { "finding_type": "bleeding|burn|fracture|...", "body_region": "left_forearm|right_knee|...", "severity": "mild|moderate|severe", "confidence": 0.0-1.0, "rationale": "explanation" } ] - Located In:
himpublic-py/src/himpublic/medical/medical_assessor.py
- Extracts 33 body landmarks (nose, shoulders, elbows, knees, etc.)
- Used to guide VLM's attention to specific body regions
- Helps disambiguate "left arm" vs "right arm" from camera perspective
Pipeline Flow:
Frame (BGR) → MediaPipe Pose → 33 keypoints
↓
Frame (JPEG) + keypoints → GPT-4-Vision → Findings (JSON)
↓
Findings → TriagePipeline → Evidence collector → Report builder
Located In: himpublic-py/src/himpublic/medical/triage_pipeline.py
Goal: Structured medical triage through dialogue + visual assessment.
Phases:
- Robot captures 4 views: front, left, right, close-up
- Each frame → MedicalAssessor (VLM) → Findings
- Findings accumulated with confidence scores
- If confidence ≥ 0.55: trigger evidence collection
-
Question Planner generates questions based on:
- Visual findings (e.g., if bleeding detected → "Where is the bleeding?")
- Standard triage protocol (consciousness, pain scale, allergies)
- Previous answers (follow-ups)
-
Dialogue Manager state machine:
- Tracks current question, retries (if no answer), follow-ups
- Integrates speech recognition (ASR) via robot mic
- Parses free-form answers using LLM:
prompt = f""" Question: "{question}" Victim's answer: "{transcript}" Extract structured data: {{ "pain_scale": 0-10, "injury_location": "left knee", "mobility": "cannot_walk|limited|mobile", ... }} """
-
Example Flow:
Robot: "Can you hear me? Are you conscious?" Victim: "Yes, I'm awake." → triage_answers["consciousness"] = "alert" Robot: "On a scale of 1-10, how much pain are you in?" Victim: "Maybe a 7." → triage_answers["pain_scale"] = 7 Robot: "Where are you hurt?" Victim: "My left knee is bleeding." → triage_answers["injury_location"] = "left_knee" → medical_pipeline.set_spoken_body_region("left_knee")
-
Compile all evidence:
- Visual findings (with confidence scores)
- Dialogue transcript
- Triage answers (structured)
- Keyframe images
-
ReportBuilder generates:
- Markdown: Human-readable report with sections:
- Incident meta (timestamp, robot ID, location)
- Patient summary (age estimate, consciousness, primary concern)
- ABCDE findings (Airway, Breathing, Circulation, Disability, Exposure)
- Suspected injuries (with evidence IDs)
- Hazards nearby
- Media evidence (images with captions)
- Evidence & Provenance table (every claim traceable to sensor data)
- Recommended actions
- PDF: Generated via
md_to_pdf.py(usesmarkdown+pdfkit)
- Markdown: Human-readable report with sections:
-
Evidence Log (JSONL):
- Every image, audio snippet, model output gets a unique ID (E1, E2, ...)
- Reports cite evidence: "Possible arterial bleed [E12][E14]"
- Enables audit trail and confidence tracking
Located In:
himpublic-py/src/himpublic/orchestrator/dialogue_manager.pyhimpublic-py/src/himpublic/medical/question_planner.pyhimpublic-py/src/himpublic/reporting/render_medical_report.py
Goal: Autonomous search for victims in unknown environment, guided by audio cues.
Phases:
- Robot patrols/rotates, calling out: "Where are you? Can you hear me?"
- Audio Scanner listens for voice responses and localizes sound direction (see Sound-based localization below).
- Person Detector (YOLO) scans video for humans
- Exit Condition: Person detected with confidence ≥ 0.7 OR audio response heard
Sub-states:
announce: Call out, wait 5slisten: Record audio, run ASR + direction findingnavigate_to_sound: Rotate toward bearing, walk forward
Flow:
announce → listen (3s) → ASR
↓
Voice detected? → Direction finding → bearing
↓
Rotate toward bearing → walk forward
↓
YOLO detects person? → APPROACH_CONFIRM
- Navigate to detected person
- Re-detect to confirm (not a false positive)
- Establish safe standoff distance (~1.5m)
- Announce: "I'm a rescue robot. I'm here to help."
- YOLO-World scans for obstacles between robot and victim
- If movable (confidence check) → attempt to push/clear
- If not movable → mark as "blocked_not_clearable" and proceed
- Scan for hazards: fire, smoke, unstable debris, water
- Choose camera viewpoints for injury scan
Located In:
himpublic-py/src/himpublic/orchestrator/search_phase.pyhimpublic-py/src/himpublic/perception/audio_scanner.pyhimpublic-py/src/himpublic/perception/person_detector.pyhimpublic-py/src/himpublic/perception/openvocab.py
We localize the victim by spin-to-peak loudness, not by stereo microphone array:
- Call out — Robot announces ("Is anyone there? Can you hear me?") and waits for a response.
- 360° scan — Robot rotates in fixed steps (e.g. 30°). At each heading it records a short audio window (~0.4 s), then turns to the next angle until a full rotation is done.
- Per-window signal — For each window we compute RMS (root-mean-square energy) and VAD (voice activity detection via webrtcvad or energy threshold). A composite score combines loudness and voice likelihood.
- Best bearing — Scores are smoothed over adjacent angles (to reduce spikes), then the heading with the highest score is chosen as the estimated bearing to the sound source. Confidence is derived from how much that peak stands out above the mean.
- Navigate — Robot rotates toward the chosen bearing and walks forward; YOLO person detection then refines approach.
So localization is directional hearing by physical rotation: we infer "sound came from that way" by which orientation had the loudest voice-like audio. No stereo phase or time-difference math—just turn, listen, and pick the loudest direction.
Located In: himpublic-py/src/himpublic/perception/audio_scanner.py
Goal: Abstract all robot hardware (camera, mic, speaker, motion) behind HTTP endpoints. Laptop orchestrator has ZERO ROS2 dependencies.
Why:
- K1 robot runs Ubuntu 22.04 with ROS2 Humble + Booster SDK
- Camera is locked by
booster-daemon-perception.service(cannot use OpenCV directly) - Audio requires specific ALSA/PulseAudio incantations
- We want to develop on laptop (macOS/Windows/WSL) without ROS2
Architecture:
Laptop Orchestrator K1 Robot (192.168.10.102)
-------------------- --------------------------
RobotBridgeClient FastAPI Server (:9090)
↓ HTTP GET /frame ↓ ROS2 Subscriber
↓ /StereoNetNode/rectified_image
↓ ↓ YUV NV12 → BGR
↓ ↓ cv2.imencode(.jpg)
↓ ← JPEG bytes ↑ Return JPEG
↓ HTTP POST /speak {"text": "..."} ↓ espeak --stdout | paplay
↓ HTTP POST /record {"duration": 5} ↓ arecord -d 5 /tmp/audio.wav
↓ ← WAV bytes ↑ Return WAV
↓ HTTP POST /velocity {"vx": 0.2} ↓ Booster SDK B1LocoClient
Bridge Endpoints:
| Endpoint | Method | Purpose | Implementation |
|---|---|---|---|
/health |
GET | Status check | Returns camera_ok, SDK availability, uptime |
/state |
GET | Telemetry | Robot joint angles, IMU, SDK members |
/frame?quality=80 |
GET | Camera frame | ROS2 subscriber → JPEG (544×448) |
/speak |
POST | Text-to-speech | espeak --stdout "text" | paplay |
/play_audio |
POST | Play WAV | paplay from bytes |
/record |
POST | Record mic | arecord -f S16_LE -r 16000 -c 1 -d N |
/velocity |
POST | Motion command | B1LocoClient.Move(vx, vy, vyaw) (requires --allow-motion flag) |
/stop |
POST | Emergency stop | Always allowed, no flag needed |
Key Implementation Details:
-
Camera (ROS2 Subscriber):
from rclpy.node import Node from sensor_msgs.msg import Image class CameraManager(Node): def __init__(self): super().__init__('camera_bridge') self.subscription = self.create_subscription( Image, '/StereoNetNode/rectified_image', # NO /booster_camera_bridge/ prefix! self.image_callback, 10 ) def image_callback(self, msg): # YUV NV12 → BGR yuv = np.frombuffer(msg.data, dtype=np.uint8).reshape((msg.height, msg.width)) bgr = cv2.cvtColor(yuv, cv2.COLOR_YUV2BGR_NV12) self.latest_frame = bgr
-
TTS (espeak + paplay):
# WRONG (hangs forever): subprocess.run(["espeak", text]) # CORRECT: espeak_proc = subprocess.Popen( ["espeak", "--stdout", "-a", "200", text], stdout=subprocess.PIPE ) paplay_proc = subprocess.Popen( ["paplay"], stdin=espeak_proc.stdout ) paplay_proc.wait()
-
Mic (arecord):
subprocess.run([ "arecord", "-D", "default", "-f", "S16_LE", "-r", "16000", "-c", "1", "-d", str(duration), wav_path ]) with open(wav_path, "rb") as f: return f.read()
-
Motion (Booster SDK):
import booster_robotics_sdk_python as sdk channel = sdk.ChannelFactory.Instance().CreateSendChannel( sdk.ChannelPublisher, sdk.SERV_LOCO ) loco_client = sdk.B1LocoClient(channel) loco_client.SetApiId(sdk.B1LocoApiId.kMove) req = sdk.B1LocoRequest() req.vx, req.vy, req.vyaw = vx, vy, vyaw req.duration = 0 # continuous loco_client.Move(req)
Deployment:
# Copy server to robot (FROM LAPTOP, NOT SSH):
scp himpublic-py/src/robot_bridge/server.py booster@192.168.10.102:~/server.py
# Start bridge (ON ROBOT):
ssh booster@192.168.10.102
source /opt/ros/humble/setup.bash && python3 ~/server.pyLocated In:
- Bridge server:
himpublic-py/src/robot_bridge/server.py - Client:
himpublic-py/src/himpublic/io/robot_client.py - Smoke test:
himpublic-py/scripts/smoke_test_robot.py
Purpose: Hub for telemetry, keyframes, operator messaging.
Endpoints:
| Endpoint | Method | From | To | Purpose |
|---|---|---|---|---|
/event |
POST | Orchestrator | Server | Telemetry (heartbeat with phase, detections, robot_map_x/y), heard_response (transcript), robot_said (TTS text) |
/snapshot |
POST | Orchestrator | Server | Upload JPEG keyframe; server stores to disk and sets latest_snapshot_path |
/report |
POST | Orchestrator | Server | Upload medical report JSON |
/operator-message |
POST | Webapp | Server | Operator text; server appends to comms log and operator queue |
/latest |
GET | Webapp | Server | Last event, snapshot_path, report, comms (victim/robot/operator), robot_map_x/y |
/snapshot/latest |
GET | Webapp | Server | Latest JPEG for live feed |
/operator-messages |
GET | Orchestrator | Server | Queue of operator messages for robot to speak |
/operator-messages/ack |
POST | Orchestrator | Server | Clear messages up to index after speaking |
Data Flow (Comms):
1. Robot hears victim:
Orchestrator → POST /event {"event": "heard_response", "transcript": "..."}
→ Server appends to comms: {"role": "victim", "text": "..."}
→ Webapp polls GET /latest → displays as VICTIM message
2. Robot says something:
Orchestrator → TTS.speak(text) + POST /event {"event": "robot_said", "text": "..."}
→ Server appends to comms: {"role": "robot", "text": "..."}
→ Webapp polls GET /latest → displays as ROBOT 1 message
3. Operator sends message:
Webapp → POST /operator-message {"text": "Tell them help is on the way"}
→ Server appends to comms: {"role": "operator", "text": "..."}
→ Server appends to operator queue
→ Orchestrator polls GET /operator-messages → TTS.speak(text)
→ Orchestrator → POST /operator-messages/ack
Located In: himpublic-py/src/himpublic/comms/command_center_server.py
Tech Stack:
- React 19
- TypeScript
- Vite (dev server + build)
- Tailwind CSS 4
Components:
-
Chat (
components/Chat.tsx):- Displays comms log: victim (green), robot (blue), operator (gray)
- Quick-reply buttons: "Help is coming", "Stay calm", "Can you move?"
- Text input for custom messages
-
FloorPlan (
components/FloorPlan.tsx):- SVG floor plan with rooms, corridors
- Robot marker (blue circle) at
(robot_map_x, robot_map_y)from telemetry - Path line showing robot's trajectory
- Map position is simulated (no real odometry):
- SEARCH_LOCALIZE: robot drifts along corridor
- APPROACH_CONFIRM: robot moves toward victim at (68, 58)
-
RobotFeed (
components/RobotFeed.tsx):- Polls
GET /snapshot/latest?t={timestamp}every 2s - Displays live JPEG from robot camera
- Keyframes are event-triggered (FOUND_PERSON, HEARD_RESPONSE, HEARTBEAT)
- Polls
-
RobotStatus (
components/RobotStatus.tsx):- Phase badge (SEARCH, APPROACH, TRIAGE, REPORT)
- Detections: persons, rubble, audio cues
- Last decision from LLM
-
API Client (
api/client.ts):export async function getLatest() { const res = await fetch('/api/latest'); return res.json(); } export async function sendOperatorMessage(text: string) { await fetch('/api/operator-message', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ text }) }); } export function getSnapshotUrl() { return `/api/snapshot/latest?t=${Date.now()}`; }
Dev Server Proxy (vite.config.ts):
export default defineConfig({
server: {
proxy: {
'/api': {
target: 'http://127.0.0.1:8000',
changeOrigin: true,
rewrite: (path) => path.replace(/^\/api/, '')
}
}
}
})Located In: webapp/src/
- Model: Booster K1 Humanoid
- OS: Ubuntu 22.04.2 LTS (aarch64)
- ROS: ROS2 Humble
- SDK:
booster_robotics_sdk_python(compiled .so, not pip-installable) - Camera: Stereo camera (544×448), published on ROS2 topic
- Audio: USB Audio Device (card 0), PulseAudio sink
- Joints: 23 DOF (head, arms, waist, legs, hands)
- Robot IP: 192.168.10.102 (USB-C Ethernet)
- Laptop IP: 192.168.10.1
- SSH:
ssh booster@192.168.10.102(password:123456)
Why Not Continuous Recording?
- Continuous video at 30 FPS → ~2 GB/hour (H.264)
- Unnecessary: only need frames at decision points / events
- Network overhead: USB-C Ethernet is fast, but streaming HD video continuously is wasteful
How It Works:
-
RingBuffer (Rolling Frame Store):
- Configurable window:
--ring-seconds 10(default) - Sample rate:
--ring-fps 2(default) - Stores:
(timestamp, jpeg_bytes, observation_summary) - Memory-efficient: JPEG compression (quality 85), ~60 KB per frame
- 10s window at 2 FPS = 20 frames = ~1.2 MB RAM
- Configurable window:
-
EventManager (Triggers):
FOUND_PERSON: First confident person detectionHEARD_RESPONSE: Victim voice detectedPOSSIBLE_INJURY: Visual injury detection confidence ≥ 0.55OPERATOR_REQUEST: Operator message receivedHEARTBEAT: Throttled (every 30s by default)
-
Keyframe Selection (on event):
def get_keyframes(k=3, strategy="spread"): # Get last 10s of frames from RingBuffer window = self.ring_buffer.get_window(seconds_back=10) if strategy == "spread": # Evenly space k frames across window indices = np.linspace(0, len(window)-1, k, dtype=int) return [window[i] for i in indices] elif strategy == "confidence": # Pick frames with highest person detection confidence sorted_frames = sorted(window, key=lambda f: f.obs.confidence, reverse=True) return sorted_frames[:k]
-
Storage & Transmission:
- Save to disk:
data/snapshots/{timestamp}_{event}.jpg - POST to command center:
/snapshot(multipart/form-data) - Command center stores:
./data/snapshots/(served via/snapshot/latest)
- Save to disk:
Example Event:
[PERCEPTION] Person detected (confidence 0.82)
→ EventManager.emit(FOUND_PERSON, meta={"confidence": 0.82})
→ get_keyframes(k=3, strategy="spread")
→ Save: data/snapshots/1708022345_found_person_0.jpg (3 frames)
→ POST /snapshot (3 requests)
→ Command center: latest_snapshot_path = ".../found_person_2.jpg"
→ Webapp: GET /snapshot/latest → displays keyframe
Located In:
- RingBuffer:
himpublic-py/src/himpublic/perception/frame_store.py - EventManager:
himpublic-py/src/himpublic/orchestrator/events.py
The orchestrator maintains a SharedState dataclass that represents the robot's understanding of the world:
@dataclass
class SharedState:
# Phase & timing
phase: Phase
phase_entered_at: float
tick: int
# Perception
latest_observation: Observation | None
num_persons: int
person_confidence: float
primary_person_offset: float # [-1, 1], 0 = centered
obstacle_distance_m: float
# Audio
last_heard_transcript: str | None
audio_bearing_deg: float | None # Direction to sound source
# Medical triage
triage_step_index: int
triage_answers: dict # {"pain_scale": 7, "injury_location": "left_knee", ...}
consent_for_photos: bool | None
visual_findings: list[Finding]
# Dialogue
pending_question_id: str | None
pending_question_text: str | None
pending_question_asked_at: float | None
pending_question_retries: int
conversation_transcript: list[str]
# Navigation (simulated in local mode)
robot_map_x: float
robot_map_y: float
robot_heading_deg: float
# Degraded mode
degraded_mode: bool
sensor_failures: set[str]The mission follows a strict, reproducible phase order:
BOOT
↓ (sensors OK)
SEARCH_LOCALIZE
↓ (person detected OR voice heard)
APPROACH_CONFIRM
↓ (standoff established)
SCENE_SAFETY_TRIAGE
↓ (hazards assessed)
DEBRIS_ASSESSMENT (optional)
↓ (access improved OR not movable)
SCAN_CAPTURE
↓ (4 views captured: front, left, right, close-up)
TRIAGE_INTERVIEW
↓ (questions answered OR timeout)
REPORT_SEND
↓ (report sent)
MONITOR_WAIT
↓ (operator handoff OR mission command)
DONE
No skipping, reordering, or ad-hoc transitions unless --force_phase X override is used (logged as deliberate override).
Layer 1: Reflex Controller (~10-20 Hz)
- Input: Latest observation (obstacle distance, person offset)
- Output: Safety override (STOP if obstacle < 0.5m, ROTATE to center person)
- Purpose: Fast, deterministic safety responses
- No LLM involved
Layer 2: LLM Policy (~1 Hz)
- Input: World state JSON (phase, observations, transcript, triage state)
- Output: Decision (action, say/ask text, params, confidence)
- Purpose: High-level decisions (what to say, where to navigate, when to transition)
- Model: GPT-4o-mini (fast, cheap, good enough for SAR dialogue)
Layer 3: VLM Specialist (on-demand)
- Input: Frame + prompt (injury detection, scene understanding)
- Output: Structured findings (JSON)
- Purpose: Visual intelligence (injury classification, hazard detection)
- Model: GPT-4-Vision (or GPT-4o with vision)
Layer 4: Planner-Executor (optional)
- Input: World state
- Output: Multi-step plan (1-5 actions)
- Purpose: Lookahead planning for complex scenarios
- Currently: Stub (future work)
- Fast reflex, slow intelligence: Safety-critical responses don't wait for LLM
- Graceful degradation: If LLM fails, reflex controller keeps robot safe
- Explainability: Every decision logged with rationale and confidence
- Testability: Mock components allow deterministic unit tests
- Modularity: Swap LLM provider, VLM model, or perception backend without changing orchestrator logic
- Python 3.10+ (we use conda
baseenvironment) - Node.js 18+ (for webapp)
- OpenAI API Key (for LLM + VLM)
- Hardware (optional):
- Webcam (for local mode)
- Booster K1 robot (for robot mode)
git clone <repo-url> treehacks2026
cd treehacks2026# Option A: Use conda (recommended)
conda activate base
cd himpublic-py
pip install ultralytics opencv-python requests fastapi uvicorn numpy openai pyttsx3 SpeechRecognition
# Option B: Use venv
cd himpublic-py
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install ultralytics opencv-python requests fastapi uvicorn numpy openai pyttsx3 SpeechRecognitionCreate .env in himpublic-py/ or repo root:
# himpublic-py/.env
OPENAI_API_KEY=sk-proj-...cd ../webapp
npm installOnly needed if you have the Booster K1 robot:
# FROM LAPTOP (not from SSH!):
scp himpublic-py/src/robot_bridge/server.py booster@192.168.10.102:~/server.pyUse this for testing without the robot.
cd himpublic-py
PYTHONPATH=src python -m uvicorn himpublic.comms.command_center_server:app --host 127.0.0.1 --port 8000Expected Output:
INFO: Uvicorn running on http://127.0.0.1:8000
cd himpublic-py
PYTHONPATH=src python -m himpublic.main --io local --command-center http://127.0.0.1:8000Expected Output:
[BOOT] Self-check passed
[PHASE] search_localize
[PERCEPTION] Persons: 0, confidence: 0.00
[POLICY] Decision: rotate_left (confidence: 0.75)
[TELEMETRY] Posted to command center
CLI Flags:
--io local: Use webcam/file--video webcam(default) or--video file --video-path <path>--show(default): OpenCV preview window--no-show: Headless (no preview)--no-tts: Disable text-to-speech--no-mic: Disable microphone (use keyboard input)--start-phase search_localize: Skip boot
cd webapp
npm run devExpected Output:
VITE ready in 342 ms
➜ Local: http://localhost:5173/
Open http://localhost:5173 in browser.
Use this when SSH'd into the K1 robot.
ssh booster@192.168.10.102 # Password: 123456
source /opt/ros/humble/setup.bash && python3 ~/server.pyExpected Output:
INFO: Trying ROS2 camera on topic /StereoNetNode/rectified_image ...
INFO: ROS2 camera: first frame received (544x448)
INFO: Camera backend: ROS2
INFO: Motion DISABLED (safe read-only mode). Use --allow-motion to enable.
INFO: Uvicorn running on http://0.0.0.0:9090
Verify from laptop:
curl http://192.168.10.102:9090/healthcd himpublic-py
PYTHONPATH=src python -m uvicorn himpublic.comms.command_center_server:app --host 127.0.0.1 --port 8000cd himpublic-py
PYTHONPATH=src python -m himpublic.main \
--io robot \
--robot-bridge-url http://192.168.10.102:9090 \
--command-center http://127.0.0.1:8000 \
--no-showExpected Output:
[IO] RobotBridgeClient connected to http://192.168.10.102:9090
[BOOT] Robot bridge health check: OK
[PHASE] search_localize
[SEARCH] Calling out: "Where are you? Can you hear me?"
[AUDIO] Recording 3s from robot mic...
[ASR] Transcript: "I'm over here!"
[SEARCH] Voice detected, bearing: 45 degrees
[POLICY] Decision: rotate_left(degrees=45)
[TELEMETRY] Posted snapshot to command center
cd webapp
npm run devBefore running the full pipeline, verify the bridge works:
cd himpublic-py
python3 scripts/smoke_test_robot.py --host 192.168.10.102Tests:
- ✅ GET /health (bridge reachable)
- ✅ GET /state (telemetry)
- ✅ GET /frame ×3 (camera frames)
- ✅ POST /speak (TTS)
- ✅ POST /record (mic recording)
Output: missions/smoke_001/results.json + 3 frame images + audio.wav
-
Startup:
- Orchestrator connects to robot bridge (or webcam)
- Boots: self-check sensors (camera, mic, speaker)
- Phase → SEARCH_LOCALIZE
-
Search:
- Robot rotates, calling out: "Where are you?"
- Audio scanner listens for voice, runs ASR
- If voice detected → direction finding → bearing
- YOLO scans for person
- If person detected with confidence ≥ 0.7 → emit FOUND_PERSON event
- RingBuffer → keyframes → POST /snapshot to command center
- Phase → APPROACH_CONFIRM
-
Approach:
- LLM policy: "I detected a person. I should navigate closer."
- Robot moves forward, re-detects person to confirm
- Establishes standoff distance (~1.5m)
- Says: "I'm a rescue robot. I'm here to help."
- Phase → SCENE_SAFETY_TRIAGE
-
Safety Check:
- YOLO-World scans for hazards (fire, debris, unstable structures)
- If rubble detected → Phase → DEBRIS_ASSESSMENT (optional)
- If safe → Phase → SCAN_CAPTURE
-
Visual Scan:
- Robot captures 4 views: front, left, right, close-up
- Each frame → MedicalAssessor (VLM)
- VLM output:
[{"finding_type": "bleeding", "body_region": "left_forearm", "confidence": 0.78}] - Findings accumulated
- Phase → TRIAGE_INTERVIEW
-
Triage Dialogue:
- QuestionPlanner generates questions based on visual findings
- Robot asks: "Can you hear me? Are you conscious?"
- Victim responds: "Yes, I'm awake."
- ASR → transcript → LLM parsing → triage_answers["consciousness"] = "alert"
- Robot: "On a scale of 1-10, how much pain are you in?"
- Victim: "Maybe a 7."
- triage_answers["pain_scale"] = 7
- Robot: "Where are you hurt?"
- Victim: "My left knee is bleeding."
- triage_answers["injury_location"] = "left_knee"
- medical_pipeline.set_spoken_body_region("left_knee")
- Phase → REPORT_SEND
-
Report Generation:
- ReportBuilder compiles:
- Visual findings (with evidence IDs)
- Triage transcript
- Keyframe images
- Generates Markdown + PDF
- POST /report to command center
- Phase → MONITOR_WAIT
- ReportBuilder compiles:
-
Monitor:
- Robot stays with victim
- Operator can send messages via webapp
- Robot polls GET /operator-messages, speaks them
- Awaits handoff to human responders
Camera Frame (BGR)
↓
YOLO Person Detection → Observation(num_persons, confidence, offset)
↓
LatestFrameStore.update(frame, obs)
↓
RingBuffer.push(frame_jpeg, obs) [every 0.5s at 2 FPS]
↓
Policy Loop (1 Hz):
obs = LatestFrameStore.get()
world_state = build_world_state(obs, state, transcript)
decision = LLMPolicy.decide(world_state) # GPT-4o-mini call
if decision.action == "say":
audio_io.speak(decision.say)
cc_client.post_event({"event": "robot_said", "text": decision.say})
elif decision.action == "rotate_left":
robot.set_velocity(vyaw=0.3)
...
Event Trigger (e.g., FOUND_PERSON)
↓
EventManager.emit(FOUND_PERSON, meta={"confidence": 0.82})
↓
keyframes = RingBuffer.get_keyframes(k=3, strategy="spread")
↓
for frame_jpeg in keyframes:
save_to_disk("data/snapshots/{timestamp}_found_person_{i}.jpg")
cc_client.post_snapshot(frame_jpeg, filename="...", meta={...})
↓
Command Center: latest_snapshot_path = "data/snapshots/..."
↓
Webapp polls GET /latest every 2s:
{
"event": {"phase": "approach_confirm", "num_persons": 1, ...},
"snapshot_path": "data/snapshots/1708022345_found_person_2.jpg"
}
↓
Webapp: GET /snapshot/latest → displays JPEG in RobotFeed component
Webapp: User types "Tell them help is on the way"
↓
POST /operator-message {"text": "Tell them help is on the way"}
↓
Command Center:
_comms.append({"role": "operator", "text": "Tell them help is on the way"})
_operator_messages.append({"text": "Tell them help is on the way", "index": N})
↓
Orchestrator: _operator_message_loop (async task, polls every 5s):
messages = GET /operator-messages
for msg in messages:
audio_io.speak(msg["text"])
POST /operator-messages/ack {"after_index": N}
↓
Command Center: clears operator_messages queue
- himpublic-py: Production-style pipeline for real deployment
- no-robot-sim: Wizard-of-Oz demo (keyboard-driven, no robot)
- webapp: Operator dashboard (React)
- booster_robotics_sdk: Official SDK reference (read-only)
-
Python-Orchestrator-First (not ROS-first):
- Develop on laptop without ROS2
- Robot hardware abstracted behind HTTP bridge
- Easy to swap backends (mock → robot)
-
Keyframes, Not Continuous Recording:
- Event-triggered capture saves storage and bandwidth
- 10s rolling buffer at 2 FPS = ~1.2 MB RAM
- Keyframes on FOUND_PERSON, HEARD_RESPONSE, HEARTBEAT
-
Two-Layer Brain:
- Fast reflex controller (safety)
- Slow LLM policy (intelligence)
- Reflex can override LLM
-
Evidence-Based Reports:
- Every claim traceable to sensor data
- Evidence log (JSONL) with unique IDs
- Reports cite evidence: "Possible bleed [E12][E14]"
-
Strict Phase Sequencing:
- No ad-hoc transitions
- Reproducible mission flow
- Override only via
--force_phase(logged)
-
Define phase enum (
himpublic-py/src/himpublic/orchestrator/phases.py):class Phase(Enum): # ... existing phases ... NEW_PHASE = "new_phase"
-
Add phase handler (
agent.pyor separate file):async def _new_phase_handler(state: SharedState) -> PhaseResult: # Do work... return PhaseResult( status=PhaseStatus.SUCCESS, outputs={"key": "value"}, reason="completed successfully" )
-
Register in phase list (
pipeline/phases.py):PIPELINE_PHASES.append( PhaseDefinition( name="NEW_PHASE", handler=_new_phase_handler, preconditions=["previous_phase_output"], retry_policy=RetryPolicy(max_attempts=3, cooldown_s=2.0), ) )
-
Create detector (
himpublic-py/src/himpublic/perception/new_detector.py):class NewDetector: def __init__(self, model_name: str = "model.pt"): self.model = load_model(model_name) def detect(self, frame_bgr: np.ndarray) -> list[Detection]: results = self.model(frame_bgr) return parse_results(results)
-
Integrate in agent (
agent.py):from himpublic.perception.new_detector import NewDetector new_detector = NewDetector() detections = new_detector.detect(frame)
-
Create adapter (
himpublic-py/src/himpublic/orchestrator/llm_adapters/new_provider.py):def call_new_provider(prompt: str, model: str) -> dict: response = new_provider.complete(prompt, model=model) return parse_json(response)
-
Update config (
config.py):@dataclass class OrchestratorConfig: llm_provider: str = "openai" # or "new_provider" llm_model: str = "gpt-4o-mini"
-
Use in policy (
policy.py):if config.llm_provider == "openai": response = call_openai(...) elif config.llm_provider == "new_provider": response = call_new_provider(...)
cd himpublic-py
pytest tests/# Test with video file (no robot)
python -m himpublic.main --io local --video file --video-path test.mp4 --no-command-center
# Test command center (no orchestrator)
python scripts/run_command_center.py
curl http://127.0.0.1:8000/latest
# Test robot bridge (smoke test)
python scripts/smoke_test_robot.py --host 192.168.10.102# Terminal 1: Command center
python scripts/run_command_center.py
# Terminal 2: Orchestrator with webcam
python -m himpublic.main --io local
# Terminal 3: Webapp
cd webapp && npm run dev
# Open http://localhost:5173, walk in front of webcam, verify:
# - Person detected → phase transitions to APPROACH
# - Keyframes appear in webapp RobotFeed
# - Comms log shows robot speech| Problem | Cause | Fix |
|---|---|---|
ModuleNotFoundError: himpublic |
Not in himpublic-py/ or forgot PYTHONPATH | cd himpublic-py && PYTHONPATH=src python -m himpublic.main |
| Command center not receiving events | Wrong URL | Check --command-center http://127.0.0.1:8000 |
| Webapp shows no feed | Orchestrator not running | Start orchestrator first |
| Robot bridge "no frame within 5s" | Forgot to source ROS2 | source /opt/ros/humble/setup.bash before starting server |
espeak hangs on robot |
PulseAudio issue | Fixed in server.py: use espeak --stdout | paplay |
| YOLO "model not found" | First run, downloading weights | Wait ~1 min, will auto-download yolov8n.pt |
| LLM "API key not found" | Missing .env file | Create himpublic-py/.env with OPENAI_API_KEY=... |
Inference (orchestrator on laptop)
- LLM (GPT-4o-mini): ~200–500 ms per decision call; drives high-level policy at ~1 Hz.
- VLM (GPT-4-Vision): ~1–2 s per frame for injury assessment; used for multi-view triage.
- CV — YOLO (person): ~30 FPS on laptop GPU, ~10 FPS on CPU; runs every frame for search/approach.
- CV — YOLO-World (open-vocab): ~15 FPS on GPU, ~5 FPS on CPU; used for rubble/obstacles.
- Keyframe transmission: ~60 KB per frame at quality 80.
Edge AI (K1 robot)
The Booster K1 is built on a Jetson Orin NX (Tensor Cores GPU @ 1173 MHz, 117 TOPS). In our current setup the robot runs the bridge (camera, mic, speaker, motion) and heavy CV/LLM inference runs on the laptop. The K1’s on-board tensor GPUs support on-edge inference: running YOLO or other lightweight models directly on the robot would reduce latency and bandwidth and allow operation with intermittent connectivity. Our pipeline is structured so that perception and policy can be moved on-robot for future edge deployment.
- LLM (GPT-4o-mini): $0.15 / 1M input tokens, $0.60 / 1M output tokens
- ~500 tokens/decision → ~200 decisions/$1
- VLM (GPT-4-Vision): ~$0.01 per image analysis
- 4 views per victim = $0.04
- ASR (Whisper): $0.006 / minute
- 1 min dialogue = $0.006
Estimated cost per mission: $0.10 - $0.50 (depending on dialogue length)
- Voice-dependent search: Sound-based localization assumes the victim can respond by voice. Silent or unresponsive victims are only found by visual search (YOLO) as the robot patrols.
- Environment: Reverberant or very noisy spaces can degrade audio bearing accuracy; the spin-to-peak method works best when the victim’s voice clearly dominates in one direction.
- API latency: LLM and VLM calls go to the cloud (OpenAI); round-trip latency (200 ms–2 s) affects how quickly the robot reacts. Edge deployment on the K1’s tensor GPUs would reduce this for CV.
- CV false positives/negatives: YOLO can miss persons in clutter or at odd angles; YOLO-World rubble detection depends on prompt and lighting. Medical VLM findings are probabilistic and should be confirmed by medics.
- Single victim focus: The current triage flow is optimized for one victim at a time; multi-victim prioritization is not fully implemented.
- Connectivity: Orchestrator and command center assume a reliable link between laptop and robot (e.g. USB Ethernet). Offline or bandwidth-limited operation would require moving inference on-robot.
ADAM — Autonomous Disaster Assistance Manager · TreeHacks 2026
Robot: Booster K1 Humanoid
Models: OpenAI (GPT-4o-mini, GPT-4-Vision), Ultralytics (YOLO, YOLO-World), Google MediaPipe
Acknowledgments: Thank you to HIM: Humans in Motion for lending us the K1 robot for TreeHacks 2026 at Stanford.
MIT




