SignFlow is the first real-time bidirectional ASL interpreter built on Stream's Vision Agents SDK. It bridges communication between deaf/hard-of-hearing signers and hearing speakers inside a live video call.
Built for the WeMakeDevs "Vision Possible: Agent Protocol" hackathon.
A deaf user signs to their webcam. The AI agent watches the video feed, recognizes ASL signs using Gemini's multimodal vision + YOLO Pose skeleton tracking, and speaks the English translation aloud. The translation also appears as text in the side panel.
A hearing user speaks normally. The agent transcribes their speech, converts it to ASL sign-by-sign descriptions (handshape, location, movement, facial expressions), and displays the instructions in the UI so the deaf user knows what was said.
| Layer | Technology | Role |
|---|---|---|
| Video Transport | Stream Edge Network | WebRTC, <30ms latency |
| Sign Recognition | Gemini Realtime (5 FPS) | Sees and interprets signs from video |
| Pose Detection | YOLO Pose (yolo11n-pose.pt) |
Skeleton overlay + structural data |
| Speech I/O | Gemini Realtime | Native audio in/out |
| Backend | Vision Agents SDK (Python) | Orchestrates everything |
| Frontend | React + @stream-io/video-react-sdk |
Video call UI + interpreter panels |
| Service | Sign Up |
|---|---|
| Stream | getstream.io |
| Google AI (Gemini) | aistudio.google.com |
cd backend
# Create .env with your API keys
cp .env .env.example # Edit .env with real keys
# Install dependencies
uv sync
# Generate a frontend token
uv run generate_token.py
# Start the agent
uv run agent.py runcd frontend
# Add the token from generate_token.py to .env
# Edit .env with VITE_STREAM_API_KEY, VITE_STREAM_TOKEN, VITE_STREAM_USER_ID
# Install dependencies
npm install
# Start the dev server
npm run dev- Open the frontend at
http://localhost:5173 - Enter a Call ID and click "Join Call"
- The agent will automatically join the same call
- Start signing or speaking!
WeMakeDevs4/
├── backend/
│ ├── agent.py # Main Vision Agent
│ ├── instructions.md # ASL knowledge base system prompt
│ ├── sign_processor.py # Custom YOLO Pose processor
│ ├── generate_token.py # Stream token generator
│ └── pyproject.toml # Python dependencies
│
├── frontend/
│ ├── src/
│ │ ├── App.tsx # Root with StreamVideo provider
│ │ ├── components/
│ │ │ ├── CallSetup.tsx # Join/create call UI
│ │ │ ├── VideoCall.tsx # Main layout (video + panels)
│ │ │ ├── SignToSpeechPanel.tsx # Detected signs + translation
│ │ │ ├── SpeechToSignPanel.tsx # Sign instructions
│ │ │ ├── TranscriptLog.tsx # Conversation history
│ │ │ ├── ModeToggle.tsx # Switch modes
│ │ │ └── StatusIndicator.tsx # Connection state
│ │ └── hooks/
│ │ ├── useSignEvents.ts # Custom event subscription
│ │ └── useTranscript.ts # Transcript state management
│ └── package.json
│
└── README.md
┌─────────────┐ WebRTC ┌──────────────┐ Gemini ┌─────────────┐
│ React App │◄────────────►│ Stream Edge │◄────────────►│ AI Agent │
│ (Browser) │ Video/Audio │ Network │ Video/Audio │ (Python) │
└──────┬──────┘ └──────────────┘ └──────┬──────┘
│ │
│ Custom Events YOLO Pose + │
│ (sign_detected, Gemini Vision │
│ sign_guide, │
│ transcript) │
└──────────────────────◄───────────────────────────────────┘
- Sign "HELLO" → Agent speaks "Hello" + text appears in Sign→Speech panel
- Sign "MY NAME" + fingerspell → Agent speaks your name
- Sign "HOW ARE YOU" → Agent translates with proper grammar
- Switch to Speech→Sign → Speak "Thank you" → Panel shows THANK-YOU sign instructions
- Both mode → Full bidirectional conversation
MIT