An AI-powered voice assistant that controls your computer through natural language commands using Google's Gemini 2.5 Flash with Live API.
- Move mouse to specific coordinates (absolute or relative)
- Click (left/right, single/double/triple)
- Hold and release mouse buttons
- Smooth animated mouse movements
- Scroll in any direction
- AI-powered coordinate detection from screen descriptions
- Press individual keys (including special keys)
- Type text with optional select-all-first
- Execute keyboard shortcuts (Cmd+C, Cmd+V, etc.)
- Replace text in fields
- Smart Coordinate Detection: Tell the AI "click on Discord" and it finds it!
- Quiz Generator: Generates fun quizzes based on what's on your screen
- Visual Recognition: Analyzes screen content with Gemini 2.5 Pro
- Real-time voice interaction with audio feedback
- Translucent modal windows for quiz display
- Screen sharing with cursor overlay
- Visual grid system for precise coordinate detection
- OS: macOS (tested on macOS 14+)
- Python: 3.12 or higher
- Microphone & Speakers: Required for voice interaction
- Screen Recording Permission: For screen capture
- Accessibility Permission: For mouse/keyboard control
- Google AI API Key from Google AI Studio
Using uv
(recommended):
cd voice-chat
uv pip install python-dotenv google-genai pyaudio pillow opencv-python mss pynput pyautogui numpy
Or using pip:
pip install python-dotenv google-genai pyaudio pillow opencv-python mss pynput pyautogui numpy
Create a .env
file in the voice-chat
directory:
echo "GOOGLE_API_KEY=your-api-key-here" > .env
Or export it in your shell:
export GOOGLE_API_KEY="your-api-key-here"
- Open System Settings
- Go to Privacy & Security → Accessibility
- Click the + button
- Add Terminal (or your IDE like VS Code, PyCharm)
- Enable the checkbox
- Restart Terminal
- Open System Settings
- Go to Privacy & Security → Screen Recording
- Enable Terminal (or your IDE)
- Restart Terminal
Run the permission test script:
uv run python test_permissions.py
Expected output:
✅ Mouse controller created
✅ Mouse moved successfully!
✅ Accessibility permission is WORKING
✅ Screen captured successfully!
✅ Screen Recording permission is WORKING
🎉 ALL PERMISSIONS ARE WORKING!
Screen Share Mode (recommended):
uv run python test3.py --mode screen
Camera Mode:
uv run python test3.py --mode camera
Audio Only (no video):
uv run python test3.py --mode none
- "Move the mouse to the center of the screen"
- "Click on Discord" (AI finds and clicks Discord icon)
- "Double click here"
- "Right click on the file"
- "Scroll down"
- "Move mouse 100 pixels to the right"
- "Type hello world"
- "Press enter"
- "Copy this" (Cmd+C)
- "Paste it" (Cmd+V)
- "Select all and replace with: new text here"
- "Press command shift S" (Save As)
- "What's on my screen?"
- "Click the close button"
- "Find the search box"
- "Generate a quiz from my screen"
- "Show me a quiz" (Generates interactive quiz from screen)
- "Create quiz questions"
Type commands when not speaking:
message > [Type your command here]
Type q
to quit.
Tool | Description | Parameters |
---|---|---|
move_mouse_absolute |
Move to exact coordinates | x, y |
move_mouse_relative |
Move relative to current position | x, y |
left_click_mouse |
Left click | count (default: 1) |
right_click_mouse |
Right click | count (default: 1) |
hold_left_mouse_button |
Press and hold left button | - |
release_left_mouse_button |
Release left button | - |
hold_right_mouse_button |
Press and hold right button | - |
release_right_mouse_button |
Release right button | - |
scroll_mouse_by |
Scroll | dx, dy |
get_mouse_position |
Get current position | - |
Tool | Description | Parameters |
---|---|---|
press_key |
Press single key | key |
type_text |
Type text | text, select_all_first |
select_all_and_replace |
Replace all text | text |
press_key_combination |
Keyboard shortcuts | keys (array) |
Special Keys: space, enter, shift, ctrl, alt, cmd, tab, esc, up, down, left, right, backspace, delete
Tool | Description | Parameters |
---|---|---|
smart_detect_screen_coordinates |
AI finds UI element | prompt |
generate_quiz_from_screen |
Create interactive quiz | - |
get_screen_size |
Get screen dimensions | - |
When you say "Click on Discord", the AI follows this sequence:
-
Detect: Calls
smart_detect_screen_coordinates("Discord")
- Captures screen with grid overlay
- Sends to Gemini 2.5 Pro
- Gets coordinates:
x=250, y=575
-
Move: Calls
move_mouse_absolute(250, 575)
- Smoothly moves mouse in 20 steps
- 5ms delay between steps for smooth animation
-
Click: Calls
left_click_mouse()
- Clicks at the current position
-
Respond: AI confirms verbally: "I've clicked on Discord"
When you say "Generate a quiz":
- Capture: Takes screenshot of current screen
- Analyze: Sends to Gemini 2.5 Pro with prompt
- Generate: Creates 3 questions:
- 2 about visible content
- 1 fun/creative question
- Display: Shows in translucent modal window
- 95% opacity
- Always on top
- Press ESC or click button to close
MODEL = "gemini-2.5-flash-native-audio-preview-09-2025"
SEND_SAMPLE_RATE = 16000 # Input: 16kHz
RECEIVE_SAMPLE_RATE = 24000 # Output: 24kHz
CHUNK_SIZE = 1024 # Audio buffer size
The AI follows a strict workflow to ensure reliable operation:
- Always detect coordinates before clicking
- Never guess coordinates
- Move mouse before clicking
- Confirm actions verbally
- Break complex tasks into steps
┌─────────────────────────────────────────────┐
│ User Voice Input (Microphone) │
└──────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Gemini 2.5 Flash Live API (WebSocket) │
│ • Speech-to-text │
│ • Natural language understanding │
│ • Tool/function calling │
│ • Text-to-speech │
└──────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Tool Execution Layer │
│ • Mouse control (pynput) │
│ • Keyboard control (pynput) │
│ • Screen capture (mss) │
│ • Vision AI (Gemini 2.5 Pro) │
│ • GUI (tkinter) │
└──────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ macOS System (with permissions) │
│ • Mouse/keyboard events │
│ • Screen capture │
│ • Audio I/O │
└─────────────────────────────────────────────┘
Symptom: Tools execute but AI doesn't speak back
Fix: Check line 951 in test3.py:
await asyncio.sleep(0.05) # This must be uncommented!
Symptoms:
- Clicks happen but mouse doesn't move
- Commands detected but no action
Fixes:
-
Check Accessibility Permission:
uv run python test_permissions.py
-
Verify global mouse controller (line 81):
_MOUSE_CONTROLLER = mouse.Controller()
-
Ensure functions use global controller:
# Correct ✅ _MOUSE_CONTROLLER.click(mouse.Button.left, count) # Wrong ❌ mouse.Controller().click(mouse.Button.left, count)
Error: CoreGraphics.CGWindowListCreateImage() failed
Fix: Grant Screen Recording permission (see Installation step 3)
Error: ModuleNotFoundError: No module named 'pynput'
Fix:
# Run with uv
uv run python test3.py --mode screen
# Or install dependencies
uv pip install pynput pyautogui mss opencv-python
Cause: Tkinter requires main thread on macOS
Current Behavior: Quiz prints to console if not on main thread (this is safe)
Solution: Use headphones! The script doesn't have echo cancellation.
- Accessibility: Controls mouse/keyboard
- Screen Recording: Captures screen content
- Microphone: Records voice commands
- Internet: Sends audio/video to Google AI
- Voice audio (for speech recognition)
- Screen images (for coordinate detection and quiz generation)
- Text transcripts
- Screenshots saved in
screens_*
folders - Quiz screenshots in
quiz_screens_*
folders - Audio frames processed in memory (not saved)
- Use Headphones: Prevent audio feedback loops
- Speak Clearly: Better recognition accuracy
- Be Specific: "Click on Discord" works better than "click there"
- Break Down Tasks: Complex tasks work better in steps
- Check Permissions: Run test script if things stop working
- Monitor Console: Watch for tool execution logs
- Use Screen Mode: Better for UI interaction than camera mode
You: "Click on Chrome"
AI: [Detects coordinates]
AI: [Moves mouse]
AI: [Clicks]
AI: "I've clicked on Chrome for you"
You: "Select all this text"
AI: [Presses Cmd+A]
You: "Copy it"
AI: [Presses Cmd+C]
You: "Now paste it in the search box"
AI: [Finds search box, clicks, presses Cmd+V]
You: "Type my email address"
AI: [Types text]
You: "Press tab"
AI: [Moves to next field]
You: "Type my password"
AI: [Types text]
You: "Press enter"
AI: [Submits form]
test_permissions.py
- Test macOS permissionstest2.py
- Earlier version (basic features)BUGFIX_SUMMARY.md
- List of bugs fixedCRITICAL_FIXES.md
- Technical details of fixesNEW_FEATURES_SUMMARY.md
- New features added
To add new tools:
- Define the function:
def my_new_tool(param1: str, param2: int):
"""Description of what the tool does."""
# Implementation
return {"result": "success"}
- Add to func_names_dict:
func_names_dict = {
# ... existing tools
"my_new_tool": my_new_tool,
}
- Add to tools declaration:
types.FunctionDeclaration(
name="my_new_tool",
description="What the AI needs to know about this tool",
parameters=types.Schema(
type=types.Type.OBJECT,
properties={
"param1": types.Schema(type=types.Type.STRING),
"param2": types.Schema(type=types.Type.NUMBER)
}
)
),
Part of the GEMINI_HACK project.
- Google Gemini 2.5 Flash Live API
- Google Gemini 2.5 Pro Vision API
- pynput for cross-platform input control
- mss for fast screen capture
- PyAudio for audio streaming
Issues? Check troubleshooting section above or run:
uv run python test_permissions.py
Status: ✅ All features tested and working on macOS
Version: 1.0
Last Updated: October 19, 2025