Autonomous web agent that uses DeepSeek R1 for reasoning and LLaVA for vision to navigate and interact with websites.
- Vision-based navigation: Uses LLaVA 7B to analyze screenshots
- Chain-of-thought reasoning: DeepSeek R1:8b decides actions intelligently
- Autonomous execution: Executes clicks, form fills, navigation automatically
- Cookie extraction: Extracts authentication cookies for automation
- GPU optimized: Runs on RTX 3090 24GB VRAM
- NVIDIA GPU with 24GB VRAM (tested on RTX 3090)
- Ollama installed and running
- Python 3.10+
- Playwright for browser automation
# 1. Clone repo
git clone https://github.com/YOUR_USERNAME/web-agent.git
cd web-agent
# 2. Run setup script
./setup-ollama-agent.sh
# 3. Start Ollama (in separate terminal)
ollama serve./run-web-agent-turboscribe.shThis will:
- Download DeepSeek R1:8b (~8GB) and LLaVA 7B (~4.5GB)
- Open browser and navigate to turboscribe.ai
- Use vision to detect "Sign in with Google" button
- Wait for you to complete Google OAuth manually
- Extract and save cookies to
turboscribe-mcp/cookies.json
python3 ollama-web-agent-reasoning.py \
--task "Search for 'AI reasoning models' on Google" \
--url "https://google.com" \
--max-steps 10โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. Screenshot capture โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2. Vision Analysis (LLaVA 7B) โ
โ "I see a blue button labeled 'Sign in' โ
โ at coordinates (640, 200)" โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 3. Reasoning (DeepSeek R1:8b) โ
โ ๐ญ "I need to login" โ
โ ๐ญ "I see a sign in button" โ
โ ๐ญ "Best action: click" โ
โ โ Decision: {"action": "click", ...} โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 4. Execution (Playwright) โ
โ ๐ฑ๏ธ Clicks button at (640, 200) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Vision: LLaVA 7B (~4.5GB VRAM)
- Reasoning: DeepSeek R1:8b (~8GB VRAM)
- Total: ~12.5GB VRAM (fits comfortably in 24GB)
python3 ollama-web-agent-reasoning.py \
--task "Your task description" \
--url "https://example.com" \
--reasoning-model "deepseek-r1:8b" \
--vision-model "llava:7b" \
--max-steps 20 \
--save-cookies "/path/to/cookies.json" \
--headless # Run without visible browser| Argument | Description | Default |
|---|---|---|
--task |
Task description for the agent | Required |
--url |
Starting URL | https://turboscribe.ai |
--reasoning-model |
Ollama reasoning model | deepseek-r1:8b |
--vision-model |
Ollama vision model | llava:7b |
--max-steps |
Maximum steps to execute | 15 |
--save-cookies |
Path to save cookies JSON | None |
--headless |
Run browser in headless mode | False |
python3 ollama-web-agent-reasoning.py \
--task "Login to example.com using Google OAuth" \
--url "https://example.com/login"python3 ollama-web-agent-reasoning.py \
--task "Fill contact form with name 'John Doe' and email 'john@example.com'" \
--url "https://example.com/contact"python3 ollama-web-agent-reasoning.py \
--task "Search for 'Claude AI' and click first result" \
--url "https://google.com"| File | Description |
|---|---|
ollama-web-agent-reasoning.py |
Main agent with DeepSeek R1 + LLaVA |
ollama-web-agent.py |
Simple version (single model) |
run-web-agent-turboscribe.sh |
TurboScribe cookie extraction script |
setup-ollama-agent.sh |
Setup script for dependencies |
extract-turboscribe-cookies.py |
Alternative Playwright-based extractor |
GUIA-WEB-AGENT.md |
Complete guide (Spanish) |
# Start Ollama in separate terminal
ollama serve# Download models manually
ollama pull deepseek-r1:8b
ollama pull llava:7bpip3 install playwright httpx
python3 -m playwright install chromium# Check GPU usage
nvidia-smi
# Use smaller models
python3 ollama-web-agent-reasoning.py \
--reasoning-model deepseek-r1:1.5b \
--vision-model llava:7b- First run: ~10-30 min (downloads ~12.5GB models)
- Subsequent runs: ~2-5 min per task
- Per step: ~5-10 seconds (vision + reasoning + execution)
- VRAM usage: ~12.5GB / 24GB (52%)
- First time: Don't use
--headlessto see how it works - Google OAuth: Agent detects button but you complete login manually
- Debugging: Agent shows detailed reasoning for each step
- Cookie persistence: Saved cookies work for ~30 days
# More capable but slower
ollama pull deepseek-r1:14b # Requires 14GB VRAM
python3 ollama-web-agent-reasoning.py --reasoning-model deepseek-r1:14b# More accurate vision
ollama pull llava:13b # Requires 8GB VRAM
python3 ollama-web-agent-reasoning.py --vision-model llava:13b- Complete Guide - Detailed documentation (Spanish)
- Ollama Documentation - Ollama setup and models
- Playwright Documentation - Browser automation
- DeepSeek R1 - Reasoning model info
- โ Cookie extraction for automation
- โ Form filling and submission
- โ OAuth login flows
- โ Web scraping with authentication
- โ E2E testing with AI reasoning
- โ Social media automation
- โ Data entry automation
- Google OAuth requires manual completion (anti-bot protection)
- CAPTCHAs cannot be solved automatically
- Complex SPAs may need more steps
- Rate limiting on some websites
MIT License - See LICENSE file for details
- DeepSeek AI - DeepSeek R1 reasoning model
- Haotian Liu - LLaVA vision model
- Ollama - Local LLM inference
- Microsoft - Playwright browser automation
Created with Claude Code ๐ค
Generated: 2025-11-16