Skip to content

Autonomous web agent using DeepSeek R1 reasoning + LLaVA vision for browser automation and cookie extraction

License

Notifications You must be signed in to change notification settings

larancibia/011-BrowserBot

Repository files navigation

๐Ÿค– Web Agent with DeepSeek R1 + Vision

Autonomous web agent that uses DeepSeek R1 for reasoning and LLaVA for vision to navigate and interact with websites.

๐ŸŽฏ Features

  • Vision-based navigation: Uses LLaVA 7B to analyze screenshots
  • Chain-of-thought reasoning: DeepSeek R1:8b decides actions intelligently
  • Autonomous execution: Executes clicks, form fills, navigation automatically
  • Cookie extraction: Extracts authentication cookies for automation
  • GPU optimized: Runs on RTX 3090 24GB VRAM

๐Ÿš€ Quick Start

Prerequisites

  • NVIDIA GPU with 24GB VRAM (tested on RTX 3090)
  • Ollama installed and running
  • Python 3.10+
  • Playwright for browser automation

Installation

# 1. Clone repo
git clone https://github.com/YOUR_USERNAME/web-agent.git
cd web-agent

# 2. Run setup script
./setup-ollama-agent.sh

# 3. Start Ollama (in separate terminal)
ollama serve

Usage

Extract TurboScribe Cookies

./run-web-agent-turboscribe.sh

This will:

  1. Download DeepSeek R1:8b (~8GB) and LLaVA 7B (~4.5GB)
  2. Open browser and navigate to turboscribe.ai
  3. Use vision to detect "Sign in with Google" button
  4. Wait for you to complete Google OAuth manually
  5. Extract and save cookies to turboscribe-mcp/cookies.json

Custom Task

python3 ollama-web-agent-reasoning.py \
    --task "Search for 'AI reasoning models' on Google" \
    --url "https://google.com" \
    --max-steps 10

๐Ÿง  How It Works

Two-Stage Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  1. Screenshot capture                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  2. Vision Analysis (LLaVA 7B)                  โ”‚
โ”‚     "I see a blue button labeled 'Sign in'      โ”‚
โ”‚      at coordinates (640, 200)"                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  3. Reasoning (DeepSeek R1:8b)                  โ”‚
โ”‚     ๐Ÿ’ญ "I need to login"                         โ”‚
โ”‚     ๐Ÿ’ญ "I see a sign in button"                  โ”‚
โ”‚     ๐Ÿ’ญ "Best action: click"                      โ”‚
โ”‚     โ†’ Decision: {"action": "click", ...}        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  4. Execution (Playwright)                      โ”‚
โ”‚     ๐Ÿ–ฑ๏ธ Clicks button at (640, 200)              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Models

  • Vision: LLaVA 7B (~4.5GB VRAM)
  • Reasoning: DeepSeek R1:8b (~8GB VRAM)
  • Total: ~12.5GB VRAM (fits comfortably in 24GB)

๐Ÿ“Š Command Line Options

python3 ollama-web-agent-reasoning.py \
    --task "Your task description" \
    --url "https://example.com" \
    --reasoning-model "deepseek-r1:8b" \
    --vision-model "llava:7b" \
    --max-steps 20 \
    --save-cookies "/path/to/cookies.json" \
    --headless  # Run without visible browser

Available Arguments

Argument Description Default
--task Task description for the agent Required
--url Starting URL https://turboscribe.ai
--reasoning-model Ollama reasoning model deepseek-r1:8b
--vision-model Ollama vision model llava:7b
--max-steps Maximum steps to execute 15
--save-cookies Path to save cookies JSON None
--headless Run browser in headless mode False

๐ŸŽฎ Example Tasks

Login to Website

python3 ollama-web-agent-reasoning.py \
    --task "Login to example.com using Google OAuth" \
    --url "https://example.com/login"

Fill Form

python3 ollama-web-agent-reasoning.py \
    --task "Fill contact form with name 'John Doe' and email 'john@example.com'" \
    --url "https://example.com/contact"

Search and Navigate

python3 ollama-web-agent-reasoning.py \
    --task "Search for 'Claude AI' and click first result" \
    --url "https://google.com"

๐Ÿ“ Files

File Description
ollama-web-agent-reasoning.py Main agent with DeepSeek R1 + LLaVA
ollama-web-agent.py Simple version (single model)
run-web-agent-turboscribe.sh TurboScribe cookie extraction script
setup-ollama-agent.sh Setup script for dependencies
extract-turboscribe-cookies.py Alternative Playwright-based extractor
GUIA-WEB-AGENT.md Complete guide (Spanish)

๐Ÿ”ง Troubleshooting

"Error connecting to Ollama"

# Start Ollama in separate terminal
ollama serve

"Model not found"

# Download models manually
ollama pull deepseek-r1:8b
ollama pull llava:7b

"Playwright not installed"

pip3 install playwright httpx
python3 -m playwright install chromium

VRAM issues

# Check GPU usage
nvidia-smi

# Use smaller models
python3 ollama-web-agent-reasoning.py \
    --reasoning-model deepseek-r1:1.5b \
    --vision-model llava:7b

๐ŸŽฏ Performance

  • First run: ~10-30 min (downloads ~12.5GB models)
  • Subsequent runs: ~2-5 min per task
  • Per step: ~5-10 seconds (vision + reasoning + execution)
  • VRAM usage: ~12.5GB / 24GB (52%)

๐Ÿ’ก Tips

  1. First time: Don't use --headless to see how it works
  2. Google OAuth: Agent detects button but you complete login manually
  3. Debugging: Agent shows detailed reasoning for each step
  4. Cookie persistence: Saved cookies work for ~30 days

๐Ÿ†š Alternative Models

Larger reasoning models

# More capable but slower
ollama pull deepseek-r1:14b  # Requires 14GB VRAM
python3 ollama-web-agent-reasoning.py --reasoning-model deepseek-r1:14b

Better vision models

# More accurate vision
ollama pull llava:13b  # Requires 8GB VRAM
python3 ollama-web-agent-reasoning.py --vision-model llava:13b

๐Ÿ“š Documentation

๐Ÿค Use Cases

  • โœ… Cookie extraction for automation
  • โœ… Form filling and submission
  • โœ… OAuth login flows
  • โœ… Web scraping with authentication
  • โœ… E2E testing with AI reasoning
  • โœ… Social media automation
  • โœ… Data entry automation

โš ๏ธ Limitations

  • Google OAuth requires manual completion (anti-bot protection)
  • CAPTCHAs cannot be solved automatically
  • Complex SPAs may need more steps
  • Rate limiting on some websites

๐Ÿ“„ License

MIT License - See LICENSE file for details

๐Ÿ™ Credits

  • DeepSeek AI - DeepSeek R1 reasoning model
  • Haotian Liu - LLaVA vision model
  • Ollama - Local LLM inference
  • Microsoft - Playwright browser automation

Created with Claude Code ๐Ÿค–

Generated: 2025-11-16

About

Autonomous web agent using DeepSeek R1 reasoning + LLaVA vision for browser automation and cookie extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •