Skip to content

[FEATURE] Add scroll skill for page scrolling and element positioning #39

@edenreich

Description

@edenreich

Summary

Add a dedicated scroll skill to enable fine-grained control over page and element scrolling. While Playwright automatically scrolls elements into view before interactions, many automation scenarios require explicit scroll control for triggering dynamic content loading, positioning elements for screenshots, and handling infinite scroll patterns.

Motivation

Current Limitations

  • No way to trigger scroll-based events (infinite scroll, lazy loading)
  • Cannot position page before taking screenshots
  • Unable to scroll without interacting with elements
  • No support for progressive content loading patterns

Use Cases

  1. Infinite Scroll / Lazy Loading

    • Scroll to bottom to trigger more content loading
    • Load images that only appear when scrolled into view
    • Trigger scroll-based animations and transitions
  2. Screenshot Positioning

    • Scroll to specific sections before capturing
    • Position elements optimally in viewport
    • Take focused screenshots of specific page areas
    • Enable smaller, more efficient screenshot files for vision models
  3. Multi-Page Navigation

    • Reset scroll position to top when navigating between pages
    • Ensure consistent starting position for automation flows
  4. Data Extraction

    • Scroll through entire page to ensure all dynamic content is loaded
    • Trigger rendering of lazy-loaded elements before extraction

Proposed API

- id: scroll
  name: scroll
  description: Scroll the page or element to a specific position or into view
  schema:
    type: object
    properties:
      target:
        type: string
        description: "What to scroll: 'page', 'element', or 'coordinates'"
        enum: [page, element, coordinates]
      selector:
        type: string
        description: "Element selector (required if target=element)"
      behavior:
        type: string
        description: "Scroll behavior: 'smooth' or 'instant'"
        enum: [smooth, instant]
        default: smooth
      block:
        type: string
        description: "Vertical alignment: 'start', 'center', 'end', 'nearest'"
        enum: [start, center, end, nearest]
        default: start
      inline:
        type: string  
        description: "Horizontal alignment: 'start', 'center', 'end', 'nearest'"
        enum: [start, center, end, nearest]
        default: nearest
      x:
        type: integer
        description: "X coordinate for scrolling (if target=coordinates)"
      y:
        type: integer
        description: "Y coordinate for scrolling (if target=coordinates)"
      direction:
        type: string
        description: "Direction to scroll: 'up', 'down', 'left', 'right', 'top', 'bottom'"
        enum: [up, down, left, right, top, bottom]
      amount:
        type: integer
        description: "Amount to scroll in pixels (for directional scrolling)"
    required:
      - target

Example Usage

// Scroll to bottom to trigger infinite scroll
scroll({ target: "page", direction: "bottom" })

// Scroll element into view before screenshot
scroll({ target: "element", selector: "#product-gallery", block: "center" })

// Reset to top of page
scroll({ target: "page", direction: "top" })

// Scroll down 500px to load lazy images
scroll({ target: "page", direction: "down", amount: 500 })

// Scroll to specific coordinates
scroll({ target: "coordinates", x: 0, y: 1000 })

Acceptance Criteria

  • Add scroll skill definition to agent.yaml
  • Implement scroll skill in skills/scroll.go with support for:
    • Page scrolling (top, bottom, directional)
    • Element scrolling (scroll element into view with alignment options)
    • Coordinate-based scrolling
    • Smooth vs instant behavior
  • Add comprehensive tests in skills/scroll_test.go
  • Update demo site examples in example/README.md to demonstrate scroll usage
  • Document integration with screenshot skill for optimal positioning
  • Handle edge cases:
    • Invalid selectors
    • Non-scrollable elements/pages
    • Out-of-bounds coordinates
  • Run task generate to regenerate codebase from updated agent.yaml

Benefits for Future Vision Integration

Once the inference gateway SDK supports vision/multimodal content (PR in progress), the scroll skill will enable:

  • Self-aware workflows: Agent can scroll to position content, take screenshot, analyze with vision model
  • Progressive screenshot capture: Break long pages into viewport-sized chunks for better LLM comprehension
  • Targeted visual validation: Scroll to specific sections before visual analysis
  • Smaller file sizes: Capture focused areas instead of full-page screenshots (5-50MB → 500KB)

Related

  • Complements existing take_screenshot skill
  • Enables better extract_data workflows for dynamic content
  • Foundation for future vision-based automation capabilities

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions