-
Notifications
You must be signed in to change notification settings - Fork 0
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add a dedicated scroll skill to enable fine-grained control over page and element scrolling. While Playwright automatically scrolls elements into view before interactions, many automation scenarios require explicit scroll control for triggering dynamic content loading, positioning elements for screenshots, and handling infinite scroll patterns.
Motivation
Current Limitations
- No way to trigger scroll-based events (infinite scroll, lazy loading)
- Cannot position page before taking screenshots
- Unable to scroll without interacting with elements
- No support for progressive content loading patterns
Use Cases
-
Infinite Scroll / Lazy Loading
- Scroll to bottom to trigger more content loading
- Load images that only appear when scrolled into view
- Trigger scroll-based animations and transitions
-
Screenshot Positioning
- Scroll to specific sections before capturing
- Position elements optimally in viewport
- Take focused screenshots of specific page areas
- Enable smaller, more efficient screenshot files for vision models
-
Multi-Page Navigation
- Reset scroll position to top when navigating between pages
- Ensure consistent starting position for automation flows
-
Data Extraction
- Scroll through entire page to ensure all dynamic content is loaded
- Trigger rendering of lazy-loaded elements before extraction
Proposed API
- id: scroll
name: scroll
description: Scroll the page or element to a specific position or into view
schema:
type: object
properties:
target:
type: string
description: "What to scroll: 'page', 'element', or 'coordinates'"
enum: [page, element, coordinates]
selector:
type: string
description: "Element selector (required if target=element)"
behavior:
type: string
description: "Scroll behavior: 'smooth' or 'instant'"
enum: [smooth, instant]
default: smooth
block:
type: string
description: "Vertical alignment: 'start', 'center', 'end', 'nearest'"
enum: [start, center, end, nearest]
default: start
inline:
type: string
description: "Horizontal alignment: 'start', 'center', 'end', 'nearest'"
enum: [start, center, end, nearest]
default: nearest
x:
type: integer
description: "X coordinate for scrolling (if target=coordinates)"
y:
type: integer
description: "Y coordinate for scrolling (if target=coordinates)"
direction:
type: string
description: "Direction to scroll: 'up', 'down', 'left', 'right', 'top', 'bottom'"
enum: [up, down, left, right, top, bottom]
amount:
type: integer
description: "Amount to scroll in pixels (for directional scrolling)"
required:
- targetExample Usage
// Scroll to bottom to trigger infinite scroll
scroll({ target: "page", direction: "bottom" })
// Scroll element into view before screenshot
scroll({ target: "element", selector: "#product-gallery", block: "center" })
// Reset to top of page
scroll({ target: "page", direction: "top" })
// Scroll down 500px to load lazy images
scroll({ target: "page", direction: "down", amount: 500 })
// Scroll to specific coordinates
scroll({ target: "coordinates", x: 0, y: 1000 })Acceptance Criteria
- Add
scrollskill definition toagent.yaml - Implement scroll skill in
skills/scroll.gowith support for:- Page scrolling (top, bottom, directional)
- Element scrolling (scroll element into view with alignment options)
- Coordinate-based scrolling
- Smooth vs instant behavior
- Add comprehensive tests in
skills/scroll_test.go - Update demo site examples in
example/README.mdto demonstrate scroll usage - Document integration with screenshot skill for optimal positioning
- Handle edge cases:
- Invalid selectors
- Non-scrollable elements/pages
- Out-of-bounds coordinates
- Run
task generateto regenerate codebase from updatedagent.yaml
Benefits for Future Vision Integration
Once the inference gateway SDK supports vision/multimodal content (PR in progress), the scroll skill will enable:
- Self-aware workflows: Agent can scroll to position content, take screenshot, analyze with vision model
- Progressive screenshot capture: Break long pages into viewport-sized chunks for better LLM comprehension
- Targeted visual validation: Scroll to specific sections before visual analysis
- Smaller file sizes: Capture focused areas instead of full-page screenshots (5-50MB → 500KB)
Related
- Complements existing
take_screenshotskill - Enables better
extract_dataworkflows for dynamic content - Foundation for future vision-based automation capabilities
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request