Skip to content

Web Connector reindexes all documents on each refresh instead of incremental indexing #4567

@ZhipengHe

Description

@ZhipengHe

Description

I am using the MIT community version of Onyx and hosting in my own server.

Onyx version: v0.27.0-beta.1

When using the Web Connector with sitemap as the Scrape Method, the system reindexes all pages on each refresh attempt instead of only processing new or changed documents. This leads to unnecessary embedding token usage.

  • The connector is configured to refresh once per day (1440 minutes)
  • Each indexing attempt processes all documents (~5400+) instead of just new/changed ones; See the number of Total Docs
  • Other connector types (Google Drive, Wikipedia) properly perform incremental indexing
  • This behaviour significantly increases embedding token consumption, since I am using APIs to call Cohere embedding models

Image

Questions

  1. Is this behavior specific to the sitemap Scrape Method or does it affect all Web Connector methods (recursive, single, sitemap)?
  2. Is there a way to configure the Web Connector to perform true incremental indexing?
  3. If not possible currently, can this feature be implemented to reduce token usage?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions