Web Connector reindexes all documents on each refresh instead of incremental indexing

#### Description

I am using the MIT community version of Onyx and hosting in my own server.  

Onyx version: v0.27.0-beta.1

When using the Web Connector with sitemap as the Scrape Method, the system reindexes all pages on each refresh attempt instead of only processing new or changed documents. This leads to unnecessary embedding token usage.
- The connector is configured to refresh once per day (1440 minutes)
- Each indexing attempt processes all documents (~5400+) instead of just new/changed ones; See the number of Total Docs
- Other connector types (Google Drive, Wikipedia) properly perform incremental indexing
- This behaviour significantly increases embedding token consumption, since I am using APIs to call Cohere embedding models

![Image](https://github.com/user-attachments/assets/9465be52-ff99-45ef-af68-09019f88697c)

#### Questions

1. Is this behavior specific to the sitemap Scrape Method or does it affect all Web Connector methods (recursive, single, sitemap)?
2. Is there a way to configure the Web Connector to perform true incremental indexing?
3. If not possible currently, can this feature be implemented to reduce token usage?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Web Connector reindexes all documents on each refresh instead of incremental indexing #4567

Description

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Web Connector reindexes all documents on each refresh instead of incremental indexing #4567

Description

Description

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions