fix: connector sync and override feature by edwinjosechittilappilly · Pull Request #1663 · langflow-ai/openrag

edwinjosechittilappilly · 2026-05-22T15:09:26Z

Add end-to-end support for filename-based duplicate handling on connector ingests.

Frontend: send a new replace_duplicates flag with connector sync requests, perform a pre-sync duplicate check, and show a DuplicateHandlingDialog that lets users overwrite or skip duplicates when uploading from provider UI.

Backend: propagate replace_duplicates through connector_router, request models, and connector services into the file processors. ConnectorFileProcessor and LangflowConnectorFileProcessor now check whether a filename already exists in the index and either fail the file task or delete the existing document before ingesting when replace_duplicates is true.

Utilities/tests: clean_connector_filename now preserves original spacing/slashes and only enforces MIME-mapped extensions; get_filename_aliases adds underscore/sanitized variants so lookups match connector-indexed names. Add unit tests covering filename dedupe logic and filename alias behavior.

Summary by CodeRabbit

Release Notes

New Features
- Added duplicate file detection during connector sync operations. Users can now see which files already exist and choose to overwrite or skip duplicates via an interactive dialog.
- Introduced replace_duplicates flag to control duplicate handling behavior during file uploads and syncs.
Tests
- Added comprehensive test coverage for filename-based duplicate detection and handling logic.

Add end-to-end support for filename-based duplicate handling on connector ingests. Frontend: send a new replace_duplicates flag with connector sync requests, perform a pre-sync duplicate check, and show a DuplicateHandlingDialog that lets users overwrite or skip duplicates when uploading from provider UI. Backend: propagate replace_duplicates through connector_router, request models, and connector services into the file processors. ConnectorFileProcessor and LangflowConnectorFileProcessor now check whether a filename already exists in the index and either fail the file task or delete the existing document before ingesting when replace_duplicates is true. Utilities/tests: clean_connector_filename now preserves original spacing/slashes and only enforces MIME-mapped extensions; get_filename_aliases adds underscore/sanitized variants so lookups match connector-indexed names. Add unit tests covering filename dedupe logic and filename alias behavior.

Replace numeric duplicateCount with a duplicateNames string[] across upload and dropdown flows so the UI can show the actual file names that would be overwritten. The duplicate-handling dialog now accepts duplicateNames, derives an effective count, and lists up to 5 duplicate filenames with an "… and N more" indicator; message labels and button text use the effective count. Toast messages and pending state in upload/[provider]/page.tsx and knowledge-dropdown.tsx were updated to pass and consume duplicateNames and to use duplicateNames.length for counts.

coderabbitai · 2026-05-22T15:09:41Z

Walkthrough

This PR implements filename-based duplicate detection during connector file ingestion with optional replacement. It threads a replace_duplicates parameter through the frontend, API, and backend layers; refactors filename utilities to preserve original names while enforcing MIME extensions; implements duplicate detection and conditional deletion in processors; and adds a frontend UI flow for checking duplicates before sync and optionally overwriting them.

Changes

Duplicate-Aware Connector Sync

Layer / File(s)	Summary
API Parameter Contract – replace_duplicates Threading `frontend/app/api/mutations/useSyncConnector.ts`, `src/api/connector_router.py`, `src/api/connectors.py`, `src/connectors/service.py`, `src/connectors/langflow_connector_service.py`	The `replace_duplicates: bool` parameter is added to the request body and threaded through the API router, connector service, and Langflow service to reach the processor layer.
Filename Resolution and Aliasing Utilities `src/utils/file_utils.py`	`clean_connector_filename` now preserves original filenames while enforcing MIME-derived extensions. `get_filename_aliases` generates both the original filename and underscore-sanitized variants to ensure duplicate lookups match files indexed via connector ingestion.
Processor Duplicate Detection and Conditional Replacement `src/models/processors.py`	Both `ConnectorFileProcessor` and `LangflowConnectorFileProcessor` check for existing documents by filename via OpenSearch. When duplicates exist, they either fail the task (if `replace_duplicates=False`) or delete existing records and continue (if `replace_duplicates=True`). Langflow hash-based early-exit logic is removed.
Frontend Upload Page – Duplicate Checking Workflow `frontend/app/upload/[provider]/page.tsx`	The upload page refactors sync initiation into a `submitSync` helper and adds an async `handleSync` that checks each selected non-folder file for duplicates. If duplicates exist, it opens `DuplicateHandlingDialog`; if the user confirms overwrite, all files are submitted with `replace_duplicates=true`; otherwise, only non-duplicate files are submitted.
Frontend Dialog and Dropdown Components `frontend/components/duplicate-handling-dialog.tsx`, `frontend/components/knowledge-dropdown.tsx`	`DuplicateHandlingDialog` gains optional `duplicateNames: string[]` and displays up to 5 duplicate filenames with "… and N more" indicator. `KnowledgeDropdown` refactors folder duplicate handling to track `duplicateNames` array instead of numeric count.
Processor Duplicate Handling Tests `tests/unit/test_connector_processor_filename_dedupe.py`	Unit tests validate filename-based deduplication: collision with `replace_duplicates=False` (task fails), collision with `replace_duplicates=True` (delete-and-overwrite), no collision (ingestion proceeds). Langflow tests also validate hash-collision behavior.
File Utilities Tests `tests/unit/test_file_utils_filename_aliases.py`	Unit tests validate `get_filename_aliases` for empty inputs, plain filenames, extension swaps, spaces/slashes normalization, and combined variants. Tests validate `clean_connector_filename` preserves spaces/slashes, enforces MIME extensions for known types, and leaves unknown types unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement

Suggested reviewers

mfortman11

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: connector sync and override feature' directly aligns with the PR's main objective of adding filename-based duplicate handling and override functionality for connector ingests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch files-sharepoint-sync

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds end-to-end support for filename-based duplicate handling during connector ingests, enabling a user-driven “overwrite vs skip” workflow and backend enforcement via OpenSearch filename lookups/deletions.

Changes:

Backend: propagate replace_duplicates through connector sync APIs/services and enforce filename-based dedupe (fail vs delete+ingest) in connector processors.
Frontend: run pre-sync duplicate checks for provider uploads, show an overwrite/skip dialog, and send replace_duplicates on sync requests.
Utils/tests: adjust connector filename normalization + filename aliasing, and add unit tests for alias and dedupe behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/unit/test_file_utils_filename_aliases.py	Adds tests for filename alias generation and connector filename normalization.
tests/unit/test_connector_processor_filename_dedupe.py	Adds async unit tests covering filename-collision behavior with/without overwrite.
src/utils/file_utils.py	Updates connector filename normalization behavior and expands filename alias matching.
src/models/processors.py	Adds filename-exists gating and overwrite deletion logic to connector processors.
src/connectors/service.py	Passes `replace_duplicates` into `ConnectorFileProcessor` for specific-file syncs.
src/connectors/langflow_connector_service.py	Passes `replace_duplicates` into `LangflowConnectorFileProcessor` for specific-file syncs.
src/api/connectors.py	Extends sync request model to accept `replace_duplicates` and forwards it to services.
src/api/connector_router.py	Threads `replace_duplicates` through the active connector service router.
frontend/components/knowledge-dropdown.tsx	Enhances duplicate dialog state to include duplicate filenames for folder uploads.
frontend/components/duplicate-handling-dialog.tsx	Displays duplicate filenames (up to a limit) and updates messaging/labels.
frontend/app/upload/[provider]/page.tsx	Adds pre-sync duplicate checks + overwrite/skip dialog and sends `replace_duplicates`.
frontend/app/api/mutations/useSyncConnector.ts	Extends sync mutation request type with `replace_duplicates`.

Comments suppressed due to low confidence (3)

src/models/processors.py:571

When replace_duplicates is true you delete by filename before computing/checking the incoming file hash. If the incoming hash already exists (so process_document_standard returns {"status": "unchanged"}), you can delete the old filename document without creating a replacement (data loss). Compute/check hash first and handle the hash-already-exists case explicitly before running delete_document_by_filename.

This issue also appears on line 707 of the same file.

            with auto_cleanup_tempfile(suffix=suffix) as tmp_path:
                # Write content to temp file
                with open(tmp_path, "wb") as f:
                    f.write(document.content)

src/models/processors.py:702

Same as ConnectorFileProcessor: filename lookup/delete uses document.filename even though the task filename is normalized via clean_connector_filename(...) above. If the MIME-mapped extension is enforced, the dedupe query can miss existing docs and the error message can show a different name than what is indexed. Use the normalized filename consistently for lookup/delete/error text (and for original_filename where applicable).

                    return
                await self.delete_document_by_filename(document.filename, opensearch_client)

            # Create temporary file and compute hash to check for duplicates
            suffix = get_file_extension(document.mimetype)

src/models/processors.py:711

delete_document_by_filename(...) runs before the hash duplicate check. If the new content’s hash already exists elsewhere, the processor may return "unchanged" and skip ingest after deleting the prior filename document. Reorder so hash existence is handled before deletion (or otherwise guarantee the replacement will be indexed) to prevent deleting without replacement.


                # Compute hash and check if already exists
                file_hash = hash_id(tmp_path)

                if await self.check_document_exists(file_hash, opensearch_client):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                    file_task.status = TaskStatus.FAILED
+                    file_task.error = f"File with name '{document.filename}' already exists"
+                    file_task.updated_at = time.time()
+                    upload_task.failed_files += 1
+                    return


    """
-    clean_name = filename.replace(" ", "_").replace("/", "_")
    suffix = get_file_extension(mimetype)
    if suffix is None:
-        # Unknown type — keep whatever extension the file already has
-        return clean_name
-    if not clean_name.lower().endswith(suffix.lower()):
-        return clean_name + suffix
-    return clean_name
+        return filename
+    if not filename.lower().endswith(suffix.lower()):


+            {visibleNames.map((name) => (
+              <li key={name} className="break-all">


+    isOverwriteConfirmedRef.current = true;
+    const { connector, allFiles } = pendingSync;
+    submitSync(connector, allFiles, true);
+    setPendingSync(null);


    if (pendingFolderUpload) {
      isFolderOverwriteConfirmedRef.current = true;
-      const { allFiles, duplicateCount, unsupportedCount } =
+      const { allFiles, duplicateNames, unsupportedCount } =
        pendingFolderUpload;
      await uploadFolderBatches(allFiles, true);


coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/models/processors.py (1)

711-716: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing processed_files increment on hash-based early return.

When the document hash already exists and the file is marked "unchanged", the method returns without incrementing processed_files.

🐛 Proposed fix

                 if await self.check_document_exists(file_hash, opensearch_client):
                     file_task.status = TaskStatus.COMPLETED
                     file_task.result = {"status": "unchanged", "id": file_hash}
                     file_task.updated_at = time.time()
                     upload_task.successful_files += 1
+                    upload_task.processed_files += 1
+                    upload_task.updated_at = time.time()
                     return

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/models/processors.py` around lines 711 - 716, The early-return branch
that handles existing document hashes (the if that calls
self.check_document_exists(...)) marks file_task as COMPLETED and increments
upload_task.successful_files but fails to increment upload_task.processed_files;
update that branch in the same block (where file_task.status, file_task.result,
file_task.updated_at are set) to also increment upload_task.processed_files
before returning so processed_files reflects the handled file.

src/api/connectors.py (1)

452-458: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Forward replace_duplicates in the bucket-filter sync path.

replace_duplicates is passed for explicit file selection but dropped for bucket-filter-based syncs, so the same request flag behaves inconsistently across valid ingest paths.

Suggested fix

                 task_id = await connector_service.sync_specific_files(
                     working_connection.connection_id,
                     user.user_id,
                     all_file_ids,
                     jwt_token=jwt_token,
                     ingest_settings=body.settings,
+                    replace_duplicates=body.replace_duplicates,
                 )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/connectors.py` around lines 452 - 458, The bucket-filter sync call to
connector_service.sync_specific_files is dropping the replace_duplicates flag,
causing inconsistent behavior; modify the call in connectors.py to forward the
flag (e.g., add replace_duplicates=body.replace_duplicates or the equivalent
request field) alongside jwt_token and ingest_settings when invoking
connector_service.sync_specific_files with working_connection.connection_id,
user.user_id, and all_file_ids.

🧹 Nitpick comments (1)

src/utils/file_utils.py (1)
129-131: 💤 Low value

Stale comment references removed behavior.

The comment says "Mirror clean_connector_filename's space/slash -> underscore" but clean_connector_filename no longer performs this transformation (it now preserves the filename verbatim). The comment should explain that connector-ingested files may have been sanitized historically or by upstream systems, so aliases must include underscore variants for lookup matching.
📝 Suggested comment update
-    # Mirror clean_connector_filename's space/slash -> underscore so lookups also
-    # match files indexed through a connector ingestion path.
+    # Connector-ingested files may have spaces/slashes replaced with underscores
+    # by upstream systems. Include underscore variants so lookups match both forms.
     aliases.extend(name.replace(" ", "_").replace("/", "_") for name in list(aliases))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/file_utils.py` around lines 129 - 131, The existing comment above
the aliases.extend(...) line is stale because clean_connector_filename no longer
replaces spaces/slashes with underscores; update that comment to state that
connector-ingested filenames may have been sanitized historically or by upstream
systems so we still generate underscore variants for lookup compatibility, and
reference the aliases.extend(name.replace(" ", "_").replace("/", "_") for name
in list(aliases)) expression and clean_connector_filename to make clear why the
alias variants are kept for matching.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@frontend/app/upload/`[provider]/page.tsx:
- Around line 496-499: The probe File used for duplicateCheck is created without
preserving the original MIME type (see fakeFile and duplicateCheck usage), which
can cause backend filename normalization mismatches; update the probe
construction to include the original file's MIME type (use the file.type when
creating fakeFile) so the duplicate pre-check matches real uploads and
accurately detects duplicates.

In `@src/models/processors.py`:
- Around line 692-698: When filename collision is detected in the block that
calls await self.check_filename_exists(document.filename, opensearch_client) and
you take the early return because not self.replace_duplicates, increment
upload_task.processed_files (same fix as in ConnectorFileProcessor) before
setting file_task.status/updated_at and returning; ensure you update
upload_task.processed_files and persist any state changes to upload_task in the
same branch so processed_files reflects the skipped file.
- Around line 556-562: When a filename collision occurs in the code path that
checks await self.check_filename_exists(document.filename, opensearch_client)
and replace_duplicates is False, the method returns after incrementing
upload_task.failed_files but never increments upload_task.processed_files; fix
this by ensuring upload_task.processed_files is incremented on that early return
(or refactor to use a finally block like DocumentFileProcessor.process_item so
processed_files is always incremented regardless of early exits), updating the
block that sets file_task.status/ error and returns to also increment
upload_task.processed_files (or move the increment into a finally that encloses
the entire processing flow).

---

Outside diff comments:
In `@src/api/connectors.py`:
- Around line 452-458: The bucket-filter sync call to
connector_service.sync_specific_files is dropping the replace_duplicates flag,
causing inconsistent behavior; modify the call in connectors.py to forward the
flag (e.g., add replace_duplicates=body.replace_duplicates or the equivalent
request field) alongside jwt_token and ingest_settings when invoking
connector_service.sync_specific_files with working_connection.connection_id,
user.user_id, and all_file_ids.

In `@src/models/processors.py`:
- Around line 711-716: The early-return branch that handles existing document
hashes (the if that calls self.check_document_exists(...)) marks file_task as
COMPLETED and increments upload_task.successful_files but fails to increment
upload_task.processed_files; update that branch in the same block (where
file_task.status, file_task.result, file_task.updated_at are set) to also
increment upload_task.processed_files before returning so processed_files
reflects the handled file.

---

Nitpick comments:
In `@src/utils/file_utils.py`:
- Around line 129-131: The existing comment above the aliases.extend(...) line
is stale because clean_connector_filename no longer replaces spaces/slashes with
underscores; update that comment to state that connector-ingested filenames may
have been sanitized historically or by upstream systems so we still generate
underscore variants for lookup compatibility, and reference the
aliases.extend(name.replace(" ", "_").replace("/", "_") for name in
list(aliases)) expression and clean_connector_filename to make clear why the
alias variants are kept for matching.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 439d890b-7e66-4bc2-bd82-42aa791ee520

📥 Commits

Reviewing files that changed from the base of the PR and between 022527b and 1ba7ee7.

📒 Files selected for processing (12)

frontend/app/api/mutations/useSyncConnector.ts
frontend/app/upload/[provider]/page.tsx
frontend/components/duplicate-handling-dialog.tsx
frontend/components/knowledge-dropdown.tsx
src/api/connector_router.py
src/api/connectors.py
src/connectors/langflow_connector_service.py
src/connectors/service.py
src/models/processors.py
src/utils/file_utils.py
tests/unit/test_connector_processor_filename_dedupe.py
tests/unit/test_file_utils_filename_aliases.py

coderabbitai · 2026-05-22T15:16:39Z

+            const fakeFile = new File([], file.name);
+            const { exists } = await duplicateCheck(fakeFile);
+            return { file, isDuplicate: exists };
+          } catch (err) {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve MIME type when building the duplicate-check probe file.

The duplicate pre-check currently builds a probe File without MIME type, which can diverge from backend filename normalization and miss real duplicates.

Suggested fix

- const fakeFile = new File([], file.name); + const fakeFile = new File([], file.name, { + type: file.mimeType || "application/octet-stream", + }); const { exists } = await duplicateCheck(fakeFile);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const fakeFile = new File([], file.name);

const { exists } = await duplicateCheck(fakeFile);

return { file, isDuplicate: exists };

} catch (err) {

const fakeFile = new File([], file.name, {

type: file.mimeType || "application/octet-stream",

});

const { exists } = await duplicateCheck(fakeFile);

return { file, isDuplicate: exists };

} catch (err) {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/app/upload/`[provider]/page.tsx around lines 496 - 499, The probe File used for duplicateCheck is created without preserving the original MIME type (see fakeFile and duplicateCheck usage), which can cause backend filename normalization mismatches; update the probe construction to include the original file's MIME type (use the file.type when creating fakeFile) so the duplicate pre-check matches real uploads and accurately detects duplicates.

coderabbitai · 2026-05-22T15:16:39Z

+            if await self.check_filename_exists(document.filename, opensearch_client):
+                if not self.replace_duplicates:
+                    file_task.status = TaskStatus.FAILED
+                    file_task.error = f"File with name '{document.filename}' already exists"
+                    file_task.updated_at = time.time()
+                    upload_task.failed_files += 1
+                    return


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing processed_files increment on early return.

When the filename collision check fails (replace_duplicates=False), the method returns after incrementing failed_files but doesn't increment processed_files. Compare with DocumentFileProcessor.process_item which uses a finally block to ensure processed_files is always incremented. This inconsistency may cause progress tracking issues where processed_files never equals total_files.

🐛 Proposed fix

if await self.check_filename_exists(document.filename, opensearch_client): if not self.replace_duplicates: file_task.status = TaskStatus.FAILED file_task.error = f"File with name '{document.filename}' already exists" file_task.updated_at = time.time() upload_task.failed_files += 1 + upload_task.processed_files += 1 + upload_task.updated_at = time.time() return await self.delete_document_by_filename(document.filename, opensearch_client)

Alternatively, consider adding a finally block like DocumentFileProcessor to ensure processed_files is always incremented.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/models/processors.py` around lines 556 - 562, When a filename collision occurs in the code path that checks await self.check_filename_exists(document.filename, opensearch_client) and replace_duplicates is False, the method returns after incrementing upload_task.failed_files but never increments upload_task.processed_files; fix this by ensuring upload_task.processed_files is incremented on that early return (or refactor to use a finally block like DocumentFileProcessor.process_item so processed_files is always incremented regardless of early exits), updating the block that sets file_task.status/ error and returns to also increment upload_task.processed_files (or move the increment into a finally that encloses the entire processing flow).

coderabbitai · 2026-05-22T15:16:39Z

+            if await self.check_filename_exists(document.filename, opensearch_client):
+                if not self.replace_duplicates:
+                    file_task.status = TaskStatus.FAILED
+                    file_task.error = f"File with name '{document.filename}' already exists"
+                    file_task.updated_at = time.time()
+                    upload_task.failed_files += 1
+                    return


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing processed_files increment on early return (same issue as ConnectorFileProcessor).

Same as the ConnectorFileProcessor issue—when the filename collision check fails, processed_files is not incremented.

🐛 Proposed fix

if await self.check_filename_exists(document.filename, opensearch_client): if not self.replace_duplicates: file_task.status = TaskStatus.FAILED file_task.error = f"File with name '{document.filename}' already exists" file_task.updated_at = time.time() upload_task.failed_files += 1 + upload_task.processed_files += 1 + upload_task.updated_at = time.time() return await self.delete_document_by_filename(document.filename, opensearch_client)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/models/processors.py` around lines 692 - 698, When filename collision is detected in the block that calls await self.check_filename_exists(document.filename, opensearch_client) and you take the early return because not self.replace_duplicates, increment upload_task.processed_files (same fix as in ConnectorFileProcessor) before setting file_task.status/updated_at and returning; ensure you update upload_task.processed_files and persist any state changes to upload_task in the same branch so processed_files reflects the skipped file.

* Add filename-based duplicate handling for connectors Add end-to-end support for filename-based duplicate handling on connector ingests. Frontend: send a new replace_duplicates flag with connector sync requests, perform a pre-sync duplicate check, and show a DuplicateHandlingDialog that lets users overwrite or skip duplicates when uploading from provider UI. Backend: propagate replace_duplicates through connector_router, request models, and connector services into the file processors. ConnectorFileProcessor and LangflowConnectorFileProcessor now check whether a filename already exists in the index and either fail the file task or delete the existing document before ingesting when replace_duplicates is true. Utilities/tests: clean_connector_filename now preserves original spacing/slashes and only enforces MIME-mapped extensions; get_filename_aliases adds underscore/sanitized variants so lookups match connector-indexed names. Add unit tests covering filename dedupe logic and filename alias behavior. * Use duplicateNames list and display names Replace numeric duplicateCount with a duplicateNames string[] across upload and dropdown flows so the UI can show the actual file names that would be overwritten. The duplicate-handling dialog now accepts duplicateNames, derives an effective count, and lists up to 5 duplicate filenames with an "… and N more" indicator; message labels and button text use the effective count. Toast messages and pending state in upload/[provider]/page.tsx and knowledge-dropdown.tsx were updated to pass and consume duplicateNames and to use duplicateNames.length for counts. * Update page.tsx * style: ruff autofix (auto) --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

* fix: Ensure SUCCESS status requires fetchable result in DoclingPollingService * style: ruff autofix (auto) * fix: Catch specific DoclingServeError when fetching task result after SUCCESS status * feat: update style for oss of the failed task in the task panel (#1647) * update style for oss of the failed task in the task panel * keep logic on click, remove unecessary useeffect * fix padding * wip implementing Saas style * utils to reshape error until backend provide info we need * utils to reshape error until backend provide info we need * utils to reshape error until backend provide info we need and fixinf fallbacks of isTotalFailure * utils to reshape error until backend provide into * have Saas style for failed and complete labelstatus and width and border * few style adjustment to follow codebase pattern * adjust succeed and partially succeed case * adding comment for TODO implementation or more clarity * remove carbon icon package and replace carbon icon * add incident-reporter-icon --------- Co-authored-by: Olfa Maslah <olfamaslah@Olfas-MacBook-Pro.local> * fix: Encode IBM API key as Basic auth header (#1664) * Encode IBM API key as Basic auth header Add base64 encoding for the IBM auth path: import base64, construct a Basic auth token from X-Username and X-Api-Key (username:apikey), and store it in user.jwt_token and user.opensearch_credentials. Also set request.state.user before attaching the DB user ID so downstream code can access the created user object. * style: ruff autofix (auto) --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> * fix: restart deployment if env changes (#1665) * restart deployment if env changes * unit test * lint * fix: Ensure Langflow .env variable definitions from LANGFLOW_VARIABLES_TO_GET_FROM_ENVIRONMENT (#1667) * Ensure we dynamically update the list of Langflow .env environment variables with default values when the comma separated list defined in LANGFLOW_VARIABLES_TO_GET_FROM_ENVIRONMENT changes * fix tests * fix additional linting errors --------- Co-authored-by: rodageve <rodrigo.geve@datastax.com> * chore: Retire openrag-mcp; switch docs to streamable HTTP (#1668) * Retire openrag-mcp; switch docs to streamable HTTP Remove the stdio-based MCP server and all in-repo MCP tooling, and update README to mark the package as retired. Deleted module files include the MCP entrypoint, server, config, registry and individual tools (chat, search, documents, settings). The README was rewritten to announce that openrag-mcp is retired, explain migration to the built-in streamable-HTTP /mcp endpoint, update Cursor/Claude examples to use URL+headers auth, list the new v1 API tools, and note that the last PyPI release is final. This change consolidates MCP functionality into the OpenRAG core and removes the subprocess/stdio implementation and its source code. * Mark MCP SDK retired and clean package metadata Update package metadata to reflect retirement and integration into the OpenRAG backend. Bump version to 0.3.0 and replace the project description with a retirement/migration note. Set Development Status to Inactive, remove explicit Python version classifiers, and clear runtime dependencies and the CLI script entrypoint. Also remove the hatch env pip-args setting; build-system and wheel package target remain unchanged. * chore: update uv.lock files after version bump * Update uv.lock --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: connector sync and override feature (#1663) * Add filename-based duplicate handling for connectors Add end-to-end support for filename-based duplicate handling on connector ingests. Frontend: send a new replace_duplicates flag with connector sync requests, perform a pre-sync duplicate check, and show a DuplicateHandlingDialog that lets users overwrite or skip duplicates when uploading from provider UI. Backend: propagate replace_duplicates through connector_router, request models, and connector services into the file processors. ConnectorFileProcessor and LangflowConnectorFileProcessor now check whether a filename already exists in the index and either fail the file task or delete the existing document before ingesting when replace_duplicates is true. Utilities/tests: clean_connector_filename now preserves original spacing/slashes and only enforces MIME-mapped extensions; get_filename_aliases adds underscore/sanitized variants so lookups match connector-indexed names. Add unit tests covering filename dedupe logic and filename alias behavior. * Use duplicateNames list and display names Replace numeric duplicateCount with a duplicateNames string[] across upload and dropdown flows so the UI can show the actual file names that would be overwritten. The duplicate-handling dialog now accepts duplicateNames, derives an effective count, and lists up to 5 duplicate filenames with an "… and N more" indicator; message labels and button text use the effective count. Toast messages and pending state in upload/[provider]/page.tsx and knowledge-dropdown.tsx were updated to pass and consume duplicateNames and to use duplicateNames.length for counts. * Update page.tsx * style: ruff autofix (auto) --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> * fix: update OAuth prompt to consent for connector mutation (#1657) * fix: implement transient error handling for Docling result fetch * style: ruff autofix (auto) * refactor: remove unused import of Optional in docling_polling_service.py * refactor: change PollOutcome to use StrEnum for better type safety * refactor: enhance task status endpoints with structured failure metadata * style: ruff autofix (auto) * revert "style: ruff autofix (auto)" This reverts commit bc8be33. * style: ruff autofix (auto) * fix: Ensure SUCCESS status requires fetchable result in DoclingPollingService * style: ruff autofix (auto) * fix: Catch specific DoclingServeError when fetching task result after SUCCESS status * fix: implement transient error handling for Docling result fetch * style: ruff autofix (auto) * refactor: remove unused import of Optional in docling_polling_service.py * refactor: change PollOutcome to use StrEnum for better type safety * refactor: enhance task status endpoints with structured failure metadata * style: ruff autofix (auto) * revert "style: ruff autofix (auto)" This reverts commit bc8be33. * style: ruff autofix (auto) * Update tests/unit/test_task_service_get_task_status2.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * style: ruff autofix (auto) * fix: handle timeout during Docling result fetch after SUCCESS status * fix: update task status checks to use enum values for consistency * fix: enhance failure metadata for duplicate file errors in ingestion * style: ruff autofix (auto) --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Wallgau <46035189+Wallgau@users.noreply.github.com> Co-authored-by: Olfa Maslah <olfamaslah@Olfas-MacBook-Pro.local> Co-authored-by: Edwin Jose <edwin.jose@datastax.com> Co-authored-by: ming <itestmycode@gmail.com> Co-authored-by: rodageve <78763007+rodageve@users.noreply.github.com> Co-authored-by: rodageve <rodrigo.geve@datastax.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

edwinjosechittilappilly added 3 commits May 22, 2026 09:03

Update page.tsx

9b1800d

Copilot AI review requested due to automatic review settings May 22, 2026 15:09

github-actions Bot added frontend 🟨 Issues related to the UI/UX backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) tests labels May 22, 2026

edwinjosechittilappilly requested a review from mfortman11 May 22, 2026 15:09

github-actions Bot added the bug 🔴 Something isn't working. label May 22, 2026

style: ruff autofix (auto)

c9226c2

Copilot started reviewing on behalf of edwinjosechittilappilly May 22, 2026 15:10 View session

edwinjosechittilappilly enabled auto-merge (squash) May 22, 2026 15:10

github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels May 22, 2026

lint fix

1ba7ee7

github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels May 22, 2026

Copilot AI reviewed May 22, 2026

View reviewed changes

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Merge branch 'main' into files-sharepoint-sync

95fd952

github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels May 22, 2026

mfortman11 approved these changes May 22, 2026

View reviewed changes

edwinjosechittilappilly merged commit 748583b into main May 22, 2026
15 checks passed

github-actions Bot added the lgtm label May 22, 2026

github-actions Bot deleted the files-sharepoint-sync branch May 22, 2026 20:26

coderabbitai Bot mentioned this pull request May 23, 2026

fix: Ensure success status requires fetchable result in DoclingPolling #1660

Open

This was referenced May 25, 2026

fix: make google drive ingestion work #1672

Merged

fix: add replace_duplicates option for Google Drive sync #1674

Merged

fix: Remove indexed chunks when connector file deleted #1677

Merged

coderabbitai Bot mentioned this pull request May 28, 2026

fix: unify connector ingestion pipelines and consolidate service layer #1695

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: connector sync and override feature#1663

fix: connector sync and override feature#1663
edwinjosechittilappilly merged 6 commits into
mainfrom
files-sharepoint-sync

edwinjosechittilappilly commented May 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 22, 2026

Uh oh!

coderabbitai Bot May 22, 2026

Uh oh!

coderabbitai Bot May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		{visibleNames.map((name) => (
		<li key={name} className="break-all">

Conversation

edwinjosechittilappilly commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edwinjosechittilappilly commented May 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading