Skip to content

fix: connector sync and override feature#1663

Merged
edwinjosechittilappilly merged 6 commits into
mainfrom
files-sharepoint-sync
May 22, 2026
Merged

fix: connector sync and override feature#1663
edwinjosechittilappilly merged 6 commits into
mainfrom
files-sharepoint-sync

Conversation

@edwinjosechittilappilly
Copy link
Copy Markdown
Collaborator

@edwinjosechittilappilly edwinjosechittilappilly commented May 22, 2026

Add end-to-end support for filename-based duplicate handling on connector ingests.

Frontend: send a new replace_duplicates flag with connector sync requests, perform a pre-sync duplicate check, and show a DuplicateHandlingDialog that lets users overwrite or skip duplicates when uploading from provider UI.

Backend: propagate replace_duplicates through connector_router, request models, and connector services into the file processors. ConnectorFileProcessor and LangflowConnectorFileProcessor now check whether a filename already exists in the index and either fail the file task or delete the existing document before ingesting when replace_duplicates is true.

Utilities/tests: clean_connector_filename now preserves original spacing/slashes and only enforces MIME-mapped extensions; get_filename_aliases adds underscore/sanitized variants so lookups match connector-indexed names. Add unit tests covering filename dedupe logic and filename alias behavior.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added duplicate file detection during connector sync operations. Users can now see which files already exist and choose to overwrite or skip duplicates via an interactive dialog.
    • Introduced replace_duplicates flag to control duplicate handling behavior during file uploads and syncs.
  • Tests

    • Added comprehensive test coverage for filename-based duplicate detection and handling logic.

Review Change Stack

Add end-to-end support for filename-based duplicate handling on connector ingests.

Frontend: send a new replace_duplicates flag with connector sync requests, perform a pre-sync duplicate check, and show a DuplicateHandlingDialog that lets users overwrite or skip duplicates when uploading from provider UI.

Backend: propagate replace_duplicates through connector_router, request models, and connector services into the file processors. ConnectorFileProcessor and LangflowConnectorFileProcessor now check whether a filename already exists in the index and either fail the file task or delete the existing document before ingesting when replace_duplicates is true.

Utilities/tests: clean_connector_filename now preserves original spacing/slashes and only enforces MIME-mapped extensions; get_filename_aliases adds underscore/sanitized variants so lookups match connector-indexed names. Add unit tests covering filename dedupe logic and filename alias behavior.
Replace numeric duplicateCount with a duplicateNames string[] across upload and dropdown flows so the UI can show the actual file names that would be overwritten. The duplicate-handling dialog now accepts duplicateNames, derives an effective count, and lists up to 5 duplicate filenames with an "… and N more" indicator; message labels and button text use the effective count. Toast messages and pending state in upload/[provider]/page.tsx and knowledge-dropdown.tsx were updated to pass and consume duplicateNames and to use duplicateNames.length for counts.
Copilot AI review requested due to automatic review settings May 22, 2026 15:09
@github-actions github-actions Bot added frontend 🟨 Issues related to the UI/UX backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) tests labels May 22, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Walkthrough

This PR implements filename-based duplicate detection during connector file ingestion with optional replacement. It threads a replace_duplicates parameter through the frontend, API, and backend layers; refactors filename utilities to preserve original names while enforcing MIME extensions; implements duplicate detection and conditional deletion in processors; and adds a frontend UI flow for checking duplicates before sync and optionally overwriting them.

Changes

Duplicate-Aware Connector Sync

Layer / File(s) Summary
API Parameter Contract – replace_duplicates Threading
frontend/app/api/mutations/useSyncConnector.ts, src/api/connector_router.py, src/api/connectors.py, src/connectors/service.py, src/connectors/langflow_connector_service.py
The replace_duplicates: bool parameter is added to the request body and threaded through the API router, connector service, and Langflow service to reach the processor layer.
Filename Resolution and Aliasing Utilities
src/utils/file_utils.py
clean_connector_filename now preserves original filenames while enforcing MIME-derived extensions. get_filename_aliases generates both the original filename and underscore-sanitized variants to ensure duplicate lookups match files indexed via connector ingestion.
Processor Duplicate Detection and Conditional Replacement
src/models/processors.py
Both ConnectorFileProcessor and LangflowConnectorFileProcessor check for existing documents by filename via OpenSearch. When duplicates exist, they either fail the task (if replace_duplicates=False) or delete existing records and continue (if replace_duplicates=True). Langflow hash-based early-exit logic is removed.
Frontend Upload Page – Duplicate Checking Workflow
frontend/app/upload/[provider]/page.tsx
The upload page refactors sync initiation into a submitSync helper and adds an async handleSync that checks each selected non-folder file for duplicates. If duplicates exist, it opens DuplicateHandlingDialog; if the user confirms overwrite, all files are submitted with replace_duplicates=true; otherwise, only non-duplicate files are submitted.
Frontend Dialog and Dropdown Components
frontend/components/duplicate-handling-dialog.tsx, frontend/components/knowledge-dropdown.tsx
DuplicateHandlingDialog gains optional duplicateNames: string[] and displays up to 5 duplicate filenames with "… and N more" indicator. KnowledgeDropdown refactors folder duplicate handling to track duplicateNames array instead of numeric count.
Processor Duplicate Handling Tests
tests/unit/test_connector_processor_filename_dedupe.py
Unit tests validate filename-based deduplication: collision with replace_duplicates=False (task fails), collision with replace_duplicates=True (delete-and-overwrite), no collision (ingestion proceeds). Langflow tests also validate hash-collision behavior.
File Utilities Tests
tests/unit/test_file_utils_filename_aliases.py
Unit tests validate get_filename_aliases for empty inputs, plain filenames, extension swaps, spaces/slashes normalization, and combined variants. Tests validate clean_connector_filename preserves spaces/slashes, enforces MIME extensions for known types, and leaves unknown types unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement

Suggested reviewers

  • mfortman11
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: connector sync and override feature' directly aligns with the PR's main objective of adding filename-based duplicate handling and override functionality for connector ingests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch files-sharepoint-sync

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the bug 🔴 Something isn't working. label May 22, 2026
@edwinjosechittilappilly edwinjosechittilappilly enabled auto-merge (squash) May 22, 2026 15:10
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels May 22, 2026
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels May 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for filename-based duplicate handling during connector ingests, enabling a user-driven “overwrite vs skip” workflow and backend enforcement via OpenSearch filename lookups/deletions.

Changes:

  • Backend: propagate replace_duplicates through connector sync APIs/services and enforce filename-based dedupe (fail vs delete+ingest) in connector processors.
  • Frontend: run pre-sync duplicate checks for provider uploads, show an overwrite/skip dialog, and send replace_duplicates on sync requests.
  • Utils/tests: adjust connector filename normalization + filename aliasing, and add unit tests for alias and dedupe behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/test_file_utils_filename_aliases.py Adds tests for filename alias generation and connector filename normalization.
tests/unit/test_connector_processor_filename_dedupe.py Adds async unit tests covering filename-collision behavior with/without overwrite.
src/utils/file_utils.py Updates connector filename normalization behavior and expands filename alias matching.
src/models/processors.py Adds filename-exists gating and overwrite deletion logic to connector processors.
src/connectors/service.py Passes replace_duplicates into ConnectorFileProcessor for specific-file syncs.
src/connectors/langflow_connector_service.py Passes replace_duplicates into LangflowConnectorFileProcessor for specific-file syncs.
src/api/connectors.py Extends sync request model to accept replace_duplicates and forwards it to services.
src/api/connector_router.py Threads replace_duplicates through the active connector service router.
frontend/components/knowledge-dropdown.tsx Enhances duplicate dialog state to include duplicate filenames for folder uploads.
frontend/components/duplicate-handling-dialog.tsx Displays duplicate filenames (up to a limit) and updates messaging/labels.
frontend/app/upload/[provider]/page.tsx Adds pre-sync duplicate checks + overwrite/skip dialog and sends replace_duplicates.
frontend/app/api/mutations/useSyncConnector.ts Extends sync mutation request type with replace_duplicates.
Comments suppressed due to low confidence (3)

src/models/processors.py:571

  • When replace_duplicates is true you delete by filename before computing/checking the incoming file hash. If the incoming hash already exists (so process_document_standard returns {"status": "unchanged"}), you can delete the old filename document without creating a replacement (data loss). Compute/check hash first and handle the hash-already-exists case explicitly before running delete_document_by_filename.

This issue also appears on line 707 of the same file.

            with auto_cleanup_tempfile(suffix=suffix) as tmp_path:
                # Write content to temp file
                with open(tmp_path, "wb") as f:
                    f.write(document.content)

src/models/processors.py:702

  • Same as ConnectorFileProcessor: filename lookup/delete uses document.filename even though the task filename is normalized via clean_connector_filename(...) above. If the MIME-mapped extension is enforced, the dedupe query can miss existing docs and the error message can show a different name than what is indexed. Use the normalized filename consistently for lookup/delete/error text (and for original_filename where applicable).
                    return
                await self.delete_document_by_filename(document.filename, opensearch_client)

            # Create temporary file and compute hash to check for duplicates
            suffix = get_file_extension(document.mimetype)

src/models/processors.py:711

  • delete_document_by_filename(...) runs before the hash duplicate check. If the new content’s hash already exists elsewhere, the processor may return "unchanged" and skip ingest after deleting the prior filename document. Reorder so hash existence is handled before deletion (or otherwise guarantee the replacement will be indexed) to prevent deleting without replacement.

                # Compute hash and check if already exists
                file_hash = hash_id(tmp_path)

                if await self.check_document_exists(file_hash, opensearch_client):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/models/processors.py
Comment on lines +558 to +562
file_task.status = TaskStatus.FAILED
file_task.error = f"File with name '{document.filename}' already exists"
file_task.updated_at = time.time()
upload_task.failed_files += 1
return
Comment thread src/utils/file_utils.py
Comment on lines 96 to +100
"""
clean_name = filename.replace(" ", "_").replace("/", "_")
suffix = get_file_extension(mimetype)
if suffix is None:
# Unknown type — keep whatever extension the file already has
return clean_name
if not clean_name.lower().endswith(suffix.lower()):
return clean_name + suffix
return clean_name
return filename
if not filename.lower().endswith(suffix.lower()):
Comment on lines +82 to +83
{visibleNames.map((name) => (
<li key={name} className="break-all">
isOverwriteConfirmedRef.current = true;
const { connector, allFiles } = pendingSync;
submitSync(connector, allFiles, true);
setPendingSync(null);
Comment on lines 383 to 387
if (pendingFolderUpload) {
isFolderOverwriteConfirmedRef.current = true;
const { allFiles, duplicateCount, unsupportedCount } =
const { allFiles, duplicateNames, unsupportedCount } =
pendingFolderUpload;
await uploadFolderBatches(allFiles, true);
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/models/processors.py (1)

711-716: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing processed_files increment on hash-based early return.

When the document hash already exists and the file is marked "unchanged", the method returns without incrementing processed_files.

🐛 Proposed fix
                 if await self.check_document_exists(file_hash, opensearch_client):
                     file_task.status = TaskStatus.COMPLETED
                     file_task.result = {"status": "unchanged", "id": file_hash}
                     file_task.updated_at = time.time()
                     upload_task.successful_files += 1
+                    upload_task.processed_files += 1
+                    upload_task.updated_at = time.time()
                     return
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/models/processors.py` around lines 711 - 716, The early-return branch
that handles existing document hashes (the if that calls
self.check_document_exists(...)) marks file_task as COMPLETED and increments
upload_task.successful_files but fails to increment upload_task.processed_files;
update that branch in the same block (where file_task.status, file_task.result,
file_task.updated_at are set) to also increment upload_task.processed_files
before returning so processed_files reflects the handled file.
src/api/connectors.py (1)

452-458: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Forward replace_duplicates in the bucket-filter sync path.

replace_duplicates is passed for explicit file selection but dropped for bucket-filter-based syncs, so the same request flag behaves inconsistently across valid ingest paths.

Suggested fix
                 task_id = await connector_service.sync_specific_files(
                     working_connection.connection_id,
                     user.user_id,
                     all_file_ids,
                     jwt_token=jwt_token,
                     ingest_settings=body.settings,
+                    replace_duplicates=body.replace_duplicates,
                 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/api/connectors.py` around lines 452 - 458, The bucket-filter sync call to
connector_service.sync_specific_files is dropping the replace_duplicates flag,
causing inconsistent behavior; modify the call in connectors.py to forward the
flag (e.g., add replace_duplicates=body.replace_duplicates or the equivalent
request field) alongside jwt_token and ingest_settings when invoking
connector_service.sync_specific_files with working_connection.connection_id,
user.user_id, and all_file_ids.
🧹 Nitpick comments (1)
src/utils/file_utils.py (1)

129-131: 💤 Low value

Stale comment references removed behavior.

The comment says "Mirror clean_connector_filename's space/slash -> underscore" but clean_connector_filename no longer performs this transformation (it now preserves the filename verbatim). The comment should explain that connector-ingested files may have been sanitized historically or by upstream systems, so aliases must include underscore variants for lookup matching.

📝 Suggested comment update
-    # Mirror clean_connector_filename's space/slash -> underscore so lookups also
-    # match files indexed through a connector ingestion path.
+    # Connector-ingested files may have spaces/slashes replaced with underscores
+    # by upstream systems. Include underscore variants so lookups match both forms.
     aliases.extend(name.replace(" ", "_").replace("/", "_") for name in list(aliases))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/file_utils.py` around lines 129 - 131, The existing comment above
the aliases.extend(...) line is stale because clean_connector_filename no longer
replaces spaces/slashes with underscores; update that comment to state that
connector-ingested filenames may have been sanitized historically or by upstream
systems so we still generate underscore variants for lookup compatibility, and
reference the aliases.extend(name.replace(" ", "_").replace("/", "_") for name
in list(aliases)) expression and clean_connector_filename to make clear why the
alias variants are kept for matching.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@frontend/app/upload/`[provider]/page.tsx:
- Around line 496-499: The probe File used for duplicateCheck is created without
preserving the original MIME type (see fakeFile and duplicateCheck usage), which
can cause backend filename normalization mismatches; update the probe
construction to include the original file's MIME type (use the file.type when
creating fakeFile) so the duplicate pre-check matches real uploads and
accurately detects duplicates.

In `@src/models/processors.py`:
- Around line 692-698: When filename collision is detected in the block that
calls await self.check_filename_exists(document.filename, opensearch_client) and
you take the early return because not self.replace_duplicates, increment
upload_task.processed_files (same fix as in ConnectorFileProcessor) before
setting file_task.status/updated_at and returning; ensure you update
upload_task.processed_files and persist any state changes to upload_task in the
same branch so processed_files reflects the skipped file.
- Around line 556-562: When a filename collision occurs in the code path that
checks await self.check_filename_exists(document.filename, opensearch_client)
and replace_duplicates is False, the method returns after incrementing
upload_task.failed_files but never increments upload_task.processed_files; fix
this by ensuring upload_task.processed_files is incremented on that early return
(or refactor to use a finally block like DocumentFileProcessor.process_item so
processed_files is always incremented regardless of early exits), updating the
block that sets file_task.status/ error and returns to also increment
upload_task.processed_files (or move the increment into a finally that encloses
the entire processing flow).

---

Outside diff comments:
In `@src/api/connectors.py`:
- Around line 452-458: The bucket-filter sync call to
connector_service.sync_specific_files is dropping the replace_duplicates flag,
causing inconsistent behavior; modify the call in connectors.py to forward the
flag (e.g., add replace_duplicates=body.replace_duplicates or the equivalent
request field) alongside jwt_token and ingest_settings when invoking
connector_service.sync_specific_files with working_connection.connection_id,
user.user_id, and all_file_ids.

In `@src/models/processors.py`:
- Around line 711-716: The early-return branch that handles existing document
hashes (the if that calls self.check_document_exists(...)) marks file_task as
COMPLETED and increments upload_task.successful_files but fails to increment
upload_task.processed_files; update that branch in the same block (where
file_task.status, file_task.result, file_task.updated_at are set) to also
increment upload_task.processed_files before returning so processed_files
reflects the handled file.

---

Nitpick comments:
In `@src/utils/file_utils.py`:
- Around line 129-131: The existing comment above the aliases.extend(...) line
is stale because clean_connector_filename no longer replaces spaces/slashes with
underscores; update that comment to state that connector-ingested filenames may
have been sanitized historically or by upstream systems so we still generate
underscore variants for lookup compatibility, and reference the
aliases.extend(name.replace(" ", "_").replace("/", "_") for name in
list(aliases)) expression and clean_connector_filename to make clear why the
alias variants are kept for matching.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 439d890b-7e66-4bc2-bd82-42aa791ee520

📥 Commits

Reviewing files that changed from the base of the PR and between 022527b and 1ba7ee7.

📒 Files selected for processing (12)
  • frontend/app/api/mutations/useSyncConnector.ts
  • frontend/app/upload/[provider]/page.tsx
  • frontend/components/duplicate-handling-dialog.tsx
  • frontend/components/knowledge-dropdown.tsx
  • src/api/connector_router.py
  • src/api/connectors.py
  • src/connectors/langflow_connector_service.py
  • src/connectors/service.py
  • src/models/processors.py
  • src/utils/file_utils.py
  • tests/unit/test_connector_processor_filename_dedupe.py
  • tests/unit/test_file_utils_filename_aliases.py

Comment on lines +496 to +499
const fakeFile = new File([], file.name);
const { exists } = await duplicateCheck(fakeFile);
return { file, isDuplicate: exists };
} catch (err) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve MIME type when building the duplicate-check probe file.

The duplicate pre-check currently builds a probe File without MIME type, which can diverge from backend filename normalization and miss real duplicates.

Suggested fix
-            const fakeFile = new File([], file.name);
+            const fakeFile = new File([], file.name, {
+              type: file.mimeType || "application/octet-stream",
+            });
             const { exists } = await duplicateCheck(fakeFile);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const fakeFile = new File([], file.name);
const { exists } = await duplicateCheck(fakeFile);
return { file, isDuplicate: exists };
} catch (err) {
const fakeFile = new File([], file.name, {
type: file.mimeType || "application/octet-stream",
});
const { exists } = await duplicateCheck(fakeFile);
return { file, isDuplicate: exists };
} catch (err) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/app/upload/`[provider]/page.tsx around lines 496 - 499, The probe
File used for duplicateCheck is created without preserving the original MIME
type (see fakeFile and duplicateCheck usage), which can cause backend filename
normalization mismatches; update the probe construction to include the original
file's MIME type (use the file.type when creating fakeFile) so the duplicate
pre-check matches real uploads and accurately detects duplicates.

Comment thread src/models/processors.py
Comment on lines +556 to +562
if await self.check_filename_exists(document.filename, opensearch_client):
if not self.replace_duplicates:
file_task.status = TaskStatus.FAILED
file_task.error = f"File with name '{document.filename}' already exists"
file_task.updated_at = time.time()
upload_task.failed_files += 1
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing processed_files increment on early return.

When the filename collision check fails (replace_duplicates=False), the method returns after incrementing failed_files but doesn't increment processed_files. Compare with DocumentFileProcessor.process_item which uses a finally block to ensure processed_files is always incremented. This inconsistency may cause progress tracking issues where processed_files never equals total_files.

🐛 Proposed fix
             if await self.check_filename_exists(document.filename, opensearch_client):
                 if not self.replace_duplicates:
                     file_task.status = TaskStatus.FAILED
                     file_task.error = f"File with name '{document.filename}' already exists"
                     file_task.updated_at = time.time()
                     upload_task.failed_files += 1
+                    upload_task.processed_files += 1
+                    upload_task.updated_at = time.time()
                     return
                 await self.delete_document_by_filename(document.filename, opensearch_client)

Alternatively, consider adding a finally block like DocumentFileProcessor to ensure processed_files is always incremented.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/models/processors.py` around lines 556 - 562, When a filename collision
occurs in the code path that checks await
self.check_filename_exists(document.filename, opensearch_client) and
replace_duplicates is False, the method returns after incrementing
upload_task.failed_files but never increments upload_task.processed_files; fix
this by ensuring upload_task.processed_files is incremented on that early return
(or refactor to use a finally block like DocumentFileProcessor.process_item so
processed_files is always incremented regardless of early exits), updating the
block that sets file_task.status/ error and returns to also increment
upload_task.processed_files (or move the increment into a finally that encloses
the entire processing flow).

Comment thread src/models/processors.py
Comment on lines +692 to +698
if await self.check_filename_exists(document.filename, opensearch_client):
if not self.replace_duplicates:
file_task.status = TaskStatus.FAILED
file_task.error = f"File with name '{document.filename}' already exists"
file_task.updated_at = time.time()
upload_task.failed_files += 1
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing processed_files increment on early return (same issue as ConnectorFileProcessor).

Same as the ConnectorFileProcessor issue—when the filename collision check fails, processed_files is not incremented.

🐛 Proposed fix
             if await self.check_filename_exists(document.filename, opensearch_client):
                 if not self.replace_duplicates:
                     file_task.status = TaskStatus.FAILED
                     file_task.error = f"File with name '{document.filename}' already exists"
                     file_task.updated_at = time.time()
                     upload_task.failed_files += 1
+                    upload_task.processed_files += 1
+                    upload_task.updated_at = time.time()
                     return
                 await self.delete_document_by_filename(document.filename, opensearch_client)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/models/processors.py` around lines 692 - 698, When filename collision is
detected in the block that calls await
self.check_filename_exists(document.filename, opensearch_client) and you take
the early return because not self.replace_duplicates, increment
upload_task.processed_files (same fix as in ConnectorFileProcessor) before
setting file_task.status/updated_at and returning; ensure you update
upload_task.processed_files and persist any state changes to upload_task in the
same branch so processed_files reflects the skipped file.

@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels May 22, 2026
@edwinjosechittilappilly edwinjosechittilappilly merged commit 748583b into main May 22, 2026
15 checks passed
@github-actions github-actions Bot added the lgtm label May 22, 2026
@github-actions github-actions Bot deleted the files-sharepoint-sync branch May 22, 2026 20:26
ricofurtado pushed a commit that referenced this pull request May 23, 2026
* Add filename-based duplicate handling for connectors

Add end-to-end support for filename-based duplicate handling on connector ingests.

Frontend: send a new replace_duplicates flag with connector sync requests, perform a pre-sync duplicate check, and show a DuplicateHandlingDialog that lets users overwrite or skip duplicates when uploading from provider UI.

Backend: propagate replace_duplicates through connector_router, request models, and connector services into the file processors. ConnectorFileProcessor and LangflowConnectorFileProcessor now check whether a filename already exists in the index and either fail the file task or delete the existing document before ingesting when replace_duplicates is true.

Utilities/tests: clean_connector_filename now preserves original spacing/slashes and only enforces MIME-mapped extensions; get_filename_aliases adds underscore/sanitized variants so lookups match connector-indexed names. Add unit tests covering filename dedupe logic and filename alias behavior.

* Use duplicateNames list and display names

Replace numeric duplicateCount with a duplicateNames string[] across upload and dropdown flows so the UI can show the actual file names that would be overwritten. The duplicate-handling dialog now accepts duplicateNames, derives an effective count, and lists up to 5 duplicate filenames with an "… and N more" indicator; message labels and button text use the effective count. Toast messages and pending state in upload/[provider]/page.tsx and knowledge-dropdown.tsx were updated to pass and consume duplicateNames and to use duplicateNames.length for counts.

* Update page.tsx

* style: ruff autofix (auto)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
ricofurtado pushed a commit that referenced this pull request May 23, 2026
* Add filename-based duplicate handling for connectors

Add end-to-end support for filename-based duplicate handling on connector ingests.

Frontend: send a new replace_duplicates flag with connector sync requests, perform a pre-sync duplicate check, and show a DuplicateHandlingDialog that lets users overwrite or skip duplicates when uploading from provider UI.

Backend: propagate replace_duplicates through connector_router, request models, and connector services into the file processors. ConnectorFileProcessor and LangflowConnectorFileProcessor now check whether a filename already exists in the index and either fail the file task or delete the existing document before ingesting when replace_duplicates is true.

Utilities/tests: clean_connector_filename now preserves original spacing/slashes and only enforces MIME-mapped extensions; get_filename_aliases adds underscore/sanitized variants so lookups match connector-indexed names. Add unit tests covering filename dedupe logic and filename alias behavior.

* Use duplicateNames list and display names

Replace numeric duplicateCount with a duplicateNames string[] across upload and dropdown flows so the UI can show the actual file names that would be overwritten. The duplicate-handling dialog now accepts duplicateNames, derives an effective count, and lists up to 5 duplicate filenames with an "… and N more" indicator; message labels and button text use the effective count. Toast messages and pending state in upload/[provider]/page.tsx and knowledge-dropdown.tsx were updated to pass and consume duplicateNames and to use duplicateNames.length for counts.

* Update page.tsx

* style: ruff autofix (auto)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
ricofurtado added a commit that referenced this pull request May 26, 2026
* fix: Ensure SUCCESS status requires fetchable result in DoclingPollingService

* style: ruff autofix (auto)

* fix: Catch specific DoclingServeError when fetching task result after SUCCESS status

* feat: update style for oss of the failed task in the task panel (#1647)

* update style for oss of the failed task in the task panel

* keep logic on click, remove unecessary useeffect

* fix padding

* wip implementing Saas style

* utils to reshape error until backend provide info we need

* utils to reshape error until backend provide info we need

* utils to reshape error until backend provide info we need and fixinf fallbacks of isTotalFailure

* utils to reshape error until backend provide into

* have Saas style for failed and complete labelstatus and width and border

* few style adjustment to follow codebase pattern

* adjust succeed and partially succeed case

* adding comment for TODO implementation or more clarity

* remove carbon icon package and replace carbon icon

* add incident-reporter-icon

---------

Co-authored-by: Olfa Maslah <olfamaslah@Olfas-MacBook-Pro.local>

* fix: Encode IBM API key as Basic auth header (#1664)

* Encode IBM API key as Basic auth header

Add base64 encoding for the IBM auth path: import base64, construct a Basic auth token from X-Username and X-Api-Key (username:apikey), and store it in user.jwt_token and user.opensearch_credentials. Also set request.state.user before attaching the DB user ID so downstream code can access the created user object.

* style: ruff autofix (auto)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

* fix: restart deployment if env changes (#1665)

* restart deployment if env changes

* unit test

* lint

* fix: Ensure Langflow .env variable definitions from LANGFLOW_VARIABLES_TO_GET_FROM_ENVIRONMENT (#1667)

* Ensure we dynamically update the list of Langflow .env environment variables with default values when the comma separated list defined in LANGFLOW_VARIABLES_TO_GET_FROM_ENVIRONMENT changes

* fix tests

* fix additional linting errors

---------

Co-authored-by: rodageve <rodrigo.geve@datastax.com>

* chore: Retire openrag-mcp; switch docs to streamable HTTP (#1668)

* Retire openrag-mcp; switch docs to streamable HTTP

Remove the stdio-based MCP server and all in-repo MCP tooling, and update README to mark the package as retired. Deleted module files include the MCP entrypoint, server, config, registry and individual tools (chat, search, documents, settings). The README was rewritten to announce that openrag-mcp is retired, explain migration to the built-in streamable-HTTP /mcp endpoint, update Cursor/Claude examples to use URL+headers auth, list the new v1 API tools, and note that the last PyPI release is final. This change consolidates MCP functionality into the OpenRAG core and removes the subprocess/stdio implementation and its source code.

* Mark MCP SDK retired and clean package metadata

Update package metadata to reflect retirement and integration into the OpenRAG backend. Bump version to 0.3.0 and replace the project description with a retirement/migration note. Set Development Status to Inactive, remove explicit Python version classifiers, and clear runtime dependencies and the CLI script entrypoint. Also remove the hatch env pip-args setting; build-system and wheel package target remain unchanged.

* chore: update uv.lock files after version bump

* Update uv.lock

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* fix: connector sync and override feature (#1663)

* Add filename-based duplicate handling for connectors

Add end-to-end support for filename-based duplicate handling on connector ingests.

Frontend: send a new replace_duplicates flag with connector sync requests, perform a pre-sync duplicate check, and show a DuplicateHandlingDialog that lets users overwrite or skip duplicates when uploading from provider UI.

Backend: propagate replace_duplicates through connector_router, request models, and connector services into the file processors. ConnectorFileProcessor and LangflowConnectorFileProcessor now check whether a filename already exists in the index and either fail the file task or delete the existing document before ingesting when replace_duplicates is true.

Utilities/tests: clean_connector_filename now preserves original spacing/slashes and only enforces MIME-mapped extensions; get_filename_aliases adds underscore/sanitized variants so lookups match connector-indexed names. Add unit tests covering filename dedupe logic and filename alias behavior.

* Use duplicateNames list and display names

Replace numeric duplicateCount with a duplicateNames string[] across upload and dropdown flows so the UI can show the actual file names that would be overwritten. The duplicate-handling dialog now accepts duplicateNames, derives an effective count, and lists up to 5 duplicate filenames with an "… and N more" indicator; message labels and button text use the effective count. Toast messages and pending state in upload/[provider]/page.tsx and knowledge-dropdown.tsx were updated to pass and consume duplicateNames and to use duplicateNames.length for counts.

* Update page.tsx

* style: ruff autofix (auto)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

* fix: update OAuth prompt to consent for connector mutation (#1657)

* fix: implement transient error handling for Docling result fetch

* style: ruff autofix (auto)

* refactor: remove unused import of Optional in docling_polling_service.py

* refactor: change PollOutcome to use StrEnum for better type safety

* refactor: enhance task status endpoints with structured failure metadata

* style: ruff autofix (auto)

* revert "style: ruff autofix (auto)"

This reverts commit bc8be33.

* style: ruff autofix (auto)

* fix: Ensure SUCCESS status requires fetchable result in DoclingPollingService

* style: ruff autofix (auto)

* fix: Catch specific DoclingServeError when fetching task result after SUCCESS status

* fix: implement transient error handling for Docling result fetch

* style: ruff autofix (auto)

* refactor: remove unused import of Optional in docling_polling_service.py

* refactor: change PollOutcome to use StrEnum for better type safety

* refactor: enhance task status endpoints with structured failure metadata

* style: ruff autofix (auto)

* revert "style: ruff autofix (auto)"

This reverts commit bc8be33.

* style: ruff autofix (auto)

* Update tests/unit/test_task_service_get_task_status2.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* style: ruff autofix (auto)

* fix: handle timeout during Docling result fetch after SUCCESS status

* fix: update task status checks to use enum values for consistency

* fix: enhance failure metadata for duplicate file errors in ingestion

* style: ruff autofix (auto)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Wallgau <46035189+Wallgau@users.noreply.github.com>
Co-authored-by: Olfa Maslah <olfamaslah@Olfas-MacBook-Pro.local>
Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
Co-authored-by: ming <itestmycode@gmail.com>
Co-authored-by: rodageve <78763007+rodageve@users.noreply.github.com>
Co-authored-by: rodageve <rodrigo.geve@datastax.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) bug 🔴 Something isn't working. frontend 🟨 Issues related to the UI/UX lgtm tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants