Skip to content

fix: Jieba keyword extraction for hyphenated and underscore identifiers#33701

Open
mkumm14 wants to merge 8 commits intolanggenius:mainfrom
mkumm14:fix/jieba-hyphenated-keyword-splitting
Open

fix: Jieba keyword extraction for hyphenated and underscore identifiers#33701
mkumm14 wants to merge 8 commits intolanggenius:mainfrom
mkumm14:fix/jieba-hyphenated-keyword-splitting

Conversation

@mkumm14
Copy link
Copy Markdown

@mkumm14 mkumm14 commented Mar 18, 2026

Summary

Fixes a bug in Economy-mode (keyword-based) indexing where compound identifiers containing hyphens or underscores — such as st-771 or model_function_description — could not be recalled by keyword search.

Root cause: Jieba's tokeniser splits these identifiers into separate tokens (st, 771, function, description) before TF-IDF scoring. The whole term was never indexed as a unit, so an exact keyword search for st-771 returned no results.

Three-layer fix in JiebaKeywordTableHandler:

Raw-text scan — after TF-IDF runs, scan the original text with r"[a-zA-Z0-9]+(?:[_-][a-zA-Z0-9]+)+" to recover all hyphen- and underscore-joined terms directly, bypassing jieba's segmentation entirely.

Subtoken suppression — remove the loose parts (st, 771, function) from the keyword set when the whole compound term is already present, preventing index pollution with overly generic single-token matches.

Expand-tokens regex — change r"\w+" to r"\w+(?:-\w+)*" in _expand_tokens_with_subtokens so the expansion step never re-splits a hyphenated term that survived into the keyword set.

Fixes #33621, fixes #28237, related to #11908

Screenshots

Before:
before_fix

After:
after_fix

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran make lint and make type-check (backend) and cd web && npx lint-staged (frontend) to appease the lint gods

Copilot AI review requested due to automatic review settings March 18, 2026 21:51
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 18, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the Economy-mode keyword extraction, specifically for compound identifiers containing hyphens or underscores. Previously, Jieba's tokenization would split these identifiers, making them unsearchable as whole units. The changes introduce a robust, multi-layered fix to ensure that such technical terms are correctly indexed and retrievable, significantly improving the accuracy of keyword-based search for these types of identifiers.

Highlights

  • Raw-text scan for compound identifiers: Implemented a raw-text scan using a regular expression to directly recover hyphenated and underscore-joined terms, bypassing Jieba's default segmentation.
  • Subtoken suppression: Added logic to remove individual sub-tokens (e.g., 'st', '771') from the keyword set when the complete compound term (e.g., 'st-771') is already present, preventing index pollution.
  • Expanded tokens regex update: Modified the regex in _expand_tokens_with_subtokens from r"\w+" to r"\w+(?:-\w+)*" to ensure that hyphenated terms are treated as atomic units and are not re-split during keyword expansion.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug where compound identifiers with hyphens or underscores were not correctly indexed, by implementing a three-part fix: adjusting regex for tokenization, explicitly extracting compound terms, and suppressing their subtokens. The changes are logical and well-tested. I've found a couple of areas for improvement: one is a performance issue related to stopword checking, and the other is a potential regression in Unicode support in one of the new regular expressions, which is critical for a library like Jieba.

Comment thread api/core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py Outdated
Comment thread api/core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes keyword-based (Economy-mode) indexing/recall for hyphenated/underscore compound identifiers by preserving/recovering them as atomic keywords throughout extraction and query expansion.

Changes:

  • Update fallback tokenization and token expansion regex to keep hyphenated identifiers intact.
  • Enhance extract_keywords() to recover _/- joined compound terms from raw text and suppress their split subtokens.
  • Add unit tests covering hyphenated/underscore identifier extraction and expansion behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
api/core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py Preserve hyphenated tokens in fallback/expansion and add raw-text compound-term recovery + subtoken suppression.
api/tests/unit_tests/core/rag/datasource/keyword/test_jieba_keyword_table_handler.py New unit tests validating hyphen/underscore identifier behavior across extraction and expansion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@mkumm14 mkumm14 force-pushed the fix/jieba-hyphenated-keyword-splitting branch from 2ae6d1c to 5b36802 Compare March 18, 2026 22:12
Copilot AI review requested due to automatic review settings March 18, 2026 22:15
@mkumm14 mkumm14 force-pushed the fix/jieba-hyphenated-keyword-splitting branch from 5b36802 to 650de2a Compare March 18, 2026 22:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Economy-mode keyword indexing/retrieval for hyphenated and underscore compound identifiers (e.g., st-771, model_function_description) by preserving compounds through TF-IDF fallback tokenization, raw-text compound recovery, and query/index token expansion.

Changes:

  • Update fallback tokenization and query expansion regex to keep hyphenated identifiers intact.
  • Post-process TF-IDF keywords by recovering _/- compound terms directly from raw text and suppressing their split subtokens.
  • Add unit tests covering hyphen/underscore compound preservation and recovery behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
api/core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py Adjusts tokenization/expansion regex and adds raw-text compound recovery + subtoken suppression to improve exact compound recall.
api/tests/unit_tests/core/rag/datasource/keyword/test_jieba_keyword_table_handler.py Adds tests to validate compound preservation in expansion and compound recovery in extraction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread api/core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py Outdated
@mkumm14 mkumm14 force-pushed the fix/jieba-hyphenated-keyword-splitting branch from 650de2a to 413d7fb Compare March 18, 2026 22:38
@mkumm14 mkumm14 requested a review from Copilot March 18, 2026 22:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Economy-mode keyword indexing/search for compound identifiers containing hyphens/underscores (e.g. st-771, model_function_description) by preserving them as atomic keywords during extraction and query expansion.

Changes:

  • Update fallback tokenization and token expansion regexes to keep hyphenated identifiers intact.
  • Extend extract_keywords() with a raw-text regex pass to add hyphen/underscore compound terms into the keyword set (plus an attempted subtoken cleanup step).
  • Add unit tests covering hyphen/underscore handling and query-expansion behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
api/core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py Preserves hyphenated tokens during fallback/expansion; adds raw-text compound term recovery during keyword extraction.
api/tests/unit_tests/core/rag/datasource/keyword/test_jieba_keyword_table_handler.py New tests for hyphen/underscore keyword handling and expansion behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@mkumm14 mkumm14 force-pushed the fix/jieba-hyphenated-keyword-splitting branch from 413d7fb to 73662cc Compare March 18, 2026 22:50
…eba keyword extraction

Jieba's tokeniser splits compound identifiers like 'st-771' (hyphen) and
'model_function_description' (underscore) into separate tokens before
TF-IDF scoring. This caused Economy-mode keyword search to fail for any
technical identifier containing these characters — the terms were never
indexed as whole units and could not be recalled by exact keyword match.

Three-layer fix in JiebaKeywordTableHandler:

1. Raw-text scan in extract_keywords: after TF-IDF runs, scan the
   original text with r"[a-zA-Z0-9]+(?:[_-][a-zA-Z0-9]+)+" to recover
   all hyphen- and underscore-joined terms, bypassing jieba's segmentation.

2. Subtoken suppression: remove loose parts ('st', '771', 'function')
   from the keyword set when the whole compound term is already present.

3. Expand-tokens regex: change r"\w+" to r"\w+(?:-\w+)*" in
   _expand_tokens_with_subtokens so expansion never re-splits a hyphated
   term that survived into the keyword set.

Fixes langgenius#33621, fixes langgenius#28237, related to langgenius#11908
Copilot AI review requested due to automatic review settings March 18, 2026 23:02
@mkumm14 mkumm14 force-pushed the fix/jieba-hyphenated-keyword-splitting branch from 73662cc to 54efab3 Compare March 18, 2026 23:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Economy-mode keyword indexing/search for compound identifiers containing hyphens/underscores (e.g., st-771, model_function_description) by preventing unintended token splitting and explicitly recovering these terms from raw text so exact keyword recall works.

Changes:

  • Update regex tokenization to keep hyphenated tokens intact (\w+(?:-\w+)*) in both fallback TF-IDF and query/token expansion.
  • Add raw-text regex scan in extract_keywords() to recover hyphen/underscore-joined identifiers into the keyword set.
  • Add unit tests covering hyphen/underscore handling and expansion behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
api/core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py Adjusts tokenization/expansion and adds raw-text recovery for compound identifiers during keyword extraction.
api/tests/unit_tests/core/rag/datasource/keyword/test_jieba_keyword_table_handler.py Adds regression/unit tests ensuring hyphenated and underscore identifiers remain searchable and aren’t re-split during expansion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

crazywoola
crazywoola previously approved these changes Mar 20, 2026
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-03-20 02:25:43.289769070 +0000
+++ /tmp/pyrefly_pr.txt	2026-03-20 02:25:34.043736758 +0000
@@ -115,11 +115,11 @@
 ERROR Pyrefly detected conflicting types while breaking a dependency cycle: `str | None` is not assignable to `None`. Adding explicit type annotations might possibly help. [bad-assignment]
   --> core/prompt/utils/extract_thread_messages.py:11:5
 ERROR Object of class `NoneType` has no attribute `data_source_type` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:143:36
+   --> core/rag/datasource/keyword/jieba/jieba.py:137:36
 ERROR Object of class `NoneType` has no attribute `keyword_table` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:145:13
+   --> core/rag/datasource/keyword/jieba/jieba.py:139:13
 ERROR Cannot index into `set[Any]` [bad-index]
-   --> core/rag/datasource/keyword/jieba/jieba.py:158:29
+   --> core/rag/datasource/keyword/jieba/jieba.py:152:29
 ERROR No matching overload found for function `list.__init__` called with arguments: (object) [no-matching-overload]
   --> core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py:88:34
 ERROR Class member `AnalyticdbVector.add_texts` overrides parent class `BaseVector` in an inconsistent manner [bad-override]
@@ -399,50 +399,50 @@
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:368:21
 ERROR Cannot index into `Generator[Unknown]` [bad-index]
-   --> core/rag/extractor/watercrawl/client.py:210:54
+   --> core/rag/extractor/watercrawl/client.py:188:54
 ERROR Cannot index into `bytes` [bad-index]
-   --> core/rag/extractor/watercrawl/client.py:210:54
+   --> core/rag/extractor/watercrawl/client.py:188:54
 ERROR Cannot index into `list[Unknown]` [bad-index]
-   --> core/rag/extractor/watercrawl/client.py:210:54
+   --> core/rag/extractor/watercrawl/client.py:188:54
 ERROR `None` is not subscriptable [unsupported-operation]
-   --> core/rag/extractor/watercrawl/client.py:210:54
+   --> core/rag/extractor/watercrawl/client.py:188:54
 ERROR Object of class `Generator` has no attribute `get`
 ERROR Cannot index into `Generator[Unknown]` [bad-index]
-  --> core/rag/extractor/watercrawl/provider.py:70:12
+  --> core/rag/extractor/watercrawl/provider.py:47:12
 ERROR Cannot index into `bytes` [bad-index]
-  --> core/rag/extractor/watercrawl/provider.py:70:12
+  --> core/rag/extractor/watercrawl/provider.py:47:12
 ERROR Cannot index into `list[Unknown]` [bad-index]
-  --> core/rag/extractor/watercrawl/provider.py:70:12
+  --> core/rag/extractor/watercrawl/provider.py:47:12
 ERROR `None` is not subscriptable [unsupported-operation]
-  --> core/rag/extractor/watercrawl/provider.py:70:12
+  --> core/rag/extractor/watercrawl/provider.py:47:12
 ERROR Object of class `Generator` has no attribute `get`
 ERROR Object of class `Generator` has no attribute `get`
 ERROR Object of class `Generator` has no attribute `get`
 ERROR Object of class `Generator` has no attribute `get`
-ERROR Argument `Generator[Unknown] | bytes | dict[Unknown, Unknown] | list[Unknown] | Unknown | None` is not assignable to parameter `result_object` with type `dict[str, Any]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
-   --> core/rag/extractor/watercrawl/provider.py:110:37
+ERROR Argument `Generator[Unknown] | bytes | dict[Unknown, Unknown] | list[Unknown] | Unknown | None` is not assignable to parameter `result_object` with type `dict[Unknown, Unknown]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
+  --> core/rag/extractor/watercrawl/provider.py:87:37
 ERROR Cannot index into `Generator[Unknown]` [bad-index]
-   --> core/rag/extractor/watercrawl/provider.py:135:20
+   --> core/rag/extractor/watercrawl/provider.py:110:20
 ERROR Cannot index into `bytes` [bad-index]
-   --> core/rag/extractor/watercrawl/provider.py:135:20
+   --> core/rag/extractor/watercrawl/provider.py:110:20
 ERROR Cannot index into `list[Unknown]` [bad-index]
-   --> core/rag/extractor/watercrawl/provider.py:135:20
+   --> core/rag/extractor/watercrawl/provider.py:110:20
 ERROR `None` is not subscriptable [unsupported-operation]
-   --> core/rag/extractor/watercrawl/provider.py:135:20
+   --> core/rag/extractor/watercrawl/provider.py:110:20
 ERROR Cannot index into `Generator[Unknown]` [bad-index]
-   --> core/rag/extractor/watercrawl/provider.py:141:16
+   --> core/rag/extractor/watercrawl/provider.py:116:16
 ERROR Cannot index into `bytes` [bad-index]
-   --> core/rag/extractor/watercrawl/provider.py:141:16
+   --> core/rag/extractor/watercrawl/provider.py:116:16
 ERROR Cannot index into `list[Unknown]` [bad-index]
-   --> core/rag/extractor/watercrawl/provider.py:141:16
+   --> core/rag/extractor/watercrawl/provider.py:116:16
 ERROR `None` is not subscriptable [unsupported-operation]
-   --> core/rag/extractor/watercrawl/provider.py:141:16
+   --> core/rag/extractor/watercrawl/provider.py:116:16
 ERROR Pyrefly detected conflicting types while breaking a dependency cycle: `str | Any | None` is not assignable to `None`. Adding explicit type annotations might possibly help. [bad-assignment]
-   --> core/rag/extractor/word_extractor.py:372:13
+   --> core/rag/extractor/word_extractor.py:371:13
 ERROR Pyrefly detected conflicting types while breaking a dependency cycle: `str | Any | None` is not assignable to `None`. Adding explicit type annotations might possibly help. [bad-assignment]
-   --> core/rag/extractor/word_extractor.py:393:25
+   --> core/rag/extractor/word_extractor.py:392:25
 ERROR Object of class `BaseOxmlElement` has no attribute `body` [missing-attribute]
-   --> core/rag/extractor/word_extractor.py:426:24
+   --> core/rag/extractor/word_extractor.py:425:24
 ERROR Object of class `Document` has no attribute `score` [missing-attribute]
    --> core/rag/index_processor/processor/paragraph_index_processor.py:194:33
 ERROR Object of class `Document` has no attribute `score` [missing-attribute]
@@ -586,7 +586,7 @@
 ERROR Argument `Literal['normal']` is not assignable to parameter `value` with type `AppStatus | SQLCoreOperations[AppStatus]` in function `sqlalchemy.orm.base.Mapped.__set__` [bad-argument-type]
   --> tests/integration_tests/controllers/console/app/test_chat_message_permissions.py:30:22
 ERROR Argument `Literal['normal']` is not assignable to parameter `value` with type `AppStatus | SQLCoreOperations[AppStatus]` in function `sqlalchemy.orm.base.Mapped.__set__` [bad-argument-type]
-  --> tests/integration_tests/controllers/console/app/test_feedback_export_api.py:32:22
+  --> tests/integration_tests/controllers/console/app/test_feedback_export_api.py:31:22
 ERROR Argument `Literal['normal']` is not assignable to parameter `value` with type `AppStatus | SQLCoreOperations[AppStatus]` in function `sqlalchemy.orm.base.Mapped.__set__` [bad-argument-type]
   --> tests/integration_tests/controllers/console/app/test_model_config_permissions.py:28:22
 ERROR Attribute `current_tenant_id` of class `Account` is a read-only property and cannot be set [read-only]
@@ -606,15 +606,15 @@
 ERROR `unpatch` may be uninitialized [unbound-name]
   --> tests/integration_tests/plugin/__mock/http.py:62:9
 ERROR Object of class `TestWorkflowDraftVariableServiceResetVariable` has no attribute `_test_user_id` [missing-attribute]
-   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:610:21
+   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:609:21
 ERROR Object of class `TestWorkflowDraftVariableServiceResetVariable` has no attribute `_test_user_id` [missing-attribute]
-   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:620:21
+   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:619:21
 ERROR Object of class `TestWorkflowDraftVariableServiceResetVariable` has no attribute `_test_user_id` [missing-attribute]
-   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:631:21
+   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:630:21
 ERROR Object of class `TestWorkflowDraftVariableServiceResetVariable` has no attribute `_test_user_id` [missing-attribute]
-   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:640:21
+   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:639:21
 ERROR Object of class `TestWorkflowDraftVariableServiceResetVariable` has no attribute `_test_user_id` [missing-attribute]
-   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:806:21
+   --> tests/integration_tests/services/test_workflow_draft_variable_service.py:805:21
 ERROR Argument `Index | None` is not assignable to parameter `index` with type `Index` in function `tcvectordb.rpc.model.collection.RPCCollection.__init__` [bad-argument-type]
   --> tests/integration_tests/vdb/__mock/tcvectordb.py:90:13
 ERROR Argument `xinference_client.types.Embedding | None` is not assignable to parameter `embedding` with type `tcvectordb.model.collection_view.Embedding` in function `tcvectordb.rpc.model.collection.RPCCollection.__init__` [bad-argument-type]
@@ -940,31 +940,31 @@
 ERROR Argument `Literal['archive']` is not assignable to parameter `value` with type `SQLCoreOperations[TenantStatus] | TenantStatus` in function `sqlalchemy.orm.base.Mapped.__set__` [bad-argument-type]
     --> tests/test_containers_integration_tests/services/test_account_service.py:3334:25
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:134:38
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:133:38
 ERROR `str` is not assignable to attribute `agent_mode` with type `Never` [bad-assignment]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:138:47
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:137:47
 ERROR Argument `Literal['account']` is not assignable to parameter `created_by_role` with type `CreatorUserRole | SQLCoreOperations[CreatorUserRole]` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:249:29
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:248:29
 ERROR Argument `Literal['account']` is not assignable to parameter `created_by_role` with type `CreatorUserRole | SQLCoreOperations[CreatorUserRole]` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:275:29
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:274:29
 ERROR Argument `Literal['account']` is not assignable to parameter `created_by_role` with type `CreatorUserRole | SQLCoreOperations[CreatorUserRole]` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:564:29
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:563:29
 ERROR Argument `str | None` is not assignable to parameter `created_by` with type `SQLCoreOperations[str] | str` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:565:24
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:564:24
 ERROR Argument `Literal['account']` is not assignable to parameter `created_by_role` with type `CreatorUserRole | SQLCoreOperations[CreatorUserRole]` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:799:29
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:798:29
 ERROR Argument `str | None` is not assignable to parameter `created_by` with type `SQLCoreOperations[str] | str` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:800:24
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:799:24
 ERROR Argument `Literal['account']` is not assignable to parameter `created_by_role` with type `CreatorUserRole | SQLCoreOperations[CreatorUserRole]` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:894:29
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:893:29
 ERROR Argument `Literal['account']` is not assignable to parameter `created_by_role` with type `CreatorUserRole | SQLCoreOperations[CreatorUserRole]` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:963:29
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:962:29
 ERROR Argument `str | None` is not assignable to parameter `created_by` with type `SQLCoreOperations[str] | str` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/services/test_agent_service.py:964:24
+   --> tests/test_containers_integration_tests/services/test_agent_service.py:963:24
 ERROR Argument `Literal['account']` is not assignable to parameter `created_by_role` with type `CreatorUserRole | SQLCoreOperations[CreatorUserRole]` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-    --> tests/test_containers_integration_tests/services/test_agent_service.py:1004:29
+    --> tests/test_containers_integration_tests/services/test_agent_service.py:1003:29
 ERROR Argument `str | None` is not assignable to parameter `created_by` with type `SQLCoreOperations[str] | str` in function `models.model.MessageAgentThought.__init__` [bad-argument-type]
-    --> tests/test_containers_integration_tests/services/test_agent_service.py:1005:24
+    --> tests/test_containers_integration_tests/services/test_agent_service.py:1004:24
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
    --> tests/test_containers_integration_tests/services/test_annotation_service.py:101:38
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
@@ -1162,19 +1162,19 @@
 ERROR Argument `Literal['owner']` is not assignable to parameter `role` with type `SQLCoreOperations[TenantAccountRole] | TenantAccountRole` in function `models.account.TenantAccountJoin.__init__` [bad-argument-type]
   --> tests/test_containers_integration_tests/services/test_end_user_service.py:37:18
 ERROR Argument `Literal['active']` is not assignable to parameter `status` with type `AccountStatus | SQLCoreOperations[AccountStatus]` in function `models.account.Account.__init__` [bad-argument-type]
-  --> tests/test_containers_integration_tests/services/test_file_service.py:69:20
+  --> tests/test_containers_integration_tests/services/test_file_service.py:68:20
 ERROR Argument `Literal['normal']` is not assignable to parameter `status` with type `SQLCoreOperations[TenantStatus] | TenantStatus` in function `models.account.Tenant.__init__` [bad-argument-type]
-  --> tests/test_containers_integration_tests/services/test_file_service.py:78:20
+  --> tests/test_containers_integration_tests/services/test_file_service.py:77:20
 ERROR Argument `str` is not assignable to parameter `role` with type `SQLCoreOperations[TenantAccountRole] | TenantAccountRole` in function `models.account.TenantAccountJoin.__init__` [bad-argument-type]
   --> tests/test_containers_integration_tests/services/test_human_input_delivery_test.py:32:18
 ERROR Argument `Literal['active']` is not assignable to parameter `status` with type `AccountStatus | SQLCoreOperations[AccountStatus]` in function `models.account.Account.__init__` [bad-argument-type]
-  --> tests/test_containers_integration_tests/services/test_message_export_service.py:53:20
+  --> tests/test_containers_integration_tests/services/test_message_export_service.py:52:20
 ERROR Argument `Literal['normal']` is not assignable to parameter `status` with type `SQLCoreOperations[TenantStatus] | TenantStatus` in function `models.account.Tenant.__init__` [bad-argument-type]
-  --> tests/test_containers_integration_tests/services/test_message_export_service.py:58:63
+  --> tests/test_containers_integration_tests/services/test_message_export_service.py:57:63
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
-   --> tests/test_containers_integration_tests/services/test_message_service.py:119:38
+   --> tests/test_containers_integration_tests/services/test_message_service.py:118:38
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
-   --> tests/test_containers_integration_tests/services/test_message_service.py:122:81
+   --> tests/test_containers_integration_tests/services/test_message_service.py:121:81
 ERROR Argument `Literal['active']` is not assignable to parameter `status` with type `AccountStatus | SQLCoreOperations[AccountStatus]` in function `models.account.Account.__init__` [bad-argument-type]
    --> tests/test_containers_integration_tests/services/test_messages_clean_service.py:112:20
 ERROR Argument `Literal['normal']` is not assignable to parameter `status` with type `SQLCoreOperations[TenantStatus] | TenantStatus` in function `models.account.Tenant.__init__` [bad-argument-type]
@@ -1417,6 +1417,28 @@
   --> tests/test_containers_integration_tests/services/tools/test_mcp_tools_manage_service.py:63:20
 ERROR Argument `Literal['normal']` is not assignable to parameter `status` with type `SQLCoreOperations[TenantStatus] | TenantStatus` in function `models.account.Tenant.__init__` [bad-argument-type]
   --> tests/test_containers_integration_tests/services/tools/test_mcp_tools_manage_service.py:72:20
+ERROR Unexpected keyword argument `icon_dark` in function `models.tools.ApiToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:51:17
+ERROR Unexpected keyword argument `credentials` in function `models.tools.ApiToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:54:17
+ERROR Unexpected keyword argument `provider_type` in function `models.tools.ApiToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:55:17
+ERROR Unexpected keyword argument `description` in function `models.tools.BuiltinToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:60:17
+ERROR Unexpected keyword argument `icon` in function `models.tools.BuiltinToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:61:17
+ERROR Unexpected keyword argument `icon_dark` in function `models.tools.BuiltinToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:62:17
+ERROR Unexpected keyword argument `credentials` in function `models.tools.BuiltinToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:66:17
+ERROR Unexpected keyword argument `icon_dark` in function `models.tools.WorkflowToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:73:17
+ERROR Unexpected keyword argument `workflow_id` in function `models.tools.WorkflowToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:76:17
+ERROR Unexpected keyword argument `description` in function `models.tools.MCPToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:81:17
+ERROR Unexpected keyword argument `provider_icon` in function `models.tools.MCPToolProvider.__init__` [unexpected-keyword]
+  --> tests/test_containers_integration_tests/services/tools/test_tools_transform_service.py:82:17
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
    --> tests/test_containers_integration_tests/services/tools/test_workflow_tools_manage_service.py:109:38
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
@@ -1512,57 +1534,57 @@
 ERROR Argument `Literal['normal']` is not assignable to parameter `status` with type `SQLCoreOperations[TenantStatus] | TenantStatus` in function `models.account.Tenant.__init__` [bad-argument-type]
   --> tests/test_containers_integration_tests/tasks/test_add_document_to_index_task.py:62:20
 ERROR Argument `Literal['active']` is not assignable to parameter `status` with type `AccountStatus | SQLCoreOperations[AccountStatus]` in function `models.account.Account.__init__` [bad-argument-type]
-  --> tests/test_containers_integration_tests/tasks/test_batch_clean_document_task.py:71:20
+  --> tests/test_containers_integration_tests/tasks/test_batch_clean_document_task.py:70:20
 ERROR Argument `Literal['normal']` is not assignable to parameter `status` with type `SQLCoreOperations[TenantStatus] | TenantStatus` in function `models.account.Tenant.__init__` [bad-argument-type]
-  --> tests/test_containers_integration_tests/tasks/test_batch_clean_document_task.py:80:20
+  --> tests/test_containers_integration_tests/tasks/test_batch_clean_document_task.py:79:20
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
-   --> tests/test_containers_integration_tests/tasks/test_batch_clean_document_task.py:293:23
+   --> tests/test_containers_integration_tests/tasks/test_batch_clean_document_task.py:292:23
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
-   --> tests/test_containers_integration_tests/tasks/test_batch_clean_document_task.py:667:27
+   --> tests/test_containers_integration_tests/tasks/test_batch_clean_document_task.py:666:27
 ERROR Argument `Literal['active']` is not assignable to parameter `status` with type `AccountStatus | SQLCoreOperations[AccountStatus]` in function `models.account.Account.__init__` [bad-argument-type]
-  --> tests/test_containers_integration_tests/tasks/test_batch_create_segment_to_index_task.py:96:20
+  --> tests/test_containers_integration_tests/tasks/test_batch_create_segment_to_index_task.py:95:20
 ERROR Argument `Literal['normal']` is not assignable to parameter `status` with type `SQLCoreOperations[TenantStatus] | TenantStatus` in function `models.account.Tenant.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_batch_create_segment_to_index_task.py:105:20
+   --> tests/test_containers_integration_tests/tasks/test_batch_create_segment_to_index_task.py:104:20
 ERROR `>` is not supported between `None` and `Literal[0]` [unsupported-operation]
-   --> tests/test_containers_integration_tests/tasks/test_batch_create_segment_to_index_task.py:312:16
+   --> tests/test_containers_integration_tests/tasks/test_batch_create_segment_to_index_task.py:311:16
 ERROR `>` is not supported between `None` and `Literal[0]` [unsupported-operation]
-   --> tests/test_containers_integration_tests/tasks/test_batch_create_segment_to_index_task.py:704:16
+   --> tests/test_containers_integration_tests/tasks/test_batch_create_segment_to_index_task.py:703:16
 ERROR Argument `Literal['active']` is not assignable to parameter `status` with type `AccountStatus | SQLCoreOperations[AccountStatus]` in function `models.account.Account.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:111:20
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:110:20
 ERROR Argument `Literal['normal']` is not assignable to parameter `status` with type `SQLCoreOperations[TenantStatus] | TenantStatus` in function `models.account.Tenant.__init__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:121:20
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:120:20
 ERROR Argument `str | None` is not assignable to parameter `indexing_technique` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:296:32
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:295:32
 ERROR Argument `str | None` is not assignable to parameter `doc_form` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:299:22
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:298:22
 ERROR Argument `str | None` is not assignable to parameter `indexing_technique` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:408:32
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:407:32
 ERROR Argument `str | None` is not assignable to parameter `doc_form` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:411:22
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:410:22
 ERROR Argument `str | None` is not assignable to parameter `indexing_technique` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:476:36
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:475:36
 ERROR Argument `str | None` is not assignable to parameter `doc_form` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:479:26
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:478:26
 ERROR Argument `str | None` is not assignable to parameter `indexing_technique` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:531:32
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:530:32
 ERROR Argument `str | None` is not assignable to parameter `doc_form` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:534:22
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:533:22
 ERROR Argument `str | None` is not assignable to parameter `indexing_technique` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:616:36
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:615:36
 ERROR Argument `str | None` is not assignable to parameter `doc_form` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:619:26
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:618:26
 ERROR Argument `str | None` is not assignable to parameter `indexing_technique` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:729:32
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:728:32
 ERROR Argument `str | None` is not assignable to parameter `doc_form` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:732:22
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:731:22
 ERROR Argument `str | None` is not assignable to parameter `indexing_technique` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:815:32
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:814:32
 ERROR Argument `str | None` is not assignable to parameter `doc_form` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:818:22
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:817:22
 ERROR Argument `str | None` is not assignable to parameter `indexing_technique` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:970:32
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:969:32
 ERROR Argument `str | None` is not assignable to parameter `doc_form` with type `str` in function `celery.app.task.Task.__call__` [bad-argument-type]
-   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:973:22
+   --> tests/test_containers_integration_tests/tasks/test_clean_dataset_task.py:972:22
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
   --> tests/test_containers_integration_tests/tasks/test_clean_notion_document_task.py:89:23
 ERROR Object of class `NoneType` has no attribute `id` [missing-attribute]
@@ -1944,7 +1966,7 @@
 ERROR Object of class `tuple` has no attribute `json` [missing-attribute]
    --> tests/unit_tests/controllers/console/auth/test_token_refresh.py:192:16
 ERROR Object of class `BadRequest` has no attribute `data` [missing-attribute]
-    --> tests/unit_tests/controllers/console/datasets/test_datasets.py:1635:16
+    --> tests/unit_tests/controllers/console/datasets/test_datasets.py:1634:16
 ERROR Module `services.errors.chunk` exists, but was not imported explicitly. You are relying on other modules to load it. [implicit-import]
    --> tests/unit_tests/controllers/console/datasets/test_datasets_segments.py:891:29
 ERROR Module `services.errors.chunk` exists, but was not imported explicitly. You are relying on other modules to load it. [implicit-import]
@@ -2076,29 +2098,29 @@
 ERROR `SimpleNamespace` is not assignable to attribute `request` with type `Request` [bad-assignment]
    --> tests/unit_tests/controllers/files/test_upload.py:170:26
 ERROR Missing argument `tenant_model` in function `controllers.inner_api.plugin.wraps.decorated_view` [missing-argument]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:171:44
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:168:44
 ERROR Missing argument `user_model` in function `controllers.inner_api.plugin.wraps.decorated_view` [missing-argument]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:171:44
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:168:44
 ERROR Missing argument `tenant_model` in function `controllers.inner_api.plugin.wraps.decorated_view` [missing-argument]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:188:31
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:185:31
 ERROR Missing argument `user_model` in function `controllers.inner_api.plugin.wraps.decorated_view` [missing-argument]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:188:31
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:185:31
 ERROR Missing argument `tenant_model` in function `controllers.inner_api.plugin.wraps.decorated_view` [missing-argument]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:203:35
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:200:35
 ERROR Missing argument `user_model` in function `controllers.inner_api.plugin.wraps.decorated_view` [missing-argument]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:203:35
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:200:35
 ERROR Missing argument `tenant_model` in function `controllers.inner_api.plugin.wraps.decorated_view` [missing-argument]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:225:44
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:222:44
 ERROR Missing argument `user_model` in function `controllers.inner_api.plugin.wraps.decorated_view` [missing-argument]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:225:44
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:222:44
 ERROR Argument `type[PluginTestPayload]` is not assignable to parameter `payload_type` with type `type[BaseModel]` in function `controllers.inner_api.plugin.wraps.plugin_data` [bad-argument-type]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:253:35
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:250:35
 ERROR Argument `type[PluginTestPayload]` is not assignable to parameter `payload_type` with type `type[BaseModel]` in function `controllers.inner_api.plugin.wraps.plugin_data` [bad-argument-type]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:268:35
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:265:35
 ERROR Argument `type[TestPluginData.test_should_raise_error_on_invalid_payload.InvalidPayload]` is not assignable to parameter `payload_type` with type `type[BaseModel]` in function `controllers.inner_api.plugin.wraps.plugin_data` [bad-argument-type]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:286:35
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:283:35
 ERROR Argument `type[PluginTestPayload]` is not assignable to parameter `payload_type` with type `type[BaseModel]` in function `controllers.inner_api.plugin.wraps.plugin_data` [bad-argument-type]
-   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:299:35
+   --> tests/unit_tests/controllers/inner_api/plugin/test_plugin_wraps.py:296:35
 ERROR `SimpleNamespace` is not assignable to attribute `db` with type `SQLAlchemy` [bad-assignment]
   --> tests/unit_tests/controllers/mcp/test_mcp.py:19:17
 ERROR `SimpleNamespace` is not assignable to attribute `mcp_ns` with type `Namespace` [bad-assignment]
@@ -2122,17 +2144,17 @@
 ERROR Missing argument `session` in function `services.conversation_service.ConversationService.pagination_by_last_id` [missing-argument]
    --> tests/unit_tests/controllers/service_api/app/test_conversation.py:383:59
 ERROR Argument value `Literal[0]` violates Pydantic `ge` constraint `Literal[1]` for field `limit` [bad-argument-type]
-  --> tests/unit_tests/controllers/service_api/app/test_message.py:94:63
+  --> tests/unit_tests/controllers/service_api/app/test_message.py:93:63
 ERROR Argument value `Literal[101]` violates Pydantic `le` constraint `Literal[100]` for field `limit` [bad-argument-type]
-   --> tests/unit_tests/controllers/service_api/app/test_message.py:100:63
+  --> tests/unit_tests/controllers/service_api/app/test_message.py:99:63
 ERROR Argument `str | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
-   --> tests/unit_tests/controllers/service_api/app/test_message.py:138:20
+   --> tests/unit_tests/controllers/service_api/app/test_message.py:137:20
 ERROR Argument value `Literal[0]` violates Pydantic `ge` constraint `Literal[1]` for field `page` [bad-argument-type]
-   --> tests/unit_tests/controllers/service_api/app/test_message.py:170:31
+   --> tests/unit_tests/controllers/service_api/app/test_message.py:169:31
 ERROR Argument value `Literal[0]` violates Pydantic `ge` constraint `Literal[1]` for field `limit` [bad-argument-type]
-   --> tests/unit_tests/controllers/service_api/app/test_message.py:183:31
+   --> tests/unit_tests/controllers/service_api/app/test_message.py:182:31
 ERROR Argument value `Literal[102]` violates Pydantic `le` constraint `Literal[101]` for field `limit` [bad-argument-type]
-   --> tests/unit_tests/controllers/service_api/app/test_message.py:188:31
+   --> tests/unit_tests/controllers/service_api/app/test_message.py:187:31
 ERROR Argument `list[dict[str, Any]] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
   --> tests/unit_tests/controllers/service_api/app/test_workflow.py:92:20
 ERROR Argument value `Literal[0]` violates Pydantic `ge` constraint `Literal[1]` for field `page` [bad-argument-type]
@@ -2802,7 +2824,7 @@
 ERROR Argument `SimpleNamespace` is not assignable to parameter `user` with type `Account | EndUser` in function `core.app.apps.common.workflow_response_converter.WorkflowResponseConverter.__init__` [bad-argument-type]
   --> tests/unit_tests/core/app/apps/common/test_workflow_response_converter_resumption.py:32:14
 ERROR Object of class `NoneType` has no attribute `data` [missing-attribute]
-   --> tests/unit_tests/core/app/apps/common/test_workflow_response_converter_truncation.py:588:16
+   --> tests/unit_tests/core/app/apps/common/test_workflow_response_converter_truncation.py:543:16
 ERROR Argument `dict[str, dict[str, str]]` is not assignable to parameter `override_config_dict` with type `AppModelConfigDict | None` in function `core.app.apps.completion.app_config_manager.CompletionAppConfigManager.get_app_config` [bad-argument-type]
   --> tests/unit_tests/core/app/apps/completion/test_completion_app_config_manager.py:29:34
 ERROR Object of class `FunctionType` has no attribute `assert_not_called` [missing-attribute]
@@ -3292,7 +3314,7 @@
 ERROR Object of class `BlobChunkMessage` has no attribute `text`
 ERROR Object of class `BlobChunkMessage` has no attribute `json_object`
 ERROR No matching overload found for function `list.__init__` called with arguments: (Generator[Unknown] | None) [no-matching-overload]
-   --> tests/unit_tests/core/datasource/test_datasource_file_manager.py:404:20
+   --> tests/unit_tests/core/datasource/test_datasource_file_manager.py:403:20
 ERROR Object of class `FunctionType` has no attribute `assert_called_once` [missing-attribute]
   --> tests/unit_tests/core/datasource/test_datasource_manager.py:52:5
 ERROR Argument `SimpleNamespace` is not assignable to parameter `datasource_type` with type `DatasourceProviderType` in function `core.datasource.datasource_manager.DatasourceManager.get_datasource_plugin_provider` [bad-argument-type]
@@ -3324,8 +3346,6 @@
    --> tests/unit_tests/core/datasource/test_datasource_manager.py:573:34
 ERROR Object of class `StreamChunkEvent` has no attribute `node_run_result` [missing-attribute]
    --> tests/unit_tests/core/datasource/test_datasource_manager.py:624:12
-ERROR `in` is not supported between `Literal['Single Page']` and `None` [not-iterable]
-   --> tests/unit_tests/core/datasource/test_website_crawl.py:989:16
 ERROR Argument `Iterator[DatasourceMessage]` is not assignable to parameter `messages` with type `Generator[DatasourceMessage]` in function `core.datasource.utils.message_transformer.DatasourceFileMessageTransformer.transform_datasource_invoke_messages` [bad-argument-type]
   --> tests/unit_tests/core/datasource/utils/test_message_transformer.py:28:26
 ERROR Object of class `BlobChunkMessage` has no attribute `text`
@@ -3956,6 +3976,20 @@
    --> tests/unit_tests/core/prompt/test_simple_prompt_transform.py:410:19
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `context_files` with type `list[File] | None` in function `core.prompt.simple_prompt_transform.SimplePromptTransform._get_last_user_message` [bad-argument-type]
    --> tests/unit_tests/core/prompt/test_simple_prompt_transform.py:411:27
+ERROR Argument `None` is not assignable to parameter `process_rule` with type `dict[Unknown, Unknown]` in function `core.rag.cleaner.clean_processor.CleanProcessor.clean` [bad-argument-type]
+  --> tests/unit_tests/core/rag/cleaner/test_clean_processor.py:10:60
+ERROR Argument `None` is not assignable to parameter `process_rule` with type `dict[Unknown, Unknown]` in function `core.rag.cleaner.clean_processor.CleanProcessor.clean` [bad-argument-type]
+  --> tests/unit_tests/core/rag/cleaner/test_clean_processor.py:13:60
+ERROR Argument `None` is not assignable to parameter `process_rule` with type `dict[Unknown, Unknown]` in function `core.rag.cleaner.clean_processor.CleanProcessor.clean` [bad-argument-type]
+  --> tests/unit_tests/core/rag/cleaner/test_clean_processor.py:18:56
+ERROR Argument `None` is not assignable to parameter `process_rule` with type `dict[Unknown, Unknown]` in function `core.rag.cleaner.clean_processor.CleanProcessor.clean` [bad-argument-type]
+  --> tests/unit_tests/core/rag/cleaner/test_clean_processor.py:23:54
+ERROR Argument `None` is not assignable to parameter `process_rule` with type `dict[Unknown, Unknown]` in function `core.rag.cleaner.clean_processor.CleanProcessor.clean` [bad-argument-type]
+  --> tests/unit_tests/core/rag/cleaner/test_clean_processor.py:29:43
+ERROR Argument `None` is not assignable to parameter `process_rule` with type `dict[Unknown, Unknown]` in function `core.rag.cleaner.clean_processor.CleanProcessor.clean` [bad-argument-type]
+   --> tests/unit_tests/core/rag/cleaner/test_clean_processor.py:167:41
+ERROR Argument `None` is not assignable to parameter `process_rule` with type `dict[Unknown, Unknown]` in function `core.rag.cleaner.clean_processor.CleanProcessor.clean` [bad-argument-type]
+   --> tests/unit_tests/core/rag/cleaner/test_clean_processor.py:175:43
 ERROR Argument `Literal['manhattan']` is not assignable to parameter `distance_function` with type `Literal['cosine', 'euclidean']` in function `core.rag.datasource.vdb.alibabacloud_mysql.alibabacloud_mysql_vector.AlibabaCloudMySQLVectorConfig.__init__` [bad-argument-type]
    --> tests/unit_tests/core/rag/datasource/vdb/alibabacloud_mysql/test_alibabacloud_mysql_vector.py:682:35
 ERROR Argument `list[str]` is not assignable to parameter `docs` with type `Sequence[Document]` in function `core.rag.docstore.dataset_docstore.DatasetDocumentStore.add_documents` [bad-argument-type]
@@ -4086,28 +4120,10 @@
     --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3769:67
 ERROR `None` is not subscriptable [unsupported-operation]
     --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4009:16
-ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4476:40
-ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4528:40
 ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
     --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4533:40
-ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4549:36
-ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4579:36
 ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
     --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4584:36
-ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4592:36
-ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4628:36
-ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4656:36
-ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4714:40
-ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4758:44
 ERROR Argument `Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.router.multi_dataset_react_route.ReactMultiDatasetRouter._handle_invoke_result` [bad-argument-type]
    --> tests/unit_tests/core/rag/retrieval/test_multi_dataset_react_route.py:184:52
 ERROR Argument `None` is not assignable to parameter `text` with type `str` in function `core.rag.splitter.text_splitter.RecursiveCharacterTextSplitter.split_text` [bad-argument-type]
@@ -4923,8 +4939,6 @@
    --> tests/unit_tests/core/workflow/graph_engine/test_table_runner.py:443:23
 ERROR `graph_runtime_state` is uninitialized [unbound-name]
    --> tests/unit_tests/core/workflow/graph_engine/test_table_runner.py:444:37
-ERROR Argument `SimpleNamespace` is not assignable to parameter `node` with type `Node[Unknown]` in function `dify_graph.graph_engine.worker.Worker._build_fallback_failure_event` [bad-argument-type]
-  --> tests/unit_tests/core/workflow/graph_engine/test_worker.py:29:50
 ERROR Argument `dict[str, dict[str, str] | str]` is not assignable to parameter `config` with type `NodeConfigDict` in function `dify_graph.nodes.base.node.Node.__init__` [bad-argument-type]
   --> tests/unit_tests/core/workflow/nodes/answer/test_answer.py:82:16
 ERROR Class member `_Version1._get_retry_config` overrides parent class `Node` in an inconsistent manner [bad-override]
@@ -5115,14 +5129,6 @@
    --> tests/unit_tests/core/workflow/nodes/iteration/iteration_node_spec.py:147:16
 ERROR `dict[str, list[str] | str]` is not assignable to TypedDict key `data` with type `BaseNodeData` [bad-typed-dict-key]
   --> tests/unit_tests/core/workflow/nodes/iteration/test_iteration_child_engine_errors.py:51:21
-ERROR `() -> nullcontext[IExecutionContext]` is not assignable to attribute `_capture_execution_context` with type `(self: IterationNode) -> IExecutionContext` [bad-assignment]
-  --> tests/unit_tests/core/workflow/nodes/iteration/test_parallel_iteration_duration.py:24:39
-ERROR No matching overload found for function `contextlib.nullcontext.__init__` called with arguments: () [no-matching-overload]
-  --> tests/unit_tests/core/workflow/nodes/iteration/test_parallel_iteration_duration.py:24:58
-ERROR `(current: LLMUsage, new: Unknown) -> LLMUsage | Unknown` is not assignable to attribute `_merge_usage` with type `(current: LLMUsage, new_usage: LLMUsage | None) -> LLMUsage` [bad-assignment]
-  --> tests/unit_tests/core/workflow/nodes/iteration/test_parallel_iteration_duration.py:26:25
-ERROR `(*, index: int, item: object, execution_context: object) -> tuple[float, list[NodeRunSucceededEvent], str, dict[Unknown, Unknown], LLMUsage]` is not assignable to attribute `_execute_single_iteration_parallel` with type `(self: IterationNode, index: int, item: object, execution_context: IExecutionContext) -> tuple[float, list[GraphNodeEventBase], object | None, dict[str, Variable], LLMUsage]` [bad-assignment]
-  --> tests/unit_tests/core/workflow/nodes/iteration/test_parallel_iteration_duration.py:44:47
 ERROR Argument `dict[str, dict[str, list[str] | str] | str]` is not assignable to parameter `config` with type `NodeConfigDict` in function `core.workflow.nodes.knowledge_index.knowledge_index_node.KnowledgeIndexNode.__init__` [bad-argument-type]
    --> tests/unit_tests/core/workflow/nodes/knowledge_index/test_knowledge_index_node.py:119:20
 ERROR Argument `dict[str, bool]` is not assignable to parameter `summary_index_setting` with type `SummaryIndexSettingDict | None` in function `core.workflow.nodes.knowledge_index.knowledge_index_node.KnowledgeIndexNode._invoke_knowledge_index` [bad-argument-type]

@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

No changes detected.

@crazywoola crazywoola removed the lgtm This PR has been approved by a maintainer label Mar 20, 2026
@crazywoola crazywoola changed the title Fix Jieba keyword extraction for hyphenated and underscore identifiers fix: Jieba keyword extraction for hyphenated and underscore identifiers Mar 20, 2026
Copilot AI review requested due to automatic review settings March 23, 2026 09:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes keyword extraction/expansion for hyphenated and underscore compound identifiers in Jieba-based (economy-mode) keyword indexing, so exact searches like st-771 / model_function_description can be recalled.

Changes:

  • Update fallback tokenization regex to preserve hyphenated identifiers as single tokens.
  • Enhance extract_keywords with a raw-text regex pass to add hyphen/underscore compound identifiers into the keyword set.
  • Add unit tests covering hyphen/underscore keyword extraction and query-side expansion behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
api/core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py Adjusts tokenization/keyword extraction to preserve and recover hyphen/underscore compounds.
api/tests/unit_tests/core/rag/datasource/keyword/test_jieba_keyword_table_handler.py Adds tests to validate compound identifier handling in expansion and extraction paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

No changes detected.

Copilot AI review requested due to automatic review settings March 23, 2026 12:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 24, 2026 22:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

No changes detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs revision for anti-slop needs-revision size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regarding the problem of keyword matching in the economic model Knowledge Search

3 participants