-
Notifications
You must be signed in to change notification settings - Fork 20.1k
feat(workflow): add document metadata configuration for Knowledge Base node #31438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e node Implement comprehensive document metadata support in Knowledge Base workflow node, allowing users to configure metadata values through both constants and variables. Backend changes: - Add DocMetadata model with support for constant values and variable selectors - Implement metadata processing in KnowledgeIndexNode with variable resolution - Add batch query optimization to prevent N+1 queries - Implement metadata validation and binding creation - Add comprehensive unit tests for node and service layers Frontend changes: - Add MetadataSection component with type-aware input controls - String type: text input - Number type: number input with validation - Time type: date picker (Unix timestamp) - Implement variable filtering based on metadata data type - String metadata: only string variables - Number metadata: only number/integer variables - Time metadata: only time-related number variables (timestamp, *time*, *date*, *at*) - Add VarReferencePicker with 360px min-width for better UX - Standardize font size to text-[13px] across all inputs - Add i18n support for all user-facing strings Technical improvements: - Use SQLAlchemy attributes.flag_modified() for JSON field updates - Optimize logging to follow project standards (warnings and exceptions only) - Add type safety with proper TypeScript definitions - Implement proper error handling with user-friendly messages Co-Authored-By: Claude <noreply@anthropic.com>
Summary of ChangesHello @ZeroZ-lab, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Knowledge Base workflow node by introducing robust document metadata configuration. Users can now define and assign custom metadata to documents, leveraging both static values and dynamic variables from the workflow. This feature provides greater flexibility for organizing and enriching knowledge base content, ultimately improving data management and retrieval within workflows. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive and well-implemented feature for document metadata in Knowledge Base nodes. The backend changes are robust, including optimizations like batch fetching to avoid N+1 queries. The frontend implementation provides a good user experience with type-aware controls and clear variable selection.
I've identified a few areas for improvement:
- A critical issue in a new unit test that prevents it from testing the intended logic.
- A high-severity issue where a check for in-use metadata only considers draft workflows, potentially allowing deletion of metadata used in published applications.
- A few medium-severity suggestions to improve performance and code quality in the backend and test suite.
Overall, this is a great contribution. Addressing the identified issues will make it even more solid.
I am having trouble creating individual review comments. Click here to see my feedback.
api/tests/unit_tests/core/workflow/nodes/knowledge_index/test_knowledge_index_node.py (108)
This test is intended to verify custom metadata handling, but it mocks _invoke_knowledge_index, which is where the metadata processing logic actually resides. As a result, the test doesn't execute the code it's supposed to be testing, and the assertions will fail.
To fix this, you should remove the line node._invoke_knowledge_index = MagicMock(). The test already mocks IndexProcessorFactory, which should be sufficient to isolate the test from the actual indexing logic while allowing the metadata processing code to run.
api/services/metadata_service.py (113-117)
The check_metadata_used_in_pipeline function only checks the draft version of the workflow for metadata usage (version=Workflow.VERSION_DRAFT). This is risky because it doesn't account for metadata being used in a published workflow. A user could delete metadata that is essential for a live application, causing it to fail.
To prevent this, the check should be extended to include the published workflow as well. You can look at RagPipelineService for examples of how to query for both draft and published workflows.
api/core/workflow/nodes/knowledge_index/knowledge_index_node.py (172-177)
For better readability and conciseness, you can use a dictionary comprehension to create metadata_name_map. This is more idiomatic Python and can be slightly more performant.
metadata_name_map: dict[str, str] = {
md.id: md.name
for md in db.session.scalars(
select(DatasetMetadata).where(DatasetMetadata.dataset_id == dataset.id)
).all()
}
api/services/dataset_service.py (1856-1867)
The current implementation for creating DatasetMetadataBinding records uses a nested loop, which can lead to a large number of individual INSERT statements and database calls, especially during batch document uploads. This could impact performance.
To improve efficiency, I recommend using db.session.bulk_insert_mappings to perform a single bulk insert operation.
if metadata_bindings_to_create and document_ids:
bindings_to_add = [
{
"tenant_id": dataset.tenant_id,
"dataset_id": dataset.id,
"document_id": doc_id,
"metadata_id": metadata_id,
"created_by": account.id,
}
for doc_id in document_ids
for metadata_id, _ in metadata_bindings_to_create
]
db.session.bulk_insert_mappings(DatasetMetadataBinding, bindings_to_add)
db.session.commit()
api/tests/unit_tests/core/workflow/nodes/knowledge_index/test_knowledge_index_node.py (48-55)
This side_effect assignment for pool.get is immediately overridden by another side_effect function definition on line 77. This makes the code confusing and seems to be leftover from development. Please remove these lines to improve test clarity.
api/tests/unit_tests/services/test_dataset_service_metadata.py (137)
The assertion assert mock_dependencies["db"].add.call_count >= 1 is quite weak. It only confirms that something was added to the session, but doesn't verify that a DatasetMetadataBinding was created with the correct attributes.
To make this test more robust, I suggest adding a more specific assertion to check that a DatasetMetadataBinding instance is added to the session and that its attributes (like metadata_id and document_id) are correct. You can iterate through mock_dependencies["db"].add.call_args_list to find the DatasetMetadataBinding instance and assert its properties.
- Fix test_delete_metadata_not_found: use pytest.raises() since service raises ValueError instead of returning None - Fix test_run_with_custom_metadata: remove mock of _invoke_knowledge_index that bypassed metadata processing logic, add missing BATCH/ORIGINAL_DOCUMENT_ID variable mocks, remove redundant side_effect code - Fix test_save_document_with_dataset_id_ignores_lock_not_owned: add missing enable_built_in_metadata and doc_metadata attributes to knowledge_config - Fix test_save_document_with_metadata: add .filter().all() mock chain for DatasetMetadata query - Enhance check_metadata_used_in_pipeline to check both draft and current published workflows (via pipeline.workflow_id) to prevent deletion of metadata actively used in production Co-Authored-By: Claude <noreply@anthropic.com>
Add proper input styling (border, rounded corners, background) to the metadata value container in Knowledge Base node for visual consistency with other form inputs. Co-Authored-By: Claude <noreply@anthropic.com>
dd457bb to
7997ef2
Compare
Rewrite workflow query conditions to avoid type checking error with list.append() having inconsistent types. Co-Authored-By: Claude <noreply@anthropic.com>
Mock attributes.flag_modified to avoid SQLAlchemy internal state requirements. Update assertion to verify flag_modified call instead of db.session.add call.
Resolve conflicts between doc_metadata feature and summary_index feature: - api/core/workflow/nodes/knowledge_index/entities.py: merge both fields - api/core/workflow/nodes/knowledge_index/knowledge_index_node.py: merge imports - web/app/components/workflow/nodes/knowledge-base/hooks/use-config.ts: merge handlers - web/app/components/workflow/nodes/knowledge-base/types.ts: merge type definitions - web/app/components/workflow/nodes/knowledge-base/panel.tsx: merge handler usage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…lab/dify into feat-rag-pipline-metadata
- Add system-xs-regular and truncate to date-picker for consistent font size - Fix value container alignment with w-0 grow overflow-hidden pattern - Change delete button from × to trash icon with destructive hover - Align condition-date layout with other condition components Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code |
…adataBinding` for new and duplicate documents and improving frontend metadata input handling.
…rkflow nodes, enhance data consistency for document metadata bindings, and improve metadata usage check robustness.
|
the function disable built-in metadata also need to add metadata using check. |
Implement comprehensive document metadata support in Knowledge Base workflow node, allowing users to configure metadata values through both constants and variables.
Backend changes:
Frontend changes:
Technical improvements:
Important
Fixes #<issue number>.Summary
fix #31437
Screenshots
Checklist
make lintandmake type-check(backend) andcd web && npx lint-staged(frontend) to appease the lint gods