Skip to content

Conversation

@ZeroZ-lab
Copy link
Contributor

Implement comprehensive document metadata support in Knowledge Base workflow node, allowing users to configure metadata values through both constants and variables.

Backend changes:

  • Add DocMetadata model with support for constant values and variable selectors
  • Implement metadata processing in KnowledgeIndexNode with variable resolution
  • Add batch query optimization to prevent N+1 queries
  • Implement metadata validation and binding creation
  • Add comprehensive unit tests for node and service layers

Frontend changes:

  • Add MetadataSection component with type-aware input controls
    • String type: text input
    • Number type: number input with validation
    • Time type: date picker (Unix timestamp)
  • Implement variable filtering based on metadata data type
    • String metadata: only string variables
    • Number metadata: only number/integer variables
    • Time metadata: only time-related number variables (timestamp, time, date, at)
  • Add VarReferencePicker with 360px min-width for better UX
  • Standardize font size to text-[13px] across all inputs
  • Add i18n support for all user-facing strings

Technical improvements:

  • Use SQLAlchemy attributes.flag_modified() for JSON field updates
  • Optimize logging to follow project standards (warnings and exceptions only)
  • Add type safety with proper TypeScript definitions
  • Implement proper error handling with user-friendly messages

Important

  1. Make sure you have read our contribution guidelines
  2. Ensure there is an associated issue and you have been assigned to it
  3. Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

fix #31437

Screenshots

Before After
... image
... image
... image
... image

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran make lint and make type-check (backend) and cd web && npx lint-staged (frontend) to appease the lint gods

…e node

Implement comprehensive document metadata support in Knowledge Base workflow node, allowing users to configure metadata values through both constants and variables.

Backend changes:
- Add DocMetadata model with support for constant values and variable selectors
- Implement metadata processing in KnowledgeIndexNode with variable resolution
- Add batch query optimization to prevent N+1 queries
- Implement metadata validation and binding creation
- Add comprehensive unit tests for node and service layers

Frontend changes:
- Add MetadataSection component with type-aware input controls
  - String type: text input
  - Number type: number input with validation
  - Time type: date picker (Unix timestamp)
- Implement variable filtering based on metadata data type
  - String metadata: only string variables
  - Number metadata: only number/integer variables
  - Time metadata: only time-related number variables (timestamp, *time*, *date*, *at*)
- Add VarReferencePicker with 360px min-width for better UX
- Standardize font size to text-[13px] across all inputs
- Add i18n support for all user-facing strings

Technical improvements:
- Use SQLAlchemy attributes.flag_modified() for JSON field updates
- Optimize logging to follow project standards (warnings and exceptions only)
- Add type safety with proper TypeScript definitions
- Implement proper error handling with user-friendly messages

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions github-actions bot added the web This relates to changes on the web. label Jan 23, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ZeroZ-lab, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Knowledge Base workflow node by introducing robust document metadata configuration. Users can now define and assign custom metadata to documents, leveraging both static values and dynamic variables from the workflow. This feature provides greater flexibility for organizing and enriching knowledge base content, ultimately improving data management and retrieval within workflows.

Highlights

  • Document Metadata Model: Introduced a new DocMetadata model in the backend to define the structure for document metadata, supporting various data types (string, integer, float, list of strings).
  • Knowledge Base Node Integration: Implemented comprehensive metadata processing within the KnowledgeIndexNode to handle document metadata configuration, including variable resolution for dynamic values and batch query optimization to prevent N+1 issues.
  • Frontend Metadata Configuration UI: Developed a new MetadataSection component for the Knowledge Base node, offering a user-friendly interface to configure document metadata. This includes type-aware input controls (text, number, date picker) and intelligent variable filtering based on metadata data types.
  • Metadata Management Enhancements: Added functionality to the DatasetService to save documents with associated metadata and updated the MetadataService to prevent deletion of metadata currently in use by a workflow pipeline.
  • Technical Improvements & Testing: Incorporated SQLAlchemy's attributes.flag_modified() for efficient JSON field updates, optimized logging, enhanced type safety with TypeScript, improved error handling, and added dedicated unit tests for metadata processing in both workflow nodes and dataset services.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-implemented feature for document metadata in Knowledge Base nodes. The backend changes are robust, including optimizations like batch fetching to avoid N+1 queries. The frontend implementation provides a good user experience with type-aware controls and clear variable selection.

I've identified a few areas for improvement:

  • A critical issue in a new unit test that prevents it from testing the intended logic.
  • A high-severity issue where a check for in-use metadata only considers draft workflows, potentially allowing deletion of metadata used in published applications.
  • A few medium-severity suggestions to improve performance and code quality in the backend and test suite.

Overall, this is a great contribution. Addressing the identified issues will make it even more solid.

I am having trouble creating individual review comments. Click here to see my feedback.

api/tests/unit_tests/core/workflow/nodes/knowledge_index/test_knowledge_index_node.py (108)

critical

This test is intended to verify custom metadata handling, but it mocks _invoke_knowledge_index, which is where the metadata processing logic actually resides. As a result, the test doesn't execute the code it's supposed to be testing, and the assertions will fail.

To fix this, you should remove the line node._invoke_knowledge_index = MagicMock(). The test already mocks IndexProcessorFactory, which should be sufficient to isolate the test from the actual indexing logic while allowing the metadata processing code to run.

api/services/metadata_service.py (113-117)

high

The check_metadata_used_in_pipeline function only checks the draft version of the workflow for metadata usage (version=Workflow.VERSION_DRAFT). This is risky because it doesn't account for metadata being used in a published workflow. A user could delete metadata that is essential for a live application, causing it to fail.

To prevent this, the check should be extended to include the published workflow as well. You can look at RagPipelineService for examples of how to query for both draft and published workflows.

api/core/workflow/nodes/knowledge_index/knowledge_index_node.py (172-177)

medium

For better readability and conciseness, you can use a dictionary comprehension to create metadata_name_map. This is more idiomatic Python and can be slightly more performant.

                metadata_name_map: dict[str, str] = {
                    md.id: md.name
                    for md in db.session.scalars(
                        select(DatasetMetadata).where(DatasetMetadata.dataset_id == dataset.id)
                    ).all()
                }

api/services/dataset_service.py (1856-1867)

medium

The current implementation for creating DatasetMetadataBinding records uses a nested loop, which can lead to a large number of individual INSERT statements and database calls, especially during batch document uploads. This could impact performance.

To improve efficiency, I recommend using db.session.bulk_insert_mappings to perform a single bulk insert operation.

                    if metadata_bindings_to_create and document_ids:
                        bindings_to_add = [
                            {
                                "tenant_id": dataset.tenant_id,
                                "dataset_id": dataset.id,
                                "document_id": doc_id,
                                "metadata_id": metadata_id,
                                "created_by": account.id,
                            }
                            for doc_id in document_ids
                            for metadata_id, _ in metadata_bindings_to_create
                        ]
                        db.session.bulk_insert_mappings(DatasetMetadataBinding, bindings_to_add)
                        db.session.commit()

api/tests/unit_tests/core/workflow/nodes/knowledge_index/test_knowledge_index_node.py (48-55)

medium

This side_effect assignment for pool.get is immediately overridden by another side_effect function definition on line 77. This makes the code confusing and seems to be leftover from development. Please remove these lines to improve test clarity.

api/tests/unit_tests/services/test_dataset_service_metadata.py (137)

medium

The assertion assert mock_dependencies["db"].add.call_count >= 1 is quite weak. It only confirms that something was added to the session, but doesn't verify that a DatasetMetadataBinding was created with the correct attributes.

To make this test more robust, I suggest adding a more specific assertion to check that a DatasetMetadataBinding instance is added to the session and that its attributes (like metadata_id and document_id) are correct. You can iterate through mock_dependencies["db"].add.call_args_list to find the DatasetMetadataBinding instance and assert its properties.

ZeroZ-lab and others added 2 commits January 23, 2026 16:29
- Fix test_delete_metadata_not_found: use pytest.raises() since service
  raises ValueError instead of returning None
- Fix test_run_with_custom_metadata: remove mock of _invoke_knowledge_index
  that bypassed metadata processing logic, add missing BATCH/ORIGINAL_DOCUMENT_ID
  variable mocks, remove redundant side_effect code
- Fix test_save_document_with_dataset_id_ignores_lock_not_owned: add missing
  enable_built_in_metadata and doc_metadata attributes to knowledge_config
- Fix test_save_document_with_metadata: add .filter().all() mock chain for
  DatasetMetadata query
- Enhance check_metadata_used_in_pipeline to check both draft and current
  published workflows (via pipeline.workflow_id) to prevent deletion of
  metadata actively used in production

Co-Authored-By: Claude <noreply@anthropic.com>
Add proper input styling (border, rounded corners, background) to the
metadata value container in Knowledge Base node for visual consistency
with other form inputs.

Co-Authored-By: Claude <noreply@anthropic.com>
@ZeroZ-lab ZeroZ-lab force-pushed the feat-rag-pipline-metadata branch from dd457bb to 7997ef2 Compare January 23, 2026 08:40
ZeroZ-lab and others added 9 commits January 23, 2026 16:40
Rewrite workflow query conditions to avoid type checking error with
list.append() having inconsistent types.

Co-Authored-By: Claude <noreply@anthropic.com>
Mock attributes.flag_modified to avoid SQLAlchemy internal state
requirements. Update assertion to verify flag_modified call instead
of db.session.add call.
@ZeroZ-lab ZeroZ-lab marked this pull request as ready for review January 23, 2026 13:18
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. 💪 enhancement New feature or request labels Jan 23, 2026
ZeroZ-lab and others added 5 commits January 26, 2026 10:01
Resolve conflicts between doc_metadata feature and summary_index feature:
- api/core/workflow/nodes/knowledge_index/entities.py: merge both fields
- api/core/workflow/nodes/knowledge_index/knowledge_index_node.py: merge imports
- web/app/components/workflow/nodes/knowledge-base/hooks/use-config.ts: merge handlers
- web/app/components/workflow/nodes/knowledge-base/types.ts: merge type definitions
- web/app/components/workflow/nodes/knowledge-base/panel.tsx: merge handler usage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ZeroZ-lab and others added 8 commits February 2, 2026 11:24
- Add system-xs-regular and truncate to date-picker for consistent font size
- Fix value container alignment with w-0 grow overflow-hidden pattern
- Change delete button from × to trash icon with destructive hover
- Align condition-date layout with other condition components

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ZeroZ-lab ZeroZ-lab requested a review from laipz8200 as a code owner February 5, 2026 08:18
@ZeroZ-lab
Copy link
Contributor Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

@ZeroZ-lab ZeroZ-lab marked this pull request as draft February 9, 2026 06:05
ZeroZ-lab and others added 5 commits February 9, 2026 14:06
…adataBinding` for new and duplicate documents and improving frontend metadata input handling.
…rkflow nodes, enhance data consistency for document metadata bindings, and improve metadata usage check robustness.
@ZeroZ-lab ZeroZ-lab marked this pull request as ready for review February 9, 2026 06:33
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 9, 2026
@JohnJyong
Copy link
Contributor

the function disable built-in metadata also need to add metadata using check.
the others LGTM, thanks for your contribution @ZeroZ-lab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

💪 enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files. web This relates to changes on the web.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add document metadata configuration for Knowledge Base workflow node

2 participants