fix(rag): include built-in metadata fields in automatic filtering #27754

Biaoo · 2025-11-03T05:23:39Z

Summary

Fixed automatic metadata filtering to properly support built-in metadata fields when built_in_field_enabled is enabled for a dataset.

Problem: The _automatic_metadata_filter_func method only queried custom metadata fields from the dataset_metadatas table, ignoring built-in fields (document_name, uploader, upload_date, last_update_date, source) even when they were enabled.

Solution:

Added _get_all_metadata_fields() private method to both DatasetRetrieval and KnowledgeRetrievalNode classes
This method queries both custom metadata fields and built-in fields (when enabled)
Refactored _automatic_metadata_filter_func to use the new helper method
Eliminated code duplication between the two modules

Impact: Automatic metadata filtering now works correctly with both custom and built-in metadata fields, enabling queries like "find documents uploaded by user X" or "show files uploaded in 2024".

Changes

Modified api/core/rag/retrieval/dataset_retrieval.py
- Added BuiltInField import
- Added _get_all_metadata_fields() method (lines 966-987)
- Updated _automatic_metadata_filter_func() to use the new method
Modified api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py
- Added BuiltInField import
- Added _get_all_metadata_fields() method (lines 524-547)
- Updated _automatic_metadata_filter_func() to use the new method

Checklist

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran `dev/reformat`(backend) and `cd web && npx lint-staged`(frontend) to appease the lint gods

Fixed automatic metadata filtering to support built-in fields (document_name, uploader, upload_date, last_update_date, source) when built_in_field_enabled is true for the dataset. Changes: - Added BuiltInField import to both retrieval modules - Added _get_all_metadata_fields() private method to query both custom and built-in fields - Refactored _automatic_metadata_filter_func to use the new helper method - Eliminated code duplication between DatasetRetrieval and KnowledgeRetrievalNode

gemini-code-assist · 2025-11-03T05:23:59Z

Summary of Changes

Hello @Biaoo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the automatic metadata filtering functionality where built-in metadata fields were not being properly included in the filtering process, even when enabled. By introducing a dedicated helper method to gather all relevant metadata fields—custom and built-in—and integrating it into the existing filtering logic, this change ensures comprehensive and accurate metadata-based document retrieval. This enhancement significantly improves the flexibility and power of RAG queries, allowing users to leverage a full spectrum of metadata for more precise results.

Highlights

Problem Addressed: Previously, the automatic metadata filtering mechanism only considered custom metadata fields, overlooking built-in fields such as 'document_name', 'uploader', 'upload_date', 'last_update_date', and 'source', even when they were enabled for a dataset.
New Helper Method Introduced: A new private method, _get_all_metadata_fields(), has been added to both DatasetRetrieval and KnowledgeRetrievalNode classes. This method is responsible for querying and returning both custom metadata fields and, if enabled, built-in metadata fields.
Refactored Filtering Logic: The _automatic_metadata_filter_func method in both affected modules has been updated to utilize the new _get_all_metadata_fields() helper. This ensures that the filtering process now comprehensively considers all relevant metadata fields, both custom and built-in.
Enhanced Query Capabilities: The fix enables automatic metadata filtering to work correctly with both custom and built-in metadata fields, allowing for more robust and flexible queries, such as filtering documents by uploader or upload date.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly fixes an issue where built-in metadata fields were not included in automatic filtering. The introduction of the _get_all_metadata_fields helper method is a good step towards isolating this logic. However, my review identifies a few areas for improvement. Specifically, I've pointed out an opportunity to make the database queries more efficient, a case of code duplication that could be resolved by further refactoring, and a missing check that could prevent an unnecessary LLM call. Addressing these points will enhance the code's performance and maintainability.

gemini-code-assist · 2025-11-03T05:25:32Z

api/core/rag/retrieval/dataset_retrieval.py

+    def _get_all_metadata_fields(self, dataset_ids: list) -> list[str]:
+        """
+        Get all metadata field names for the given datasets, including both custom and built-in fields.
+
+        :param dataset_ids: list of dataset IDs
+        :return: list of metadata field names
+        """
+        # Get custom metadata fields
        metadata_stmt = select(DatasetMetadata).where(DatasetMetadata.dataset_id.in_(dataset_ids))
        metadata_fields = db.session.scalars(metadata_stmt).all()
        all_metadata_fields = [metadata_field.name for metadata_field in metadata_fields]
+
+        # Check if any dataset has built-in fields enabled
+        datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))
+        datasets = db.session.scalars(datasets_stmt).all()
+        built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)
+
+        # Add built-in fields if enabled
+        if built_in_enabled:
+            built_in_fields = [field.value for field in BuiltInField]
+            all_metadata_fields.extend(built_in_fields)
+
+        return all_metadata_fields


This method is identical to _get_all_metadata_fields in api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py. The PR description mentions eliminating code duplication, but this change introduces new duplication. This logic could be extracted into a shared utility function or a static method on the Dataset model to adhere to the DRY principle and improve maintainability.

gemini-code-assist · 2025-11-03T05:25:32Z

api/core/rag/retrieval/dataset_retrieval.py

+        datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))
+        datasets = db.session.scalars(datasets_stmt).all()
+        built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)


Fetching all dataset objects just to check if any have built_in_field_enabled is inefficient. This can be optimized by performing the check directly in the database, which avoids transferring unnecessary data and processing it in Python.

Suggested change

datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))

datasets = db.session.scalars(datasets_stmt).all()

built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)

datasets_stmt = select(Dataset.id).where(Dataset.id.in_(dataset_ids), Dataset.built_in_field_enabled).limit(1)

built_in_enabled = db.session.scalar(datasets_stmt) is not None

gemini-code-assist · 2025-11-03T05:25:32Z

api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py

+        datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))
+        datasets = db.session.scalars(datasets_stmt).all()
+        built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)


Fetching all dataset objects to check if any have built_in_field_enabled is inefficient. This can be optimized by letting the database perform the check, which avoids transferring unnecessary data and processing it in Python.

Suggested change

datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))

datasets = db.session.scalars(datasets_stmt).all()

built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)

datasets_stmt = select(Dataset.id).where(Dataset.id.in_(dataset_ids), Dataset.built_in_field_enabled).limit(1)

built_in_enabled = db.session.scalar(datasets_stmt) is not None

gemini-code-assist · 2025-11-03T05:25:32Z

api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py

+        all_metadata_fields = self._get_all_metadata_fields(dataset_ids)
+


For consistency and efficiency, you should add a check here to return early if all_metadata_fields is empty, similar to the implementation in dataset_retrieval.py. This avoids an unnecessary LLM call when no metadata fields are available.

Suggested change

all_metadata_fields = self._get_all_metadata_fields(dataset_ids)

all_metadata_fields = self._get_all_metadata_fields(dataset_ids)

if not all_metadata_fields:

return [], usage

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. labels Nov 3, 2025

gemini-code-assist bot reviewed Nov 3, 2025

View reviewed changes

crazywoola and others added 4 commits November 4, 2025 09:40

Merge branch 'main' into fix/metadata-filter-built-in-fields

8814e2c

Merge branch 'main' into fix/metadata-filter-built-in-fields

e8518d9

Merge branch 'main' into fix/metadata-filter-built-in-fields

338e1d6

Merge branch 'main' into fix/metadata-filter-built-in-fields

9e0fc60

laipz8200 requested review from JohnJyong and QuantumGhost as code owners December 9, 2025 05:46

laipz8200 force-pushed the main branch from a506b55 to 18601d8 Compare December 9, 2025 05:48

dosubot bot mentioned this pull request Dec 12, 2025

Automatic metadata filtering does not work with custom metadata field #29556

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rag): include built-in metadata fields in automatic filtering #27754

fix(rag): include built-in metadata fields in automatic filtering #27754

Biaoo commented Nov 3, 2025

Uh oh!

gemini-code-assist bot commented Nov 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Uh oh!

gemini-code-assist bot Nov 3, 2025

Uh oh!

gemini-code-assist bot Nov 3, 2025

Uh oh!

gemini-code-assist bot Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		all_metadata_fields = self._get_all_metadata_fields(dataset_ids)

fix(rag): include built-in metadata fields in automatic filtering #27754

Are you sure you want to change the base?

fix(rag): include built-in metadata fields in automatic filtering #27754

Conversation

Biaoo commented Nov 3, 2025

Summary

Changes

Checklist

Uh oh!

gemini-code-assist bot commented Nov 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants