Skip to content

Conversation

@Biaoo
Copy link

@Biaoo Biaoo commented Nov 3, 2025

Fixes #27753

Summary

Fixed automatic metadata filtering to properly support built-in metadata fields when built_in_field_enabled is enabled for a dataset.

Problem: The _automatic_metadata_filter_func method only queried custom metadata fields from the dataset_metadatas table, ignoring built-in fields (document_name, uploader, upload_date, last_update_date, source) even when they were enabled.

Solution:

  • Added _get_all_metadata_fields() private method to both DatasetRetrieval and KnowledgeRetrievalNode classes
  • This method queries both custom metadata fields and built-in fields (when enabled)
  • Refactored _automatic_metadata_filter_func to use the new helper method
  • Eliminated code duplication between the two modules

Impact: Automatic metadata filtering now works correctly with both custom and built-in metadata fields, enabling queries like "find documents uploaded by user X" or "show files uploaded in 2024".

Changes

  • Modified api/core/rag/retrieval/dataset_retrieval.py

    • Added BuiltInField import
    • Added _get_all_metadata_fields() method (lines 966-987)
    • Updated _automatic_metadata_filter_func() to use the new method
  • Modified api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py

    • Added BuiltInField import
    • Added _get_all_metadata_fields() method (lines 524-547)
    • Updated _automatic_metadata_filter_func() to use the new method

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran `dev/reformat`(backend) and `cd web && npx lint-staged`(frontend) to appease the lint gods

Fixed automatic metadata filtering to support built-in fields (document_name,
uploader, upload_date, last_update_date, source) when built_in_field_enabled
is true for the dataset.

Changes:
- Added BuiltInField import to both retrieval modules
- Added _get_all_metadata_fields() private method to query both custom and built-in fields
- Refactored _automatic_metadata_filter_func to use the new helper method
- Eliminated code duplication between DatasetRetrieval and KnowledgeRetrievalNode
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. labels Nov 3, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Biaoo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the automatic metadata filtering functionality where built-in metadata fields were not being properly included in the filtering process, even when enabled. By introducing a dedicated helper method to gather all relevant metadata fields—custom and built-in—and integrating it into the existing filtering logic, this change ensures comprehensive and accurate metadata-based document retrieval. This enhancement significantly improves the flexibility and power of RAG queries, allowing users to leverage a full spectrum of metadata for more precise results.

Highlights

  • Problem Addressed: Previously, the automatic metadata filtering mechanism only considered custom metadata fields, overlooking built-in fields such as 'document_name', 'uploader', 'upload_date', 'last_update_date', and 'source', even when they were enabled for a dataset.
  • New Helper Method Introduced: A new private method, _get_all_metadata_fields(), has been added to both DatasetRetrieval and KnowledgeRetrievalNode classes. This method is responsible for querying and returning both custom metadata fields and, if enabled, built-in metadata fields.
  • Refactored Filtering Logic: The _automatic_metadata_filter_func method in both affected modules has been updated to utilize the new _get_all_metadata_fields() helper. This ensures that the filtering process now comprehensively considers all relevant metadata fields, both custom and built-in.
  • Enhanced Query Capabilities: The fix enables automatic metadata filtering to work correctly with both custom and built-in metadata fields, allowing for more robust and flexible queries, such as filtering documents by uploader or upload date.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue where built-in metadata fields were not included in automatic filtering. The introduction of the _get_all_metadata_fields helper method is a good step towards isolating this logic. However, my review identifies a few areas for improvement. Specifically, I've pointed out an opportunity to make the database queries more efficient, a case of code duplication that could be resolved by further refactoring, and a missing check that could prevent an unnecessary LLM call. Addressing these points will enhance the code's performance and maintainability.

Comment on lines +966 to +988
def _get_all_metadata_fields(self, dataset_ids: list) -> list[str]:
"""
Get all metadata field names for the given datasets, including both custom and built-in fields.

:param dataset_ids: list of dataset IDs
:return: list of metadata field names
"""
# Get custom metadata fields
metadata_stmt = select(DatasetMetadata).where(DatasetMetadata.dataset_id.in_(dataset_ids))
metadata_fields = db.session.scalars(metadata_stmt).all()
all_metadata_fields = [metadata_field.name for metadata_field in metadata_fields]

# Check if any dataset has built-in fields enabled
datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))
datasets = db.session.scalars(datasets_stmt).all()
built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)

# Add built-in fields if enabled
if built_in_enabled:
built_in_fields = [field.value for field in BuiltInField]
all_metadata_fields.extend(built_in_fields)

return all_metadata_fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This method is identical to _get_all_metadata_fields in api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py. The PR description mentions eliminating code duplication, but this change introduces new duplication. This logic could be extracted into a shared utility function or a static method on the Dataset model to adhere to the DRY principle and improve maintainability.

Comment on lines +979 to +981
datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))
datasets = db.session.scalars(datasets_stmt).all()
built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Fetching all dataset objects just to check if any have built_in_field_enabled is inefficient. This can be optimized by performing the check directly in the database, which avoids transferring unnecessary data and processing it in Python.

Suggested change
datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))
datasets = db.session.scalars(datasets_stmt).all()
built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)
datasets_stmt = select(Dataset.id).where(Dataset.id.in_(dataset_ids), Dataset.built_in_field_enabled).limit(1)
built_in_enabled = db.session.scalar(datasets_stmt) is not None

Comment on lines +537 to +539
datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))
datasets = db.session.scalars(datasets_stmt).all()
built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Fetching all dataset objects to check if any have built_in_field_enabled is inefficient. This can be optimized by letting the database perform the check, which avoids transferring unnecessary data and processing it in Python.

Suggested change
datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids))
datasets = db.session.scalars(datasets_stmt).all()
built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets)
datasets_stmt = select(Dataset.id).where(Dataset.id.in_(dataset_ids), Dataset.built_in_field_enabled).limit(1)
built_in_enabled = db.session.scalar(datasets_stmt) is not None

Comment on lines +554 to +555
all_metadata_fields = self._get_all_metadata_fields(dataset_ids)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and efficiency, you should add a check here to return early if all_metadata_fields is empty, similar to the implementation in dataset_retrieval.py. This avoids an unnecessary LLM call when no metadata fields are available.

Suggested change
all_metadata_fields = self._get_all_metadata_fields(dataset_ids)
all_metadata_fields = self._get_all_metadata_fields(dataset_ids)
if not all_metadata_fields:
return [], usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatic metadata filtering does not work with built-in metadata fields

2 participants