-
Notifications
You must be signed in to change notification settings - Fork 20.1k
fix(rag): include built-in metadata fields in automatic filtering #27754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fixed automatic metadata filtering to support built-in fields (document_name, uploader, upload_date, last_update_date, source) when built_in_field_enabled is true for the dataset. Changes: - Added BuiltInField import to both retrieval modules - Added _get_all_metadata_fields() private method to query both custom and built-in fields - Refactored _automatic_metadata_filter_func to use the new helper method - Eliminated code duplication between DatasetRetrieval and KnowledgeRetrievalNode
Summary of ChangesHello @Biaoo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug in the automatic metadata filtering functionality where built-in metadata fields were not being properly included in the filtering process, even when enabled. By introducing a dedicated helper method to gather all relevant metadata fields—custom and built-in—and integrating it into the existing filtering logic, this change ensures comprehensive and accurate metadata-based document retrieval. This enhancement significantly improves the flexibility and power of RAG queries, allowing users to leverage a full spectrum of metadata for more precise results. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes an issue where built-in metadata fields were not included in automatic filtering. The introduction of the _get_all_metadata_fields helper method is a good step towards isolating this logic. However, my review identifies a few areas for improvement. Specifically, I've pointed out an opportunity to make the database queries more efficient, a case of code duplication that could be resolved by further refactoring, and a missing check that could prevent an unnecessary LLM call. Addressing these points will enhance the code's performance and maintainability.
| def _get_all_metadata_fields(self, dataset_ids: list) -> list[str]: | ||
| """ | ||
| Get all metadata field names for the given datasets, including both custom and built-in fields. | ||
|
|
||
| :param dataset_ids: list of dataset IDs | ||
| :return: list of metadata field names | ||
| """ | ||
| # Get custom metadata fields | ||
| metadata_stmt = select(DatasetMetadata).where(DatasetMetadata.dataset_id.in_(dataset_ids)) | ||
| metadata_fields = db.session.scalars(metadata_stmt).all() | ||
| all_metadata_fields = [metadata_field.name for metadata_field in metadata_fields] | ||
|
|
||
| # Check if any dataset has built-in fields enabled | ||
| datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids)) | ||
| datasets = db.session.scalars(datasets_stmt).all() | ||
| built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets) | ||
|
|
||
| # Add built-in fields if enabled | ||
| if built_in_enabled: | ||
| built_in_fields = [field.value for field in BuiltInField] | ||
| all_metadata_fields.extend(built_in_fields) | ||
|
|
||
| return all_metadata_fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is identical to _get_all_metadata_fields in api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py. The PR description mentions eliminating code duplication, but this change introduces new duplication. This logic could be extracted into a shared utility function or a static method on the Dataset model to adhere to the DRY principle and improve maintainability.
| datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids)) | ||
| datasets = db.session.scalars(datasets_stmt).all() | ||
| built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fetching all dataset objects just to check if any have built_in_field_enabled is inefficient. This can be optimized by performing the check directly in the database, which avoids transferring unnecessary data and processing it in Python.
| datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids)) | |
| datasets = db.session.scalars(datasets_stmt).all() | |
| built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets) | |
| datasets_stmt = select(Dataset.id).where(Dataset.id.in_(dataset_ids), Dataset.built_in_field_enabled).limit(1) | |
| built_in_enabled = db.session.scalar(datasets_stmt) is not None |
| datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids)) | ||
| datasets = db.session.scalars(datasets_stmt).all() | ||
| built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fetching all dataset objects to check if any have built_in_field_enabled is inefficient. This can be optimized by letting the database perform the check, which avoids transferring unnecessary data and processing it in Python.
| datasets_stmt = select(Dataset).where(Dataset.id.in_(dataset_ids)) | |
| datasets = db.session.scalars(datasets_stmt).all() | |
| built_in_enabled = any(dataset.built_in_field_enabled for dataset in datasets) | |
| datasets_stmt = select(Dataset.id).where(Dataset.id.in_(dataset_ids), Dataset.built_in_field_enabled).limit(1) | |
| built_in_enabled = db.session.scalar(datasets_stmt) is not None |
| all_metadata_fields = self._get_all_metadata_fields(dataset_ids) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency and efficiency, you should add a check here to return early if all_metadata_fields is empty, similar to the implementation in dataset_retrieval.py. This avoids an unnecessary LLM call when no metadata fields are available.
| all_metadata_fields = self._get_all_metadata_fields(dataset_ids) | |
| all_metadata_fields = self._get_all_metadata_fields(dataset_ids) | |
| if not all_metadata_fields: | |
| return [], usage |
Fixes #27753
Summary
Fixed automatic metadata filtering to properly support built-in metadata fields when
built_in_field_enabledis enabled for a dataset.Problem: The
_automatic_metadata_filter_funcmethod only queried custom metadata fields from thedataset_metadatastable, ignoring built-in fields (document_name, uploader, upload_date, last_update_date, source) even when they were enabled.Solution:
_get_all_metadata_fields()private method to bothDatasetRetrievalandKnowledgeRetrievalNodeclasses_automatic_metadata_filter_functo use the new helper methodImpact: Automatic metadata filtering now works correctly with both custom and built-in metadata fields, enabling queries like "find documents uploaded by user X" or "show files uploaded in 2024".
Changes
Modified
api/core/rag/retrieval/dataset_retrieval.pyBuiltInFieldimport_get_all_metadata_fields()method (lines 966-987)_automatic_metadata_filter_func()to use the new methodModified
api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.pyBuiltInFieldimport_get_all_metadata_fields()method (lines 524-547)_automatic_metadata_filter_func()to use the new methodChecklist