Skip to content

feat(workflow): pass project/public scope to knowledge retrieval#33855

Open
zt15242 wants to merge 6 commits intolanggenius:mainfrom
zt15242:feature/project-space
Open

feat(workflow): pass project/public scope to knowledge retrieval#33855
zt15242 wants to merge 6 commits intolanggenius:mainfrom
zt15242:feature/project-space

Conversation

@zt15242
Copy link
Copy Markdown

@zt15242 zt15242 commented Mar 21, 2026

Summary

Pass project_id and include_public through the workflow knowledge retrieval path so dataset retrieval can enforce project/public space scope correctly.

Changes

  • add project_id and include_public to workflow/app config dataset entity
  • propagate scope fields through workflow converter -> node data -> retrieval request
  • apply dataset availability filtering by project/public scope in dataset retrieval
  • add unit tests for:
    • workflow converter scope propagation
    • knowledge retrieval node request propagation
    • dataset retrieval scope handoff

Verification

Passed:

  • api/tests/unit_tests/services/workflow/test_workflow_converter.py
  • api/tests/unit_tests/core/workflow/nodes/knowledge_retrieval/test_knowledge_retrieval_node.py
  • api/tests/unit_tests/core/rag/retrieval/test_dataset_retrieval_scope_patch.py

Result:

  • 26 passed

Docker validation:

  • docker compose config
  • docker compose build api

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 21, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to knowledge retrieval by implementing project and public scope filtering for datasets. The changes enable more precise control over which datasets are accessible within different contexts, such as specific projects or publicly available data. This ensures that the knowledge retrieval process correctly enforces data visibility rules, improving data governance and relevance through schema updates, API modifications, and core logic adjustments.

Highlights

  • Enhanced Dataset Model with Scoping: The dataset model has been extended to include project_id and space_type fields, allowing for granular control over dataset visibility and access within personal, project, or public scopes. A new DatasetSpaceType enum was introduced to define these categories.
  • Workflow Knowledge Retrieval Scope Propagation: The workflow's knowledge retrieval path now correctly propagates project_id and include_public parameters from the app configuration through the workflow node data to the final retrieval request. This ensures that dataset retrieval respects the defined project and public space scopes.
  • API and Service Layer Updates for Scoping: Dataset listing, creation, and update APIs have been modified to accept and filter by the new project_id and space_type fields. The core DatasetService now includes logic to apply these filters when retrieving datasets.
  • Improved Tenant Isolation for Dataset Access: All dataset retrieval calls across console controllers have been refactored to utilize a new get_dataset_in_tenant method. This change enhances tenant isolation by ensuring that datasets are always retrieved within the context of the current tenant.
  • Database Migration for New Fields: A new Alembic migration script has been added to introduce the project_id and space_type columns to the datasets table in the database, along with appropriate indexes.
  • Comprehensive Unit Test Coverage: New unit tests have been added and existing ones updated to validate the correct propagation of scope parameters through the workflow and the accurate application of the new filtering logic in dataset retrieval.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces project and public scoping for knowledge retrieval by adding project_id and space_type fields to datasets. The changes are propagated through various layers of the application, from API controllers to the core retrieval logic, and include necessary database migrations and unit tests. The implementation is mostly solid, but I've identified a logical flaw in the dataset scoping logic that could lead to incorrect retrieval results, and an opportunity to reduce code duplication for better maintainability. My feedback focuses on correcting the scoping bug and improving the code structure.

Comment on lines +1809 to +1820
space_scope = []
if project_id:
space_scope.append(
and_(
Dataset.space_type == DatasetSpaceType.PROJECT.value,
Dataset.project_id == project_id,
)
)
if include_public:
space_scope.append(Dataset.space_type == DatasetSpaceType.PUBLIC.value)
if not space_scope:
space_scope.append(Dataset.space_type != DatasetSpaceType.PUBLIC.value)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a logical issue in how the space_scope is constructed. When project_id is None and include_public is True, the current logic incorrectly filters for only public datasets. The expected behavior should be to retrieve both public datasets and the user's accessible non-public (personal, project) datasets.

The proposed suggestion refactors this logic to be more explicit and correct, ensuring the right combination of scopes is applied in all cases.

            space_scope = []
            if project_id:
                space_scope.append(
                    and_(
                        Dataset.space_type == DatasetSpaceType.PROJECT.value,
                        Dataset.project_id == project_id,
                    )
                )
            else:
                # If no project is specified, user can access their non-public datasets.
                space_scope.append(Dataset.space_type != DatasetSpaceType.PUBLIC.value)

            if include_public:
                space_scope.append(Dataset.space_type == DatasetSpaceType.PUBLIC.value)

Comment on lines +130 to +136
def _validate_space_type(value: str | None) -> str | None:
if value is None:
return None
try:
return DatasetSpaceType(value).value
except ValueError:
raise ValueError("Invalid space_type. Allowed values: personal, project, public.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This validation logic for space_type is duplicated in api/controllers/service_api/dataset/dataset.py. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, this logic should be extracted into a shared utility function. This would ensure that any future changes to the validation only need to be made in one place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant