Skip to content

fix: implement Milvus partitions for incremental ingestion#12

Open
haroon0x wants to merge 1 commit intokubeflow:mainfrom
haroon0x:fix/milvus-partitions-separate-branch
Open

fix: implement Milvus partitions for incremental ingestion#12
haroon0x wants to merge 1 commit intokubeflow:mainfrom
haroon0x:fix/milvus-partitions-separate-branch

Conversation

@haroon0x
Copy link

@haroon0x haroon0x commented Jan 30, 2026

Fixes #10

Problem

The ingestion pipeline was previously dropping the entire Milvus collection on every run. This caused:

  1. Data Loss: Re-running the pipeline for one repository would delete data from all other sources.
  2. Blocking Multi-Source RAG: It was impossible to maintain a unified index for documentation and issues simultaneously.

Solution

Modified pipelines/kubeflow-pipeline.py to use Milvus Partitions for source isolation.

Key Changes:

  • Partition Management: Updated store_milvus to create/use partitions derived from the repo_name (e.g., kubeflow/website becomes kubeflow_website).
  • Selective Sync: The pipeline now only drops and rebuilds the specific partition for the repository being indexed, leaving other partitions untouched.
  • Resource Cleanup: Added collection.release() before dropping partitions to avoid "partition is loaded" errors in Milvus.
  • Unified Search: Search logic remains unchanged as Milvus automatically searches all partitions in a collection.

Verification: Verbose E2E Logs

The following logs were captured during a real-world simulation using a standalone Milvus instance in Docker. The test demonstrates two independent sources (kubeflow/website and kubeflow/pipelines) being managed in separate partitions.

==================== STARTING VERBOSE E2E PARTITION TEST ====================

==================== PHASE 1: CONNECTION ====================
Connecting to Milvus server at milvus-standalone:19530...
✅ Connected successfully.

==================== PHASE 2: CLEANUP ====================
✅ Environment ready.

==================== PHASE 3: SCHEMA INITIALIZATION ====================
✅ Collection 'verbose_docs_rag_test' created with 8-dim vectors.

==================== PHASE 4: MULTI-SOURCE INGESTION ====================
Creating partition 'kubeflow_website' for kubeflow/website...
✅ Inserted 1 record into 'kubeflow_website'.
Creating partition 'kubeflow_pipelines' for kubeflow/pipelines...
✅ Inserted 1 record into 'kubeflow_pipelines'.

==================== PHASE 5: INDEXING & LOADING ====================
Creating vector index (IVF_FLAT/L2)...
Loading collection into memory...
✅ Collection loaded and ready for search.

==================== PHASE 6: INITIAL STATE VERIFICATION ====================
Total collection entities: 2
Entities in 'kubeflow_website': 1
Entities in 'kubeflow_pipelines': 1
Search results across all partitions:
  - Hit: Found data from repo 'kubeflow/website' (Score: 0.010000001639127731)
  - Hit: Found data from repo 'kubeflow/pipelines' (Score: 0.8100000619888306)

==================== PHASE 7: INCREMENTAL UPDATE EXECUTION ====================
Simulating pipeline run for 'kubeflow/website'...
Action: Releasing collection from memory...
Action: Dropping ONLY partition 'kubeflow_website'...
Action: Re-creating partition 'kubeflow_website'...
Action: Inserting new data into 'kubeflow_website'...
Action: Reloading collection...
✅ Incremental update for 'kubeflow/website' complete.

==================== PHASE 8: FINAL ISOLATION VERIFICATION ====================
Total entities after update: 3
Verifying data preservation in 'kubeflow_pipelines'...
✅ SUCCESS: Found preserved data in kubeflow_pipelines: 'Documentation for Kubeflow Pipelines'
Verifying updated data in 'kubeflow_website'...
✅ Updated content: 'UPDATED Content for Kubeflow Website'

==================== E2E PARTITION TEST PASSED SUCCESSFULLY ====================

Checklist

  • Code is clean and free of unnecessary comments
  • Partition names are sanitized
  • collection.release() added for safe drops
  • E2E verified with real Milvus
  • Search across all partitions confirmed

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: haroon0x <haroonbmc0@gmail.com>
@haroon0x haroon0x force-pushed the fix/milvus-partitions-separate-branch branch from 2615268 to 681a644 Compare January 30, 2026 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Pipeline drops entire Milvus collection on every run

1 participant