Skip to content

Conversation

@casteryh
Copy link
Contributor

Implement delete functionality across the torchstore distributed storage system to allow removal of stored tensors and objects.

Current problems

  • in my local monarch environment, actor_mesh.slice() seems to accept different arguments. Work around is to use choose instead of explicitly choosing a slice.
  • rdma seems broken (I don't think this is introduced in my code), test fails on put().

See P1955672937 for error logs.

@LucasLLC @kaiyuan-li what is your monarch version?

casteryh and others added 2 commits September 19, 2025 16:27
Summary:
Implement delete functionality across the torchstore distributed storage system to allow removal of stored tensors and objects.

This adds the delete API to all layers of the storage system:
- API layer: Added delete() function with proper documentation
- Client layer: Implemented distributed delete across storage volumes with asyncio.gather
- Controller layer: Added notify_delete() to maintain key index consistency
- Storage layer: Added delete() method to storage implementations with proper error handling

The delete operation ensures data consistency by removing entries from all storage volumes where the key exists and updating the controller's key index accordingly.

Also includes comprehensive test coverage that verifies:
- Sequential deletion of multiple tensors
- Verification that deleted keys no longer exist
- Confirmation that remaining keys are unaffected during incremental deletion
- Proper error handling when attempting to access deleted keys

Test Plan:
Added test_delete() function that:
- Tests deletion across multiple storage volumes and processes
- Verifies each deletion incrementally (delete one, verify it's gone, verify others remain)
- Confirms deleted tensors cannot be retrieved (proper error handling)
- Runs with different transport and strategy configurations like existing tests
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 20, 2025
Copy link
Contributor

@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm! Left a comment on correct order of operations. If we notify of delete first, this presents other storage volumes from initiating new read requests while we're attempting to delete.

@LucasLLC
Copy link
Contributor

Wait for tests to pass please!

@casteryh
Copy link
Contributor Author

@kaiyuan-li
disabled this test because it's causing CI to get stuck. Weirdly, my own tests are passing when it does get stuck but disabling the test will unstuck the CI.
please approve if I can proceed to merge.

@kaiyuan-li
Copy link
Contributor

Feel free to merge and I'll fix the CI tests.

@casteryh casteryh merged commit bef9ba7 into meta-pytorch:main Sep 26, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants