Skip to content

Conversation

casteryh
Copy link
Contributor

@casteryh casteryh commented Aug 26, 2025

Summary:
This is a proposed interface that will be the backend of the replay buffer. I have included methods that would be useful from the point of view of making a replay buffer.

This also doubles as a proposal for the actual torchstore api.

Test Plan:
n/a

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 26, 2025
@casteryh casteryh changed the title Add BufferView and RawBuffer interfaces [RFC] BufferView and RawBuffer interfaces Aug 26, 2025
@casteryh casteryh requested a review from joecummings August 26, 2025 23:29
@casteryh casteryh changed the title [RFC] BufferView and RawBuffer interfaces Add BufferView and RawBuffer interfaces Aug 27, 2025
@casteryh casteryh changed the title Add BufferView and RawBuffer interfaces [RFC] Add BufferView and RawBuffer interfaces Aug 27, 2025
@casteryh casteryh requested a review from ebsmothers August 27, 2025 22:04
Summary:
This is a proposed interface that will be the backend of the replay buffer. I have included methods that would be useful from the point of view of making a replay buffer.

This also doubles as a proposal for the actual torchstore api.

Test Plan:
n/a
@casteryh casteryh changed the title [RFC] Add BufferView and RawBuffer interfaces [RFC] StoreInterface Sep 9, 2025
@casteryh casteryh requested review from DNXie and LucasLLC September 9, 2025 22:10
@casteryh
Copy link
Contributor Author

casteryh commented Sep 9, 2025

@LucasLLC

@casteryh
Copy link
Contributor Author

casteryh commented Sep 9, 2025

Also @kaiyuan-li


# TODO(yuxuanh): add this to torchstore.
@abstractmethod
async def release(self, key: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea: what's the inspiration here?

Copy link
Contributor Author

@casteryh casteryh Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say the trainer is at step 10 and will no longer need stuff from step 5. A reasonable thing to do would be simply mark all keys starting with replay_buffer.step_10 as released and move on, instead of waiting it to be actually deleted.

While from the torchstore side it's probably easier to implement this as instant deletion right now, it would be nice to have this semantics, for if and when we hit a scale where this matters.

Copy link
Contributor Author

@casteryh casteryh Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, on the other hand, since everything is implemented in Python. It's probably fast enough to just delete instantly since we don't deallocate memory when deleting. Indeed, currently all keys are held by a single process Controller actor in torchstore right now - so it makes less sense to reinvent GC ourself.

Things do get complicated if we need to shard the controller. And it's much easier to just not make any promises.

In this regard, we should probably remove the delete[_all] methods all together, as it would be a nightmare to do it correctly in a distributed setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be some perf gains here where we notify controller of delete and then let storage volumes garbage collect later.


# TODO(yuxuanh): add this to torchstore.
@abstractmethod
async def release_all(self, prefix: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between release and delete?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And delete?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the difference between two functions release and delete

@LucasLLC
Copy link
Contributor

I like the ideas here but in general I think as a practice I think we should only add what we need as it's needed and used. Great work!

@casteryh casteryh closed this Oct 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants