Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to use MMIO for large Datasets #310

Open
handshape opened this issue Dec 21, 2022 · 3 comments
Open

Option to use MMIO for large Datasets #310

handshape opened this issue Dec 21, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@handshape
Copy link

Is your feature request related to a problem? Please describe.
I'm frequently frustrated by OOMEs when building (or deserializing) Datasets that don't fit in heap memory (or even physical memory).

Describe the solution you'd like
I'd like to see an option in (or around) org.tribuo.Dataset to use mapped memory for storing Examples rather than on-heap.

Describe alternatives you've considered
I've considered subclassing Dataset and reimplementing everything that makes use of the data member, replacing it with an instance of Jan Kotek's MapDB, and using the existing protobuf implementations to marshall the Examples to/from storage. I also considered rolling my own MMIO-backed ISAM instead of MapDB, given how simple the use case is.

The reason I've not yet done these is that my Datasets are computationally expensive to prepare; I need to serialize and deserialize them when spinning processes up and down, and the existing protobuf-based implementations all instantiate Datasets with on-heap storage.

I've also considered buying a ton of physical memory. ;)

@handshape handshape added the enhancement New feature or request label Dec 21, 2022
@Craigacp
Copy link
Member

Yes, the current dataset representation isn't as memory efficient as we'd like it to be, particularly when deserializing from protobufs. The protobuf deserialization path doesn't deduplicate the feature name strings on the way through (unlike Dataset.add or Java serialization), which will increase the memory consumed unless you're using a new enough JDK which automatically does string deduplication in the garbage collector. If your data is dense we're already planning to build a dense example which should have half the memory usage of the current sparse example (as it will only store the feature values not the names).

Could you provide a little detail on the size and shape of the datasets you want to work with? Are they sparse or dense? How many features & examples are there?

Tribuo isn't particularly designed for very large datasets as most of the training methods create a copy of the data to get it into a more compute friendly representation, and the original dataset will still have a reference on the stack during a training call meaning it can't be garbage collected. We're investigating online learning support which will allow some models to scale much further wrt to data size as you'll only need a portion of it in memory while it is used for training, but that won't be possible for all model types.

@handshape
Copy link
Author

My current use case has about 80k examples, with about 1000 dense features each. (Short phrases that have been run through BERT or another embedding, plus a few fistfuls of contextual features.)

I anticipate that my number of examples is going to grow exponentially.

Online learning could be made to work (and would be a fantastic addition for other use cases), but I'd much prefer being able to work with the existing Dataset abstraction, and provide my own backing store.

@Craigacp
Copy link
Member

Craigacp commented Jan 3, 2023

Ok, sounds like the dense example will help you quite a bit. Moving to memory mapped IO as a supported Tribuo Dataset class will be hard as neither the protobuf format we're moving to, nor the current java.io.Serializable representation are suitable for that, so we'd need to design a new disk representation and then we'd have to live with it for a long time.

If you're happy with writing your own dataset, then the protobuf serialization mechanisms will accept other classes that implement ProtoSerializable, so you could override all the serialization stuff and have your own Example and Dataset that each implement ProtoSerializable for the wrapper types, but contain your own implementation of the protobuf Message. The serialization infrastructure will then route the deserialization of that message to your class automatically via reflection because the class name is stored in the message. You could then sidestep the examples and just load the dataset with Dataset.deserialize which will call through to YourMMIODataset.deserializeFromProto(int version, String className, Any message).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants