Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch ParquetDataset by event rather than by file #731

Closed
OscarBarreraGithub opened this issue Jul 15, 2024 · 3 comments
Closed

Batch ParquetDataset by event rather than by file #731

OscarBarreraGithub opened this issue Jul 15, 2024 · 3 comments
Labels
feature New feature or request

Comments

@OscarBarreraGithub
Copy link
Collaborator

The _calculate_sizes function (which calculates the number of events in each batch) within the ParquetDataset class calculates the batch size by appending the length of each file inside the batch.

It would be useful to batch by event rather than by file so we can process high energy events (which have many rows per event) without manually chunking the .parquet file beforehand (.parquet is well suited to handle large files anyway).

I am working on a fix by updating the way query_table batches events.

@OscarBarreraGithub OscarBarreraGithub added the feature New feature or request label Jul 15, 2024
@RasmusOrsoe
Copy link
Collaborator

Hey @OscarBarreraGithub!

Could you elaborate a little on what you're suggesting here? The parquet files are packed in shuffled chunks of events and is not designed to provide fast, random access on a per-row level, like the SQLite format is well-suited for.

@OscarBarreraGithub
Copy link
Collaborator Author

Hey @RasmusOrsoe,

ah, I see. However, I was under the impression that the benefit of Parquet is that we can work with larger files (since they take up ~ 1/7th of the space of .db SQLite files). However, this then means that we can no longer batch by file as it will be too computationally expensive - especially for very high energy events.

My current workaround is to chunk my large Parquet files in a preprocessing step, and then feeding this directory to the trainer. Is this the optimal way (rather than batching by # of events in each file)?

@RasmusOrsoe
Copy link
Collaborator

RasmusOrsoe commented Aug 14, 2024

The trade-off between parquet and sqlite in our context is basically this:

SQLite provides a single (sometimes a few) uncompressed file(s) that has very fast, random access to rows. This means that the resulting Dataset class points to individual event ids and stream them individually as you train. So it provides you with the ability to dynamically change which part of the dataset you'd like to train on. I.e. you can bundle muons and neutrinos together in the database, and choose-as-you-go if you want to train on the full dataset or a sub sample of it. This format provides the fastest random access with the smallest memory footprint and is quite useful for downstream analytics/plotting.

Parquet files from the Parquet converter gives you many compressed (~8x smaller than sqlite) parquet files - each of which is a random sample of the entire dataset you converted. Because you do not have fast random access to the individual events, this format does not allow you to dynamically change which events you train on, but on the other hand, it allows you to train on datasets that would take up many tb's of space in sqlite format. Instead, you load in each batch at a time and train on the content sequentially, meaning that the memory footprint is higher than for the sqlite alternative. In the Parquet dataset, you can choose to train on all batches or some of them.

We would have strongly preferred to provide random access to events in the parquet files as well, but I was unable to provide that access at speeds sufficient for real-time streaming of the data. If you have a local version that appears to be able to do this, I would very much like to see the details :-)

I hope this was helpful. More information here: https://graphnet-team.github.io/graphnet/datasets/datasets.html#sqlitedataset-vs-parquetdataset

@OscarBarreraGithub OscarBarreraGithub closed this as not planned Won't fix, can't repro, duplicate, stale Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants