Batch ParquetDataset by event rather than by file #731

OscarBarreraGithub · 2024-07-15T15:32:17Z

The _calculate_sizes function (which calculates the number of events in each batch) within the ParquetDataset class calculates the batch size by appending the length of each file inside the batch.

It would be useful to batch by event rather than by file so we can process high energy events (which have many rows per event) without manually chunking the .parquet file beforehand (.parquet is well suited to handle large files anyway).

I am working on a fix by updating the way query_table batches events.

The text was updated successfully, but these errors were encountered:

RasmusOrsoe · 2024-08-13T12:43:58Z

Hey @OscarBarreraGithub!

Could you elaborate a little on what you're suggesting here? The parquet files are packed in shuffled chunks of events and is not designed to provide fast, random access on a per-row level, like the SQLite format is well-suited for.

OscarBarreraGithub · 2024-08-13T16:07:00Z

Hey @RasmusOrsoe,

ah, I see. However, I was under the impression that the benefit of Parquet is that we can work with larger files (since they take up ~ 1/7th of the space of .db SQLite files). However, this then means that we can no longer batch by file as it will be too computationally expensive - especially for very high energy events.

My current workaround is to chunk my large Parquet files in a preprocessing step, and then feeding this directory to the trainer. Is this the optimal way (rather than batching by # of events in each file)?

RasmusOrsoe · 2024-08-14T08:37:48Z

The trade-off between parquet and sqlite in our context is basically this:

SQLite provides a single (sometimes a few) uncompressed file(s) that has very fast, random access to rows. This means that the resulting Dataset class points to individual event ids and stream them individually as you train. So it provides you with the ability to dynamically change which part of the dataset you'd like to train on. I.e. you can bundle muons and neutrinos together in the database, and choose-as-you-go if you want to train on the full dataset or a sub sample of it. This format provides the fastest random access with the smallest memory footprint and is quite useful for downstream analytics/plotting.

Parquet files from the Parquet converter gives you many compressed (~8x smaller than sqlite) parquet files - each of which is a random sample of the entire dataset you converted. Because you do not have fast random access to the individual events, this format does not allow you to dynamically change which events you train on, but on the other hand, it allows you to train on datasets that would take up many tb's of space in sqlite format. Instead, you load in each batch at a time and train on the content sequentially, meaning that the memory footprint is higher than for the sqlite alternative. In the Parquet dataset, you can choose to train on all batches or some of them.

We would have strongly preferred to provide random access to events in the parquet files as well, but I was unable to provide that access at speeds sufficient for real-time streaming of the data. If you have a local version that appears to be able to do this, I would very much like to see the details :-)

I hope this was helpful. More information here: https://graphnet-team.github.io/graphnet/datasets/datasets.html#sqlitedataset-vs-parquetdataset

OscarBarreraGithub added the feature New feature or request label Jul 15, 2024

OscarBarreraGithub closed this as not planned Won't fix, can't repro, duplicate, stale Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch ParquetDataset by event rather than by file #731

Batch ParquetDataset by event rather than by file #731

OscarBarreraGithub commented Jul 15, 2024

RasmusOrsoe commented Aug 13, 2024

OscarBarreraGithub commented Aug 13, 2024

RasmusOrsoe commented Aug 14, 2024 •

edited

Loading

Batch ParquetDataset by event rather than by file #731

Batch ParquetDataset by event rather than by file #731

Comments

OscarBarreraGithub commented Jul 15, 2024

RasmusOrsoe commented Aug 13, 2024

OscarBarreraGithub commented Aug 13, 2024

RasmusOrsoe commented Aug 14, 2024 • edited Loading

RasmusOrsoe commented Aug 14, 2024 •

edited

Loading