Skip to content

Enhance parquet reader with MMAP and with metadata_cache#13

Open
wolfgang-desalvador wants to merge 1 commit intomainfrom
wdesalvador/enhance-parquet-reader
Open

Enhance parquet reader with MMAP and with metadata_cache#13
wolfgang-desalvador wants to merge 1 commit intomainfrom
wdesalvador/enhance-parquet-reader

Conversation

@wolfgang-desalvador
Copy link
Copy Markdown

This pull request adds new configuration options to the ParquetReader in dlio_benchmark/reader/parquet_reader.py to improve performance and flexibility. The main changes introduce support for metadata caching and memory-mapped I/O, which can significantly speed up repeated file access and large file reads. The implementation includes new flags, updates to the YAML configuration, and logic to cache and reuse Parquet file metadata.

Performance and configuration improvements:

  • Added metadata_cache and memory_map options to the YAML configuration and class, allowing users to enable Parquet footer metadata caching and memory-mapped I/O for faster file access. [1] [2] [3]
  • Implemented a metadata cache (self._metadata_cache) in ParquetReader to store and reuse Parquet file metadata and cumulative row group offsets, reducing redundant reads when metadata_cache is enabled. [1] [2]
  • Updated the open() method to utilize the metadata cache and memory-mapped I/O, improving efficiency when opening files multiple times.
  • Enhanced logging to report the status of the new configuration flags for easier debugging and performance analysis.

Documentation updates:

  • Updated class and configuration documentation to describe the new options and their effects on file reading behavior. [1] [2]

@wolfgang-desalvador
Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant