Releases: mjakubowski84/parquet4s
v2.19.0
The release introduces several changes to viaParquet:
- ability to define a default partition value so that you can partition your data even when the partition column is nullable
- introduced a custom builder for viaParquet in Akka / Pekko so that you can stream any document format (supported by https://github.com/apache/parquet-java, e.g. Avro) and partition it.
Moreover, to keep compatibility with Apache Spark, from now on, Parquet4s url-encodes partition values during writing and url-decodes during reading.
Notable dependency changes:
- Parquet (Java) upgraded to 1.14.1
- Pekko upgraded to 1.0.3
- Slf4j upgraded to 2.0.16
- Protobuf compiler upgraded to 0.11.17
v2.18.0
This release introduces two significant changes:
-
Improved internals responsible for reading content and statistics of Parquet files. The difference is especially noticeable in the case of
Stats
: it is faster and now you can also query for min and max of partition fields. -
Upgrades Parquet to 1.14.0. The biggest improvement is support for Hadoop's vectored IO, which you can optionally enable in
ParquetReader.Options
. It can significantly improve the performance of reading huge files.
v2.17.0
Improved reading of partitioned directories
Do you read data from a huge data lake partitioned into lots of directories? You have probably noticed that listing all those directories and files within takes a lot of time. And then, when you are interested in just a single partition, you still wait minutes before the files are actually being read. Indeed, reading a file can be much faster than locating it in storage. That's why Parquet4s introduces an improvement in listing partitioned directories. When you provide a filter it is eagerly evaluated against partitions. Partitions that do not match the filter are skipped early. Thanks to that Parquet4s avoids loading the whole structure of the directory tree into the memory - it lists only those directories which match the filter. You can expect a huge improvement in the speed of filtering huge data lakes!
Record filter
Parquet4s introduces an experimental RecordFilter
. It allows skipping records based on their index in the file. The RecordFilter
can be used for the development of custom low-level solutions.
Other notable changes:
- Fixed bug in FS2 -
postWriteHandler
now always receives proper counts in the state of the partition - Various fixes and improvements in examples
- Updated docs
v2.16.1
This small release optimizes the calculation of partition paths in viaParquet
function in Akka, Pekko and FS2 modules. Resource consumption was lowered and performance significantly improved - especially in applications with utilize multiple nested partitions.
Big thanks to @sndnv for the contribution.
v2.16.0
This release introduces a feature that enables significant improvement in the performance of reading Parquet files. Parquet storage, like a data lake usually consists of a huge number of files. How can we speed up the reading of such a storage? Simply by reading multiple files in parallel at the same time!
Parquet4s by default reads a file by file - in a sequence. Now, by using Akka, Pekko or FS2, you can choose a parallelism level and read multiple files at the same time, while still controlling the utilization of resources. Simply use the option parallelism(n = ???)
when defining your reader.
Besides that, there were numerous minor and bugfix dependency updates, e.g. in Pekko, Cats Effect, FS2 and Slf4j.
Big thanks to @calvinlfer for his contribution.
v2.15.1
This release fixes a bug when a decimal value is encoded in Parquet in the form of a long number. Parquet4s was reading such a value as a simple long. Now it also applies a scale and a precision
v2.15.0
v2.14.2
Versions 2.14.0 and 2.14.1 mistakenly released parquet4s-scalapb
module as parquet4s-scalapb-akka
and parquet4s-scalapb-pekko
. In version 2.14.2 a sole parquet4s-scalapb
is brought back.
v2.14.1
This release fixes generic projection over a group using the group's multiple fields.
v2.14.0
Version 2.14.0 brings a revolution to Parquet4s led mostly by @utkuaydn and @j-madden:
- Parquet4s now supports both Akka and Pekko 🥳
- Upgrade to Scala 2.13.12 and 3.3.1
- Upgrade of SBT to 1.9.x and building project using sbt-projectmatrix
- Supporting legacy pyarrow lists in file reads
Big thanks to the contributors!