Releases · mjakubowski84/parquet4s

19 Aug 13:48

mjakubowski84

v2.19.0

cde2bf0

v2.19.0 Latest

Latest

The release introduces several changes to viaParquet:

ability to define a default partition value so that you can partition your data even when the partition column is nullable
introduced a custom builder for viaParquet in Akka / Pekko so that you can stream any document format (supported by https://github.com/apache/parquet-java, e.g. Avro) and partition it.

Moreover, to keep compatibility with Apache Spark, from now on, Parquet4s url-encodes partition values during writing and url-decodes during reading.

Notable dependency changes:

Parquet (Java) upgraded to 1.14.1
Pekko upgraded to 1.0.3
Slf4j upgraded to 2.0.16
Protobuf compiler upgraded to 0.11.17

Assets 2

19 May 18:35

mjakubowski84

v2.18.0

70c8880

v2.18.0

This release introduces two significant changes:

Improved internals responsible for reading content and statistics of Parquet files. The difference is especially noticeable in the case of Stats: it is faster and now you can also query for min and max of partition fields.
Upgrades Parquet to 1.14.0. The biggest improvement is support for Hadoop's vectored IO, which you can optionally enable in ParquetReader.Options. It can significantly improve the performance of reading huge files.

Assets 2

25 Feb 08:58

mjakubowski84

v2.17.0

68c635c

v2.17.0

Improved reading of partitioned directories

Do you read data from a huge data lake partitioned into lots of directories? You have probably noticed that listing all those directories and files within takes a lot of time. And then, when you are interested in just a single partition, you still wait minutes before the files are actually being read. Indeed, reading a file can be much faster than locating it in storage. That's why Parquet4s introduces an improvement in listing partitioned directories. When you provide a filter it is eagerly evaluated against partitions. Partitions that do not match the filter are skipped early. Thanks to that Parquet4s avoids loading the whole structure of the directory tree into the memory - it lists only those directories which match the filter. You can expect a huge improvement in the speed of filtering huge data lakes!

Record filter

Parquet4s introduces an experimental RecordFilter. It allows skipping records based on their index in the file. The RecordFilter can be used for the development of custom low-level solutions.

Other notable changes:

Fixed bug in FS2 - postWriteHandler now always receives proper counts in the state of the partition
Various fixes and improvements in examples
Updated docs

Assets 2

11 Feb 19:38

mjakubowski84

v2.16.1

8619ae7

v2.16.1

This small release optimizes the calculation of partition paths in viaParquet function in Akka, Pekko and FS2 modules. Resource consumption was lowered and performance significantly improved - especially in applications with utilize multiple nested partitions.

Big thanks to @sndnv for the contribution.

Contributors

sndnv

Assets 2

07 Feb 21:45

mjakubowski84

v2.16.0

67754da

v2.16.0

This release introduces a feature that enables significant improvement in the performance of reading Parquet files. Parquet storage, like a data lake usually consists of a huge number of files. How can we speed up the reading of such a storage? Simply by reading multiple files in parallel at the same time!
Parquet4s by default reads a file by file - in a sequence. Now, by using Akka, Pekko or FS2, you can choose a parallelism level and read multiple files at the same time, while still controlling the utilization of resources. Simply use the option parallelism(n = ???) when defining your reader.

Besides that, there were numerous minor and bugfix dependency updates, e.g. in Pekko, Cats Effect, FS2 and Slf4j.

Big thanks to @calvinlfer for his contribution.

Contributors

calvinlfer

Assets 2

05 Feb 18:45

mjakubowski84

v2.15.1

76f91fd

v2.15.1

This release fixes a bug when a decimal value is encoded in Parquet in the form of a long number. Parquet4s was reading such a value as a simple long. Now it also applies a scale and a precision

Assets 2

20 Jan 13:59

mjakubowski84

v2.15.0

491210b

v2.15.0

Two contributions were made in this release:

@flipp5b added codecs for java.time.Instant. A bug in encoding timestamps as nanos was also fixed.
@i10416 turned Path into a value class.

Big thanks to both of them!

Contributors

flipp5b and i10416

Assets 2

01 Dec 14:54

mjakubowski84

v2.14.2

2387fb3

v2.14.2

Versions 2.14.0 and 2.14.1 mistakenly released parquet4s-scalapb module as parquet4s-scalapb-akka and parquet4s-scalapb-pekko. In version 2.14.2 a sole parquet4s-scalapb is brought back.

Assets 2

12 Nov 13:28

mjakubowski84

v2.14.1

8ab1270

v2.14.1

This release fixes generic projection over a group using the group's multiple fields.

Assets 2

09 Nov 19:19

mjakubowski84

v2.14.0

503a911

v2.14.0

Version 2.14.0 brings a revolution to Parquet4s led mostly by @utkuaydn and @j-madden:

Parquet4s now supports both Akka and Pekko 🥳
Upgrade to Scala 2.13.12 and 3.3.1
Upgrade of SBT to 1.9.x and building project using sbt-projectmatrix
Supporting legacy pyarrow lists in file reads

Big thanks to the contributors!

Contributors

utkuaydn and j-madden

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved reading of partitioned directories

Record filter

Other notable changes:

Contributors

Contributors

Contributors

Contributors

Releases: mjakubowski84/parquet4s

v2.19.0

v2.18.0

v2.17.0

Improved reading of partitioned directories

Record filter

Other notable changes:

v2.16.1

Contributors

v2.16.0

Contributors

v2.15.1

v2.15.0

Contributors

v2.14.2

v2.14.1

v2.14.0

Contributors