Skip to content

Polars Cloud client 0.9.0

Latest

Choose a tag to compare

@TNieuwdorp TNieuwdorp released this 01 Jul 13:52
5cb480f

🏆 Highlight

Expression running distributed

Expression lowering allows expressions to be planned and executed using distributed algorithms, just like operations on LazyFrames like (join, group-by, select). Before it had to shuffle to a single stage.

In this release, expression lowering is enabled for select, with_columns, and filter. You'll notice the biggest improvements when combining aggregating expressions such as first, max, mean, or unique with other expressions, for example:

lf.filter(pl.col.price > pl.col.price.mean())

We expect this initial rollout to improve execution plans for many queries. We're continuing to add distributed implementations for more expressions and further optimize the generated plans. We welcome any feedback on how this feature performs on your workloads.

🚀 Performance improvements

Because this lands Add bytes-based concurrency control for cloud IO, we've seen 20% reduction in TPC-H SF1000-32x_m6i.xlarge. As this benchmark is IO bound, this had some queries reduced by half.

⚠️ Breaking change

In the ClusterContext the keyword compute_address="…" is now superseded by uri="…".

✨ Enhancements

Improved ClusterContext API

We've updated the ClusterContext API to make it more flexible and easier to configure.

Distributed Unions of Python Scans

Users can manually distribute/multiple scans by calling:

pl.union([custom_scan(partition) for partition in partitions])
  • Specify a complete URI instead of only a hostname or IP address.
  • SSL connections now use the system's trusted Certificate Authorities (CAs) by default.
  • Configure Scheduler and Observatory connection settings independently. The insecure flag now only affects TLS behavior instead of changing the protocol, allowing you to use HTTPS while optionally ignoring certificate validation errors via tls_options.
  • Attach custom HTTP headers to every request using extra_headers.

Distributed Iceberg sink

You can now use lf.remote().sink_iceberg() to write Iceberg tables directly from distributed queries. This feature is available when pyiceberg is installed on the cluster.

Manual cluster scaling for on-premise deployments

Clusters can now be scaled manually by adding or removing nodes, provided the cluster is not configured with a fixed number of workers.

New disk I/O metrics

Additional storage metrics are now available:

  • Disk utilization (GB used) for clusters with cgroup I/O accounting enabled.
  • Disk read/write throughput when disk I/O metrics are enabled on the cluster.

Try with Polars Cloud - https://cloud.pola.rs/
On-Prem Releases - https://docs.pola.rs/polars-on-premises/releases/