Skip to content

Rust Polars 0.37.0

Compare
Choose a tag to compare
@github-actions github-actions released this 05 Feb 08:10
f3c4cc5

🏆 Highlights

  • new implementation for String/Binary type. (#13748)

💥 Breaking changes

  • Remove DatetimeChunked::convert_time_zone (#14046)
  • Rename LiteralValue::to_anyvalue to LiteralValue::to_any_value (#14033)
  • Rename drop_columns to drop (#13754)
  • Rename pl.count() to pl.len() (#13719)
  • Rename row_count_name/row_count_offset parameters in IO functions to row_index_* (#13563)
  • Rename with_row_count to with_row_index (#13494)

🚀 Performance improvements

  • prune parquet row groups when is_not_null is used (#14260)
  • use is_between to skip parquet row groups (#14244)
  • Use a compression API that is designed for this use case (#11699) (#14194)
  • Use UnitVec in polars-plan traversal (#14199)
  • use UnitVec in streaming joins (#14197)
  • improve ChunkId (#14175)
  • improve iteration performance (#14126)
  • elide unneeded work in window? (#14108)
  • run window functions more in parallel (#14095)
  • improve skip row group using statistics condition (#14056)
  • improve string/binary reverse performance (#14016)
  • optimize DataFrame.describe by presorting columns (#13822)
  • elide redundant bound checks. (#13909)
  • speedup boolean filter (#13905)
  • speedup binview filter (#13902)
  • improve binview filter (#13878)
  • apply string view GC more conservatively (#13850)
  • add optimized BinaryViewArray comparison kernels (#13839)
  • lazy cache binview bytes len (#13830)
  • fast-path for eager int_range (#13811)
  • Optimize arr.sum for inner non-null bool (#13800)
  • directly embed data ptr in Buffer (#13744)
  • elide parallelism restriction on generic rolling expressions (#13662)
  • ensure time groups are parallelized (#13660)
  • do not eagerly compute bitcount (#13562)
  • optimise SQL engine string concat (#13499)
  • remove lifetime requirement from CategoricalChunkedBuilder (#13319)

✨ Enhancements

  • add u8/i8/u16/i16 parsers to CSV reader (#14241)
  • Implements list.gather_every (#14253)
  • Implements prefix/suffix_fields (#14251)
  • Polish decimal arithmetic (#14172)
  • Introduce arr.to_struct (#14202)
  • Supports map fields name of struct (#14203)
  • make IdxVec generic as UnitVec (#14196)
  • add new arithmetic kernels (#14026)
  • Supports unique and hash_rows for null column (#14111)
  • Implement arithmetic operations for Null columns (#14107)
  • Add strict/non-strict construction of Boolean/Binary series (#14073)
  • Improve Series::from_any_values logic (#14052)
  • Adapt extend_constant to function expr architecture and expressify it (#14058)
  • add integer negation (#14049)
  • list & array measures of dispersion (#13245)
  • gc binview when writing ipc (#14035)
  • When calling convert_time_zone on time-zone-naive datetime, convert as if converting from UTC (#13960)
  • DataFrame supports explode by array column (#13958)
  • improve binary formatting (#13981)
  • preserve Enum information when going to IPC (#13943)
  • support kwargs in plugin 'field' functions and raise error on unsupported binview layout (#13944)
  • support cast decimal to utf8 (#13829)
  • add SQL support for timestamp precision modifier (#13936)
  • support negative indexing and expressions for LEFT, RIGHT and SUBSTR SQL string funcs (#13888)
  • Introduce explode for ArrayNameSpace (#13923)
  • raise better error message for .dt.time on Date column (#13932)
  • List set_operations supports float (#13920)
  • Add ignore_nulls for arr.join (#13919)
  • register 'set_sorted' as batch/elementwise (#13896)
  • move Enum/Categorical categories to binview (#13882)
  • Add ignore_nulls for list.join (#13701)
  • Add ignore_nulls for pl.concat_str (#13877)
  • fix parquet for binview (#13873)
  • support mmap for binview in OOC (#13872)
  • implement ffi for binview (#13871)
  • Support zero fill null strategy for binary and string columns (#13869)
  • Implement/fix unary minus operator -pl.col(...) (#13776)
  • extend SQL EXTRACT with "century", "millennium", and "timezone" parts (#13634)
  • fix binview ipc format (#13842)
  • add SQL support for numeric and/or decimal types (#13739)
  • improve panic message (#13836)
  • Expressify str.zfill (#13790)
  • new implementation for String/Binary type. (#13748)
  • Add nulls_last for Series.sort (#13794)
  • Impl count_matches for array namespace (#13675)
  • Add nulls_last for list/array.sort (#13795)
  • Rename drop_columns to drop (#13754)
  • convert fixed-offset timezones to respective Etc timezone from time zone database (#13738)
  • Expressify str.slice (#13747)
  • implement binview for polars-row (#13736)
  • implement binview for polars-json (#13737)
  • add architecture for polars-flavored IPC (#13734)
  • implement binview comparison kernels (#13715)
  • raise default frame/series repr height from 8 to 10 (#13699)
  • write parquet ColumnOrder (#13672)
  • Impl contains for ArrayNameSpace (#13638)
  • improve rolling() expression formatting (#13657)
  • Implement is_between in Rust (#11945)
  • Expressify pattern of str.extract (#13607)
  • Impl join for ArrayNameSpace (#13586)
  • add SQL engine support for string cast to json (#13624)
  • add SQL engine support for EXTRACT and DATE_PART (#13603)
  • add BinaryView to parquet writer/reader. (#13489)
  • add SQL engine support for POSITION and STRPOS (#13585)
  • is_in support for array dtype (#13559)
  • add new str.find expression, returning the index of a regex pattern or literal substring (#13561)
  • add SQL engine support for LIKE and ILIKE pattern matching (#13522)
  • improve hive partition pruning (#13358) (#13426)
  • don't rechunk by default in lazy scans (#13518)
  • Add cum_count expression function (#13478)
  • add SQL engine support for IF control flow function (#13491)
  • add SQL engine support for MOD function (#13502)
  • return datetime for datetime mean & median (#13417)
  • add SQL engine support for CONCAT_WS string function (#13483)
  • BinaryView/Utf8View IPC support (#13464)
  • Implement wasm Pool::scope (#13476)
  • add SQL engine support for RIGHT and REVERSE string functions (#13461)
  • implement BinaryView and Utf8View in polars-arrow (#13243)
  • add SQL engine support for variadic string CONCAT function (#13428)
  • add support for AND in SQL join-clause context (#13242)
  • Impl ordering ops for array namespace (#13414)
  • add SQL engine support for REPLACE string function (#13431)
  • add SQL engine support for SIGN function (#13429)
  • add SQL engine support for IFNULL function (#13432)
  • additional SQL support for bytes, bit, and hex literals (#13389)

🐞 Bug fixes

  • deduplicate recursive growables (#14264)
  • Fix glimpse overload signature (#14258)
  • allow set operations on list of categoricals (#14110)
  • any/all_horizontal with single input has incorrect type (#14256)
  • load numpy array with np array values #14237 (#14238)
  • Fix join validation for String types (#14229)
  • make csv parser more robust to edge cases (#14210)
  • Fix for set_operations of binary dtype (#14152)
  • fix read_csv date/datetime inference and parsing (#14113)
  • don't see files as hive partitions (#14128)
  • allow eval on list of categoricals (#14132)
  • add missing conditional compile flag for StringFunction::Find (#14129)
  • Forbid casting from Date to Time and vice versa (#14127)
  • preserve old naming convention for multi-value pivot (this will change in 1.0 to no longer redundantly have the column name in the middle) (#14120)
  • Implements gt/lt cmp for null dtype (#14119)
  • ignore comments at beginning of csv if schema provided (#14115)
  • fix pivot when multiple columns are passed. Output is now aligned with what tidyverse / pandas.pivot_table would do (#14048)
  • some temporal conversion errors for datetimes earlier than 1970-01-01 (#14050)
  • Preserve name when casting from categorical (#14085)
  • fix cse bug when window function is nested (#14070)
  • Fix melt panic when there are no value vars (#14057)
  • json_encode should respect the logical type (#14063)
  • improve skip row group using statistics condition (#14056)
  • Raise for .dt.epoch and .dt.timestamp for Duration dtype (#13962)
  • handle SliceSink with empty data (#14025)
  • correct field type schema inference (using read_csv) (#14042)
  • Map AnyValue::Null to datatype Null (#14045)
  • Use int formatter for unsigned ints (#14043)
  • quick fix for multiple chunks binary reverse (#14024)
  • count matches on list categorical (#14021)
  • list.min/max with empty and/or None elements (#14018)
  • allow get access to list of categoricals (#14015)
  • Fix casting from categorical to numeric (#13957)
  • read_csv preserve whitespace and newlines (#13934)
  • append decimal with different scale (#13977)
  • Allow casting integer types to Enum (#13955)
  • arg_min/max on categoricals should respect ordering (#13998)
  • serialize decimal type (#13997)
  • check input type for arr/list.contains (#13959)
  • Allow dtype merge when inner dtype is enum (#13938)
  • recurse less in streaming shared sinks (#13930)
  • ensure order is preserved if streaming from different sources (#13922)
  • Fix is_not_null for Struct columns (#13921)
  • make 100 * pl.col(pl.Boolean).mean() work (#13725)
  • allow extract of numeric from str AnyValue (#13865)
  • single-element .dt.time() and .dt.date() should always preserve sortedness (#13808)
  • prune emtpy chunks before set operations (#13898)
  • treat null columns as zero in sum_horizontal (#13880)
  • include null count in rolling window validity with min_periods (#13863)
  • don't return NaN as free memory fraction (#13860)
  • parquet hybrid RLE encoding did not always align to bit width (#13883)
  • Add ignore_nulls for list.join (#13701)
  • .dt.time() was panicking for datetimes prior to unix epoch (#13812)
  • Correct err message of check_map_output_len (#13854)
  • allow list creation of decimals (#13851)
  • Implement abs for Decimal, error on Date/Time/Datetime (#13821)
  • decompress the right number of rows when reading compressed CSVs (#13721)
  • rolling nested groups deadlock (#13835)
  • gather_every should work on agg context (#13810)
  • When reading Parquet or Arrow, convert +00:00 timezone to UTC (#13816)
  • Fix segfault of is_in (#13814)
  • don't panic on full null qcut (#13815)
  • do not read data for zero-length compressed buffer (#13791)
  • Fix the non-null test of transpose (#13783)
  • Raise error instead of panic when joining on wildcard/nth (#13742)
  • str.concat correctly ignore single null value (#13751)
  • Selectors by_name and by_dtype should allow empty list as input (#11024)
  • Use NonZeroUsize for batch_size parameter in write_csv/sink_csv/scan_ndjson (#13726)
  • error instead of panicking in sql if empty function (#13691)
  • gather.get schema (#13679)
  • ensure we hit proper cache in nested rolling expressions (#13666)
  • Allow av_buffer cast numeric record to temporal type (#13661)
  • streaming cross join if swapped is hit (#13656)
  • Make sure rolling key is projected when process projection (#13622)
  • fix schema inference for json (#13637)
  • Empty series of AggregatedList should also have list dtype (#13620)
  • fallback to cast kernel if inline_cast AnyValue raise (#13595)
  • LazyFrame::join() no longer ignores 3 JoinArgs parameters (#13570)
  • fix reverse variable row decoding (#13587)
  • Fix scatter for null values (#13578)
  • Fix cum_count with regards to start value / null values (#13535)
  • Fix precision/scale handling and invalid numbers in string-to-decimal conversions. (#13548)
  • Treat Python None as null value for Object dtype (#13564)
  • Expr.replace to single value did not replace NULLs (#13551)
  • AnyValue::StructOwned panic when hashing (#13553)
  • improve hive partition pruning (#13358) (#13426)
  • fix projection pushdown for new outer join schema (#13527)
  • ensure size-hint of TrueIdxIter is correct (#13508)
  • correct 'outer_coalesce' logic in case of duplicate names (#13501)
  • raise for out-of-range datetimes in to_datetime/strptime (#13403)
  • Keep logical type when getting values from list (#13456)
  • Handle duplicate/ambiguous inputs for replace (#13217)
  • skip null/empty values if replace_lit_n_char (#13400)
  • fix is_in operator when comparing string with global categoricals (#13412)
  • use different generics for shift_and_fill parameters (#13379)

📖 Documentation

  • fix code block in user-guide/lazy/schemas (#14228)
  • Fix typo in contributing guide (#14181)
  • Small improvements Ecosystem page (#14176)
  • fix code blocks in user-guide/concepts/data-structures (#14146)
  • Fix bullet point formatting in CI contributing guide (#14117)
  • Remove outdated reference to horizontal concat feature (#14105)
  • Replace alternatives page with more objective comparison (#13784)
  • Improve structure of user guide (#13951)
  • Improve structure of user guide (#13639)
  • Introduce ecosystem page in user guide (#13903)
  • Mention deltalake write support in README (#13890)
  • Fix typo in deprecation message of with_row_count (#13793)
  • Fix incorrect "coming from pandas" syntax (#13767)
  • Improve streaming section of the user guide (#13750)
  • fix linking to feature flags in user guide (#13644)
  • Improve documentation on broadcasting (#13394)
  • Add note about toolchain issue under native Windows (#13590)
  • update SQL section of the README (#13529)
  • update polars-business > polars-xdt link (#13509)

📦 Build system

  • Enable feature nightly with optional sql feature (#14222)
  • remove horizontal_concat feature (#13390)

🛠️ Other improvements

  • make gather_chunked completely generic (#14195)
  • Add .cargo directory to .gitignore (#14191)
  • take_chunked to polars-ops (#14185)
  • Enable clippy lint to warn on debug macros (#14178)
  • Run cargo update (#14160)
  • merge take kernels (#14137)
  • improve From<Ca> -> Vec (#14123)
  • hoist boolean -> string cast (#14122)
  • Remove DatetimeChunked::convert_time_zone (#14046)
  • More generic way to present an expression tree diagram (#14020)
  • Rename LiteralValue::to_anyvalue to LiteralValue::to_any_value (#14033)
  • make Enums an actual datatype (#14011)
  • update rustc (#13947)
  • move filter to polars-compute (#13897)
  • bump object_store to 0.9 (#13857)
  • Make functions in expr/general non-anonymous (#13832)
  • Fix doctests (#13831)
  • Refactor Python release workflow (#13807)
  • Make pl.duration non-anonymous (#13762)
  • Rename pl.count() to pl.len() (#13719)
  • Deprecate dt.with_time_unit in favor of cast(pl.Int64).cast(pl.Datetime(time_unit, time_zone)) (#13667)
  • Auto-add 'needs triage' label to bugs (#13671)
  • make rolling index column visible to optimizer (#13658)
  • Rename lazy-regex feature to regex to align polars with polars-lazy crate (#13647)
  • Add Documentation / Build system sections to the changelog (#13594)
  • Filter unhelpful messages in make build (#13579)
  • Remove extra line break between checkboxes in GitHub bug report issues (#13576)
  • Rename row_count_name/row_count_offset parameters in IO functions to row_index_* (#13563)
  • Rename with_row_count to with_row_index (#13494)
  • simplify parquet binary ordering function (#13488)
  • dont panic of ambiguous is of wrong type (#13388)

Thank you to all our contributors for making this release possible!
@29antonioac, @Bromeon, @ByteNybbler, @JulianCologne, @MarcNuebel, @MarcoGorelli, @NedJWestern, @ShivMunagala, @Vincenthays, @Wainberg, @aaarrti, @alexander-beedie, @apcamargo, @bchalk101, @braaannigan, @c-peters, @cgevans, @cmdlineluser, @collinprince, @deanm0000, @dependabot, @dependabot[bot], @dpinol, @edavisau, @eitsupi, @flisky, @grinya007, @hamishs, @henryharbeck, @ion-elgreco, @itamarst, @jacksonthall22, @jcrozum, @kstoneriv3, @langestefan, @lukemanley, @mcrumiller, @mkucijan, @nameexhaustion, @orlp, @petrosbar, @r-brink, @reswqa, @ritchie46, @s-banach, @shritesh, @stinodego, @taki-mekhalfa, @thomasaarholt, @tim-stephenson, @universalmind303, @valorien and @wjandrea