Skip to content

Latest commit

 

History

History
370 lines (277 loc) · 14.9 KB

v2.2.0.rst

File metadata and controls

370 lines (277 loc) · 14.9 KB

What's new in 2.2.0 (Month XX, 2024)

These are the changes in pandas 2.2.0. See release for a full changelog including other versions of pandas.

{{ header }}

Enhancements

Calamine engine for read_excel

The calamine engine was added to read_excel. It uses python-calamine, which provides Python bindings for the Rust library calamine. This engine supports Excel files (.xlsx, .xlsm, .xls, .xlsb) and OpenDocument spreadsheets (.ods) (50395).

There are two advantages of this engine:

  1. Calamine is often faster than other engines, some benchmarks show results up to 5x faster than 'openpyxl', 20x - 'odf', 4x - 'pyxlsb', and 1.5x - 'xlrd'. But, 'openpyxl' and 'pyxlsb' are faster in reading a few rows from large files because of lazy iteration over rows.
  2. Calamine supports the recognition of datetime in .xlsb files, unlike 'pyxlsb' which is the only other engine in pandas that can read .xlsb files.
pd.read_excel("path_to_file.xlsb", engine="calamine")

For more, see io.calamine in the user guide on IO tools.

Series.struct accessor to with PyArrow structured data

The Series.struct accessor provides attributes and methods for processing data with struct[pyarrow] dtype Series. For example, Series.struct.explode converts PyArrow structured data to a pandas DataFrame. (54938)

python

import pyarrow as pa series = pd.Series( [ {"project": "pandas", "version": "2.2.0"}, {"project": "numpy", "version": "1.25.2"}, {"project": "pyarrow", "version": "13.0.0"}, ], dtype=pd.ArrowDtype( pa.struct([ ("project", pa.string()), ("version", pa.string()), ]) ), ) series.struct.explode()

enhancement2

Other enhancements

  • read_csv now supports on_bad_lines parameter with engine="pyarrow". (54480)
  • ExtensionArray._explode interface method added to allow extension type implementations of the explode method (54833)
  • DataFrame.apply now allows the usage of numba (via engine="numba") to JIT compile the passed function, allowing for potential speedups (54666)

Notable bug fixes

These are bug fixes that might have notable behavior changes.

merge and DataFrame.join now consistently follow documented sort behavior

In previous versions of pandas, merge and DataFrame.join did not always return a result that followed the documented sort behavior. pandas now follows the documented sort behavior in merge and join operations (54611).

As documented, sort=True sorts the join keys lexicographically in the resulting DataFrame. With sort=False, the order of the join keys depends on the join type (how keyword):

  • how="left": preserve the order of the left keys
  • how="right": preserve the order of the right keys
  • how="inner": preserve the order of the left keys
  • how="outer": sort keys lexicographically

One example with changing behavior is inner joins with non-unique left join keys and sort=False:

python

left = pd.DataFrame({"a": [1, 2, 1]}) right = pd.DataFrame({"a": [1, 2]}) result = pd.merge(left, right, how="inner", on="a", sort=False)

Old Behavior

In [5]: result
Out[5]:
   a
0  1
1  1
2  2

New Behavior

python

result

notable_bug_fix2

Backwards incompatible API changes

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed

X

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed

X

See install.dependencies and install.optional_dependencies for more.

Other API changes

Deprecations

Deprecate alias M in favour of ME for offsets

The alias M is deprecated in favour of ME for offsets, please use ME for "month end" instead of M (9586)

For example:

Previous behavior:

In [7]: pd.date_range('2020-01-01', periods=3, freq='M')
Out [7]:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'],
              dtype='datetime64[ns]', freq='M')

Future behavior:

python

pd.date_range('2020-01-01', periods=3, freq='ME')

Other Deprecations

  • Changed Timedelta.resolution_string to return min, s, ms, us, and ns instead of T, S, L, U, and N, for compatibility with respective deprecations in frequency aliases (52536)
  • Deprecated allowing non-keyword arguments in DataFrame.to_clipboard. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_csv except path_or_buf. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_dict. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_excel except excel_writer. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_gbq except destination_table. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_hdf except path_or_buf. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_html except buf. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_json except path_or_buf. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_latex except buf. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_markdown except buf. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_parquet except path. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_pickle except path. (54229)
  • Deprecated allowing non-keyword arguments in DataFrame.to_string except buf. (54229)
  • Deprecated automatic downcasting of object-dtype results in Series.replace and DataFrame.replace, explicitly call result = result.infer_objects(copy=False) instead. To opt in to the future version, use pd.set_option("future.no_silent_downcasting", True) (54710)
  • Deprecated downcasting behavior in Series.where, DataFrame.where, Series.mask, DataFrame.mask, Series.clip, DataFrame.clip; in a future version these will not infer object-dtype columns to non-object dtype, or all-round floats to integer dtype. Call result.infer_objects(copy=False) on the result for object inference, or explicitly cast floats to ints. To opt in to the future version, use pd.set_option("future.no_silent_downcasting", True) (53656)
  • Deprecated including the groups in computations when using DataFrameGroupBy.apply and DataFrameGroupBy.resample; pass include_groups=False to exclude the groups (7155)
  • Deprecated not passing a tuple to DataFrameGroupBy.get_group or SeriesGroupBy.get_group when grouping by a length-1 list-like (25971)
  • Deprecated string A denoting frequency in YearEnd and strings A-DEC, A-JAN, etc. denoting annual frequencies with various fiscal year ends (52536)
  • Deprecated strings S, U, and N denoting units in to_timedelta (52536)
  • Deprecated strings T, S, L, U, and N denoting frequencies in Minute, Second, Milli, Micro, Nano (52536)
  • Deprecated strings T, S, L, U, and N denoting units in Timedelta (52536)
  • Deprecated the extension test classes BaseNoReduceTests, BaseBooleanReduceTests, and BaseNumericReduceTests, use BaseReduceTests instead (54663)
  • Deprecated the option mode.data_manager and the ArrayManager; only the BlockManager will be available in future versions (55043)
  • Deprecating downcasting the results of DataFrame.fillna, Series.fillna, DataFrame.ffill, Series.ffill, DataFrame.bfill, Series.bfill in object-dtype cases. To opt in to the future version, use pd.set_option("future.no_silent_downcasting", True) (54261)

Performance improvements

  • Performance improvement in concat with axis=1 and objects with unaligned indexes (55084)
  • Performance improvement in to_dict on converting DataFrame to dictionary (50990)
  • Performance improvement in DataFrame.groupby when aggregating pyarrow timestamp and duration dtypes (55031)
  • Performance improvement in DataFrame.sort_index and Series.sort_index when indexed by a MultiIndex (54835)
  • Performance improvement in Index.difference (55108)
  • Performance improvement when indexing with more than 4 keys (54550)
  • Performance improvement when localizing time to UTC (55241)

Bug fixes

  • Bug in AbstractHolidayCalendar where timezone data was not propagated when computing holiday observances (54580)
  • Bug in pandas.core.window.Rolling where duplicate datetimelike indexes are treated as consecutive rather than equal with closed='left' and closed='neither' (20712)
  • Bug in DataFrame.apply where passing raw=True ignored args passed to the applied function (55009)
  • Bug in pandas.read_excel with a ODS file without cached formatted cell for float values (55219)

Categorical

  • Categorical.isin raising InvalidIndexError for categorical containing overlapping Interval values (34974)

Datetimelike

  • Bug in DatetimeIndex.union returning object dtype for tz-aware indexes with the same timezone but different units (55238)

Timedelta

Timezones

Numeric

  • Bug in read_csv with engine="pyarrow" causing rounding errors for large integers (52505)

Conversion

Strings

Interval

  • Bug in Interval __repr__ not displaying UTC offsets for Timestamp bounds. Additionally the hour, minute and second components will now be shown. (55015)
  • Bug in IntervalIndex.get_indexer with datetime or timedelta intervals incorrectly matching on integer targets (47772)
  • Bug in IntervalIndex.get_indexer with timezone-aware datetime intervals incorrectly matching on a sequence of timezone-naive targets (47772)
  • Bug in setting values on a Series with an IntervalIndex using a slice incorrectly raising (54722)

Indexing

  • Bug in Index.difference not returning a unique set of values when other is empty or other is considered non-comparable (55113)
  • Bug in setting Categorical values into a DataFrame with numpy dtypes raising RecursionError (52927)

Missing

MultiIndex

I/O

  • Bug in read_csv where on_bad_lines="warn" would write to stderr instead of raise a Python warning. This now yields a .errors.ParserWarning (54296)
  • Bug in read_csv with engine="pyarrow" where usecols wasn't working with a csv with no headers (54459)
  • Bug in read_excel, with engine="xlrd" (xls files) erroring when file contains NaNs/Infs (54564)
  • Bug in to_excel, with OdsWriter (ods files) writing boolean/string value (54994)

Period

Plotting

  • Bug in DataFrame.plot.box with vert=False and a matplotlib Axes created with sharey=True (54941)

Groupby/resample/rolling

Reshaping

  • Bug in concat ignoring sort parameter when passed DatetimeIndex indexes (54769)
  • Bug in merge returning columns in incorrect order when left and/or right is empty (51929)

Sparse

ExtensionArray

Styler

Other

  • Bug in cut incorrectly allowing cutting of timezone-aware datetimes with timezone-naive bins (54964)

Contributors