These are the changes in pandas 2.2.0. See release
for a full changelog including other versions of pandas.
{{ header }}
The calamine
engine was added to read_excel
. It uses python-calamine
, which provides Python bindings for the Rust library calamine. This engine supports Excel files (.xlsx
, .xlsm
, .xls
, .xlsb
) and OpenDocument spreadsheets (.ods
) (50395
).
There are two advantages of this engine:
- Calamine is often faster than other engines, some benchmarks show results up to 5x faster than 'openpyxl', 20x - 'odf', 4x - 'pyxlsb', and 1.5x - 'xlrd'. But, 'openpyxl' and 'pyxlsb' are faster in reading a few rows from large files because of lazy iteration over rows.
- Calamine supports the recognition of datetime in
.xlsb
files, unlike 'pyxlsb' which is the only other engine in pandas that can read.xlsb
files.
pd.read_excel("path_to_file.xlsb", engine="calamine")
For more, see io.calamine
in the user guide on IO tools.
The Series.struct
accessor provides attributes and methods for processing data with struct[pyarrow]
dtype Series. For example, Series.struct.explode
converts PyArrow structured data to a pandas DataFrame. (54938
)
python
import pyarrow as pa series = pd.Series( [ {"project": "pandas", "version": "2.2.0"}, {"project": "numpy", "version": "1.25.2"}, {"project": "pyarrow", "version": "13.0.0"}, ], dtype=pd.ArrowDtype( pa.struct([ ("project", pa.string()), ("version", pa.string()), ]) ), ) series.struct.explode()
read_csv
now supportson_bad_lines
parameter withengine="pyarrow"
. (54480
)ExtensionArray._explode
interface method added to allow extension type implementations of theexplode
method (54833
)- DataFrame.apply now allows the usage of numba (via
engine="numba"
) to JIT compile the passed function, allowing for potential speedups (54666
)
These are bug fixes that might have notable behavior changes.
In previous versions of pandas, merge
and DataFrame.join
did not always return a result that followed the documented sort behavior. pandas now follows the documented sort behavior in merge and join operations (54611
).
As documented, sort=True
sorts the join keys lexicographically in the resulting DataFrame
. With sort=False
, the order of the join keys depends on the join type (how
keyword):
how="left"
: preserve the order of the left keyshow="right"
: preserve the order of the right keyshow="inner"
: preserve the order of the left keyshow="outer"
: sort keys lexicographically
One example with changing behavior is inner joins with non-unique left join keys and sort=False
:
python
left = pd.DataFrame({"a": [1, 2, 1]}) right = pd.DataFrame({"a": [1, 2]}) result = pd.merge(left, right, how="inner", on="a", sort=False)
Old Behavior
In [5]: result
Out[5]:
a
0 1
1 1
2 2
New Behavior
python
result
Some minimum supported versions of dependencies were updated. If installed, we now require:
Package | Minimum Version | Required | Changed |
---|---|---|---|
|
|
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package | Minimum Version | Changed |
---|---|---|
|
See install.dependencies
and install.optional_dependencies
for more.
The alias M
is deprecated in favour of ME
for offsets, please use ME
for "month end" instead of M
(9586
)
For example:
Previous behavior:
In [7]: pd.date_range('2020-01-01', periods=3, freq='M')
Out [7]:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'],
dtype='datetime64[ns]', freq='M')
Future behavior:
python
pd.date_range('2020-01-01', periods=3, freq='ME')
- Changed
Timedelta.resolution_string
to returnmin
,s
,ms
,us
, andns
instead ofT
,S
,L
,U
, andN
, for compatibility with respective deprecations in frequency aliases (52536
) - Deprecated allowing non-keyword arguments in
DataFrame.to_clipboard
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_csv
exceptpath_or_buf
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_dict
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_excel
exceptexcel_writer
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_gbq
exceptdestination_table
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_hdf
exceptpath_or_buf
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_html
exceptbuf
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_json
exceptpath_or_buf
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_latex
exceptbuf
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_markdown
exceptbuf
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_parquet
exceptpath
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_pickle
exceptpath
. (54229
) - Deprecated allowing non-keyword arguments in
DataFrame.to_string
exceptbuf
. (54229
) - Deprecated automatic downcasting of object-dtype results in
Series.replace
andDataFrame.replace
, explicitly callresult = result.infer_objects(copy=False)
instead. To opt in to the future version, usepd.set_option("future.no_silent_downcasting", True)
(54710
) - Deprecated downcasting behavior in
Series.where
,DataFrame.where
,Series.mask
,DataFrame.mask
,Series.clip
,DataFrame.clip
; in a future version these will not infer object-dtype columns to non-object dtype, or all-round floats to integer dtype. Callresult.infer_objects(copy=False)
on the result for object inference, or explicitly cast floats to ints. To opt in to the future version, usepd.set_option("future.no_silent_downcasting", True)
(53656
) - Deprecated including the groups in computations when using
DataFrameGroupBy.apply
andDataFrameGroupBy.resample
; passinclude_groups=False
to exclude the groups (7155
) - Deprecated not passing a tuple to
DataFrameGroupBy.get_group
orSeriesGroupBy.get_group
when grouping by a length-1 list-like (25971
) - Deprecated string
A
denoting frequency inYearEnd
and stringsA-DEC
,A-JAN
, etc. denoting annual frequencies with various fiscal year ends (52536
) - Deprecated strings
S
,U
, andN
denoting units into_timedelta
(52536
) - Deprecated strings
T
,S
,L
,U
, andN
denoting frequencies inMinute
,Second
,Milli
,Micro
,Nano
(52536
) - Deprecated strings
T
,S
,L
,U
, andN
denoting units inTimedelta
(52536
) - Deprecated the extension test classes
BaseNoReduceTests
,BaseBooleanReduceTests
, andBaseNumericReduceTests
, useBaseReduceTests
instead (54663
) - Deprecated the option
mode.data_manager
and theArrayManager
; only theBlockManager
will be available in future versions (55043
) - Deprecating downcasting the results of
DataFrame.fillna
,Series.fillna
,DataFrame.ffill
,Series.ffill
,DataFrame.bfill
,Series.bfill
in object-dtype cases. To opt in to the future version, usepd.set_option("future.no_silent_downcasting", True)
(54261
)
- Performance improvement in
concat
withaxis=1
and objects with unaligned indexes (55084
) - Performance improvement in
to_dict
on converting DataFrame to dictionary (50990
) - Performance improvement in
DataFrame.groupby
when aggregating pyarrow timestamp and duration dtypes (55031
) - Performance improvement in
DataFrame.sort_index
andSeries.sort_index
when indexed by aMultiIndex
(54835
) - Performance improvement in
Index.difference
(55108
) - Performance improvement when indexing with more than 4 keys (
54550
) - Performance improvement when localizing time to UTC (
55241
)
- Bug in
AbstractHolidayCalendar
where timezone data was not propagated when computing holiday observances (54580
) - Bug in
pandas.core.window.Rolling
where duplicate datetimelike indexes are treated as consecutive rather than equal withclosed='left'
andclosed='neither'
(20712
) - Bug in
DataFrame.apply
where passingraw=True
ignoredargs
passed to the applied function (55009
) - Bug in
pandas.read_excel
with a ODS file without cached formatted cell for float values (55219
)
Categorical.isin
raisingInvalidIndexError
for categorical containing overlappingInterval
values (34974
)
- Bug in
DatetimeIndex.union
returning object dtype for tz-aware indexes with the same timezone but different units (55238
)
- Bug in
read_csv
withengine="pyarrow"
causing rounding errors for large integers (52505
)
- Bug in
Interval
__repr__
not displaying UTC offsets forTimestamp
bounds. Additionally the hour, minute and second components will now be shown. (55015
) - Bug in
IntervalIndex.get_indexer
with datetime or timedelta intervals incorrectly matching on integer targets (47772
) - Bug in
IntervalIndex.get_indexer
with timezone-aware datetime intervals incorrectly matching on a sequence of timezone-naive targets (47772
) - Bug in setting values on a
Series
with anIntervalIndex
using a slice incorrectly raising (54722
)
- Bug in
Index.difference
not returning a unique set of values whenother
is empty orother
is considered non-comparable (55113
) - Bug in setting
Categorical
values into aDataFrame
with numpy dtypes raisingRecursionError
(52927
)
- Bug in
read_csv
whereon_bad_lines="warn"
would write tostderr
instead of raise a Python warning. This now yields a.errors.ParserWarning
(54296
) - Bug in
read_csv
withengine="pyarrow"
whereusecols
wasn't working with a csv with no headers (54459
) - Bug in
read_excel
, withengine="xlrd"
(xls
files) erroring when file contains NaNs/Infs (54564
) - Bug in
to_excel
, withOdsWriter
(ods
files) writing boolean/string value (54994
)
- Bug in
DataFrame.plot.box
withvert=False
and a matplotlibAxes
created withsharey=True
(54941
)
- Bug in
concat
ignoringsort
parameter when passedDatetimeIndex
indexes (54769
) - Bug in
merge
returning columns in incorrect order when left and/or right is empty (51929
)
- Bug in
cut
incorrectly allowing cutting of timezone-aware datetimes with timezone-naive bins (54964
)