These are the changes in pandas 1.5.0. See release
for a full changelog including other versions of pandas.
{{ header }}
Pandas now implement the DataFrame exchange API spec. See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html
The protocol consists of two parts:
- New method
DataFrame.__dataframe__
which produces the exchange object. It effectively "exports" the Pandas dataframe as an exchange object so any other library which has the protocol implemented can "import" that dataframe without knowing anything about the producer except that it makes an exchange object.- New function
pandas.api.exchange.from_dataframe
which can take an arbitrary exchange object from any conformant library and construct a Pandas DataFrame out of it.
The most notable development is the new method .Styler.concat
which allows adding customised footer rows to visualise additional calculations on the data, e.g. totals and counts etc. (43875
, 46186
)
Additionally there is an alternative output method .Styler.to_string
, which allows using the Styler's formatting methods to create, for example, CSVs (44502
).
Minor feature improvements are:
- Adding the ability to render
border
andborder-{side}
CSS properties in Excel (42276
)- Making keyword arguments consist:
.Styler.highlight_null
now acceptscolor
and deprecatesnull_color
although this remains backwards compatible (45907
)
The argument group_keys
has been added to the method DataFrame.resample
. As with DataFrame.groupby
, this argument controls the whether each group is added to the index in the resample when .Resampler.apply
is used.
Warning
Not specifying the group_keys
argument will retain the previous behavior and emit a warning if the result will change by specifying group_keys=False
. In a future version of pandas, not specifying group_keys
will default to the same behavior as group_keys=False
.
python
- df = pd.DataFrame(
{'a': range(6)}, index=pd.date_range("2021-01-01", periods=6, freq="8H")
) df.resample("D", group_keys=True).apply(lambda x: x) df.resample("D", group_keys=False).apply(lambda x: x)
Previously, the resulting index would depend upon the values returned by apply
, as seen in the following example.
In [1]: # pandas 1.3
In [2]: df.resample("D").apply(lambda x: x)
Out[2]:
a
2021-01-01 00:00:00 0
2021-01-01 08:00:00 1
2021-01-01 16:00:00 2
2021-01-02 00:00:00 3
2021-01-02 08:00:00 4
2021-01-02 16:00:00 5
In [3]: df.resample("D").apply(lambda x: x.reset_index())
Out[3]:
index a
2021-01-01 0 2021-01-01 00:00:00 0
1 2021-01-01 08:00:00 1
2 2021-01-01 16:00:00 2
2021-01-02 0 2021-01-02 00:00:00 3
1 2021-01-02 08:00:00 4
2 2021-01-02 16:00:00 5
Added new function ~pandas.from_dummies
to convert a dummy coded DataFrame
into a categorical DataFrame
.
Example:
.. ipython:: python
import pandas as pd
- df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0],
"col2_a": [0, 1, 0], "col2_b": [1, 0, 0], "col2_c": [0, 0, 1]})
pd.from_dummies(df, sep="_")
The new method DataFrame.to_orc
allows writing to ORC files (43864
).
This functionality depends the pyarrow library. For more details, see the IO docs on ORC <io.orc>
.
Warning
- It is highly recommended to install pyarrow using conda due to some issues occurred by pyarrow.
~pandas.DataFrame.to_orc
requires pyarrow>=7.0.0.~pandas.DataFrame.to_orc
is not supported on Windows yet, you can find valid environments oninstall optional dependencies <install.warn_orc>
.- For supported dtypes please refer to supported ORC features in Arrow.
- Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.
df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]})
df.to_orc("./out.orc")
I/O methods like read_csv
or DataFrame.to_json
now allow reading and writing directly on TAR archives (44787
).
df = pd.read_csv("./movement.tar.gz")
# ...
df.to_csv("./out.tar.gz")
This supports .tar
, .tar.gz
, .tar.bz
and .tar.xz2
archives. The used compression method is inferred from the filename. If the compression method cannot be inferred, use the compression
argument:
df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821
(mode
being one of tarfile.open
's modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)
Similar to other IO methods, pandas.read_xml
now supports assigning specific dtypes to columns, apply converter methods, and parse dates (43567
).
python
xml_dates = """<?xml version='1.0' encoding='utf-8'?> <data> <row> <shape>square</shape> <degrees>00360</degrees> <sides>4.0</sides> <date>2020-01-01</date> </row> <row> <shape>circle</shape> <degrees>00360</degrees> <sides/> <date>2021-01-01</date> </row> <row> <shape>triangle</shape> <degrees>00180</degrees> <sides>3.0</sides> <date>2022-01-01</date> </row> </data>"""
- df = pd.read_xml(
xml_dates, dtype={'sides': 'Int64'}, converters={'degrees': str}, parse_dates=['date']
) df df.dtypes
For very large XML files that can range in hundreds of megabytes to gigabytes, pandas.read_xml
now supports parsing such sizeable files using lxml's iterparse and etree's iterparse which are memory-efficient methods to iterate through XML trees and extract specific elements and attributes without holding entire tree in memory (45442
).
In [1]: df = pd.read_xml(
... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
... iterparse = {"page": ["title", "ns", "id"]})
... )
df
Out[2]:
title ns id
0 Gettysburg Address 0 21450
1 Main Page 0 42950
2 Declaration by United Nations 0 8435
3 Constitution of the United States of America 0 8435
4 Declaration of Independence (Israel) 0 17858
... ... ... ...
3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
[3578765 rows x 3 columns]
Series.map
now raises whenarg
is dict butna_action
is not eitherNone
or'ignore'
(46588
)MultiIndex.to_frame
now supports the argumentallow_duplicates
and raises on duplicate labels if it is missing or False (45245
).StringArray
now accepts array-likes containing nan-likes (None
,np.nan
) for thevalues
parameter in its constructor in addition to strings andpandas.NA
. (40839
)- Improved the rendering of
categories
inCategoricalIndex
(45218
) DataFrame.plot
will now allow thesubplots
parameter to be a list of iterables specifying column groups, so that columns may be grouped together in the same subplot (29688
).to_numeric
now preserves float64 arrays when downcasting would generate values not representable in float32 (43693
)Series.reset_index
andDataFrame.reset_index
now support the argumentallow_duplicates
(44410
).GroupBy.min
and.GroupBy.max
now supports Numba execution with theengine
keyword (45428
)read_csv
now supportsdefaultdict
as adtype
parameter (41574
)DataFrame.rolling
andSeries.rolling
now support astep
parameter with fixed-length windows (15354
)- Implemented a
bool
-dtypeIndex
, passing a bool-dtype array-like topd.Index
will now retainbool
dtype instead of casting toobject
(45061
) - Implemented a complex-dtype
Index
, passing a complex-dtype array-like topd.Index
will now retain complex dtype instead of casting toobject
(45845
) Series
andDataFrame
withIntegerDtype
now supports bitwise operations (34463
)- Add
milliseconds
field support for.DateOffset
(43371
) DataFrame.reset_index
now accepts anames
argument which renames the index names (6878
)concat
now raises whenlevels
is given butkeys
is None (46653
)concat
now raises whenlevels
contains duplicate values (46653
)- Added
numeric_only
argument toDataFrame.corr
,DataFrame.corrwith
,DataFrame.cov
,DataFrame.idxmin
,DataFrame.idxmax
,.DataFrameGroupBy.idxmin
,.DataFrameGroupBy.idxmax
,.GroupBy.var
,.GroupBy.std
,.GroupBy.sem
, and.DataFrameGroupBy.quantile
(46560
) - A
errors.PerformanceWarning
is now thrown when usingstring[pyarrow]
dtype with methods that don't dispatch topyarrow.compute
methods (42613
,46725
) - Added
validate
argument toDataFrame.join
(46622
) - A
errors.PerformanceWarning
is now thrown when usingstring[pyarrow]
dtype with methods that don't dispatch topyarrow.compute
methods (42613
) - Added
numeric_only
argument toResampler.sum
,Resampler.prod
,Resampler.min
,Resampler.max
,Resampler.first
, andResampler.last
(46442
) times
argument in.ExponentialMovingWindow
now acceptsnp.timedelta64
(47003
).DataError
,.SpecificationError
,.SettingWithCopyError
,.SettingWithCopyWarning
,.NumExprClobberingError
,.UndefinedVariableError
, and.IndexingError
are now exposed inpandas.errors
(27656
)- Added
check_like
argument totesting.assert_series_equal
(47247
) - Allow reading compressed SAS files with
read_sas
(e.g.,.sas7bdat.gz
files) DatetimeIndex.astype
now supports casting timezone-naive indexes todatetime64[s]
,datetime64[ms]
, anddatetime64[us]
, and timezone-aware indexes to the correspondingdatetime64[unit, tzname]
dtypes (47579
)Series
reducers (e.g.min
,max
,sum
,mean
) will now successfully operate when the dtype is numeric andnumeric_only=True
is provided; previously this would raise aNotImplementedError
(47500
)RangeIndex.union
now can return aRangeIndex
instead of aInt64Index
if the resulting values are equally spaced (47557
,43885
)DataFrame.compare
now accepts an argumentresult_names
to allow the user to specify the result's names of both left and right DataFrame which are being compared. This is by default'self'
and'other'
(44354
)
These are bug fixes that might have notable behavior changes.
A transform is an operation whose result has the same size as its input. When the result is a DataFrame
or Series
, it is also required that the index of the result matches that of the input. In pandas 1.4, using .DataFrameGroupBy.transform
or .SeriesGroupBy.transform
with null values in the groups and dropna=True
gave incorrect results. Demonstrated by the examples below, the incorrect results either contained incorrect values, or the result did not have the same index as the input.
python
df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})
Old behavior:
In [3]: # Value in the last row should be np.nan
df.groupby('a', dropna=True).transform('sum')
Out[3]:
b
0 5
1 5
2 5
In [3]: # Should have one additional row with the value np.nan
df.groupby('a', dropna=True).transform(lambda x: x.sum())
Out[3]:
b
0 5
1 5
In [3]: # The value in the last row is np.nan interpreted as an integer
df.groupby('a', dropna=True).transform('ffill')
Out[3]:
b
0 2
1 3
2 -9223372036854775808
In [3]: # Should have one additional row with the value np.nan
df.groupby('a', dropna=True).transform(lambda x: x)
Out[3]:
b
0 2
1 3
New behavior:
python
df.groupby('a', dropna=True).transform('sum') df.groupby('a', dropna=True).transform(lambda x: x.sum()) df.groupby('a', dropna=True).transform('ffill') df.groupby('a', dropna=True).transform(lambda x: x)
DataFrame.to_json
, Series.to_json
, and Index.to_json
would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps to UTC. (38760
)
Note that this patch does not fix the localization of tz-aware Timestamps to UTC upon serialization. (Related issue 12997
)
Old Behavior
python
- index = pd.date_range(
start='2020-12-28 00:00:00', end='2020-12-28 02:00:00', freq='1H',
) a = pd.Series( data=range(3), index=index, )
In [4]: a.to_json(date_format='iso')
Out[4]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'
In [5]: pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
Out[5]: array([False, False, False])
New Behavior
python
a.to_json(date_format='iso') # Roundtripping now works pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
Some minimum supported versions of dependencies were updated. If installed, we now require:
Package | Minimum Version | Required | Changed |
---|---|---|---|
numpy | 1.20.3 |
|
|
mypy (dev) | 0.971 |
|
|
beautifulsoup4 | 4.9.3 |
|
|
blosc | 1.21.0 |
|
|
bottleneck | 1.3.2 |
|
|
fsspec | 2021.05.0 |
|
|
hypothesis | 6.13.0 |
|
|
gcsfs | 2021.05.0 |
|
|
jinja2 | 3.0.0 |
|
|
lxml | 4.6.3 |
|
|
numba | 0.53.1 |
|
|
numexpr | 2.7.3 |
|
|
openpyxl | 3.0.7 |
|
|
pandas-gbq | 0.15.0 |
|
|
psycopg2 | 2.8.6 |
|
|
pymysql | 1.0.2 |
|
|
pyreadstat | 1.1.2 |
|
|
pyxlsb | 1.0.8 |
|
|
s3fs | 2021.05.0 |
|
|
scipy | 1.7.1 |
|
|
sqlalchemy | 1.4.16 |
|
|
tabulate | 0.8.9 |
|
|
xarray | 0.19.0 |
|
|
xlsxwriter | 1.4.3 |
|
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package | Minimum Version | Changed |
---|---|---|
|
See install.dependencies
and install.optional_dependencies
for more.
- BigQuery I/O methods
read_gbq
andDataFrame.to_gbq
default toauth_local_webserver = True
. Google has deprecated theauth_local_webserver = False
"out of band" (copy-paste) flow. Theauth_local_webserver = False
option is planned to stop working in October 2022. (46312
) read_json
now raisesFileNotFoundError
(previouslyValueError
) when input is a string ending in.json
,.json.gz
,.json.bz2
, etc. but no such file exists. (29102
)- Operations with
Timestamp
orTimedelta
that would previously raiseOverflowError
instead raiseOutOfBoundsDatetime
orOutOfBoundsTimedelta
where appropriate (47268
) - When
read_sas
previously returnedNone
, it now returns an emptyDataFrame
(47410
)
In a future version, integer slicing on a Series
with a Int64Index
or RangeIndex
will be treated as label-based, not positional. This will make the behavior consistent with other Series.__getitem__
and Series.__setitem__
behaviors (45162
).
For example:
python
ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])
In the old behavior, ser[2:4]
treats the slice as positional:
Old behavior:
In [3]: ser[2:4]
Out[3]:
5 3
7 4
dtype: int64
In a future version, this will be treated as label-based:
Future behavior:
In [4]: ser.loc[2:4]
Out[4]:
2 1
3 2
dtype: int64
To retain the old behavior, use series.iloc[i:j]
. To get the future behavior, use series.loc[i:j]
.
Slicing on a DataFrame
will not be affected.
All attributes of ExcelWriter
were previously documented as not public. However some third party Excel engines documented accessing ExcelWriter.book
or ExcelWriter.sheets
, and users were utilizing these and possibly other attributes. Previously these attributes were not safe to use; e.g. modifications to ExcelWriter.book
would not update ExcelWriter.sheets
and conversely. In order to support this, pandas has made some attributes public and improved their implementations so that they may now be safely used. (45572
)
The following attributes are now public and considered safe to access.
book
check_extension
close
date_format
datetime_format
engine
if_sheet_exists
sheets
supported_extensions
The following attributes have been deprecated. They now raise a FutureWarning
when accessed and will be removed in a future version. Users should be aware that their usage is considered unsafe, and can lead to unexpected results.
cur_sheet
handles
path
save
write_cells
See the documentation of ExcelWriter
for further details.
In previous versions of pandas, if it was inferred that the function passed to .GroupBy.apply
was a transformer (i.e. the resulting index was equal to the input index), the group_keys
argument of DataFrame.groupby
and Series.groupby
was ignored and the group keys would never be added to the index of the result. In the future, the group keys will be added to the index when the user specifies group_keys=True
.
As group_keys=True
is the default value of DataFrame.groupby
and Series.groupby
, not specifying group_keys
with a transformer will raise a FutureWarning
. This can be silenced and the previous behavior retained by specifying group_keys=False
.
Most of the time setting values with frame.iloc
attempts to set values inplace, only falling back to inserting a new array if necessary. There are some cases where this rule is not followed, for example when setting an entire column from an array with different dtype:
python
df = pd.DataFrame({'price': [11.1, 12.2]}, index=['book1', 'book2']) original_prices = df['price'] new_prices = np.array([98, 99])
Old behavior:
In [3]: df.iloc[:, 0] = new_prices
In [4]: df.iloc[:, 0]
Out[4]:
book1 98
book2 99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1 11.1
book2 12.2
Name: price, float: 64
This behavior is deprecated. In a future version, setting an entire column with iloc will attempt to operate inplace.
Future behavior:
In [3]: df.iloc[:, 0] = new_prices
In [4]: df.iloc[:, 0]
Out[4]:
book1 98.0
book2 99.0
Name: price, dtype: float64
In [5]: original_prices
Out[5]:
book1 98.0
book2 99.0
Name: price, dtype: float64
To get the old behavior, use DataFrame.__setitem__
directly:
In [3]: df[df.columns[0]] = new_prices
In [4]: df.iloc[:, 0]
Out[4]
book1 98
book2 99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1 11.1
book2 12.2
Name: price, dtype: float64
To get the old behaviour when df.columns
is not unique and you want to change a single column by index, you can use DataFrame.isetitem
, which has been added in pandas 1.5:
In [3]: df_with_duplicated_cols = pd.concat([df, df], axis='columns')
In [3]: df_with_duplicated_cols.isetitem(0, new_prices)
In [4]: df_with_duplicated_cols.iloc[:, 0]
Out[4]:
book1 98
book2 99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1 11.1
book2 12.2
Name: 0, dtype: float64
Across the DataFrame
, .DataFrameGroupBy
, and .Resampler
operations such as min
, sum
, and idxmax
, the default value of the numeric_only
argument, if it exists at all, was inconsistent. Furthermore, operations with the default value None
can lead to surprising results. (46560
)
In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})
In [2]: # Reading the next line without knowing the contents of df, one would
# expect the result to contain the products for both columns a and b.
df[["a", "b"]].prod()
Out[2]:
a 2
dtype: int64
To avoid this behavior, the specifying the value numeric_only=None
has been deprecated, and will be removed in a future version of pandas. In the future, all operations with a numeric_only
argument will default to False
. Users should either call the operation only with columns that can be operated on, or specify numeric_only=True
to operate only on Boolean, integer, and float columns.
In order to support the transition to the new behavior, the following methods have gained the numeric_only
argument.
DataFrame.corr
DataFrame.corrwith
DataFrame.cov
DataFrame.idxmin
DataFrame.idxmax
.DataFrameGroupBy.cummin
.DataFrameGroupBy.cummax
.DataFrameGroupBy.idxmin
.DataFrameGroupBy.idxmax
.GroupBy.var
.GroupBy.std
.GroupBy.sem
.DataFrameGroupBy.quantile
.Resampler.mean
.Resampler.median
.Resampler.sem
.Resampler.std
.Resampler.var
DataFrame.rolling
operationsDataFrame.expanding
operationsDataFrame.ewm
operations
- Deprecated the keyword
line_terminator
inDataFrame.to_csv
andSeries.to_csv
, uselineterminator
instead; this is for consistency withread_csv
and the standard library 'csv' module (9568
) - Deprecated behavior of
SparseArray.astype
,Series.astype
, andDataFrame.astype
withSparseDtype
when passing a non-sparsedtype
. In a future version, this will cast to that non-sparse dtype instead of wrapping it in aSparseDtype
(34457
) - Deprecated behavior of
DatetimeIndex.intersection
andDatetimeIndex.symmetric_difference
(union
behavior was already deprecated in version 1.3.0) with mixed time zones; in a future version both will be cast to UTC instead of object dtype (39328
,45357
) - Deprecated
DataFrame.iteritems
,Series.iteritems
,HDFStore.iteritems
in favor ofDataFrame.items
,Series.items
,HDFStore.items
(45321
) - Deprecated
Series.is_monotonic
andIndex.is_monotonic
in favor ofSeries.is_monotonic_increasing
andIndex.is_monotonic_increasing
(45422
,21335
) - Deprecated behavior of
DatetimeIndex.astype
,TimedeltaIndex.astype
,PeriodIndex.astype
when converting to an integer dtype other thanint64
. In a future version, these will convert to exactly the specified dtype (instead of alwaysint64
) and will raise if the conversion overflows (45034
) - Deprecated the
__array_wrap__
method of DataFrame and Series, rely on standard numpy ufuncs instead (45451
) - Deprecated treating float-dtype data as wall-times when passed with a timezone to
Series
orDatetimeIndex
(45573
) - Deprecated the behavior of
Series.fillna
andDataFrame.fillna
withtimedelta64[ns]
dtype and incompatible fill value; in a future version this will cast to a common dtype (usually object) instead of raising, matching the behavior of other dtypes (45746
) - Deprecated the
warn
parameter ininfer_freq
(45947
) - Deprecated allowing non-keyword arguments in
.ExtensionArray.argsort
(46134
) - Deprecated treating all-bool
object
-dtype columns as bool-like inDataFrame.any
andDataFrame.all
withbool_only=True
, explicitly cast to bool instead (46188
) - Deprecated behavior of method
DataFrame.quantile
, attributenumeric_only
will default False. Including datetime/timedelta columns in the result (7308
). - Deprecated
Timedelta.freq
andTimedelta.is_populated
(46430
) - Deprecated
Timedelta.delta
(46476
) - Deprecated passing arguments as positional in
DataFrame.any
andSeries.any
(44802
) - Deprecated the
closed
argument ininterval_range
in favor ofinclusive
argument; In a future version passingclosed
will raise (40245
) - Deprecated the methods
DataFrame.mad
,Series.mad
, and the corresponding groupby methods (11787
) - Deprecated positional arguments to
Index.join
except forother
, use keyword-only arguments instead of positional arguments (46518
) - Deprecated positional arguments to
StringMethods.rsplit
andStringMethods.split
except forpat
, use keyword-only arguments instead of positional arguments (47423
) - Deprecated indexing on a timezone-naive
DatetimeIndex
using a string representing a timezone-aware datetime (46903
,36148
) - Deprecated the
closed
argument inInterval
in favor ofinclusive
argument; In a future version passingclosed
will raise (40245
) - Deprecated the
closed
argument inIntervalIndex
in favor ofinclusive
argument; In a future version passingclosed
will raise (40245
) - Deprecated the
closed
argument inIntervalDtype
in favor ofinclusive
argument; In a future version passingclosed
will raise (40245
) - Deprecated the
closed
argument in.IntervalArray
in favor ofinclusive
argument; In a future version passingclosed
will raise (40245
) - Deprecated
.IntervalArray.set_closed
and.IntervalIndex.set_closed
in favor ofset_inclusive
; In a future versionset_closed
will get removed (40245
) - Deprecated the
closed
argument inArrowInterval
in favor ofinclusive
argument; In a future version passingclosed
will raise (40245
) - Deprecated allowing
unit="M"
orunit="Y"
inTimestamp
constructor with a non-round float value (47267
) - Deprecated the
display.column_space
global configuration option (7576
) - Deprecated the argument
na_sentinel
infactorize
,Index.factorize
, and.ExtensionArray.factorize
; passuse_na_sentinel=True
instead to use the sentinel-1
for NaN values anduse_na_sentinel=False
instead ofna_sentinel=None
to encode NaN values (46910
) - Deprecated
DataFrameGroupBy.transform
not aligning the result when the UDF returned DataFrame (45648
) - Clarified warning from
to_datetime
when delimited dates can't be parsed in accordance to specifieddayfirst
argument (46210
) - Emit warning from
to_datetime
when delimited dates can't be parsed in accordance to specifieddayfirst
argument even for dates where leading zero is omitted (e.g.31/1/2001
) (47880
) - Deprecated
Series
andResampler
reducers (e.g.min
,max
,sum
,mean
) raising aNotImplementedError
when the dtype is non-numric andnumeric_only=True
is provided; this will raise aTypeError
in a future version (47500
) - Deprecated
Series.rank
returning an empty result when the dtype is non-numeric andnumeric_only=True
is provided; this will raise aTypeError
in a future version (47500
) - Deprecated argument
errors
forSeries.mask
,Series.where
,DataFrame.mask
, andDataFrame.where
aserrors
had no effect on this methods (47728
) - Deprecated arguments
*args
and**kwargs
inRolling
,Expanding
, andExponentialMovingWindow
ops. (47836
) - Deprecated unused arguments
encoding
andverbose
inSeries.to_excel
andDataFrame.to_excel
(47912
) - Deprecated producing a single element when iterating over a
DataFrameGroupBy
or aSeriesGroupBy
that has been grouped by a list of length 1; A tuple of length one will be returned instead (42795
)
- Performance improvement in
DataFrame.corrwith
for column-wise (axis=0) Pearson and Spearman correlation when other is aSeries
(46174
) - Performance improvement in
.GroupBy.transform
for some user-defined DataFrame -> Series functions (45387
) - Performance improvement in
DataFrame.duplicated
when subset consists of only one column (45236
) - Performance improvement in
.GroupBy.diff
(16706
) - Performance improvement in
.GroupBy.transform
when broadcasting values for user-defined functions (45708
) - Performance improvement in
.GroupBy.transform
for user-defined functions when only a single group exists (44977
) - Performance improvement in
.GroupBy.apply
when grouping on a non-unique unsorted index (46527
) - Performance improvement in
DataFrame.loc
andSeries.loc
for tuple-based indexing of aMultiIndex
(45681
,46040
,46330
) - Performance improvement in
DataFrame.to_records
when the index is aMultiIndex
(47263
) - Performance improvement in
MultiIndex.values
when the MultiIndex contains levels of type DatetimeIndex, TimedeltaIndex or ExtensionDtypes (46288
) - Performance improvement in
merge
when left and/or right are empty (45838
) - Performance improvement in
DataFrame.join
when left and/or right are empty (46015
) - Performance improvement in
DataFrame.reindex
andSeries.reindex
when target is aMultiIndex
(46235
) - Performance improvement when setting values in a pyarrow backed string array (
46400
) - Performance improvement in
factorize
(46109
) - Performance improvement in
DataFrame
andSeries
constructors for extension dtype scalars (45854
) - Performance improvement in
read_excel
whennrows
argument provided (32727
) - Performance improvement in
.Styler.to_excel
when applying repeated CSS formats (47371
) - Performance improvement in
MultiIndex.is_monotonic_increasing
(47458
) - Performance improvement in
BusinessHour
str
andrepr
(44764
) - Performance improvement in datetime arrays string formatting when one of the default strftime formats
"%Y-%m-%d %H:%M:%S"
or"%Y-%m-%d %H:%M:%S.%f"
is used. (44764
) - Performance improvement in
Series.to_sql
andDataFrame.to_sql
(SQLiteTable
) when processing time arrays. (44764
) - Performance improvements to
read_sas
(47403
,47404
,47405
) - Performance improvement in
argmax
andargmin
forarrays.SparseArray
(34197
)
- Bug in
.Categorical.view
not accepting integer dtypes (25464
) - Bug in
.CategoricalIndex.union
when the index's categories are integer-dtype and the index containsNaN
values incorrectly raising instead of casting tofloat64
(45362
)
- Bug in
DataFrame.quantile
with datetime-like dtypes and no rows incorrectly returningfloat64
dtype instead of retaining datetime-like dtype (41544
) - Bug in
to_datetime
with sequences ofnp.str_
objects incorrectly raising (32264
) - Bug in
Timestamp
construction when passing datetime components as positional arguments andtzinfo
as a keyword argument incorrectly raising (31929
) - Bug in
Index.astype
when casting from object dtype totimedelta64[ns]
dtype incorrectly castingnp.datetime64("NaT")
values tonp.timedelta64("NaT")
instead of raising (45722
) - Bug in
SeriesGroupBy.value_counts
index when passing categorical column (44324
) - Bug in
DatetimeIndex.tz_localize
localizing to UTC failing to make a copy of the underlying data (46460
) - Bug in
DatetimeIndex.resolution
incorrectly returning "day" instead of "nanosecond" for nanosecond-resolution indexes (46903
) - Bug in
Timestamp
with an integer or float value andunit="Y"
orunit="M"
giving slightly-wrong results (47266
) - Bug in
.DatetimeArray
construction when passed another.DatetimeArray
andfreq=None
incorrectly inferring the freq from the given array (47296
)
- Bug in
astype_nansafe
astype("timedelta64[ns]") fails when np.nan is included (45798
) - Bug in constructing a
Timedelta
with anp.timedelta64
object and aunit
sometimes silently overflowing and returning incorrect results instead of raisingOutOfBoundsTimedelta
(46827
) - Bug in constructing a
Timedelta
from a large integer or float withunit="W"
silently overflowing and returning incorrect results instead of raisingOutOfBoundsTimedelta
(47268
)
- Bug in
Timestamp
constructor raising when passed aZoneInfo
tzinfo object (46425
)
- Bug in operations with array-likes with
dtype="boolean"
andNA
incorrectly altering the array in-place (45421
) - Bug in division,
pow
andmod
operations on array-likes withdtype="boolean"
not being like theirnp.bool_
counterparts (46063
) - Bug in multiplying a
Series
withIntegerDtype
orFloatingDtype
by an array-like withtimedelta64[ns]
dtype incorrectly raising (45622
) - Bug in
mean
where the optional dependencybottleneck
causes precision loss linear in the length of the array.bottleneck
has been disabled formean
improving the loss to log-linear but may result in a performance decrease. (42878
)
- Bug in
DataFrame.astype
not preserving subclasses (40810
) - Bug in constructing a
Series
from a float-containing list or a floating-dtype ndarray-like (e.g.dask.Array
) and an integer dtype raising instead of casting like we would with annp.ndarray
(40110
) - Bug in
Float64Index.astype
to unsigned integer dtype incorrectly casting tonp.int64
dtype (45309
) - Bug in
Series.astype
andDataFrame.astype
from floating dtype to unsigned integer dtype failing to raise in the presence of negative values (45151
) - Bug in
array
withFloatingDtype
and values containing float-castable strings incorrectly raising (45424
) - Bug when comparing string and datetime64ns objects causing
OverflowError
exception. (45506
) - Bug in metaclass of generic abstract dtypes causing
DataFrame.apply
andSeries.apply
to raise for the built-in functiontype
(46684
) - Bug in
DataFrame.to_records
returning inconsistent numpy types if the index was aMultiIndex
(47263
) - Bug in
DataFrame.to_dict
fororient="list"
ororient="index"
was not returning native types (46751
) - Bug in
DataFrame.apply
that returns aDataFrame
instead of aSeries
when applied to an emptyDataFrame
andaxis=1
(39111
) - Bug when inferring the dtype from an iterable that is not a NumPy
ndarray
consisting of all NumPy unsigned integer scalars did not result in an unsigned integer dtype (47294
)
- Bug in
str.startswith
andstr.endswith
when using other series as parameter _pat. Now raisesTypeError
(3485
) - Bug in
Series.str.zfill
when strings contain leading signs, padding '0' before the sign character rather than after asstr.zfill
from standard library (20868
)
- Bug in
IntervalArray.__setitem__
when settingnp.nan
into an integer-backed array raisingValueError
instead ofTypeError
(45484
) - Bug in
IntervalDtype
when using datetime64[ns, tz] as a dtype string (46999
)
- Bug in
loc.__getitem__
with a list of keys causing an internal inconsistency that could lead to a disconnect betweenframe.at[x, y]
vsframe[y].loc[x]
(22372
) - Bug in
DataFrame.iloc
where indexing a single row on aDataFrame
with a single ExtensionDtype column gave a copy instead of a view on the underlying data (45241
) - Bug in
DataFrame.__getitem__
returning copy whenDataFrame
has duplicated columns even if a unique column is selected (45316
,41062
) - Bug in
Series.align
does not createMultiIndex
with union of levels when both MultiIndexes intersections are identical (45224
) - Bug in setting a NA value (
None
ornp.nan
) into aSeries
with int-basedIntervalDtype
incorrectly casting to object dtype instead of a float-basedIntervalDtype
(45568
) - Bug in indexing setting values into an
ExtensionDtype
column withdf.iloc[:, i] = values
withvalues
having the same dtype asdf.iloc[:, i]
incorrectly inserting a new array instead of setting in-place (33457
) - Bug in
Series.__setitem__
with a non-integerIndex
when using an integer key to set a value that cannot be set inplace where aValueError
was raised instead of casting to a common dtype (45070
) - Bug in
Series.__setitem__
when setting incompatible values into aPeriodDtype
orIntervalDtype
Series
raising when indexing with a boolean mask but coercing when indexing with otherwise-equivalent indexers; these now consistently coerce, along withSeries.mask
andSeries.where
(45768
) - Bug in
DataFrame.where
with multiple columns with datetime-like dtypes failing to downcast results consistent with other dtypes (45837
) - Bug in
isin
upcasting tofloat64
with unsigned integer dtype and list-like argument without a dtype (46485
) - Bug in
Series.loc.__setitem__
andSeries.loc.__getitem__
not raising when using multiple keys without using aMultiIndex
(13831
) - Bug in
Index.reindex
raisingAssertionError
whenlevel
was specified but noMultiIndex
was given; level is ignored now (35132
) - Bug when setting a value too large for a
Series
dtype failing to coerce to a common type (26049
,32878
) - Bug in
loc.__setitem__
treatingrange
keys as positional instead of label-based (45479
) - Bug in
Series.__setitem__
when settingboolean
dtype values containingNA
incorrectly raising instead of casting toboolean
dtype (45462
) - Bug in
Series.loc
raising with boolean indexer containingNA
whenIndex
did not match (46551
) - Bug in
Series.__setitem__
where settingNA
into a numeric-dtypeSeries
would incorrectly upcast to object-dtype rather than treating the value asnp.nan
(44199
) - Bug in
DataFrame.loc
when setting values to a column and right hand side is a dictionary (47216
) - Bug in
DataFrame.loc
when setting aDataFrame
not aligning index in some cases (47578
) - Bug in
Series.__setitem__
withdatetime64[ns]
dtype, an all-False
boolean mask, and an incompatible value incorrectly casting toobject
instead of retainingdatetime64[ns]
dtype (45967
) - Bug in
Index.__getitem__
raisingValueError
when indexer is from boolean dtype withNA
(45806
) - Bug in
Series.__setitem__
losing precision when enlargingSeries
with scalar (32346
) - Bug in
Series.mask
withinplace=True
or setting values with a boolean mask with small integer dtypes incorrectly raising (45750
) - Bug in
DataFrame.mask
withinplace=True
andExtensionDtype
columns incorrectly raising (45577
) - Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (
42950
) - Bug in
DataFrame.__getattribute__
raisingAttributeError
if columns have"string"
dtype (46185
) - Bug in indexing on a
DatetimeIndex
with anp.str_
key incorrectly raising (45580
) - Bug in
CategoricalIndex.get_indexer
when index containsNaN
values, resulting in elements that are in target but not present in the index to be mapped to the index of the NaN element, instead of -1 (45361
) - Bug in setting large integer values into
Series
withfloat32
orfloat16
dtype incorrectly altering these values instead of coercing tofloat64
dtype (45844
) - Bug in
Series.asof
andDataFrame.asof
incorrectly casting bool-dtype results tofloat64
dtype (16063
) - Bug in
NDFrame.xs
,DataFrame.iterrows
,DataFrame.loc
andDataFrame.iloc
not always propagating metadata (28283
) - Bug in
DataFrame.sum
min_count changes dtype if input contains NaNs (46947
) - Bug in
IntervalTree
that lead to an infinite recursion. (46658
) - Bug in
PeriodIndex
raisingAttributeError
when indexing onNA
, rather than puttingNaT
in its place. (46673
)
- Bug in
Series.fillna
andDataFrame.fillna
withdowncast
keyword not being respected in some cases where there are no NA values present (45423
) - Bug in
Series.fillna
andDataFrame.fillna
withIntervalDtype
and incompatible value raising instead of casting to a common (usually object) dtype (45796
) - Bug in
Series.map
not respectingna_action
argument if mapper is adict
orSeries
(47527
) - Bug in
DataFrame.interpolate
with object-dtype column not returning a copy withinplace=False
(45791
) - Bug in
DataFrame.dropna
allows to set bothhow
andthresh
incompatible arguments (46575
) - Bug in
DataFrame.fillna
ignoredaxis
whenDataFrame
is single block (47713
)
- Bug in
DataFrame.loc
returning empty result when slicing aMultiIndex
with a negative step size and non-null start/stop values (46156
) - Bug in
DataFrame.loc
raising when slicing aMultiIndex
with a negative step size other than -1 (46156
) - Bug in
DataFrame.loc
raising when slicing aMultiIndex
with a negative step size and slicing a non-int labeled index level (46156
) - Bug in
Series.to_numpy
where multiindexed Series could not be converted to numpy arrays when anna_value
was supplied (45774
) - Bug in
MultiIndex.equals
not commutative when only one side has extension array dtype (46026
) - Bug in
MultiIndex.from_tuples
cannot construct Index of empty tuples (45608
)
- Bug in
DataFrame.to_stata
where no error is raised if theDataFrame
contains-np.inf
(45350
) - Bug in
read_excel
results in an infinite loop with certainskiprows
callables (45585
) - Bug in
DataFrame.info
where a new line at the end of the output is omitted when called on an emptyDataFrame
(45494
) - Bug in
read_csv
not recognizing line break foron_bad_lines="warn"
forengine="c"
(41710
) - Bug in
DataFrame.to_csv
not respectingfloat_format
forFloat64
dtype (45991
) - Bug in
read_csv
not respecting a specified converter to index columns in all cases (40589
) - Bug in
read_csv
interpreting second row asIndex
names even whenindex_col=False
(46569
) - Bug in
read_parquet
whenengine="pyarrow"
which caused partial write to disk when column of unsupported datatype was passed (44914
) - Bug in
DataFrame.to_excel
andExcelWriter
would raise when writing an empty DataFrame to a.ods
file (45793
) - Bug in
read_csv
ignoring non-existing header row forengine="python"
(47400
) - Bug in
read_excel
raising uncontrolledIndexError
whenheader
references non-existing rows (43143
) - Bug in
read_html
where elements surrounding<br>
were joined without a space between them (29528
) - Bug in
read_csv
when data is longer than header leading to issues with callables inusecols
expecting strings (46997
) - Bug in Parquet roundtrip for Interval dtype with
datetime64[ns]
subtype (45881
) - Bug in
read_excel
when reading a.ods
file with newlines between xml elements (45598
) - Bug in
read_parquet
whenengine="fastparquet"
where the file was not closed on error (46555
) to_html
now excludes theborder
attribute from<table>
elements whenborder
keyword is set toFalse
.- Bug in
read_sas
with certain types of compressed SAS7BDAT files (35545
) - Bug in
read_excel
not forward fillingMultiIndex
when no names were given (47487
) - Bug in
read_sas
returnedNone
rather than an empty DataFrame for SAS7BDAT files with zero rows (18198
) - Bug in
StataWriter
where value labels were always written with default encoding (46750
) - Bug in
StataWriterUTF8
where some valid characters were removed from variable names (47276
) - Bug in
DataFrame.to_excel
when writing an empty dataframe withMultiIndex
(19543
) - Bug in
read_sas
with RLE-compressed SAS7BDAT files that contain 0x40 control bytes (31243
) - Bug in
read_sas
that scrambled column names (31243
) - Bug in
read_sas
with RLE-compressed SAS7BDAT files that contain 0x00 control bytes (47099
) - Bug in
read_parquet
withuse_nullable_dtypes=True
wherefloat64
dtype was returned instead of nullableFloat64
dtype (45694
) - Bug in
DataFrame.to_json
wherePeriodDtype
would not make the serialization roundtrip when read back withread_json
(44720
)
- Bug in subtraction of
Period
from.PeriodArray
returning wrong results (45999
) - Bug in
Period.strftime
andPeriodIndex.strftime
, directives%l
and%u
were giving wrong results (46252
) - Bug in inferring an incorrect
freq
when passing a string toPeriod
microseconds that are a multiple of 1000 (46811
) - Bug in constructing a
Period
from aTimestamp
ornp.datetime64
object with non-zero nanoseconds andfreq="ns"
incorrectly truncating the nanoseconds (46811
) - Bug in adding
np.timedelta64("NaT", "ns")
to aPeriod
with a timedelta-like freq incorrectly raisingIncompatibleFrequency
instead of returningNaT
(47196
) - Bug in adding an array of integers to an array with
PeriodDtype
giving incorrect results whendtype.freq.n > 1
(47209
) - Bug in subtracting a
Period
from an array withPeriodDtype
returning incorrect results instead of raisingOverflowError
when the operation overflows (47538
)
- Bug in
DataFrame.plot.barh
that prevented labeling the x-axis andxlabel
updating the y-axis label (45144
) - Bug in
DataFrame.plot.box
that prevented labeling the x-axis (45463
) - Bug in
DataFrame.boxplot
that prevented passing inxlabel
andylabel
(45463
) - Bug in
DataFrame.boxplot
that prevented specifyingvert=False
(36918
) - Bug in
DataFrame.plot.scatter
that prevented specifyingnorm
(45809
) - The function
DataFrame.plot.scatter
now acceptscolor
as an alias forc
andsize
as an alias fors
for consistency to other plotting functions (44670
) - Fix showing "None" as ylabel in
Series.plot
when not setting ylabel (46129
) - Bug in
DataFrame.plot
that led to xticks and vertical grids being improperly placed when plotting a quarterly series (47602
) - Bug in
DataFrame.plot
that prevented setting y-axis label, limits and ticks for a secondary y-axis (47753
)
- Bug in
DataFrame.resample
ignoringclosed="right"
onTimedeltaIndex
(45414
) - Bug in
.DataFrameGroupBy.transform
fails whenfunc="size"
and the input DataFrame has multiple columns (27469
) - Bug in
.DataFrameGroupBy.size
and.DataFrameGroupBy.transform
withfunc="size"
produced incorrect results whenaxis=1
(45715
) - Bug in
.ExponentialMovingWindow.mean
withaxis=1
andengine='numba'
when theDataFrame
has more columns than rows (46086
) - Bug when using
engine="numba"
would return the same jitted function when modifyingengine_kwargs
(46086
) - Bug in
.DataFrameGroupBy.transform
fails whenaxis=1
andfunc
is"first"
or"last"
(45986
) - Bug in
DataFrameGroupBy.cumsum
withskipna=False
giving incorrect results (46216
) - Bug in
.GroupBy.cumsum
withtimedelta64[ns]
dtype failing to recognizeNaT
as a null value (46216
) - Bug in
.GroupBy.cummin
and.GroupBy.cummax
with nullable dtypes incorrectly altering the original data in place (46220
) - Bug in
DataFrame.groupby
raising error whenNone
is in first level ofMultiIndex
(47348
) - Bug in
.GroupBy.cummax
withint64
dtype with leading value being the smallest possible int64 (46382
) - Bug in
.GroupBy.max
with empty groups anduint64
dtype incorrectly raisingRuntimeError
(46408
) - Bug in
.GroupBy.apply
would fail whenfunc
was a string and args or kwargs were supplied (46479
) - Bug in
SeriesGroupBy.apply
would incorrectly name its result when there was a unique group (46369
) - Bug in
.Rolling.sum
and.Rolling.mean
would give incorrect result with window of same values (42064
,46431
) - Bug in
.Rolling.var
and.Rolling.std
would give non-zero result with window of same values (42064
) - Bug in
.Rolling.skew
and.Rolling.kurt
would give NaN with window of same values (30993
) - Bug in
.Rolling.var
would segfault calculating weighted variance when window size was larger than data size (46760
) - Bug in
Grouper.__repr__
wheredropna
was not included. Now it is (46754
) - Bug in
DataFrame.rolling
gives ValueError when center=True, axis=1 and win_type is specified (46135
) - Bug in
.DataFrameGroupBy.describe
and.SeriesGroupBy.describe
produces inconsistent results for empty datasets (41575
) - Bug in
DataFrame.resample
reduction methods when used withon
would attempt to aggregate the provided column (47079
) - Bug in
DataFrame.groupby
andSeries.groupby
would not respectdropna=False
when the input DataFrame/Series had a NaN values in aMultiIndex
(46783
) - Bug in
DataFrameGroupBy.resample
raisesKeyError
when getting the result from a key list which misses the resample key (47362
)
- Bug in
concat
between aSeries
with integer dtype and another withCategoricalDtype
with integer categories and containingNaN
values casting to object dtype instead offloat64
(45359
) - Bug in
get_dummies
that selected object and categorical dtypes but not string (44965
) - Bug in
DataFrame.align
when aligning aMultiIndex
to aSeries
with anotherMultiIndex
(46001
) - Bug in concatenation with
IntegerDtype
, orFloatingDtype
arrays where the resulting dtype did not mirror the behavior of the non-nullable dtypes (46379
) - Bug in
concat
losing dtype of columns whenjoin="outer"
andsort=True
(47329
) - Bug in
concat
not sorting the column names whenNone
is included (47331
) - Bug in
concat
with identical key leads to error when indexingMultiIndex
(46519
) - Bug in
DataFrame.join
with a list when using suffixes to join DataFrames with duplicate column names (46396
) - Bug in
DataFrame.pivot_table
withsort=False
results in sorted index (17041
) - Bug in
concat
whenaxis=1
andsort=False
where the resulting Index was aInt64Index
instead of aRangeIndex
(46675
) - Bug in
wide_to_long
raises whenstubnames
is missing in columns andi
contains string dtype column (46044
) - Bug in
DataFrame.join
with categorical index results in unexpected reordering (47812
)
- Bug in
Series.where
andDataFrame.where
withSparseDtype
failing to retain the array'sfill_value
(45691
) - Bug in
SparseArray.unique
fails to keep original elements order (47809
)
- Bug in
IntegerArray.searchsorted
andFloatingArray.searchsorted
returning inconsistent results when acting onnp.nan
(45255
)
- Bug when attempting to apply styling functions to an empty DataFrame subset (
45313
) - Bug in
CSSToExcelConverter
leading toTypeError
when border color provided without border style forxlsxwriter
engine (42276
) - Bug in
Styler.set_sticky
leading to white text on white background in dark mode (46984
) - Bug in
Styler.to_latex
causingUnboundLocalError
whenclines="all;data"
and theDataFrame
has no rows. (47203
) - Bug in
Styler.to_excel
when usingvertical-align: middle;
withxlsxwriter
engine (30107
) - Bug when applying styles to a DataFrame with boolean column labels (
47838
)
- Fixed metadata propagation in
DataFrame.melt
(28283
) - Fixed metadata propagation in
DataFrame.explode
(28283
)
- Bug in
.assert_index_equal
withnames=True
andcheck_order=False
not checking names (47328
)