Skip to content

Latest commit

 

History

History
927 lines (731 loc) · 56.4 KB

v2.2.0.rst

File metadata and controls

927 lines (731 loc) · 56.4 KB

What's new in 2.2.0 (Month XX, 2024)

These are the changes in pandas 2.2.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Upcoming changes in pandas 3.0

pandas 3.0 will bring two bigger changes to the default behavior of pandas.

Copy-on-Write

The currently optional mode Copy-on-Write will be enabled by default in pandas 3.0. There won't be an option to keep the current behavior enabled. The new behavioral semantics are explained in the :ref:`user guide about Copy-on-Write <copy_on_write>`.

The new behavior can be enabled since pandas 2.0 with the following option:

pd.options.mode.copy_on_write = True

This change brings different changes in behavior in how pandas operates with respect to copies and views. Some of these changes allow a clear deprecation, like the changes in chained assignment. Other changes are more subtle and thus, the warnings are hidden behind an option that can be enabled in pandas 2.2.

pd.options.mode.copy_on_write = "warn"

This mode will warn in many different scenarios that aren't actually relevant to most queries. We recommend exploring this mode, but it is not necessary to get rid of all of these warnings. The :ref:`migration guide <copy_on_write.migration_guide>` explains the upgrade process in more detail.

Dedicated string data type (backed by Arrow) by default

Historically, pandas represented string columns with NumPy object data type. This representation has numerous problems, including slow performance and a large memory footprint. This will change in pandas 3.0. pandas will start inferring string columns as a new string data type, backed by Arrow, which represents strings contiguous in memory. This brings a huge performance and memory improvement.

Old behavior:

In [1]: ser = pd.Series(["a", "b"])
Out[1]:
0    a
1    b
dtype: object

New behavior:

In [1]: ser = pd.Series(["a", "b"])
Out[1]:
0    a
1    b
dtype: string

The string data type that is used in these scenarios will mostly behave as NumPy object would, including missing value semantics and general operations on these columns.

This change includes a few additional changes across the API:

  • Currently, specifying dtype="string" creates a dtype that is backed by Python strings which are stored in a NumPy array. This will change in pandas 3.0, this dtype will create an Arrow backed string column.
  • The column names and the Index will also be backed by Arrow strings.
  • PyArrow will become a required dependency with pandas 3.0 to accommodate this change.

This future dtype inference logic can be enabled with:

pd.options.future.infer_string = True

Enhancements

ADBC Driver support in to_sql and read_sql

:func:`read_sql` and :meth:`~DataFrame.to_sql` now work with Apache Arrow ADBC drivers. Compared to traditional drivers used via SQLAlchemy, ADBC drivers should provide significant performance improvements, better type support and cleaner nullability handling.

import adbc_driver_postgresql.dbapi as pg_dbapi

df = pd.DataFrame(
    [
        [1, 2, 3],
        [4, 5, 6],
    ],
    columns=['a', 'b', 'c']
)
uri = "postgresql://postgres:postgres@localhost/postgres"
with pg_dbapi.connect(uri) as conn:
    df.to_sql("pandas_table", conn, index=False)

# for round-tripping
with pg_dbapi.connect(uri) as conn:
    df2 = pd.read_sql("pandas_table", conn)

The Arrow type system offers a wider array of types that can more closely match what databases like PostgreSQL can offer. To illustrate, note this (non-exhaustive) listing of types available in different databases and pandas backends:

numpy/pandas arrow postgres sqlite
int16/Int16 int16 SMALLINT INTEGER
int32/Int32 int32 INTEGER INTEGER
int64/Int64 int64 BIGINT INTEGER
float32 float32 REAL REAL
float64 float64 DOUBLE PRECISION REAL
object string TEXT TEXT
bool bool_ BOOLEAN  
datetime64[ns] timestamp(us) TIMESTAMP  
datetime64[ns,tz] timestamp(us,tz) TIMESTAMPTZ  
  date32 DATE  
  month_day_nano_interval INTERVAL  
  binary BINARY BLOB
  decimal128 DECIMAL [1]  
  list ARRAY [1]  
  struct
COMPOSITE TYPE
[1]
 

Footnotes

[1](1, 2, 3) Not implemented as of writing, but theoretically possible

If you are interested in preserving database types as best as possible throughout the lifecycle of your DataFrame, users are encouraged to leverage the dtype_backend="pyarrow" argument of :func:`~pandas.read_sql`

# for round-tripping
with pg_dbapi.connect(uri) as conn:
    df2 = pd.read_sql("pandas_table", conn, dtype_backend="pyarrow")

This will prevent your data from being converted to the traditional pandas/NumPy type system, which often converts SQL types in ways that make them impossible to round-trip.

For a full list of ADBC drivers and their development status, see the ADBC Driver Implementation Status documentation.

to_numpy for NumPy nullable and Arrow types converts to suitable NumPy dtype

to_numpy for NumPy nullable and Arrow types will now convert to a suitable NumPy dtype instead of object dtype for nullable and PyArrow backed extension dtypes.

Old behavior:

In [1]: ser = pd.Series([1, 2, 3], dtype="Int64")
In [2]: ser.to_numpy()
Out[2]: array([1, 2, 3], dtype=object)

New behavior:

.. ipython:: python

    ser = pd.Series([1, 2, 3], dtype="Int64")
    ser.to_numpy()

    ser = pd.Series([1, 2, 3], dtype="timestamp[ns][pyarrow]")
    ser.to_numpy()

The default NumPy dtype (without any arguments) is determined as follows:

  • float dtypes are cast to NumPy floats
  • integer dtypes without missing values are cast to NumPy integer dtypes
  • integer dtypes with missing values are cast to NumPy float dtypes and NaN is used as missing value indicator
  • boolean dtypes without missing values are cast to NumPy bool dtype
  • boolean dtypes with missing values keep object dtype
  • datetime and timedelta types are cast to Numpy datetime64 and timedelta64 types respectively and NaT is used as missing value indicator

Series.struct accessor for PyArrow structured data

The Series.struct accessor provides attributes and methods for processing data with struct[pyarrow] dtype Series. For example, :meth:`Series.struct.explode` converts PyArrow structured data to a pandas DataFrame. (:issue:`54938`)

.. ipython:: python

    import pyarrow as pa
    series = pd.Series(
        [
            {"project": "pandas", "version": "2.2.0"},
            {"project": "numpy", "version": "1.25.2"},
            {"project": "pyarrow", "version": "13.0.0"},
        ],
        dtype=pd.ArrowDtype(
            pa.struct([
                ("project", pa.string()),
                ("version", pa.string()),
            ])
        ),
    )
    series.struct.explode()

Use :meth:`Series.struct.field` to index into a (possible nested) struct field.

.. ipython:: python

    series.struct.field("project")

Series.list accessor for PyArrow list data

The Series.list accessor provides attributes and methods for processing data with list[pyarrow] dtype Series. For example, :meth:`Series.list.__getitem__` allows indexing pyarrow lists in a Series. (:issue:`55323`)

.. ipython:: python

    import pyarrow as pa
    series = pd.Series(
        [
            [1, 2, 3],
            [4, 5],
            [6],
        ],
        dtype=pd.ArrowDtype(
            pa.list_(pa.int64())
        ),
    )
    series.list[0]

Calamine engine for :func:`read_excel`

The calamine engine was added to :func:`read_excel`. It uses python-calamine, which provides Python bindings for the Rust library calamine. This engine supports Excel files (.xlsx, .xlsm, .xls, .xlsb) and OpenDocument spreadsheets (.ods) (:issue:`50395`).

There are two advantages of this engine:

  1. Calamine is often faster than other engines, some benchmarks show results up to 5x faster than 'openpyxl', 20x - 'odf', 4x - 'pyxlsb', and 1.5x - 'xlrd'. But, 'openpyxl' and 'pyxlsb' are faster in reading a few rows from large files because of lazy iteration over rows.
  2. Calamine supports the recognition of datetime in .xlsb files, unlike 'pyxlsb' which is the only other engine in pandas that can read .xlsb files.
pd.read_excel("path_to_file.xlsb", engine="calamine")

For more, see :ref:`io.calamine` in the user guide on IO tools.

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

:func:`merge` and :meth:`DataFrame.join` now consistently follow documented sort behavior

In previous versions of pandas, :func:`merge` and :meth:`DataFrame.join` did not always return a result that followed the documented sort behavior. pandas now follows the documented sort behavior in merge and join operations (:issue:`54611`, :issue:`56426`, :issue:`56443`).

As documented, sort=True sorts the join keys lexicographically in the resulting :class:`DataFrame`. With sort=False, the order of the join keys depends on the join type (how keyword):

  • how="left": preserve the order of the left keys
  • how="right": preserve the order of the right keys
  • how="inner": preserve the order of the left keys
  • how="outer": sort keys lexicographically

One example with changing behavior is inner joins with non-unique left join keys and sort=False:

.. ipython:: python

    left = pd.DataFrame({"a": [1, 2, 1]})
    right = pd.DataFrame({"a": [1, 2]})
    result = pd.merge(left, right, how="inner", on="a", sort=False)

Old Behavior

In [5]: result
Out[5]:
   a
0  1
1  1
2  2

New Behavior

.. ipython:: python

    result

:func:`merge` and :meth:`DataFrame.join` no longer reorder levels when levels differ

In previous versions of pandas, :func:`merge` and :meth:`DataFrame.join` would reorder index levels when joining on two indexes with different levels (:issue:`34133`).

.. ipython:: python

    left = pd.DataFrame({"left": 1}, index=pd.MultiIndex.from_tuples([("x", 1), ("x", 2)], names=["A", "B"]))
    right = pd.DataFrame({"right": 2}, index=pd.MultiIndex.from_tuples([(1, 1), (2, 2)], names=["B", "C"]))
    left
    right
    result = left.join(right)

Old Behavior

In [5]: result
Out[5]:
       left  right
B A C
1 x 1     1      2
2 x 2     1      2

New Behavior

.. ipython:: python

    result

Backwards incompatible API changes

Increased minimum versions for dependencies

For optional dependencies the general recommendation is to use the latest version. Optional dependencies below the lowest tested version may still work but are not considered supported. The following table lists the optional dependencies that have had their minimum tested version increased.

Package New Minimum Version
beautifulsoup4 4.11.2
blosc 1.21.3
bottleneck 1.3.6
fastparquet 2022.12.0
fsspec 2022.11.0
gcsfs 2022.11.0
lxml 4.9.2
matplotlib 3.6.3
numba 0.56.4
numexpr 2.8.4
qtpy 2.3.0
openpyxl 3.1.0
psycopg2 2.9.6
pyreadstat 1.2.0
pytables 3.8.0
pyxlsb 1.0.10
s3fs 2022.11.0
scipy 1.10.0
sqlalchemy 2.0.0
tabulate 0.9.0
xarray 2022.12.0
xlsxwriter 3.0.5
zstandard 0.19.0
pyqt5 5.15.8
tzdata 2022.7

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

Chained assignment

In preparation of larger upcoming changes to the copy / view behaviour in pandas 3.0 (:ref:`copy_on_write`, PDEP-7), we started deprecating chained assignment.

Chained assignment occurs when you try to update a pandas DataFrame or Series through two subsequent indexing operations. Depending on the type and order of those operations this currently does or does not work.

A typical example is as follows:

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

# first selecting rows with a mask, then assigning values to a column
# -> this has never worked and raises a SettingWithCopyWarning
df[df["bar"] > 5]["foo"] = 100

# first selecting the column, and then assigning to a subset of that column
# -> this currently works
df["foo"][df["bar"] > 5] = 100

This second example of chained assignment currently works to update the original df. This will no longer work in pandas 3.0, and therefore we started deprecating this:

>>> df["foo"][df["bar"] > 5] = 100
FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

You can fix this warning and ensure your code is ready for pandas 3.0 by removing the usage of chained assignment. Typically, this can be done by doing the assignment in a single step using for example .loc. For the example above, we can do:

df.loc[df["bar"] > 5, "foo"] = 100

The same deprecation applies to inplace methods that are done in a chained manner, such as:

>>> df["foo"].fillna(0, inplace=True)
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

When the goal is to update the column in the DataFrame df, the alternative here is to call the method on df itself, such as df.fillna({"foo": 0}, inplace=True).

See more details in the :ref:`migration guide <copy_on_write.migration_guide>`.

Deprecate aliases M, Q, Y, etc. in favour of ME, QE, YE, etc. for offsets

Deprecated the following frequency aliases (:issue:`9586`):

offsets deprecated aliases new aliases
:class:`MonthEnd` M ME
:class:`BusinessMonthEnd` BM BME
:class:`SemiMonthEnd` SM SME
:class:`CustomBusinessMonthEnd` CBM CBME
:class:`QuarterEnd` Q QE
:class:`BQuarterEnd` BQ BQE
:class:`YearEnd` Y YE
:class:`BYearEnd` BY BYE

For example:

Previous behavior:

In [8]: pd.date_range('2020-01-01', periods=3, freq='Q-NOV')
Out[8]:
DatetimeIndex(['2020-02-29', '2020-05-31', '2020-08-31'],
              dtype='datetime64[ns]', freq='Q-NOV')

Future behavior:

.. ipython:: python

    pd.date_range('2020-01-01', periods=3, freq='QE-NOV')

Deprecated automatic downcasting

Deprecated the automatic downcasting of object dtype results in a number of methods. These would silently change the dtype in a hard to predict manner since the behavior was value dependent. Additionally, pandas is moving away from silent dtype changes (:issue:`54710`, :issue:`54261`).

These methods are:

Explicitly call :meth:`DataFrame.infer_objects` to replicate the current behavior in the future.

result = result.infer_objects(copy=False)

Or explicitly cast all-round floats to ints using astype.

Set the following option to opt into the future behavior:

In [9]: pd.set_option("future.no_silent_downcasting", True)

Other Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

Other

Contributors