Skip to content

Latest commit

 

History

History
1271 lines (924 loc) · 64.9 KB

v0.25.0.rst

File metadata and controls

1271 lines (924 loc) · 64.9 KB

What's new in 0.25.0 (July 18, 2019)

Warning

Starting with the 0.25.x series of releases, pandas only supports Python 3.5.3 and higher. See Dropping Python 2.7 for more details.

Warning

The minimum supported Python version will be bumped to 3.6 in a future release.

Warning

Panel has been fully removed. For N-D labeled data structures, please use xarray

Warning

:func:`read_pickle` and :func:`read_msgpack` are only guaranteed backwards compatible back to pandas version 0.20.3 (:issue:`27082`)

{{ header }}

These are the changes in pandas 0.25.0. See :ref:`release` for a full changelog including other versions of pandas.

Enhancements

Groupby aggregation with relabeling

Pandas has added special groupby behavior, known as "named aggregation", for naming the output columns when applying multiple aggregation functions to specific columns (:issue:`18366`, :issue:`26512`).

.. ipython:: python

   animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                           'height': [9.1, 6.0, 9.5, 34.0],
                           'weight': [7.9, 7.5, 9.9, 198.0]})
   animals
   animals.groupby("kind").agg(
       min_height=pd.NamedAgg(column='height', aggfunc='min'),
       max_height=pd.NamedAgg(column='height', aggfunc='max'),
       average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
   )

Pass the desired columns names as the **kwargs to .agg. The values of **kwargs should be tuples where the first element is the column selection, and the second element is the aggregation function to apply. Pandas provides the pandas.NamedAgg namedtuple to make it clearer what the arguments to the function are, but plain tuples are accepted as well.

.. ipython:: python

   animals.groupby("kind").agg(
       min_height=('height', 'min'),
       max_height=('height', 'max'),
       average_weight=('weight', np.mean),
   )

Named aggregation is the recommended replacement for the deprecated "dict-of-dicts" approach to naming the output of column-specific aggregations (:ref:`whatsnew_0200.api_breaking.deprecate_group_agg_dict`).

A similar approach is now available for Series groupby objects as well. Because there's no need for column selection, the values can just be the functions to apply

.. ipython:: python

   animals.groupby("kind").height.agg(
       min_height="min",
       max_height="max",
   )


This type of aggregation is the recommended alternative to the deprecated behavior when passing a dict to a Series groupby aggregation (:ref:`whatsnew_0200.api_breaking.deprecate_group_agg_dict`).

See :ref:`groupby.aggregate.named` for more.

Groupby Aggregation with multiple lambdas

You can now provide multiple lambda functions to a list-like aggregation in :class:`pandas.core.groupby.GroupBy.agg` (:issue:`26430`).

.. ipython:: python

   animals.groupby('kind').height.agg([
       lambda x: x.iloc[0], lambda x: x.iloc[-1]
   ])

   animals.groupby('kind').agg([
       lambda x: x.iloc[0] - x.iloc[1],
       lambda x: x.iloc[0] + x.iloc[1]
   ])

Previously, these raised a SpecificationError.

Better repr for MultiIndex

Printing of :class:`MultiIndex` instances now shows tuples of each row and ensures that the tuple items are vertically aligned, so it's now easier to understand the structure of the MultiIndex. (:issue:`13480`):

The repr now looks like this:

.. ipython:: python

   pd.MultiIndex.from_product([['a', 'abc'], range(500)])

Previously, outputting a :class:`MultiIndex` printed all the levels and codes of the MultiIndex, which was visually unappealing and made the output more difficult to navigate. For example (limiting the range to 5):

In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
   ...:            codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

In the new repr, all values will be shown, if the number of rows is smaller than :attr:`options.display.max_seq_items` (default: 100 items). Horizontally, the output will truncate, if it's wider than :attr:`options.display.width` (default: 80 characters).

Shorter truncated repr for Series and DataFrame

Currently, the default display options of pandas ensure that when a Series or DataFrame has more than 60 rows, its repr gets truncated to this maximum of 60 rows (the display.max_rows option). However, this still gives a repr that takes up a large part of the vertical screen estate. Therefore, a new option display.min_rows is introduced with a default of 10 which determines the number of rows showed in the truncated repr:

  • For small Series or DataFrames, up to max_rows number of rows is shown (default: 60).
  • For larger Series of DataFrame with a length above max_rows, only min_rows number of rows is shown (default: 10, i.e. the first and last 5 rows).

This dual option allows to still see the full content of relatively small objects (e.g. df.head(20) shows all 20 rows), while giving a brief repr for large objects.

To restore the previous behaviour of a single threshold, set pd.options.display.min_rows = None.

Json normalize with max_level param support

:func:`json_normalize` normalizes the provided input dict to all nested levels. The new max_level parameter provides more control over which level to end normalization (:issue:`23843`):

The repr now looks like this:

.. ipython:: python

    from pandas.io.json import json_normalize
    data = [{
        'CreatedBy': {'Name': 'User001'},
        'Lookup': {'TextField': 'Some text',
                   'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
        'Image': {'a': 'b'}
    }]
    json_normalize(data, max_level=1)


Series.explode to split list-like values to rows

:class:`Series` and :class:`DataFrame` have gained the :meth:`DataFrame.explode` methods to transform list-likes to individual rows. See :ref:`section on Exploding list-like column <reshaping.explode>` in docs for more information (:issue:`16538`, :issue:`10511`)

Here is a typical usecase. You have comma separated string in a column.

.. ipython:: python

    df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
                       {'var1': 'd,e,f', 'var2': 2}])
    df

Creating a long form DataFrame is now straightforward using chained operations

.. ipython:: python

    df.assign(var1=df.var1.str.split(',')).explode('var1')

Other enhancements

Backwards incompatible API changes

Indexing with date strings with UTC offsets

Indexing a :class:`DataFrame` or :class:`Series` with a :class:`DatetimeIndex` with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (:issue:`24076`, :issue:`16785`)

.. ipython:: python

    df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
    df

Previous behavior:

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
                           0
2019-01-01 00:00:00-08:00  0

New behavior:

.. ipython:: python

    df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']


MultiIndex constructed from levels and codes

Constructing a :class:`MultiIndex` with NaN levels or codes value < -1 was allowed previously. Now, construction with codes value < -1 is not allowed and NaN levels' corresponding codes would be reassigned as -1. (:issue:`19387`)

Previous behavior:

In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
   ...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
                   codes=[[0, -1, 1, 2, 3, 4]])

In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
                   codes=[[0, -2]])

New behavior:

.. ipython:: python
    :okexcept:

    pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
                  codes=[[0, -1, 1, 2, 3, 4]])
    pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])


Groupby.apply on DataFrame evaluates first group only once

The implementation of :meth:`DataFrameGroupBy.apply() <pandas.core.groupby.DataFrameGroupBy.apply>` previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (:issue:`2936`, :issue:`2656`, :issue:`7739`, :issue:`10519`, :issue:`12155`, :issue:`20084`, :issue:`21417`)

Now every group is evaluated only a single time.

.. ipython:: python

    df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
    df

    def func(group):
        print(group.name)
        return group

Previous behavior:

In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
   a  b
0  x  1
1  y  2

New behavior:

.. ipython:: python

    df.groupby("a").apply(func)


Concatenating sparse values

When passed DataFrames whose values are sparse, :func:`concat` will now return a :class:`Series` or :class:`DataFrame` with sparse values, rather than a :class:`SparseDataFrame` (:issue:`25702`).

.. ipython:: python

   df = pd.DataFrame({"A": pd.SparseArray([0, 1])})

Previous behavior:

In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame

New behavior:

.. ipython:: python

   type(pd.concat([df, df]))


This now matches the existing behavior of :class:`concat` on Series with sparse values. :func:`concat` will continue to return a SparseDataFrame when all the values are instances of SparseDataFrame.

This change also affects routines using :func:`concat` internally, like :func:`get_dummies`, which now returns a :class:`DataFrame` in all cases (previously a SparseDataFrame was returned if all the columns were dummy encoded, and a :class:`DataFrame` otherwise).

Providing any SparseSeries or SparseDataFrame to :func:`concat` will cause a SparseSeries or SparseDataFrame to be returned, as before.

The .str-accessor performs stricter type checks

Due to the lack of more fine-grained dtypes, :attr:`Series.str` so far only checked whether the data was of object dtype. :attr:`Series.str` will now infer the dtype data within the Series; in particular, 'bytes'-only data will raise an exception (except for :meth:`Series.str.decode`, :meth:`Series.str.get`, :meth:`Series.str.len`, :meth:`Series.str.slice`), see :issue:`23163`, :issue:`23011`, :issue:`23551`.

Previous behavior:

In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [2]: s
Out[2]:
0      b'a'
1     b'ba'
2    b'cba'
dtype: object

In [3]: s.str.startswith(b'a')
Out[3]:
0     True
1    False
2    False
dtype: bool

New behavior:

.. ipython:: python
    :okexcept:

    s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
    s
    s.str.startswith(b'a')

Categorical dtypes are preserved during groupby

Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype during groupby operations. Pandas now will preserve these dtypes. (:issue:`18502`)

.. ipython:: python

   cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)
   df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})
   df
   df.dtypes

Previous Behavior:

In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')

New Behavior:

.. ipython:: python

   df.groupby('payload').first().col.dtype


Incompatible Index type unions

When performing :func:`Index.union` operations between objects of incompatible dtypes, the result will be a base :class:`Index` of dtype object. This behavior holds true for unions between :class:`Index` objects that previously would have been prohibited. The dtype of empty :class:`Index` objects will now be evaluated before performing union operations rather than simply returning the other :class:`Index` object. :func:`Index.union` can now be considered commutative, such that A.union(B) == B.union(A) (:issue:`23525`).

Previous behavior:

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')

New behavior:

.. ipython:: python

    pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
    pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))

Note that integer- and floating-dtype indexes are considered "compatible". The integer values are coerced to floating point, which may result in loss of precision. See :ref:`indexing.set_ops` for more.

DataFrame groupby ffill/bfill no longer return group labels

The methods ffill, bfill, pad and backfill of :class:`DataFrameGroupBy <pandas.core.groupby.DataFrameGroupBy>` previously included the group labels in the return value, which was inconsistent with other groupby transforms. Now only the filled values are returned. (:issue:`21521`)

.. ipython:: python

    df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
    df

Previous behavior:

In [3]: df.groupby("a").ffill()
Out[3]:
   a  b
0  x  1
1  y  2

New behavior:

.. ipython:: python

    df.groupby("a").ffill()

DataFrame describe on an empty categorical / object column will return top and freq

When calling :meth:`DataFrame.describe` with an empty categorical / object column, the 'top' and 'freq' columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the 'top' and 'freq' columns will always be included, with :attr:`numpy.nan` in the case of an empty :class:`DataFrame` (:issue:`26397`)

.. ipython:: python

   df = pd.DataFrame({"empty_col": pd.Categorical([])})
   df

Previous behavior:

In [3]: df.describe()
Out[3]:
        empty_col
count           0
unique          0

New behavior:

.. ipython:: python

    df.describe()

__str__ methods now call __repr__ rather than vice versa

Pandas has until now mostly defined string representations in a Pandas objects's __str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__ method, if a specific __repr__ method is not found. This is not needed for Python3. In Pandas 0.25, the string representations of Pandas objects are now generally defined in __repr__, and calls to __str__ in general now pass the call on to the __repr__, if a specific __str__ method doesn't exist, as is standard for Python. This change is backward compatible for direct usage of Pandas, but if you subclass Pandas objects and give your subclasses specific __str__/__repr__ methods, you may have to adjust your __str__/__repr__ methods (:issue:`26495`).

Indexing an IntervalIndex with Interval objects

Indexing methods for :class:`IntervalIndex` have been modified to require exact matches only for :class:`Interval` queries. IntervalIndex methods previously matched on any overlapping Interval. Behavior with scalar points, e.g. querying with an integer, is unchanged (:issue:`16316`).

.. ipython:: python

   ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])
   ii

The in operator (__contains__) now only returns True for exact matches to Intervals in the IntervalIndex, whereas this would previously return True for any Interval overlapping an Interval in the IntervalIndex.

Previous behavior:

In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True

In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True

New behavior:

.. ipython:: python

   pd.Interval(1, 2, closed='neither') in ii
   pd.Interval(-10, 10, closed='both') in ii

The :meth:`~IntervalIndex.get_loc` method now only returns locations for exact matches to Interval queries, as opposed to the previous behavior of returning locations for overlapping matches. A KeyError will be raised if an exact match is not found.

Previous behavior:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])

In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])

New behavior:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1

In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')

Likewise, :meth:`~IntervalIndex.get_indexer` and :meth:`~IntervalIndex.get_indexer_non_unique` will also only return locations for exact matches to Interval queries, with -1 denoting that an exact match was not found.

These indexing changes extend to querying a :class:`Series` or :class:`DataFrame` with an IntervalIndex index.

.. ipython:: python

   s = pd.Series(list('abc'), index=ii)
   s

Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for Interval queries.

Previous behavior:

In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4]    a
(1, 5]    b
dtype: object

In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

New behavior:

.. ipython:: python

   s[pd.Interval(1, 5)]
   s.loc[pd.Interval(1, 5)]

Similarly, a KeyError will be raised for non-exact matches instead of returning overlapping matches.

Previous behavior:

In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4]    a
(1, 5]    b
dtype: object

New behavior:

In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

The :meth:`~IntervalIndex.overlaps` method can be used to create a boolean indexer that replicates the previous behavior of returning overlapping matches.

New behavior:

.. ipython:: python

   idxr = s.index.overlaps(pd.Interval(2, 3))
   idxr
   s[idxr]
   s.loc[idxr]


Binary ufuncs on Series now align

Applying a binary ufunc like :func:`numpy.power` now aligns the inputs when both are :class:`Series` (:issue:`23293`).

.. ipython:: python

   s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
   s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])
   s1
   s2

Previous behavior

In [5]: np.power(s1, s2)
Out[5]:
a      1
b     16
c    243
dtype: int64

New behavior

.. ipython:: python

   np.power(s1, s2)

This matches the behavior of other binary operations in pandas, like :meth:`Series.add`. To retain the previous behavior, convert the other Series to an array before applying the ufunc.

.. ipython:: python

   np.power(s1, s2.array)

Categorical.argsort now places missing values at the end

:meth:`Categorical.argsort` now places missing values at the end of the array, making it consistent with NumPy and the rest of pandas (:issue:`21801`).

.. ipython:: python

   cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

Previous behavior

In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

In [3]: cat.argsort()
Out[3]: array([1, 2, 0])

In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]

New behavior

.. ipython:: python

   cat.argsort()
   cat[cat.argsort()]

Column order is preserved when passing a list of dicts to DataFrame

Starting with Python 3.7 the key-order of dict is guaranteed. In practice, this has been true since Python 3.6. The :class:`DataFrame` constructor now treats a list of dicts in the same way as it does a list of OrderedDict, i.e. preserving the order of the dicts. This change applies only when pandas is running on Python>=3.6 (:issue:`27309`).

.. ipython:: python

   data = [
       {'name': 'Joe', 'state': 'NY', 'age': 18},
       {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
       {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
   ]

Previous Behavior:

The columns were lexicographically sorted previously,

In [1]: pd.DataFrame(data)
Out[1]:
   age finances      hobby  name state
0   18      NaN        NaN   Joe    NY
1   19      NaN  Minecraft  Jane    KY
2   20     good        NaN  Jean    OK

New Behavior:

The column order now matches the insertion-order of the keys in the dict, considering all the records from top to bottom. As a consequence, the column order of the resulting DataFrame has changed compared to previous pandas verisons.

.. ipython:: python

   pd.DataFrame(data)

Increased minimum versions for dependencies

Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (:issue:`25725`, :issue:`24942`, :issue:`25752`). Independently, some minimum supported versions of dependencies were updated (:issue:`23519`, :issue:`25554`). If installed, we now require:

Package Minimum Version Required
numpy 1.13.3 X
pytz 2015.4 X
python-dateutil 2.6.1 X
bottleneck 1.2.1  
numexpr 2.6.2  
pytest (dev) 4.0.2  

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version
beautifulsoup4 4.6.0
fastparquet 0.2.1
gcsfs 0.2.2
lxml 3.8.0
matplotlib 2.2.2
openpyxl 2.4.8
pyarrow 0.9.0
pymysql 0.7.1
pytables 3.4.2
scipy 0.19.0
sqlalchemy 1.1.4
xarray 0.8.2
xlrd 1.1.0
xlsxwriter 0.9.8
xlwt 1.2.0

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

Sparse subclasses

The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided by a Series or DataFrame with sparse values.

Previous way

.. ipython:: python
   :okwarning:

   df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
   df.dtypes

New way

.. ipython:: python

   df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])})
   df.dtypes

The memory usage of the two approaches is identical. See :ref:`sparse.migration` for more (:issue:`19239`).

msgpack format

The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (:issue:`27084`)

Other deprecations

Removal of prior version deprecations/changes

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Plotting

Groupby/resample/rolling

Reshaping

Sparse

Build Changes

ExtensionArray

Other

Contributors

.. contributors:: v0.24.x..HEAD