Skip to content

Latest commit

 

History

History
723 lines (578 loc) · 44.6 KB

v1.3.0.rst

File metadata and controls

723 lines (578 loc) · 44.6 KB

What's new in 1.3.0 (??)

These are the changes in pandas 1.3.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Warning

When reading new Excel 2007+ (.xlsx) files, the default argument engine=None to :func:`~pandas.read_excel` will now result in using the openpyxl engine in all cases when the option :attr:`io.excel.xlsx.reader` is set to "auto". Previously, some cases would use the xlrd engine instead. See :ref:`What's new 1.2.0 <whatsnew_120>` for background on this change.

Enhancements

Custom HTTP(s) headers when reading csv or json files

When reading from a remote URL that is not handled by fsspec (ie. HTTP and HTTPS) the dictionary passed to storage_options will be used to create the headers included in the request. This can be used to control the User-Agent header or send other custom headers (:issue:`36688`). For example:

.. ipython:: python

    headers = {"User-Agent": "pandas"}
    df = pd.read_csv(
        "https://download.bls.gov/pub/time.series/cu/cu.item",
        sep="\t",
        storage_options=headers
    )

Read and write XML documents

We added I/O support to read and render shallow versions of XML documents with :func:`pandas.read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser, both XPath 1.0 and XSLT 1.0 is available. (:issue:`27554`)

In [1]: xml = """<?xml version='1.0' encoding='utf-8'?>
   ...: <data>
   ...:  <row>
   ...:     <shape>square</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides>4.0</sides>
   ...:  </row>
   ...:  <row>
   ...:     <shape>circle</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides/>
   ...:  </row>
   ...:  <row>
   ...:     <shape>triangle</shape>
   ...:     <degrees>180</degrees>
   ...:     <sides>3.0</sides>
   ...:  </row>
   ...:  </data>"""

In [2]: df = pd.read_xml(xml)
In [3]: df
Out[3]:
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0

In [4]: df.to_xml()
Out[4]:
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <shape>square</shape>
    <degrees>360</degrees>
    <sides>4.0</sides>
  </row>
  <row>
    <index>1</index>
    <shape>circle</shape>
    <degrees>360</degrees>
    <sides/>
  </row>
  <row>
    <index>2</index>
    <shape>triangle</shape>
    <degrees>180</degrees>
    <sides>3.0</sides>
  </row>
</data>

For more, see :ref:`io.xml` in the user guide on IO tools.

DataFrame constructor honors copy=False with dict

When passing a dictionary to :class:`DataFrame` with copy=False, a copy will no longer be made (:issue:`32960`)

.. ipython:: python

    arr = np.array([1, 2, 3])
    df = pd.DataFrame({"A": arr, "B": arr.copy()}, copy=False)
    df

df["A"] remains a view on arr:

.. ipython:: python

    arr[0] = 0
    assert df.iloc[0, 0] == 0

The default behavior when not passing copy will remain unchanged, i.e. a copy will be made.

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

:meth:`~pandas.DataFrame.combine_first` will now preserve dtypes (:issue:`7509`)

.. ipython:: python

   df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])
   df1
   df2 = pd.DataFrame({"B": [4, 5, 6], "C": [1, 2, 3]}, index=[2, 3, 4])
   df2
   combined = df1.combine_first(df2)

pandas 1.2.x

In [1]: combined.dtypes
Out[2]:
A    float64
B    float64
C    float64
dtype: object

pandas 1.3.0

.. ipython:: python

   combined.dtypes


Try operating inplace when setting values with loc and iloc

When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.

.. ipython:: python

   df = pd.DataFrame(range(3), columns=["A"], dtype="float64")
   values = df.values
   new = np.array([5, 6, 7], dtype="int64")
   df.loc[[0, 1, 2], "A"] = new

In both the new and old behavior, the data in values is overwritten, but in the old behavior the dtype of df["A"] changed to int64.

pandas 1.2.x

In [1]: df.dtypes
Out[1]:
A    int64
dtype: object
In [2]: np.shares_memory(df["A"].values, new)
Out[2]: False
In [3]: np.shares_memory(df["A"].values, values)
Out[3]: False

In pandas 1.3.0, df continues to share data with values

pandas 1.3.0

.. ipython:: python

   df.dtypes
   np.shares_memory(df["A"], new)
   np.shares_memory(df["A"], values)


Never Operate Inplace When Setting frame[keys] = values

When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (:issue:`39510`). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.

.. ipython:: python

   df = pd.DataFrame(range(3), columns=["A"], dtype="float64")
   df[["A"]] = 5

In the old behavior, 5 was cast to float64 and inserted into the existing array backing df:

pandas 1.2.x

In [1]: df.dtypes
Out[1]:
A    float64

In the new behavior, we get a new array, and retain an integer-dtyped 5:

pandas 1.3.0

.. ipython:: python

   df.dtypes


Consistent Casting With Setting Into Boolean Series

Setting non-boolean values into a :class:`Series with ``dtype=bool`` consistently cast to dtype=object (:issue:`38709`)

.. ipython:: python

   orig = pd.Series([True, False])
   ser = orig.copy()
   ser.iloc[1] = np.nan
   ser2 = orig.copy()
   ser2.iloc[1] = 2.0

pandas 1.2.x

In [1]: ser
Out [1]:
0    1.0
1    NaN
dtype: float64

In [2]:ser2
Out [2]:
0    True
1     2.0
dtype: object

pandas 1.3.0

.. ipython:: python

   ser
   ser2


Removed artificial truncation in rolling variance and standard deviation

:meth:`core.window.Rolling.std` and :meth:`core.window.Rolling.var` will no longer artificially truncate results that are less than ~1e-8 and ~1e-15 respectively to zero (:issue:`37051`, :issue:`40448`, :issue:`39872`).

However, floating point artifacts may now exist in the results when rolling over larger values.

.. ipython:: python

   s = pd.Series([7, 5, 5, 5])
   s.rolling(3).var()


Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
numpy 1.16.5 X  
pytz 2017.3 X  
python-dateutil 2.7.3 X  
bottleneck 1.2.1    
numexpr 2.6.8    
pytest (dev) 5.0.1    
mypy (dev) 0.800   X
setuptools 38.6.0   X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0  
fastparquet 0.4.0 X
fsspec 0.7.4  
gcsfs 0.6.0  
lxml 4.3.0  
matplotlib 2.2.3  
numba 0.46.0  
openpyxl 3.0.0 X
pyarrow 0.15.0  
pymysql 0.7.11  
pytables 3.5.1  
s3fs 0.4.0  
scipy 1.2.0  
sqlalchemy 1.2.8  
tabulate 0.8.7 X
xarray 0.12.0  
xlrd 1.2.0  
xlsxwriter 1.0.2  
xlwt 1.3.0  
pandas-gbq 0.12.0  

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

  • Bug in different tzinfo objects representing UTC not being treated as equivalent (:issue:`39216`)
  • Bug in dateutil.tz.gettz("UTC") not being recognized as equivalent to other UTC-representing tzinfos (:issue:`39276`)

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Other

Contributors