Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: honor copy=True when passing dict to DataFrame #38939

Merged
merged 69 commits into from Mar 31, 2021
Merged
Show file tree
Hide file tree
Changes from 65 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
cbc97f0
ENH: allow non-consolidation in constructors
jbrockmendel Oct 5, 2020
a135d96
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Oct 5, 2020
5c94129
mypy fixup
jbrockmendel Oct 5, 2020
d653c54
ENH: allow non-consolidation in constructors
jbrockmendel Oct 5, 2020
e71e319
Merge branch 'myway-init-no-consolidate' of github.com:jbrockmendel/p…
jbrockmendel Jan 3, 2021
cafa718
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 3, 2021
c706ad6
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 3, 2021
3f9195e
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 4, 2021
1cba671
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 4, 2021
396daba
BUG: respect copy=False in constructing DataFrame from dict
jbrockmendel Jan 4, 2021
11ae1c9
whatsnew
jbrockmendel Jan 4, 2021
b505267
clean test
jbrockmendel Jan 4, 2021
a1e9b68
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 4, 2021
b70d997
fixed xfail
jbrockmendel Jan 4, 2021
09213e0
update whatsnew
jbrockmendel Jan 4, 2021
e895de0
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 4, 2021
31bda58
de-kludge
jbrockmendel Jan 4, 2021
37a2c0c
remove no-longer-used msg
jbrockmendel Jan 4, 2021
b93d7d5
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 10, 2021
aa667a6
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 12, 2021
185cd99
fix broken test
jbrockmendel Jan 12, 2021
701356f
Consolidate in tm, update whatsnew
jbrockmendel Jan 20, 2021
948ac67
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 21, 2021
46f2fcf
always copy when data is None
jbrockmendel Jan 21, 2021
0bbfec0
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Jan 23, 2021
590c820
update exception message
jbrockmendel Jan 23, 2021
17a693c
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 27, 2021
fb8f32d
update exception message
jbrockmendel Jan 27, 2021
7835184
typo fixup
jbrockmendel Jan 28, 2021
2136289
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 2, 2021
187499c
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 2, 2021
6f30beb
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Feb 4, 2021
e80d57c
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 11, 2021
b0a6abd
CI: fix broken asv
jbrockmendel Feb 11, 2021
bf942ae
revert
jbrockmendel Feb 11, 2021
95e30a5
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 11, 2021
510f697
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 13, 2021
48e359e
Default to copy=True for dict data
jbrockmendel Feb 13, 2021
048e826
troubleshoot docbuild
jbrockmendel Feb 16, 2021
6a9c9f0
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 17, 2021
a17c728
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 18, 2021
5b3d419
update whatsnew
jbrockmendel Feb 18, 2021
f961378
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 18, 2021
fcee44b
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 27, 2021
0c60ae8
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 27, 2021
1468e59
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 27, 2021
b6d8b70
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 3, 2021
8b66b11
skip for ArrayManager
jbrockmendel Mar 3, 2021
54cacfc
Update doc/source/whatsnew/v1.3.0.rst
jbrockmendel Mar 6, 2021
5ea7a75
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 6, 2021
e11ea68
requested edits
jbrockmendel Mar 6, 2021
65d01c7
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 8, 2021
41c4e7a
test test_df_mod_zero_df with and without copy
jbrockmendel Mar 8, 2021
7260a72
collect copy-adjusting code in one place
jbrockmendel Mar 9, 2021
52344bb
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 14, 2021
3ddc3d3
update docstring
jbrockmendel Mar 14, 2021
e6bae0f
whatsnew, comment
jbrockmendel Mar 14, 2021
7cab084
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 15, 2021
e8e3d84
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 15, 2021
5c44953
mypy fixup
jbrockmendel Mar 15, 2021
e32f630
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 16, 2021
abd890a
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 16, 2021
1b7f7ca
Update pandas/core/frame.py
jbrockmendel Mar 16, 2021
b326b5f
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 22, 2021
b7aed5d
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 26, 2021
4d20fe7
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 30, 2021
ad5485a
add versionchanged
jbrockmendel Mar 30, 2021
6bed6ac
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 30, 2021
98b6dff
update bc .values has changed to DTA/TDA
jbrockmendel Mar 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
26 changes: 26 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Expand Up @@ -110,6 +110,30 @@ both XPath 1.0 and XSLT 1.0 is available. (:issue:`27554`)

For more, see :ref:`io.xml` in the user guide on IO tools.

.. _whatsnew_130.dataframe_honors_copy_with_dict:

DataFrame constructor honors ``copy=False`` with dict
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When passing a dictionary to :class:`DataFrame` with ``copy=False``,
a copy will no longer be made (:issue:`32960`)
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

.. ipython:: python

arr = np.array([1, 2, 3])
df = pd.DataFrame({"A": arr, "B": arr.copy()}, copy=False)
df

``df["A"]`` remains a view on ``arr``:

.. ipython:: python

arr[0] = 0
jreback marked this conversation as resolved.
Show resolved Hide resolved
assert df.iloc[0, 0] == 0

The default behavior when not passing ``copy`` will remain unchanged, i.e.
a copy will be made.

.. _whatsnew_130.enhancements.other:

Other enhancements
Expand Down Expand Up @@ -513,6 +537,8 @@ Conversion
- Bug in creating a :class:`DataFrame` from an empty ``np.recarray`` not retaining the original dtypes (:issue:`40121`)
- Bug in :class:`DataFrame` failing to raise ``TypeError`` when constructing from a ``frozenset`` (:issue:`40163`)
- Bug in :class:`Index` construction silently ignoring a passed ``dtype`` when the data cannot be cast to that dtype (:issue:`21311`)
- Bug in :class:`DataFrame` construction with a dictionary containing an arraylike with ``ExtensionDtype`` and ``copy=True`` failing to make a copy (:issue:`38939`)
-

Strings
^^^^^^^
Expand Down
39 changes: 23 additions & 16 deletions pandas/core/frame.py
Expand Up @@ -476,8 +476,10 @@ class DataFrame(NDFrame, OpsMixin):
RangeIndex (0, 1, 2, ..., n) if no column labels are provided.
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input.
copy : bool or None, default None
Copy data from inputs.
For dict data, the default of None behaves like ``copy=True``. For DataFrame
or 2d ndarray input, the default of None behaves like ``copy=False``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionchanged 1.3


See Also
--------
Expand Down Expand Up @@ -555,8 +557,17 @@ def __init__(
index: Optional[Axes] = None,
columns: Optional[Axes] = None,
dtype: Optional[Dtype] = None,
copy: bool = False,
copy: Optional[bool] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the docstring above for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, will do.

):

if copy is None:
# GH#38939
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line (as you mention below)

if isinstance(data, dict) or data is None:
# retain pre-GH38939 default behavior
copy = True
else:
copy = False

if data is None:
data = {}
if dtype is not None:
Expand All @@ -565,18 +576,13 @@ def __init__(
if isinstance(data, DataFrame):
data = data._mgr

# first check if a Manager is passed without any other arguments
# -> use fastpath (without checking Manager type)
if (
index is None
and columns is None
and dtype is None
and copy is False
and isinstance(data, (BlockManager, ArrayManager))
):
# GH#33357 fastpath
NDFrame.__init__(self, data)
return
if isinstance(data, (BlockManager, ArrayManager)):
# first check if a Manager is passed without any other arguments
# -> use fastpath (without checking Manager type)
if index is None and columns is None and dtype is None and not copy:
# GH#33357 fastpath
NDFrame.__init__(self, data)
return

manager = get_option("mode.data_manager")

Expand All @@ -586,7 +592,8 @@ def __init__(
)

elif isinstance(data, dict):
mgr = dict_to_mgr(data, index, columns, dtype=dtype, typ=manager)
# GH#38939 de facto copy defaults to False only in non-dict cases
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
elif isinstance(data, ma.MaskedArray):
import numpy.ma.mrecords as mrecords

Expand Down
4 changes: 3 additions & 1 deletion pandas/core/groupby/groupby.py
Expand Up @@ -1807,7 +1807,9 @@ def describe(self, **kwargs):
result = self.apply(lambda x: x.describe(**kwargs))
if self.axis == 1:
return result.T
return result.unstack()
# FIXME: not being consolidated breaks
# test_describe_with_duplicate_output_column_names
return result._consolidate().unstack()

@final
def resample(self, rule, *args, **kwargs):
Expand Down
34 changes: 30 additions & 4 deletions pandas/core/internals/construction.py
Expand Up @@ -101,9 +101,11 @@ def arrays_to_mgr(
arr_names,
index,
columns,
*,
dtype: Optional[DtypeObj] = None,
verify_integrity: bool = True,
typ: Optional[str] = None,
consolidate: bool = True,
) -> Manager:
"""
Segregate Series based on type and coerce into matrices.
Expand Down Expand Up @@ -131,7 +133,9 @@ def arrays_to_mgr(
axes = [columns, index]

if typ == "block":
return create_block_manager_from_arrays(arrays, arr_names, axes)
return create_block_manager_from_arrays(
arrays, arr_names, axes, consolidate=consolidate
)
elif typ == "array":
if len(columns) != len(arrays):
assert len(arrays) == 0
Expand Down Expand Up @@ -181,7 +185,7 @@ def rec_array_to_mgr(
if columns is None:
columns = arr_columns

mgr = arrays_to_mgr(arrays, arr_columns, index, columns, dtype, typ=typ)
mgr = arrays_to_mgr(arrays, arr_columns, index, columns, dtype=dtype, typ=typ)

if copy:
mgr = mgr.copy()
Expand Down Expand Up @@ -376,7 +380,13 @@ def maybe_squeeze_dt64tz(dta: ArrayLike) -> ArrayLike:


def dict_to_mgr(
data: Dict, index, columns, dtype: Optional[DtypeObj], typ: str
data: Dict,
index,
columns,
*,
dtype: Optional[DtypeObj] = None,
typ: str = "block",
copy: bool = True,
) -> Manager:
"""
Segregate Series based on type and coerce into matrices.
Expand Down Expand Up @@ -414,6 +424,8 @@ def dict_to_mgr(
val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
arrays.loc[missing] = [val] * missing.sum()

arrays = list(arrays)

else:
keys = list(data.keys())
columns = data_names = Index(keys)
Expand All @@ -424,7 +436,21 @@ def dict_to_mgr(
arrays = [
arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
]
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype, typ=typ)

if copy:
# arrays_to_mgr (via form_blocks) won't make copies for EAs
# dtype attr check to exclude EADtype-castable strs
arrays = [
x
if not hasattr(x, "dtype") or not isinstance(x.dtype, ExtensionDtype)
else x.copy()
for x in arrays
]
# TODO: can we get rid of the dt64tz special case above?
jreback marked this conversation as resolved.
Show resolved Hide resolved

return arrays_to_mgr(
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
)


def nested_data_to_arrays(
Expand Down
65 changes: 51 additions & 14 deletions pandas/core/internals/managers.py
Expand Up @@ -53,7 +53,10 @@

import pandas.core.algorithms as algos
from pandas.core.arrays.sparse import SparseDtype
from pandas.core.construction import extract_array
from pandas.core.construction import (
ensure_wrapped_if_datetimelike,
extract_array,
)
from pandas.core.indexers import maybe_convert_indices
from pandas.core.indexes.api import (
Float64Index,
Expand Down Expand Up @@ -991,6 +994,8 @@ def fast_xs(self, loc: int) -> ArrayLike:
# Any]]"
result = np.empty(n, dtype=dtype) # type: ignore[arg-type]

result = ensure_wrapped_if_datetimelike(result)

for blk in self.blocks:
# Such assignment may incorrectly coerce NaT to None
# result[blk.mgr_locs] = blk._slice((slice(None), loc))
Expand Down Expand Up @@ -1715,7 +1720,7 @@ def set_values(self, values: ArrayLike):


def create_block_manager_from_blocks(
blocks: List[Block], axes: List[Index]
blocks: List[Block], axes: List[Index], consolidate: bool = True
) -> BlockManager:
try:
mgr = BlockManager(blocks, axes)
Expand All @@ -1725,7 +1730,8 @@ def create_block_manager_from_blocks(
tot_items = sum(arr.shape[0] for arr in arrays)
raise construction_error(tot_items, arrays[0].shape[1:], axes, err)

mgr._consolidate_inplace()
if consolidate:
mgr._consolidate_inplace()
return mgr


Expand All @@ -1735,7 +1741,10 @@ def _extract_array(obj):


def create_block_manager_from_arrays(
arrays, names: Index, axes: List[Index]
arrays,
names: Index,
axes: List[Index],
consolidate: bool = True,
) -> BlockManager:
assert isinstance(names, Index)
assert isinstance(axes, list)
Expand All @@ -1744,12 +1753,13 @@ def create_block_manager_from_arrays(
arrays = [_extract_array(x) for x in arrays]

try:
blocks = _form_blocks(arrays, names, axes)
blocks = _form_blocks(arrays, names, axes, consolidate)
mgr = BlockManager(blocks, axes)
mgr._consolidate_inplace()
return mgr
except ValueError as e:
raise construction_error(len(arrays), arrays[0].shape, axes, e)
if consolidate:
mgr._consolidate_inplace()
return mgr


def construction_error(
Expand Down Expand Up @@ -1782,7 +1792,7 @@ def construction_error(


def _form_blocks(
arrays: List[ArrayLike], names: Index, axes: List[Index]
arrays: List[ArrayLike], names: Index, axes: List[Index], consolidate: bool
) -> List[Block]:
# put "leftover" items in float bucket, where else?
# generalize?
Expand All @@ -1808,15 +1818,21 @@ def _form_blocks(

blocks: List[Block] = []
if len(items_dict["NumericBlock"]):
numeric_blocks = _multi_blockify(items_dict["NumericBlock"])
numeric_blocks = _multi_blockify(
items_dict["NumericBlock"], consolidate=consolidate
)
blocks.extend(numeric_blocks)

if len(items_dict["TimeDeltaBlock"]):
timedelta_blocks = _multi_blockify(items_dict["TimeDeltaBlock"])
timedelta_blocks = _multi_blockify(
items_dict["TimeDeltaBlock"], consolidate=consolidate
)
blocks.extend(timedelta_blocks)

if len(items_dict["DatetimeBlock"]):
datetime_blocks = _simple_blockify(items_dict["DatetimeBlock"], DT64NS_DTYPE)
datetime_blocks = _simple_blockify(
items_dict["DatetimeBlock"], DT64NS_DTYPE, consolidate=consolidate
)
blocks.extend(datetime_blocks)

if len(items_dict["DatetimeTZBlock"]):
Expand All @@ -1827,7 +1843,9 @@ def _form_blocks(
blocks.extend(dttz_blocks)

if len(items_dict["ObjectBlock"]) > 0:
object_blocks = _simple_blockify(items_dict["ObjectBlock"], np.object_)
object_blocks = _simple_blockify(
items_dict["ObjectBlock"], np.object_, consolidate=consolidate
)
blocks.extend(object_blocks)

if len(items_dict["CategoricalBlock"]) > 0:
Expand Down Expand Up @@ -1866,11 +1884,14 @@ def _form_blocks(
return blocks


def _simple_blockify(tuples, dtype) -> List[Block]:
def _simple_blockify(tuples, dtype, consolidate: bool) -> List[Block]:
"""
return a single array of a block that has a single dtype; if dtype is
not None, coerce to this dtype
"""
if not consolidate:
return _tuples_to_blocks_no_consolidate(tuples, dtype=dtype)

values, placement = _stack_arrays(tuples, dtype)

# TODO: CHECK DTYPE?
Expand All @@ -1881,8 +1902,12 @@ def _simple_blockify(tuples, dtype) -> List[Block]:
return [block]


def _multi_blockify(tuples, dtype: Optional[Dtype] = None):
def _multi_blockify(tuples, dtype: Optional[DtypeObj] = None, consolidate: bool = True):
""" return an array of blocks that potentially have different dtypes """

if not consolidate:
return _tuples_to_blocks_no_consolidate(tuples, dtype=dtype)

# group by dtype
grouper = itertools.groupby(tuples, lambda x: x[1].dtype)

Expand All @@ -1902,6 +1927,18 @@ def _multi_blockify(tuples, dtype: Optional[Dtype] = None):
return new_blocks


def _tuples_to_blocks_no_consolidate(tuples, dtype: Optional[DtypeObj]) -> List[Block]:
# tuples produced within _form_blocks are of the form (placement, whatever, array)
if dtype is not None:
return [
new_block(
np.atleast_2d(x[1].astype(dtype, copy=False)), placement=x[0], ndim=2
)
for x in tuples
]
return [new_block(np.atleast_2d(x[1]), placement=x[0], ndim=2) for x in tuples]

jreback marked this conversation as resolved.
Show resolved Hide resolved

def _stack_arrays(tuples, dtype: np.dtype):

placement, arrays = zip(*tuples)
Expand Down
10 changes: 9 additions & 1 deletion pandas/tests/arithmetic/test_numeric.py
Expand Up @@ -538,7 +538,6 @@ def test_df_div_zero_series_does_not_commute(self):
def test_df_mod_zero_df(self, using_array_manager):
# GH#3590, modulo as ints
df = pd.DataFrame({"first": [3, 4, 5, 8], "second": [0, 0, 0, 3]})

# this is technically wrong, as the integer portion is coerced to float
first = Series([0, 0, 0, 0])
if not using_array_manager:
Expand All @@ -551,6 +550,15 @@ def test_df_mod_zero_df(self, using_array_manager):
result = df % df
tm.assert_frame_equal(result, expected)

# GH#38939 If we dont pass copy=False, df is consolidated and
# result["first"] is float64 instead of int64
df = pd.DataFrame({"first": [3, 4, 5, 8], "second": [0, 0, 0, 3]}, copy=False)
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
first = Series([0, 0, 0, 0], dtype="int64")
second = Series([np.nan, np.nan, np.nan, 0])
expected = pd.DataFrame({"first": first, "second": second})
result = df % df
tm.assert_frame_equal(result, expected)

def test_df_mod_zero_array(self):
# GH#3590, modulo as ints
df = pd.DataFrame({"first": [3, 4, 5, 8], "second": [0, 0, 0, 3]})
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/extension/decimal/array.py
Expand Up @@ -150,7 +150,7 @@ def take(self, indexer, allow_fill=False, fill_value=None):
return self._from_sequence(result)

def copy(self):
return type(self)(self._data.copy())
return type(self)(self._data.copy(), dtype=self.dtype)

def astype(self, dtype, copy=True):
if is_dtype_equal(dtype, self._dtype):
Expand Down