Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: honor copy=True when passing dict to DataFrame #38939

Merged
merged 69 commits into from Mar 31, 2021

Conversation

jbrockmendel
Copy link
Member

xref #34872 cc @TomAugspurger used the test_dict_nocopy you wrote but it ended up pretty mangled

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved
@@ -1297,7 +1297,7 @@ def test_strings_to_numbers_comparisons_raises(self, compare_operators_no_eq_ne)
f(df, 0)

def test_comparison_protected_from_errstate(self):
missing_df = tm.makeDataFrame()
missing_df = tm.makeDataFrame()._consolidate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are already doing this on the creation (I understand by find this fragile)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are already doing this on the creation

i dont think so

(I understand by find this fragile)

I agree. Silver lining: finding the existing fragility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actuallly you are doing this on creation, maybe you recently added. prefer NOT to do this in the tests proper (in pandas/testing is ok)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to do the consolidation in tm.makeDataFrame

@@ -765,7 +765,7 @@ def test_operators_timedelta64(self):
def test_std_timedelta64_skipna_false(self):
# GH#37392
tdi = pd.timedelta_range("1 Day", periods=10)
df = DataFrame({"A": tdi, "B": tdi})
df = DataFrame({"A": tdi, "B": tdi}, copy=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would copy matter here at all? isn't tdi supposed to be immutable? we need to always copy no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would copy matter here at all?

there are a couple of inter-twined issues here. the copy is necesary because we need df["A"] and df["B"] to not share data. We could do df = DataFrame({"A": tdi.values, "B": tdi.values.copy()})

isn't tdi supposed to be immutable?

yes, but thats kinda orthogonal (maybe 45°). e.g. in master

tdi = pd.timedelta_range("1 Day", periods=3)
df = pd.DataFrame(tdi)
df.iloc[0, 0] = pd.NaT
>>> tdi
TimedeltaIndex([NaT, '2 days', '3 days'], dtype='timedelta64[ns]', freq='D')

we need to always copy no?

i think we do something like this for dt64tz motivated by a similar immutability concern. ill see if that can be adapted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this copy=True can be removed now?

result[rl] = blk.iget((i, loc))
out = blk.iget((i, loc))
if is_dtype_equal(blk.dtype, dtype) and dtype == "m8[ns]":
# FIXME: kludge for NaT -> tdnat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know you mention, but this is very kludgy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, im still looking for an alternative here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

de-kludged. i need to double check, but think this may solve a separate silent bug. if so ill make a separate PR for that

@jreback
Copy link
Contributor

jreback commented Jan 4, 2021

I am leary of this consolidation (or rather non-consolidation). it is going to be quite fragile. why is consolidation tied to the dict copy at all?

@jreback jreback added the API - Consistency Internal Consistency of API/Behavior label Jan 4, 2021
@jbrockmendel
Copy link
Member Author

I am leary of this consolidation (or rather non-consolidation). it is going to be quite fragile. why is consolidation tied to the dict copy at all?

@TomAugspurger can you weigh in on why this is worthwhile

@TomAugspurger
Copy link
Contributor

I think just having the copy keyword apply to all data inputs, rather than just ndarrays.

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.3.0.rst Show resolved Hide resolved
@@ -1297,7 +1297,7 @@ def test_strings_to_numbers_comparisons_raises(self, compare_operators_no_eq_ne)
f(df, 0)

def test_comparison_protected_from_errstate(self):
missing_df = tm.makeDataFrame()
missing_df = tm.makeDataFrame()._consolidate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actuallly you are doing this on creation, maybe you recently added. prefer NOT to do this in the tests proper (in pandas/testing is ok)

pandas/tests/frame/test_constructors.py Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
df.iloc[0, 1] = 0
if not copy:
# setitem on non-EA values preserves views
assert sum(x is c for x in df._mgr.arrays) == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes, that clarifies

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 26, 2021

@jbrockmendel can you do a final merge of master here to be sure it's all passing with latest master?

@jreback all good for you? (your "approve" is from before the behaviour changed to preserve the copy=False behaviour for dicts)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. can you merge master, ping on greenish

copy : bool or None, default None
Copy data from inputs.
For dict data, the default of None behaves like ``copy=True``. For DataFrame
or 2d ndarray input, the default of None behaves like ``copy=False``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionchanged 1.3

):

if copy is None:
# GH#38939
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line (as you mention below)

@jreback
Copy link
Contributor

jreback commented Mar 30, 2021

hmm, this looks to be failing a bunch of things

@jreback jreback merged commit 66e6ba3 into pandas-dev:master Mar 31, 2021
@jreback
Copy link
Contributor

jreback commented Mar 31, 2021

thanks @jbrockmendel this has been a long road, and @TomAugspurger for the original version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Copy / view semantics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataframe change alters original array used in creation
5 participants