You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Related to #53529, about index labels still being shared after an indexing operation.
Currently, a shallow copy of a DataFrame/Series only creates a new DataFrame/Series object, while pointing to the same data and index objects under the hood. Quoting the docstring:
When deep=False, a new object will be created without copying
the calling object's data or index (only references to the data
and index are copied).
and one of the examples in the docstring:
>>>s=pd.Series([1, 2], index=["a", "b"])
>>>deep=s.copy()
>>>shallow=s.copy(deep=False)
# Shallow copy shares data and index with original.>>>sisshallowFalse>>>s.valuesisshallow.valuesands.indexisshallow.indexTrue
So apart from being the long-standing behaviour, it's also clearly documentation that the Index objects (for row index and columns) are identical in the returned shallow copy.
However, under Copy-on-Write, we now updated the behaviour of a shallow copy to no longer propagate mutations to the values of the Series/DataFrame, but still returning a new Series/DataFrame with the shared memory, but protected with a reference track to ensure we will copy later on if needed.
So the question is, for shallow copies like this, when CoW is enabled, should we also protect from sharing mutable state in the index/columns attributes of the Series/DataFrame?
Current behaviour:
>>>s=pd.Series([1, 2], index=["a", "b"])
>>>shallow=s.copy(deep=False)
>>>shallow.index.name="some_new_name">>>ssome_new_name# <-- modifying shallow copy also modified the parent seriesa1b2dtype: int64
My proposal would be to extend the notion of "changing one object never updates another object" under CoW that we currently apply to the data values, to also apply this to the Index mutable state.
This can easily be achieved by also shallow copying the Index objects when shallow copying a DataFrame/Series.
How to shallow copy an Index?
While the Index class also has a copy method with a deep argument, and thus one can do idx.copy(deep=False) to get a new Index object sharing the same data (and this option is actually the default for Index), there is also a idx.view() method, which essentially does the same, but also sets the _id attribute of the Index, ensuring equality testing still pass the fast path check for identity (idx1.is_(idx2) is True).
We can probably use a view instead of an actual shallow copy with copy(deep=False), which will ensure we keep a faster path in several places where we check for Index equality between objects.
Consequence beyond the copy method?
It seems we actually have quite some places that are currently returning identical index objects shared by multiple objects, beyond just the typical shallow copy case. Those could / should all change with the proposal above. Some examples:
Any method that currently returns a new Series/DataFrame that preserves the shape. Those that return a shallow copy with CoW will automatically get updated by changing the behaviour of copy(deep=False), such as:
>>>s=pd.Series([1, 2])
>>>s2=s.infer_objects()
>>>s2.indexiss.index# current behaviour, would become FalseTrue
but also any object that returns new data (and currently doesn't care about CoW) but preserves the shape currently typically reuses the index object and should probably be updated?
>>>s=pd.Series([1, 2])
>>>s2=s.diff()
>>>s2.indexiss.index# current behaviour, would become FalseTrue
Binary / unary operations preserve the index in the result:
>>>s=pd.Series([1, 2])
>>>s2=s+1>>>s2.indexiss.index# current behaviour, would become FalseTrue
Reindexing with an index also would end up with not exactly that index object but with a shallow copy of it:
>>>s=pd.Series([1, 2])
>>>idx=pd.Index([0, 1, 2])
>>>s2=s.reindex(idx)
>>>s2.indexisidx# current behaviour, would become FalseTrue
Maybe more controversially or more surprising for users if we would do this, but what about constructors where you pass an Index object?
>>>idx=pd.Index([0, 1, 2])
>>>s=pd.Series(["a", "b", "c"], index=idx)
>>>s.indexisidx# should this be False as well?True
Here it might be more logical to use exactly that object (and not create a shallow copy of it)? (but that will mean propagating changes if you passed the index object of an existing Series/DataFrame)
The text was updated successfully, but these errors were encountered:
jorisvandenbossche
changed the title
API: Shallow copy of DataFrame (copy(deep=False)) to share the index or also shallow copy the index?
API: with CoW, should every new Series/DataFrame object also have its own new Index objects?
Jun 19, 2023
Related to #53529, about index labels still being shared after an indexing operation.
Currently, a shallow copy of a DataFrame/Series only creates a new DataFrame/Series object, while pointing to the same data and index objects under the hood. Quoting the docstring:
and one of the examples in the docstring:
So apart from being the long-standing behaviour, it's also clearly documentation that the Index objects (for row index and columns) are identical in the returned shallow copy.
However, under Copy-on-Write, we now updated the behaviour of a shallow copy to no longer propagate mutations to the values of the Series/DataFrame, but still returning a new Series/DataFrame with the shared memory, but protected with a reference track to ensure we will copy later on if needed.
So the question is, for shallow copies like this, when CoW is enabled, should we also protect from sharing mutable state in the index/columns attributes of the Series/DataFrame?
Current behaviour:
My proposal would be to extend the notion of "changing one object never updates another object" under CoW that we currently apply to the data values, to also apply this to the Index mutable state.
This can easily be achieved by also shallow copying the Index objects when shallow copying a DataFrame/Series.
How to shallow copy an Index?
While the Index class also has a copy method with a
deep
argument, and thus one can doidx.copy(deep=False)
to get a new Index object sharing the same data (and this option is actually the default for Index), there is also aidx.view()
method, which essentially does the same, but also sets the_id
attribute of the Index, ensuring equality testing still pass the fast path check for identity (idx1.is_(idx2) is True
).We can probably use a view instead of an actual shallow copy with
copy(deep=False)
, which will ensure we keep a faster path in several places where we check for Index equality between objects.Consequence beyond the copy method?
It seems we actually have quite some places that are currently returning identical index objects shared by multiple objects, beyond just the typical shallow copy case. Those could / should all change with the proposal above. Some examples:
Any method that currently returns a new Series/DataFrame that preserves the shape. Those that return a shallow copy with CoW will automatically get updated by changing the behaviour of
copy(deep=False)
, such as:but also any object that returns new data (and currently doesn't care about CoW) but preserves the shape currently typically reuses the index object and should probably be updated?
Binary / unary operations preserve the index in the result:
Reindexing with an index also would end up with not exactly that index object but with a shallow copy of it:
Maybe more controversially or more surprising for users if we would do this, but what about constructors where you pass an Index object?
Here it might be more logical to use exactly that object (and not create a shallow copy of it)? (but that will mean propagating changes if you passed the index object of an existing Series/DataFrame)
The text was updated successfully, but these errors were encountered: