API: with CoW, should every new Series/DataFrame object also have its own new Index objects? #53721

jorisvandenbossche · 2023-06-19T11:00:40Z

Related to #53529, about index labels still being shared after an indexing operation.

Currently, a shallow copy of a DataFrame/Series only creates a new DataFrame/Series object, while pointing to the same data and index objects under the hood. Quoting the docstring:

When deep=False, a new object will be created without copying
the calling object's data or index (only references to the data
and index are copied).

and one of the examples in the docstring:

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

# Shallow copy shares data and index with original.
>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True

So apart from being the long-standing behaviour, it's also clearly documentation that the Index objects (for row index and columns) are identical in the returned shallow copy.

However, under Copy-on-Write, we now updated the behaviour of a shallow copy to no longer propagate mutations to the values of the Series/DataFrame, but still returning a new Series/DataFrame with the shared memory, but protected with a reference track to ensure we will copy later on if needed.
So the question is, for shallow copies like this, when CoW is enabled, should we also protect from sharing mutable state in the index/columns attributes of the Series/DataFrame?

Current behaviour:

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> shallow = s.copy(deep=False)
>>> shallow.index.name = "some_new_name"
>>> s
some_new_name      # <-- modifying shallow copy also modified the parent series
a    1
b    2
dtype: int64

My proposal would be to extend the notion of "changing one object never updates another object" under CoW that we currently apply to the data values, to also apply this to the Index mutable state.

This can easily be achieved by also shallow copying the Index objects when shallow copying a DataFrame/Series.

How to shallow copy an Index?
While the Index class also has a copy method with a deep argument, and thus one can do idx.copy(deep=False) to get a new Index object sharing the same data (and this option is actually the default for Index), there is also a idx.view() method, which essentially does the same, but also sets the _id attribute of the Index, ensuring equality testing still pass the fast path check for identity (idx1.is_(idx2) is True).
We can probably use a view instead of an actual shallow copy with copy(deep=False), which will ensure we keep a faster path in several places where we check for Index equality between objects.

Consequence beyond the copy method?
It seems we actually have quite some places that are currently returning identical index objects shared by multiple objects, beyond just the typical shallow copy case. Those could / should all change with the proposal above. Some examples:

Any method that currently returns a new Series/DataFrame that preserves the shape. Those that return a shallow copy with CoW will automatically get updated by changing the behaviour of copy(deep=False), such as:
```
>>> s = pd.Series([1, 2])
>>> s2 = s.infer_objects()
>>> s2.index is s.index    # current behaviour, would become False
True
```
but also any object that returns new data (and currently doesn't care about CoW) but preserves the shape currently typically reuses the index object and should probably be updated?
```
>>> s = pd.Series([1, 2])
>>> s2 = s.diff()
>>> s2.index is s.index    # current behaviour, would become False
True
```

Binary / unary operations preserve the index in the result:

>>> s = pd.Series([1, 2])
>>> s2= s + 1
>>> s2.index is s.index   # current behaviour, would become False
True

Reindexing with an index also would end up with not exactly that index object but with a shallow copy of it:

>>> s = pd.Series([1, 2])
>>> idx = pd.Index([0, 1, 2])
>>> s2 = s.reindex(idx)
>>> s2.index is idx   # current behaviour, would become False
True

Maybe more controversially or more surprising for users if we would do this, but what about constructors where you pass an Index object?
```
>>> idx = pd.Index([0, 1, 2])
>>> s = pd.Series(["a", "b", "c"], index=idx)
>>> s.index is idx    # should this be False as well?
True
```
Here it might be more logical to use exactly that object (and not create a shallow copy of it)? (but that will mean propagating changes if you passed the index object of an existing Series/DataFrame)

The text was updated successfully, but these errors were encountered:

phofl · 2023-06-19T20:05:27Z

I think this is a consequence out of the CoW rules, so probably have to do this. The last point should also make a shallow copy imo

jorisvandenbossche added API Design Index Related to the Index class or subclasses Copy / view semantics labels Jun 19, 2023

This was referenced Jun 19, 2023

API / CoW: shallow copy of DataFrame/Series (.copy(deep=False)) also returns shallow copy of the index/columns #53722

Merged

Copy-on-Write (CoW): index labels are still shared mutable state #53529

Open

jorisvandenbossche changed the title ~~API: Shallow copy of DataFrame (copy(deep=False)) to share the index or also shallow copy the index?~~ API: with CoW, should every new Series/DataFrame object also have its own new Index objects? Jun 19, 2023

jorisvandenbossche linked a pull request Jun 19, 2023 that will close this issue

API / COW: ensure every new Series/DataFrame also has new (shallow copy) index #53699

Draft

5 tasks

flying-sheep mentioned this issue Aug 18, 2023

Fix/Circumvent upcoming breakage in pandas 2.1.0 scverse/anndata#1098

Closed

3 tasks

jorisvandenbossche added this to the 2.2 milestone Nov 17, 2023

lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024

lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024

lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 11, 2024

jorisvandenbossche modified the milestones: 2.2.3, 2.3 Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: with CoW, should every new Series/DataFrame object also have its own new Index objects? #53721

API: with CoW, should every new Series/DataFrame object also have its own new Index objects? #53721

jorisvandenbossche commented Jun 19, 2023

phofl commented Jun 19, 2023

API: with CoW, should every new Series/DataFrame object also have its own new Index objects? #53721

API: with CoW, should every new Series/DataFrame object also have its own new Index objects? #53721

Comments

jorisvandenbossche commented Jun 19, 2023

phofl commented Jun 19, 2023