Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: with CoW, should every new Series/DataFrame object also have its own new Index objects? #53721

Open
jorisvandenbossche opened this issue Jun 19, 2023 · 1 comment · May be fixed by #53699
Labels
API Design Copy / view semantics Index Related to the Index class or subclasses
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Related to #53529, about index labels still being shared after an indexing operation.

Currently, a shallow copy of a DataFrame/Series only creates a new DataFrame/Series object, while pointing to the same data and index objects under the hood. Quoting the docstring:

When deep=False, a new object will be created without copying
the calling object's data or index (only references to the data
and index are copied).

and one of the examples in the docstring:

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

# Shallow copy shares data and index with original.
>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True

So apart from being the long-standing behaviour, it's also clearly documentation that the Index objects (for row index and columns) are identical in the returned shallow copy.

However, under Copy-on-Write, we now updated the behaviour of a shallow copy to no longer propagate mutations to the values of the Series/DataFrame, but still returning a new Series/DataFrame with the shared memory, but protected with a reference track to ensure we will copy later on if needed.
So the question is, for shallow copies like this, when CoW is enabled, should we also protect from sharing mutable state in the index/columns attributes of the Series/DataFrame?

Current behaviour:

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> shallow = s.copy(deep=False)
>>> shallow.index.name = "some_new_name"
>>> s
some_new_name      # <-- modifying shallow copy also modified the parent series
a    1
b    2
dtype: int64

My proposal would be to extend the notion of "changing one object never updates another object" under CoW that we currently apply to the data values, to also apply this to the Index mutable state.

This can easily be achieved by also shallow copying the Index objects when shallow copying a DataFrame/Series.

How to shallow copy an Index?
While the Index class also has a copy method with a deep argument, and thus one can do idx.copy(deep=False) to get a new Index object sharing the same data (and this option is actually the default for Index), there is also a idx.view() method, which essentially does the same, but also sets the _id attribute of the Index, ensuring equality testing still pass the fast path check for identity (idx1.is_(idx2) is True).
We can probably use a view instead of an actual shallow copy with copy(deep=False), which will ensure we keep a faster path in several places where we check for Index equality between objects.

Consequence beyond the copy method?
It seems we actually have quite some places that are currently returning identical index objects shared by multiple objects, beyond just the typical shallow copy case. Those could / should all change with the proposal above. Some examples:

  • Any method that currently returns a new Series/DataFrame that preserves the shape. Those that return a shallow copy with CoW will automatically get updated by changing the behaviour of copy(deep=False), such as:

    >>> s = pd.Series([1, 2])
    >>> s2 = s.infer_objects()
    >>> s2.index is s.index    # current behaviour, would become False
    True

    but also any object that returns new data (and currently doesn't care about CoW) but preserves the shape currently typically reuses the index object and should probably be updated?

    >>> s = pd.Series([1, 2])
    >>> s2 = s.diff()
    >>> s2.index is s.index    # current behaviour, would become False
    True
  • Binary / unary operations preserve the index in the result:

    >>> s = pd.Series([1, 2])
    >>> s2= s + 1
    >>> s2.index is s.index   # current behaviour, would become False
    True
  • Reindexing with an index also would end up with not exactly that index object but with a shallow copy of it:

    >>> s = pd.Series([1, 2])
    >>> idx = pd.Index([0, 1, 2])
    >>> s2 = s.reindex(idx)
    >>> s2.index is idx   # current behaviour, would become False
    True
  • Maybe more controversially or more surprising for users if we would do this, but what about constructors where you pass an Index object?

    >>> idx = pd.Index([0, 1, 2])
    >>> s = pd.Series(["a", "b", "c"], index=idx)
    >>> s.index is idx    # should this be False as well?
    True

    Here it might be more logical to use exactly that object (and not create a shallow copy of it)? (but that will mean propagating changes if you passed the index object of an existing Series/DataFrame)

@jorisvandenbossche jorisvandenbossche added API Design Index Related to the Index class or subclasses Copy / view semantics labels Jun 19, 2023
@jorisvandenbossche jorisvandenbossche changed the title API: Shallow copy of DataFrame (copy(deep=False)) to share the index or also shallow copy the index? API: with CoW, should every new Series/DataFrame object also have its own new Index objects? Jun 19, 2023
@phofl
Copy link
Member

phofl commented Jun 19, 2023

I think this is a consequence out of the CoW rules, so probably have to do this. The last point should also make a shallow copy imo

@jorisvandenbossche jorisvandenbossche added this to the 2.2 milestone Nov 17, 2023
@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 11, 2024
@jorisvandenbossche jorisvandenbossche modified the milestones: 2.2.3, 2.3 Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Copy / view semantics Index Related to the Index class or subclasses
Projects
None yet
3 participants