## Requirments

In [14]:
import numpy as np
import pandas as pd

## Shared or not shared?

When creating a new dataframe out of an existing dataframe, is column data shared or not?

The answer to that question seems to depend on the version of `pandas`.

## Pandas 2.x

To see what happens when you modify a dataframe from which another is derived, you can start by creating a large dataframe.

In [15]:
data = pd.DataFrame({
    'column1': np.random.uniform(-1_000.0, 1_000.0, size=100_000),
    'column2': np.random.uniform(-1_000.0, 1_000.0, size=100_000),
    'column3': np.random.uniform(-1_000, 1_000, size=100_000).astype(np.int64),
    'column4': np.random.uniform(-1_000, 1_000, size=100_000).astype(np.int64),
})

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   column1  100000 non-null  float64
 1   column2  100000 non-null  float64
 2   column3  100000 non-null  int64  
 3   column4  100000 non-null  int64  
dtypes: float64(2), int64(2)
memory usage: 3.1 MB


The data has the statistical properties you would expect from a uniform distribution.

In [17]:
data.describe()

Unnamed: 0,column1,column2,column3,column4
count,100000.0,100000.0,100000.0,100000.0
mean,4.225888,1.284407,-2.83082,0.14392
std,577.75984,577.063194,577.528168,576.729627
min,-999.977694,-999.908941,-999.0,-999.0
25%,-495.491551,-497.783061,-506.0,-499.0
50%,4.74851,2.10387,-4.0,0.0
75%,505.160586,500.502871,497.0,498.0
max,999.99294,999.947904,999.0,999.0


When you create a new dataframe with half the number of rows as the original dataset, the question is whether or not the two dataframes share data.

In [18]:
data2 = data.iloc[:50_000]

In [19]:
data2.describe()

Unnamed: 0,column1,column2,column3,column4
count,50000.0,50000.0,50000.0,50000.0
mean,4.828118,0.964712,-2.11148,-3.6563
std,577.936131,577.971021,577.96331,576.181292
min,-999.944283,-999.8976,-999.0,-999.0
25%,-493.817751,-500.003775,-504.0,-503.0
50%,4.462345,3.577384,-6.0,-2.0
75%,503.114598,500.88686,501.0,494.0
max,999.964569,999.864196,999.0,999.0


To ascertain this, you can modify one of the dataframes, say `data` in place, and check whether the other dataframe is affected.  For insstance, yu can clip the values of the first column to the interval $[-500, 500]$.

In [20]:
data['column1'].clip(-500.0, 500.0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['column1'].clip(-500.0, 500.0, inplace=True)


In [21]:
data2.describe()

Unnamed: 0,column1,column2,column3,column4
count,50000.0,50000.0,50000.0,50000.0
mean,2.605923,0.964712,-2.11148,-3.6563
std,408.136559,577.971021,577.96331,576.181292
min,-500.0,-999.8976,-999.0,-999.0
25%,-493.817751,-500.003775,-504.0,-503.0
50%,4.462345,3.577384,-6.0,-2.0
75%,500.0,500.88686,501.0,494.0
max,500.0,999.864196,999.0,999.0


Indeed, the first column of `data2` was affected, its minimum and maximem values aore no longer approximately -1000 and 1000 as before, but rather -500 and 500 respectively.

This may surprise you, and is of course a potential source of subtle bugs.  It is a trade-off, on the one hand sharing data between related dataframes saves memory, but also increases performance, since copying data is expensive in terms of execution time.  On the other hand, it means you perceive "spooky action at a distance" since operations on one dataframe may affect a related dataframe.

The developers of pandas want to ensure that an operation on a dataframe never affects another dataframe and have introduced Copy-on-Write in pandas 1.5.x.  It is a "mode" that can be set via pandas `options` module.  Pandas 2.x will generate some warnings for obvious cases when Copy-on-Write is violated, so that you can adapt your code accordingly.

The warning displayed when you executed the `clip` operation in-place is an illustration of that.

## Pandas 3.x

In pandas 3.x, Copy-on-Write will be the only mode that is available, it will ensure that only a single dataframe or series is affected by an operation.  When using pandas 2.x, you can optionally switch on this semantics already.

In [22]:
pd.options.mode.copy_on_write = True

You can replay the same scenario as before.

In [23]:
data = pd.DataFrame({
    'column1': np.random.uniform(-1_000.0, 1_000.0, size=100_000),
    'column2': np.random.uniform(-1_000.0, 1_000.0, size=100_000),
    'column3': np.random.uniform(-1_000, 1_000, size=100_000).astype(np.int64),
    'column4': np.random.uniform(-1_000, 1_000, size=100_000).astype(np.int64),
})

In [24]:
data2 = data.iloc[:50_000]

In [25]:
data2.describe()

Unnamed: 0,column1,column2,column3,column4
count,50000.0,50000.0,50000.0,50000.0
mean,1.695724,-3.040382,4.00146,0.5111
std,578.62756,576.453798,576.244217,578.301376
min,-999.983952,-999.968792,-999.0,-999.0
25%,-502.160012,-500.333306,-492.25,-498.0
50%,4.057258,-4.377464,3.0,0.0
75%,502.159214,494.614704,499.0,500.25
max,999.998666,999.913716,999.0,999.0


In [26]:
data['column2'].clip(-500.0, 500.0, inplace=True)

/tmp/ipykernel_565/3016868282.py:1: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object.


  data['column2'].clip(-500.0, 500.0, inplace=True)


This time around, thanks to setting the `copy_on_write` mode to `True`, an error is raised, and you can verify that `data2` is unaffected.

In [28]:
data2.describe()

Unnamed: 0,column1,column2,column3,column4
count,50000.0,50000.0,50000.0,50000.0
mean,1.695724,-3.040382,4.00146,0.5111
std,578.62756,576.453798,576.244217,578.301376
min,-999.983952,-999.968792,-999.0,-999.0
25%,-502.160012,-500.333306,-492.25,-498.0
50%,4.057258,-4.377464,3.0,0.0
75%,502.159214,494.614704,499.0,500.25
max,999.998666,999.913716,999.0,999.0


Of course, the orginal dataframe `data` is also unmodified.

In [29]:
data.describe()

Unnamed: 0,column1,column2,column3,column4
count,100000.0,100000.0,100000.0,100000.0
mean,1.623964,-3.764614,1.34378,1.4784
std,577.971329,577.167935,575.56337,577.826276
min,-999.983952,-999.980827,-999.0,-999.0
25%,-499.179172,-503.072353,-497.0,-497.0
50%,3.13326,-5.69174,1.0,2.0
75%,502.454035,495.773629,497.0,501.0
max,999.998666,999.98579,999.0,999.0


Following the advice of the warning/error message, you can execute the `clip` operation on the dataframe, specifying the clip values for each column, where `None` will not clip any values for that column.

In [30]:
data.clip([-500.0, *[None]*3], [500.0, *[None]*3], inplace=True)

You can verify that `data2` is unaffected, as intended.

In [31]:
data2.describe()

Unnamed: 0,column1,column2,column3,column4
count,50000.0,50000.0,50000.0,50000.0
mean,1.695724,-3.040382,4.00146,0.5111
std,578.62756,576.453798,576.244217,578.301376
min,-999.983952,-999.968792,-999.0,-999.0
25%,-502.160012,-500.333306,-492.25,-498.0
50%,4.057258,-4.377464,3.0,0.0
75%,502.159214,494.614704,499.0,500.25
max,999.998666,999.913716,999.0,999.0


Of course, the values in the first column of `data` are clipped as expected.

In [32]:
data.describe()

Unnamed: 0,column1,column2,column3,column4
count,100000.0,100000.0,100000.0,100000.0
mean,1.174175,-3.764614,1.34378,1.4784
std,408.414807,577.167935,575.56337,577.826276
min,-500.0,-999.980827,-999.0,-999.0
25%,-499.179172,-503.072353,-497.0,-497.0
50%,3.13326,-5.69174,1.0,2.0
75%,500.0,495.773629,497.0,501.0
max,500.0,999.98579,999.0,999.0


Alternatively, the values can also be clipped as follows for the second column.

In [33]:
data['column2'] = data['column2'].clip(-500.0, 500.0)

In [34]:
data.describe()

Unnamed: 0,column1,column2,column3,column4
count,100000.0,100000.0,100000.0,100000.0
mean,1.174175,-2.876739,1.34378,1.4784
std,408.414807,408.10788,575.56337,577.826276
min,-500.0,-500.0,-999.0,-999.0
25%,-499.179172,-500.0,-497.0,-497.0
50%,3.13326,-5.69174,1.0,2.0
75%,500.0,495.773629,497.0,501.0
max,500.0,500.0,999.0,999.0


Dataframe `data2` is unaffected as well.

In [35]:
data2.describe()

Unnamed: 0,column1,column2,column3,column4
count,50000.0,50000.0,50000.0,50000.0
mean,1.695724,-3.040382,4.00146,0.5111
std,578.62756,576.453798,576.244217,578.301376
min,-999.983952,-999.968792,-999.0,-999.0
25%,-502.160012,-500.333306,-492.25,-498.0
50%,4.057258,-4.377464,3.0,0.0
75%,502.159214,494.614704,499.0,500.25
max,999.998666,999.913716,999.0,999.0


### Performance?

You can check which of the two alternatives has the better performance for in-place operations.

In [42]:
nr_rows = 100_000

In [43]:
data = pd.DataFrame({
    'column1': np.random.uniform(-1_000.0, 1_000.0, size=nr_rows),
    'column2': np.random.uniform(-1_000.0, 1_000.0, size=nr_rows),
    'column3': np.random.uniform(-1_000, 1_000, size=nr_rows).astype(np.int64),
    'column4': np.random.uniform(-1_000, 1_000, size=nr_rows).astype(np.int64),
})

In [44]:
%%timeit
data.clip([-500.0, None, None, None], [500.0, None, None, None], inplace=True)

6.94 ms ± 638 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [45]:
%%timeit
data['column1'] = data['column1'].clip(-500.0, 500.0)

946 μs ± 49.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [46]:
nr_rows = 10_000_000

In [47]:
data = pd.DataFrame({
    'column1': np.random.uniform(-1_000.0, 1_000.0, size=nr_rows),
    'column2': np.random.uniform(-1_000.0, 1_000.0, size=nr_rows),
    'column3': np.random.uniform(-1_000, 1_000, size=nr_rows).astype(np.int64),
    'column4': np.random.uniform(-1_000, 1_000, size=nr_rows).astype(np.int64),
})

In [48]:
%%timeit
data.clip([-500.0, None, None, None], [500.0, None, None, None], inplace=True)

778 ms ± 78.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [49]:
%%timeit
data['column1'] = data['column1'].clip(-500.0, 500.0)

49 ms ± 6.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


As you can see, assigning the column to perform an in--place operations is significantly faster.

## Conclusion

If your code relies on "spooky action at a distance" you have to prepare for the release of pandas 3.x, since your code will no longer work and generate errors.  You can do so by enabling Copy-on-Write via the `options` and fix the resulting issues.

In general, it is good practice to enable Copy-on-Write for development of new code as well, since you will be sure that when pandas 3.x is release, your code will at least work without issues related to Copy-on-Write.

You can find more information and examples in the [pandas documentation](https://pandas.pydata.org/docs/user_guide/copy_on_write.html).