# Rethinking pandas' copy/view semantics

aka *Death to the SettingWithCopyWarning* ;)


## Problem 1: unclear copy/view semantics in indexing

In [1]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

In [2]:
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [3]:
subset = df[["A"]]
# or
# subset = df[df['A'] == 1]

When the user modifies this subset:

1. Did the user intent to modify `df` as well when modifying `subset`?    
2. Or did the user just want to work further with `subset`, ignoring `df`?

In [4]:
subset.iloc[:, 0] = 10

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [5]:
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


Original motivation for the SettingWithCopyWarning:

In [6]:
df[df['B'] > 4]['B'] = 10
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['B'] > 4]['B'] = 10


Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [7]:
df['B'][df['B'] > 4] = 10
df

Unnamed: 0,A,B
0,1,4
1,2,10
2,3,10


## Problem 2: wasteful copying

Quoting Wes McKinney (from https://wesmckinney.com/blog/apache-arrow-pandas-internals/):


<div style="font-size:120%">

> pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset

</div>

In [8]:
N = 1_000_000
df = pd.DataFrame({
    'key': np.arange(N),
    'A': np.random.randn(N),
    'B': np.random.randn(N),
    'C': ['a', 'b', 'c', 'd'] * int(N/4),
    'D': pd.date_range("2012", periods=N, freq='T')
})

In [9]:
df

Unnamed: 0,key,A,B,C,D
0,0,-1.695256,-0.286734,a,2012-01-01 00:00:00
1,1,-0.626765,-1.136325,b,2012-01-01 00:01:00
2,2,-1.898366,0.437011,c,2012-01-01 00:02:00
3,3,0.140612,0.432111,d,2012-01-01 00:03:00
4,4,-2.793411,0.989344,a,2012-01-01 00:04:00
...,...,...,...,...,...
999995,999995,1.036492,-0.521045,d,2013-11-25 10:35:00
999996,999996,-0.440127,0.991176,a,2013-11-25 10:36:00
999997,999997,1.197192,0.371876,b,2013-11-25 10:37:00
999998,999998,-0.402997,1.974036,c,2013-11-25 10:38:00


In [10]:
%%timeit
(df.rename(columns={"D": "date"})
   .fillna({"A": np.nan})
   .set_index("key")
   .loc[:, ["date", "A"]]
)

37.9 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
%load_ext snakeviz

In [12]:
%%snakeviz
(df.rename(columns={"D": "date"})
   .fillna({"A": np.nan})
   .set_index("key")
   .loc[:, ["date", "A"]]
)

 
*** Profile stats marshalled to file '/tmp/tmpq7jubqlq'. 
Embedding SnakeViz in this document...


In [13]:
%timeit df.copy()

9.18 ms ± 533 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%timeit df.rename(columns={"D": "date"})

8.36 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Can we do better?

- Don't return copies in indexing (eg selecting columns) when not needed?
- Don't always copy in methods like `set_index` or `rename`, but use "Copy on Write" ?
- ...

Currently being discussed in https://github.com/pandas-dev/pandas/issues/36195