# Rethinking pandas' copy/view semantics

aka *Death to the SettingWithCopyWarning* ;)


In [None]:
import pandas as pd

## Problem 1: unclear copy/view semantics in indexing

In [None]:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

In [None]:
df

In [None]:
subset = df[["A"]]
# or
# subset = df[df['A'] == 1]

When the user modifies this subset:

1. Did the user intent to modify `df` as well when modifying `subset`?    
2. Or did the user just want to work further with `subset`, ignoring `df`?

In [None]:
subset.iloc[:, 0] = 10

In [None]:
df

Original motivation for the SettingWithCopyWarning:

In [None]:
df[df['B'] > 4]['B'] = 10
df

In [None]:
df['B'][df['B'] > 4] = 10
df

## Problem 2: wasteful copying

Quoting Wes McKinney (from https://wesmckinney.com/blog/apache-arrow-pandas-internals/):


<div style="font-size:120%">

> pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset

</div>

In [None]:
N = 1_000_000
df = pd.DataFrame({
    'key': np.arange(N),
    'A': np.random.randn(N),
    'B': np.random.randn(N),
    'C': ['a', 'b', 'c', 'd'] * int(N/4),
    'D': pd.date_range("2012", periods=N, freq='T')
})

In [None]:
df

In [None]:
%%timeit
(df.rename(columns={"D": "date"})
   .fillna({"A": np.nan})
   .set_index("key")
   .loc[:, ["date", "A"]]
)

In [None]:
%load_ext snakeviz

In [None]:
%%snakeviz
(df.rename(columns={"D": "date"})
   .fillna({"A": np.nan})
   .set_index("key")
   .loc[:, ["date", "A"]]
)

In [None]:
%timeit df.copy()

In [None]:
%timeit df.rename(columns={"D": "date"})

## Can we do better?

- Don't return copies in indexing (eg selecting columns) when not needed?
- Don't always copy in methods like `set_index` or `rename`, but use "Copy on Write" ?
- ...

Currently being discussed in https://github.com/pandas-dev/pandas/issues/36195