Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.copy(deep=True) dosen't deep copy the value which type is list() #22203

Closed
GabrielDrapor opened this issue Aug 5, 2018 · 4 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug

Comments

@GabrielDrapor
Copy link

Code Sample, a copy-pastable example if possible

a = pd.DataFrame([[1]])
print(a.loc[0][0])

b = a.copy() #.copy(deep=True) as default
b.loc[0,0] = 2
print(b.loc[0, 0])  # 2
print(a.loc[0, 0])  # 1

print(a.loc[0,0] is b.loc[0,0]) # False
print(id(a.loc[0,0]) == id(b.loc[0,0])) #True

# ------

a = pd.DataFrame([[[1]]])
print(a.loc[0][0])

b = a.copy()
b.loc[0,0][0] = 2
print(b.loc[0, 0]) # [2]
print(a.loc[0, 0]) # [2]

print(a.loc[0,0] is b.loc[0,0])  # True
print(id(a.loc[0,0]) == id(b.loc[0,0])) # True

Problem description

When the type of values is list(), DataFrame.copy(deep=True) doesn't work as expected,
or is it a meaningless problem and is it not necessary to fix this?

Expected Output

a = pd.DataFrame([[1]])
print(a.loc[0][0])

b = a.copy()
b.loc[0,0] = 2
print(b.loc[0, 0])  # 2
print(a.loc[0, 0])  # 1

print(a.loc[0,0] is b.loc[0,0]) # False
print(id(a.loc[0,0]) == id(b.loc[0,0])) # False

# ------

a = pd.DataFrame([[[1]]])
print(a.loc[0][0])

b = a.copy()
b.loc[0,0][0] = 2
print(b.loc[0, 0]) # [2]
print(a.loc[0, 0]) # [1]

print(a.loc[0,0] is b.loc[0,0])  # False
print(id(a.loc[0,0]) == id(b.loc[0,0])) # False

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.0.7
pip: 18.0
setuptools: 38.2.4
Cython: 0.25.2
numpy: 1.13.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.1.1
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Aug 6, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 6, 2018

Semantically speaking, deepcopy should not have this kind of aliasing. Thus, I would consider this a bug. Investigation and PR are welcome!

@gfyoung gfyoung added the Bug label Aug 6, 2018
@nmusolino
Copy link
Contributor

nmusolino commented Aug 11, 2018

In the second case, if the contained series has dtype object, then this is behaving as documented. From the “Notes” section of the pandas.DataFrame.copy() documentation:

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

@nmusolino
Copy link
Contributor

nmusolino commented Aug 11, 2018

In the first case, I don't think there is any actual aliasing going on. It only appears this way because of how you are checking with the id function.

(By the way, this example would be clearer if it did not recycle the names a and b for the two cases. Here I am referring to the first example.)

The example shows:

print(a.loc[0,0] is b.loc[0,0]) # False
print(id(a.loc[0,0]) == id(b.loc[0,0])) #True

The first line (is comparison) is as expected. The second is surprising, but that's because the statement is comparing two temporary objects. In other words, to evaluate this expression:

id(a.loc[0,0]) == id(b.loc[0,0])

the Python interpreter could perform the following steps:

  1. Evaluate a.loc[0, 0]; then
  2. Get the id of the temporary object created in step 1; then
  3. Evaluate b.loc[0, 0]; then
  4. Get the id of the temporary object created in step 3.

If the temporary object created in step 1 is GC'ed in between, the temporary object created in step 3 may be created at the same address. (In CPython, the id function returns the memory address of an object, although this is considered a CPython implementation detail.)

One case see examples of this just using plain old Python objects:

In [13]: id(object()), id(object())
Out[13]: (4763425312, 4763425312)

In [19]: print(object() is object())
False

In [20]: print(id(object()) == id(object()))
True

I think this issue should be closed as not-a-bug, unless I am missing something in the original report.

@GabrielDrapor
Copy link
Author

@nmusolino you are right. It's literally not a bug. Thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Projects
None yet
Development

No branches or pull requests

3 participants