Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: invalid cache after take operation with non-consolidated dataframe #35521

Closed
on55 opened this issue Aug 3, 2020 · 22 comments · Fixed by #36114
Closed

REGR: invalid cache after take operation with non-consolidated dataframe #35521

on55 opened this issue Aug 3, 2020 · 22 comments · Fixed by #36114
Labels
Bug Internals Related to non-user accessible pandas implementation Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@on55
Copy link

on55 commented Aug 3, 2020

a program run ok with pandas 1.0.5, but after upgrade to 1.1.0. it fail.
i found sometimes when I change one cell of dataframe value, print(df) and it doesn't change.
but i use ==, it says changed.
for example: original cell value is A, then i change cell value to B. A=B
print df, it show still A
df.at(x, x) == B, it says TRUE.


MarcoGorelli's edit: here's a reproducible example:

import pandas as pd


position = pd.DataFrame(columns=["code", "startdate"])
position = position.append([{"code": "a", "startdate": 0}])

# These two lines should not change anything.
# BUT, commenting either of them out makes this code run as intendeed
position["code"] == "A"
position[position["startdate"] == 0]

position.at[0, "code"] = "A"

print(position.at[0, "code"])
print(position)

output:

A
  code startdate
0    a         0

expected output:

A
  code startdate
0    A         0
@jreback
Copy link
Contributor

jreback commented Aug 3, 2020

pls show an example that reproduces the issue

@simonjayhawkins simonjayhawkins added the Needs Info Clarification about behavior needed to assess issue label Aug 3, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Aug 3, 2020
@MarcoGorelli
Copy link
Member

Closing as I can't reproduce this:

>>> import pandas as pd
>>> df = pd.DataFrame({'a': ['A']})
>>> df
   a
0  A
>>> df.loc[0, 'a'] = 'B'
>>> df.at[0, 'a'] == 'B'
True
>>> print(df)
   a
0  B

Please ping if you can provide a bug report as is suggested here https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@MarcoGorelli MarcoGorelli modified the milestones: 1.1.1, No action Aug 12, 2020
@MarcoGorelli
Copy link
Member

@simonjayhawkins noticed you'd put the 1.1.1 milestone - were you able to reproduce this?

@on55
Copy link
Author

on55 commented Aug 15, 2020

MarcoGorelli:
Nice to meet you! sorry i seldom come to here and just see your comments.

Details is:
The program read many data and calculate.
after program calculate and output a big dataframe, then program change several value of the left 1st column in a cycle.
the sentence is 'position.at[index, 'code'] = code‘.
only one cell's value can not be changed.
when i print(df) , it show original value. i use 'df.at[x, y] == new value' it say TRUE.
i use 'df.at[x, y] == original value' it say FALSE.
since this sentence is ran in a cycle, other cell's values are changed correctly, only one cell value meet this problem.

now i have degraded to 1.0.5 and my program run ok again.
my program run ok with many pandas version, just have issue with pandas 1.1.0.
so i suppose there may be memory corruption for that version.

thank you and thank pandas.

@on55
Copy link
Author

on55 commented Aug 15, 2020

supplement:
the dataframe is a little big and produced by program automatic, it is not a simple dataframe with 1X1 matrix.
program change left 1st column's several cell value in a loop , and only one cell meet this issue, other cell value is changed well.

the cell value printed show original value,
when i use == new value it say TRUE.
it looks like value is changed and just print issue.

following program continue run and report error.
because following program read dataframe and find the cell value still is original value, it doesn't changed.
following program can not get correct code value,so it report error.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Aug 15, 2020

@simonjayhawkins noticed you'd put the 1.1.1 milestone - were you able to reproduce this?

not tried to reproduce. using the 1.1.1 milestone as a tracker (for regressions and related PRs). 1.1.1 will be released next week and any items on the milestone still open at that time will be moved to 1.1.2 (or can review). the blocker tag can be used, but less relevant for a patch release as 1.1.2 could follow in 2-3 weeks.

for the minor releases (i.e 1.2) it is common to only apply the milestone when the PR is close to be ready (and then also on the corresponding issue)

when triaging issues i tend to either tag as 1.1.1(for regressions from 1.0.5) or contributions welcome (bugs and regressions from earlier versions)

@MarcoGorelli MarcoGorelli modified the milestones: No action, 1.1.1 Aug 15, 2020
@MarcoGorelli
Copy link
Member

any items on the milestone still open at that time will be moved to 1.1.2 (or can review).

Thanks for explaining - OK have reopened so it can reviewed, though there's not much to go off of here

@MarcoGorelli MarcoGorelli reopened this Aug 15, 2020
@jreback
Copy link
Contributor

jreback commented Aug 15, 2020

this needs an actual reproducible example

@simonjayhawkins
Copy link
Member

this needs an actual reproducible example

@on55 if you look at other bug reports, you'll see most use the template (available when opening a new issue) and ideally follow the guidance given in https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.1, 1.1.2 Aug 17, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.1 milestone (scheduled for this week) as no PRs to fix in the pipeline

@on55 on55 changed the title pandas version 1.1.0 has a issue pandas version 1.1.0 and 1.1.1 has a issue Aug 21, 2020
@on55
Copy link
Author

on55 commented Aug 21, 2020

today i upgraded pandas to 1.1.1, this issue still exist.

Traceback (most recent call last):
IndexError: list index out of range

my program like this:
first creat blank dataframe -- ‘position’
then a cycle: for cycle in xxx: (Cyclic reading database)
condition judgement: if elif elif elif elif elif elif elif elif elif ...
once one elif meet condition, then add/change/delete some value of ' position'.

the error:(i use pycharm debug mode to check it.)
one of condition judgement 'elif' meet condition,
then print(position) the value still is original. position.code[4] show changed value.
but following program say the value is still original.
as same as my previous error details in version 1.1.0

Previous program meet this condition judgement 'elif' several times , and each time it run ok.
it correctly add/change cell value,
i use pycharm debug mode to see it, both print(position) and position.code[x] show changed value.

this time the value position.code[4] seems be changed, but it is not changed Essentially.
in next several cycle, the program read dataframe 'position' but can not find changed code value in 'position' dataframe.
so it report error 'IndexError: list index out of range'

now, i re-degrade my pandas version to 1.0.5 and everything is ok.

@on55 on55 closed this as completed Aug 21, 2020
@on55 on55 reopened this Aug 22, 2020
@MarcoGorelli
Copy link
Member

@on55 could you please add a reproducible example as suggested above?

@on55
Copy link
Author

on55 commented Aug 22, 2020

@MarcoGorelli
i found it is an odd issue, not sure it is pandas issue or pycharm issue.

when i use pycharm run, it give a result( a wrong answer);
while i use pycharm debug mode, and 'step over' the sentence, it give another result( the answer i wanted)

debug_123.xlsx
data is attached in excel.

`
def test():
df = pd.read_excel('excel address')

rawdatelist = [str(x)[:10] for x in df.date]
datelist = [x.replace('-','') for x in rawdatelist]
df['date'] = datelist

position = pd.DataFrame(columns=['code', 'vol', 'startdate'])
# -----------------------------
for cycle in range(len(datelist)):
    type = df.type[cycle]
    date = df.date[cycle]
    code = df.code[cycle]
    vol = df.vol.to_numpy()[cycle]

    print(date)

    if type == 'p':
        position = position.append([{'code': code,  'vol': vol, 'startdate': 0}], ignore_index=True)
    elif type == 'new':
        if len(position[position['code'] == code].index.tolist()) > 0:
            newindex = position[position['code'] == code].index.tolist()[0]
            position.at[newindex, 'startdate'] = date
        else:
            newindex = position[(position.startdate == 0) & (position.vol == vol)].index.tolist()[0] 
            position.at[newindex, 'code'] = code
            position.at[newindex, 'startdate'] = date

    print(position)
print(position)

`

i want code 'a' to transfer 'A', in run mode it is still 'a'; in debug mode it is changed to 'A'.

one result show:
20200304
code vol startdate
0 a 10 0
20200304
code vol startdate
0 a 10 0
1 b 10 0
20200319
code vol startdate
0 a 10 20200319
1 b 10 0
code vol startdate
0 a 10 20200319
1 b 10 0

another result show:
20200304
code vol startdate
0 a 10 0
20200304
code vol startdate
0 a 10 0
1 b 10 0
20200319
code vol startdate
0 A 10 20200319
1 b 10 0
code vol startdate
0 A 10 20200319
1 b 10 0

@MarcoGorelli
Copy link
Member

i reproduce it, cost my much time, so sad...

With all due respect, this is a community-based project, it costs everyone time.

Please see the link posted above for how to post a reproducible example, there shouldn't be any need to attach external data

@on55
Copy link
Author

on55 commented Aug 22, 2020

got it. i firstly submit issue here. i attach a excel with less data just for you can quickly run it.
thank your patient explain and thank pandas.

@MarcoGorelli
Copy link
Member

@on55 OK I see what you mean now:

(Pdb) position.at[newindex, 'code']
'A'
(Pdb) position
  code vol startdate
0    a  10         0
1    b  10         0

I'll try putting together a short example now

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 22, 2020

@on55 I've removed the irrelevant parts from your report, this reproduces the issue:

import pandas as pd


position = pd.DataFrame(columns=["code", "startdate"])
position = position.append([{"code": "a", "startdate": 0}])

# These two lines should not change anything.
# BUT, commenting either of them out makes this code run as intendeed
position["code"] == "A"
position[position["startdate"] == 0]

position.at[0, "code"] = "A"

print(position.at[0, "code"])
print(position)

output:

A
  code startdate
0    a         0

expected output:

A
  code startdate
0    A         0

@on55
Copy link
Author

on55 commented Aug 22, 2020

Yes, that is it. Your summary really catch point.

@MarcoGorelli MarcoGorelli added Regression Functionality that used to work in a prior pandas version and removed Needs Info Clarification about behavior needed to assess issue labels Aug 22, 2020
@on55
Copy link
Author

on55 commented Aug 22, 2020

but i recall my Initial question, this sentence is executed about 20 times in cycle (pycharm run mode).
it report issue just at 15th(i assume) times, previous 'position.at[0, "code"] = "A"' run ok.
why previous 14 times the sentence run ok , and that time it run fail? so strange.

@asishm
Copy link
Contributor

asishm commented Aug 24, 2020

I tried to do a git bisect on this

#34389

760ba37a776947694ec816eaf29ce8937e08f544 is the first bad commit
commit 760ba37a776947694ec816eaf29ce8937e08f544
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Tue May 26 15:47:18 2020 -0700

    CLN: _consolidate_inplace less (#34389)

 pandas/core/generic.py            |  9 ---------
 pandas/core/internals/managers.py | 10 ----------
 2 files changed, 19 deletions(-)
bisect run success

@simonjayhawkins
Copy link
Member

cc @jbrockmendel

@simonjayhawkins simonjayhawkins added Internals Related to non-user accessible pandas implementation Bug labels Aug 25, 2020
@jbrockmendel
Copy link
Member

The line position[position["startdate"] == 0] goes through position.take, which calls self._mgr.take, which calls _consolidate_inplace, which fails to invalidate the _item_cache.

One fix is to restore the self._consolidate_inplace done in NDFrame.take that was removed in #34389. Another (my preferred long-term solution) is to remove the self._consolidate_inplace done in BlockManager.take.

@jorisvandenbossche jorisvandenbossche changed the title pandas version 1.1.0 and 1.1.1 has a issue REGR: invalid cache after take operation with non-consolidated dataframe Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Internals Related to non-user accessible pandas implementation Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants