REGR: invalid cache after take operation with non-consolidated dataframe #35521

on55 · 2020-08-03T03:05:29Z

a program run ok with pandas 1.0.5, but after upgrade to 1.1.0. it fail.
i found sometimes when I change one cell of dataframe value, print(df) and it doesn't change.
but i use ==, it says changed.
for example: original cell value is A, then i change cell value to B. A=B
print df, it show still A
df.at(x, x) == B, it says TRUE.

MarcoGorelli's edit: here's a reproducible example:

import pandas as pd


position = pd.DataFrame(columns=["code", "startdate"])
position = position.append([{"code": "a", "startdate": 0}])

# These two lines should not change anything.
# BUT, commenting either of them out makes this code run as intendeed
position["code"] == "A"
position[position["startdate"] == 0]

position.at[0, "code"] = "A"

print(position.at[0, "code"])
print(position)

output:

A
  code startdate
0    a         0

expected output:

A
  code startdate
0    A         0

The text was updated successfully, but these errors were encountered:

jreback · 2020-08-03T03:15:01Z

pls show an example that reproduces the issue

MarcoGorelli · 2020-08-12T11:38:05Z

Closing as I can't reproduce this:

>>> import pandas as pd
>>> df = pd.DataFrame({'a': ['A']})
>>> df
   a
0  A
>>> df.loc[0, 'a'] = 'B'
>>> df.at[0, 'a'] == 'B'
True
>>> print(df)
   a
0  B

Please ping if you can provide a bug report as is suggested here https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

MarcoGorelli · 2020-08-14T21:08:07Z

@simonjayhawkins noticed you'd put the 1.1.1 milestone - were you able to reproduce this?

on55 · 2020-08-15T04:24:31Z

MarcoGorelli：
Nice to meet you! sorry i seldom come to here and just see your comments.

Details is:
The program read many data and calculate.
after program calculate and output a big dataframe, then program change several value of the left 1st column in a cycle.
the sentence is 'position.at[index, 'code'] = code‘.
only one cell's value can not be changed.
when i print(df) , it show original value. i use 'df.at[x, y] == new value' it say TRUE.
i use 'df.at[x, y] == original value' it say FALSE.
since this sentence is ran in a cycle, other cell's values are changed correctly, only one cell value meet this problem.

now i have degraded to 1.0.5 and my program run ok again.
my program run ok with many pandas version, just have issue with pandas 1.1.0.
so i suppose there may be memory corruption for that version.

thank you and thank pandas.

on55 · 2020-08-15T04:46:03Z

supplement：
the dataframe is a little big and produced by program automatic, it is not a simple dataframe with 1X1 matrix.
program change left 1st column's several cell value in a loop , and only one cell meet this issue, other cell value is changed well.

the cell value printed show original value,
when i use == new value it say TRUE.
it looks like value is changed and just print issue.

following program continue run and report error.
because following program read dataframe and find the cell value still is original value, it doesn't changed.
following program can not get correct code value，so it report error.

simonjayhawkins · 2020-08-15T10:50:41Z

@simonjayhawkins noticed you'd put the 1.1.1 milestone - were you able to reproduce this?

not tried to reproduce. using the 1.1.1 milestone as a tracker (for regressions and related PRs). 1.1.1 will be released next week and any items on the milestone still open at that time will be moved to 1.1.2 (or can review). the blocker tag can be used, but less relevant for a patch release as 1.1.2 could follow in 2-3 weeks.

for the minor releases (i.e 1.2) it is common to only apply the milestone when the PR is close to be ready (and then also on the corresponding issue)

when triaging issues i tend to either tag as 1.1.1(for regressions from 1.0.5) or contributions welcome (bugs and regressions from earlier versions)

MarcoGorelli · 2020-08-15T12:26:45Z

any items on the milestone still open at that time will be moved to 1.1.2 (or can review).

Thanks for explaining - OK have reopened so it can reviewed, though there's not much to go off of here

jreback · 2020-08-15T12:43:02Z

this needs an actual reproducible example

simonjayhawkins · 2020-08-17T19:20:50Z

this needs an actual reproducible example

@on55 if you look at other bug reports, you'll see most use the template (available when opening a new issue) and ideally follow the guidance given in https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

simonjayhawkins · 2020-08-17T19:41:11Z

moved off 1.1.1 milestone (scheduled for this week) as no PRs to fix in the pipeline

on55 · 2020-08-21T07:28:43Z

today i upgraded pandas to 1.1.1, this issue still exist.

Traceback (most recent call last):
IndexError: list index out of range

my program like this:
first creat blank dataframe -- ‘position’
then a cycle: for cycle in xxx: (Cyclic reading database)
condition judgement: if elif elif elif elif elif elif elif elif elif ...
once one elif meet condition, then add/change/delete some value of ' position'.

the error：(i use pycharm debug mode to check it.)
one of condition judgement 'elif' meet condition,
then print(position) the value still is original. position.code[4] show changed value.
but following program say the value is still original.
as same as my previous error details in version 1.1.0

Previous program meet this condition judgement 'elif' several times , and each time it run ok.
it correctly add/change cell value,
i use pycharm debug mode to see it, both print(position) and position.code[x] show changed value.

this time the value position.code[4] seems be changed, but it is not changed Essentially.
in next several cycle, the program read dataframe 'position' but can not find changed code value in 'position' dataframe.
so it report error 'IndexError: list index out of range'

now, i re-degrade my pandas version to 1.0.5 and everything is ok.

MarcoGorelli · 2020-08-22T07:27:32Z

@on55 could you please add a reproducible example as suggested above?

on55 · 2020-08-22T13:27:50Z

@MarcoGorelli
i found it is an odd issue, not sure it is pandas issue or pycharm issue.

when i use pycharm run, it give a result( a wrong answer);
while i use pycharm debug mode, and 'step over' the sentence, it give another result( the answer i wanted)

debug_123.xlsx
data is attached in excel.

`
def test():
df = pd.read_excel('excel address')

rawdatelist = [str(x)[:10] for x in df.date]
datelist = [x.replace('-','') for x in rawdatelist]
df['date'] = datelist

position = pd.DataFrame(columns=['code', 'vol', 'startdate'])
# -----------------------------
for cycle in range(len(datelist)):
    type = df.type[cycle]
    date = df.date[cycle]
    code = df.code[cycle]
    vol = df.vol.to_numpy()[cycle]

    print(date)

    if type == 'p':
        position = position.append([{'code': code,  'vol': vol, 'startdate': 0}], ignore_index=True)
    elif type == 'new':
        if len(position[position['code'] == code].index.tolist()) > 0:
            newindex = position[position['code'] == code].index.tolist()[0]
            position.at[newindex, 'startdate'] = date
        else:
            newindex = position[(position.startdate == 0) & (position.vol == vol)].index.tolist()[0] 
            position.at[newindex, 'code'] = code
            position.at[newindex, 'startdate'] = date

    print(position)
print(position)

`

i want code 'a' to transfer 'A', in run mode it is still 'a'; in debug mode it is changed to 'A'.

one result show:
20200304
code vol startdate
0 a 10 0
20200304
code vol startdate
0 a 10 0
1 b 10 0
20200319
code vol startdate
0 a 10 20200319
1 b 10 0
code vol startdate
0 a 10 20200319
1 b 10 0

another result show:
20200304
code vol startdate
0 a 10 0
20200304
code vol startdate
0 a 10 0
1 b 10 0
20200319
code vol startdate
0 A 10 20200319
1 b 10 0
code vol startdate
0 A 10 20200319
1 b 10 0

MarcoGorelli · 2020-08-22T13:36:26Z

i reproduce it, cost my much time, so sad...

With all due respect, this is a community-based project, it costs everyone time.

Please see the link posted above for how to post a reproducible example, there shouldn't be any need to attach external data

on55 · 2020-08-22T13:47:42Z

got it. i firstly submit issue here. i attach a excel with less data just for you can quickly run it.
thank your patient explain and thank pandas.

MarcoGorelli · 2020-08-22T14:34:34Z

@on55 OK I see what you mean now:

(Pdb) position.at[newindex, 'code']
'A'
(Pdb) position
  code vol startdate
0    a  10         0
1    b  10         0

I'll try putting together a short example now

MarcoGorelli · 2020-08-22T14:57:26Z

@on55 I've removed the irrelevant parts from your report, this reproduces the issue:

import pandas as pd


position = pd.DataFrame(columns=["code", "startdate"])
position = position.append([{"code": "a", "startdate": 0}])

# These two lines should not change anything.
# BUT, commenting either of them out makes this code run as intendeed
position["code"] == "A"
position[position["startdate"] == 0]

position.at[0, "code"] = "A"

print(position.at[0, "code"])
print(position)

output:

A
  code startdate
0    a         0

expected output:

A
  code startdate
0    A         0

on55 · 2020-08-22T15:38:35Z

Yes, that is it. Your summary really catch point.

on55 · 2020-08-22T16:03:01Z

but i recall my Initial question, this sentence is executed about 20 times in cycle (pycharm run mode).
it report issue just at 15th(i assume) times, previous 'position.at[0, "code"] = "A"' run ok.
why previous 14 times the sentence run ok , and that time it run fail? so strange.

asishm · 2020-08-24T16:42:10Z

I tried to do a git bisect on this

#34389

760ba37a776947694ec816eaf29ce8937e08f544 is the first bad commit
commit 760ba37a776947694ec816eaf29ce8937e08f544
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Tue May 26 15:47:18 2020 -0700

    CLN: _consolidate_inplace less (#34389)

 pandas/core/generic.py            |  9 ---------
 pandas/core/internals/managers.py | 10 ----------
 2 files changed, 19 deletions(-)
bisect run success

simonjayhawkins · 2020-08-25T10:10:33Z

cc @jbrockmendel

jbrockmendel · 2020-08-25T16:49:44Z

The line position[position["startdate"] == 0] goes through position.take, which calls self._mgr.take, which calls _consolidate_inplace, which fails to invalidate the _item_cache.

One fix is to restore the self._consolidate_inplace done in NDFrame.take that was removed in #34389. Another (my preferred long-term solution) is to remove the self._consolidate_inplace done in BlockManager.take.

simonjayhawkins added the Needs Info Clarification about behavior needed to assess issue label Aug 3, 2020

simonjayhawkins added this to the 1.1.1 milestone Aug 3, 2020

MarcoGorelli closed this as completed Aug 12, 2020

MarcoGorelli modified the milestones: 1.1.1, No action Aug 12, 2020

MarcoGorelli modified the milestones: No action, 1.1.1 Aug 15, 2020

MarcoGorelli reopened this Aug 15, 2020

simonjayhawkins modified the milestones: 1.1.1, 1.1.2 Aug 17, 2020

on55 changed the title ~~pandas version 1.1.0 has a issue~~ pandas version 1.1.0 and 1.1.1 has a issue Aug 21, 2020

on55 closed this as completed Aug 21, 2020

on55 reopened this Aug 22, 2020

MarcoGorelli added Regression Functionality that used to work in a prior pandas version and removed Needs Info Clarification about behavior needed to assess issue labels Aug 22, 2020

simonjayhawkins added Internals Related to non-user accessible pandas implementation Bug labels Aug 25, 2020

jorisvandenbossche mentioned this issue Sep 4, 2020

REGR: revert "CLN: _consolidate_inplace less" / fix regression in fillna() #34407

Merged

jorisvandenbossche changed the title ~~pandas version 1.1.0 and 1.1.1 has a issue~~ REGR: invalid cache after take operation with non-consolidated dataframe Sep 4, 2020

jorisvandenbossche mentioned this issue Sep 4, 2020

REGR: fix consolidation/cache issue with take operation #36114

Merged

jreback closed this as completed in #36114 Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: invalid cache after take operation with non-consolidated dataframe #35521

REGR: invalid cache after take operation with non-consolidated dataframe #35521

on55 commented Aug 3, 2020 •

edited by MarcoGorelli

Loading

jreback commented Aug 3, 2020

MarcoGorelli commented Aug 12, 2020

MarcoGorelli commented Aug 14, 2020

on55 commented Aug 15, 2020

on55 commented Aug 15, 2020

simonjayhawkins commented Aug 15, 2020 •

edited

Loading

MarcoGorelli commented Aug 15, 2020

jreback commented Aug 15, 2020

simonjayhawkins commented Aug 17, 2020

simonjayhawkins commented Aug 17, 2020

on55 commented Aug 21, 2020 •

edited

Loading

MarcoGorelli commented Aug 22, 2020

on55 commented Aug 22, 2020 •

edited

Loading

MarcoGorelli commented Aug 22, 2020

on55 commented Aug 22, 2020

MarcoGorelli commented Aug 22, 2020

MarcoGorelli commented Aug 22, 2020 •

edited

Loading

on55 commented Aug 22, 2020

on55 commented Aug 22, 2020

asishm commented Aug 24, 2020

simonjayhawkins commented Aug 25, 2020

jbrockmendel commented Aug 25, 2020

REGR: invalid cache after take operation with non-consolidated dataframe #35521

REGR: invalid cache after take operation with non-consolidated dataframe #35521

Comments

on55 commented Aug 3, 2020 • edited by MarcoGorelli Loading

jreback commented Aug 3, 2020

MarcoGorelli commented Aug 12, 2020

MarcoGorelli commented Aug 14, 2020

on55 commented Aug 15, 2020

on55 commented Aug 15, 2020

simonjayhawkins commented Aug 15, 2020 • edited Loading

MarcoGorelli commented Aug 15, 2020

jreback commented Aug 15, 2020

simonjayhawkins commented Aug 17, 2020

simonjayhawkins commented Aug 17, 2020

on55 commented Aug 21, 2020 • edited Loading

MarcoGorelli commented Aug 22, 2020

on55 commented Aug 22, 2020 • edited Loading

MarcoGorelli commented Aug 22, 2020

on55 commented Aug 22, 2020

MarcoGorelli commented Aug 22, 2020

MarcoGorelli commented Aug 22, 2020 • edited Loading

on55 commented Aug 22, 2020

on55 commented Aug 22, 2020

asishm commented Aug 24, 2020

simonjayhawkins commented Aug 25, 2020

jbrockmendel commented Aug 25, 2020

on55 commented Aug 3, 2020 •

edited by MarcoGorelli

Loading

simonjayhawkins commented Aug 15, 2020 •

edited

Loading

on55 commented Aug 21, 2020 •

edited

Loading

on55 commented Aug 22, 2020 •

edited

Loading

MarcoGorelli commented Aug 22, 2020 •

edited

Loading