Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: maximum recursion error when replacing empty lists #22083

Merged
merged 22 commits into from
Aug 7, 2018

Conversation

minggli
Copy link
Contributor

@minggli minggli commented Jul 27, 2018

Internal mechanism in replace method is intertwined and can cause RecursionError (not very helpful). Discussed in #21977, there doesn't appear to be first class support for ExtensionArray yet and this issue is right now treated as error handling, having investigated treating it as new use case but not easily co-exist without large change.

The RecursionError is caused by Block.replace and ObjectBlock.replace calling each other under certain circumstances such as in the linked issues.

New approach is now to prevent re-casting of blocks to object blocks if it is already an ObjectBlock and instead raise the error caught as it is, because if it's an ObjectBlock already then ValueError and TypeError raised won't go away when re-casted to object and re-try, resulting in infinite loop and RecursionError.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the example as a test

@@ -1000,8 +999,8 @@ def putmask(self, mask, new, align=True, inplace=False, axis=0,
if not (mask.shape[-1] == len(new) or
mask[mask].shape[-1] == len(new) or
len(new) == 1):
raise ValueError("cannot assign mismatch "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to change? (or is this what is actually causing the issue); this is not exposed to the user, correct?

Copy link
Contributor Author

@minggli minggli Jul 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this ValueError caused Block.replace and ObjectBlock.replace to form infinite loop and ultimately raise RecursionError. It's because Block.replace catches exception include ValueError and if it fails, delegate replace to ObjectBlock. But ObjectBlock.replace also uses super(ObjectBlock, self), hence the loop. The exception handling includes TypeError and ValueError and they are both depended on (as appear to me) by other casting cases.

Open to suggestion, not sure a better way to handle, at the moment this change is needed to address the issue in question.

This can be exposed to user when:

import pandas as pd

s = pd.Series(range(10))
mask = s > 5

s[mask] = [5, 4, 3, 2, 1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why we are catching ValueError at a higher level, maybe can get away with only catching TypeError and let ValueError buble up?

Copy link
Contributor Author

@minggli minggli Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point and it appears below test (only this test) fails when we only catch TypeError:

_________________ TestDataFrameReplace.test_replace_datetimetz _________________
# coerce to object
result = df.copy()
result.iloc[1, 0] = np.nan
result = result.replace(
{'A': pd.NaT}, Timestamp('20130104', tz='US/Pacific'))

pandas/tests/frame/test_replace.py:1060:


    elif isinstance(other, (np.datetime64, datetime, date)):
        other = tslibs.Timestamp(other)
        tz = getattr(other, 'tz', None)

        # test we can have an equal time zone
        if tz is None or str(tz) != str(self.values.tz):
         raise ValueError("incompatible or non tz-aware value")

E ValueError: incompatible or non tz-aware value

pandas/core/internals/blocks.py:2898: ValueError

It fails because of above ValueError which is passed on to ObjectBlock.replace as part of error handling in Block.replace. so it appears it is expected to be handled after the block is casted as object and try again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback made some changes now we don't touch above error but instead make sure we don't enter infinite loop.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 28, 2018
@codecov
Copy link

codecov bot commented Jul 28, 2018

Codecov Report

Merging #22083 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22083      +/-   ##
==========================================
+ Coverage   92.06%   92.06%   +<.01%     
==========================================
  Files         169      169              
  Lines       50694    50698       +4     
==========================================
+ Hits        46671    46675       +4     
  Misses       4023     4023
Flag Coverage Δ
#multiple 90.47% <100%> (ø) ⬆️
#single 42.32% <25%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/missing.py 91.7% <100%> (+0.04%) ⬆️
pandas/core/internals/blocks.py 94.45% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2156431...2bbd097. Read the comment docs.

@minggli
Copy link
Contributor Author

minggli commented Jul 28, 2018

added test cases too from two issues linked. 🚀

@jreback gentle nudge.

@jreback jreback added this to the 0.24.0 milestone Aug 1, 2018
# GH 19266 and GH 21977
# ValueError triggers try except block in Block.replace
# causing RecursionError
raise Exception("cannot assign mismatch length "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a bit more informative from a user POV?

Copy link
Contributor Author

@minggli minggli Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. this change here is exposed to user so when one tries to assign an array of n-dimension to mask of m-dimension, when m != n, this error throws.

>>> import pandas as pd
>>> s = pd.Series(range(10))
>>> mask = s > 5

>>> s[mask].shape
(4,)

>>> s[mask] = [5, 4, 3, 2, 1]

  File "/Users/mingli/.virtualenvs/cust-value/lib/python3.5/site-packages/pandas/core/internals.py", line 1006, in putmask
    raise ValueError("cannot assign mismatch "
ValueError: cannot assign mismatch length to masked array

This is also used internally by replace and ValueError it raises is causing infinite loop as per linked issues. Whilst changing that to not ValueError will solve it, it also makes it less clear and potentially breaks users' codes if they expect to catch ValueError running above code snippet.

Do you think it's ok to use Exception here as it would solve the issues linked but users might have to change their code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is reverted.

@minggli minggli closed this Aug 2, 2018
@minggli minggli reopened this Aug 2, 2018
@minggli minggli closed this Aug 5, 2018
@minggli minggli reopened this Aug 5, 2018
@pep8speaks
Copy link

pep8speaks commented Aug 6, 2018

Hello @minggli! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 07, 2018 at 13:23 Hours UTC

@minggli
Copy link
Contributor Author

minggli commented Aug 6, 2018

@jreback @WillAyd I think PR is done, comments welcome.

@@ -603,6 +603,20 @@ def test_replace_list(self):

assert_frame_equal(res, expec)

def test_replace_with_empty_list(self):
# GH 21977
s = pd.Series([['a', 'b'], np.nan, [1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to include an empty list in this test case (exists in the series module one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

@@ -667,7 +667,8 @@ Reshaping
- Bug in :meth:`DataFrame.replace` raises RecursionError when converting OutOfBounds ``datetime64[ns, tz]`` (:issue:`20380`)
- :func:`pandas.core.groupby.GroupBy.rank` now raises a ``ValueError`` when an invalid value is passed for argument ``na_option`` (:issue:`22124`)
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
-
- Bug in :meth:`DataFrame.replace` raises RecursionError when replace empty lists (:issue:`22083`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor nits but put RecursionError in double backticks and change replace to replacing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! changed.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, ping on green.

convert=convert)
# GH 22083, TypeError or ValueError occurred within error handling
# causes infinite loop. Cast and retry only if not objectblock.
if self.dtype == 'object':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use is_object_dtype

# causes infinite loop. Cast and retry only if not objectblock.
if self.dtype == 'object':
raise
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this else here

@jreback jreback merged commit 96b4873 into pandas-dev:master Aug 7, 2018
@jreback
Copy link
Contributor

jreback commented Aug 7, 2018

thanks @minggli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
4 participants