Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix group index calculation to prevent hitting maximum recursion depth (#21524) #21541

Merged
merged 8 commits into from Jun 21, 2018

Conversation

Templarrr
Copy link
Contributor

@Templarrr Templarrr commented Jun 19, 2018

This just replaces tail recursion call with a simple loop. It should have no effect whatsoever on a performance but prevent hitting recursion limits on some input data ( See example in my issue here: #21524 )

@Templarrr
Copy link
Contributor Author

Templarrr commented Jun 19, 2018

3.5 failure in Travis looks like caused by a network glitch :( need to rerun it

3.5 is not installed; attempting download
Downloading archive: https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64/python-3.5.tar.bz2
$ curl -sSf -o python-3.5.tar.bz2 ${archive_url}
curl: (56) SSL read: error:00000000:lib(0):func(0):reason(0), errno 104
Unable to download 3.5 archive. The archive may not exist. Please consider a different version.

@@ -52,7 +52,17 @@ def _int64_cut_off(shape):
return i
return len(shape)

def loop(labels, shape):
def maybe_lift(lab, size): # pormote nan values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp here

can you add a 1-line comment on what this is doing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with loop - I would try, though I'm not the author of original code - I've just changed it from recursion to loop, so I can't be sure I understand 100% all the nuances here...

@@ -60,6 +60,7 @@ Bug Fixes

- Bug in :meth:`Index.get_indexer_non_unique` with categorical key (:issue:`21448`)
- Bug in comparison operations for :class:`MultiIndex` where error was raised on equality / inequality comparison involving a MultiIndex with ``nlevels == 1`` (:issue:`21149`)
- Bug in calculation of group index causing "maximum recursion depth exceeded" errors during ``DataFrame.duplicated`` calls (:issue:`21524`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just say:

bug in :func:`DataFrame.duplicated` with a large number of columns causing a 'maximum .....'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll update this


labels = list(labels)
shape = list(shape)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on the purpose of the loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try, though I'm not the author of original code - I've just changed it from recursion to loop, so I can't be sure I understand 100% all the nuances here...

labels, shape = map(list, zip(*map(maybe_lift, labels, shape)))

return loop(list(labels), list(shape))
return out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does out need a definition outside of the loop? e.g. is it always defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is always defined here - out is assigned before the exit from the loop can happen.
And if something (though I don't know what in this case) throw an Exception - we will bypass return alltogether

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 19, 2018
@jreback
Copy link
Contributor

jreback commented Jun 19, 2018

can you add your example as a test

@pep8speaks
Copy link

pep8speaks commented Jun 19, 2018

Hello @Templarrr! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 20, 2018 at 10:25 Hours UTC

@codecov
Copy link

codecov bot commented Jun 19, 2018

Codecov Report

Merging #21541 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21541      +/-   ##
==========================================
+ Coverage   91.91%   91.92%   +<.01%     
==========================================
  Files         153      153              
  Lines       49546    49564      +18     
==========================================
+ Hits        45542    45560      +18     
  Misses       4004     4004
Flag Coverage Δ
#multiple 90.32% <100%> (ø) ⬆️
#single 41.8% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/sorting.py 98.2% <100%> (ø) ⬆️
pandas/core/arrays/categorical.py 95.63% <0%> (-0.06%) ⬇️
pandas/core/generic.py 96.12% <0%> (-0.01%) ⬇️
pandas/core/indexes/base.py 96.62% <0%> (ø) ⬆️
pandas/core/series.py 94.19% <0%> (+0.01%) ⬆️
pandas/core/indexes/category.py 97.28% <0%> (+0.18%) ⬆️
pandas/core/dtypes/cast.py 88.49% <0%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2625759...eebb8cf. Read the comment docs.

@Templarrr
Copy link
Contributor Author

Templarrr commented Jun 19, 2018

@jreback I've addressed all your comments the best I can.
Changelog updated, snipped used in the issue added as a test case (with minor changes to accomodate to different py3 recursion settings)
Comments to the loop and internal method added describing what is going on there as best as I understand it.

@Templarrr
Copy link
Contributor Author

Also, just to doublecheck that I've updated the correct changelog - if this is merged- it will be in 0.23.2, right?

@Templarrr
Copy link
Contributor Author

Again :(

3.5 is not installed; attempting download
Downloading archive: https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64/python-3.5.tar.bz2
$ curl -sSf -o python-3.5.tar.bz2 ${archive_url}
curl: (56) SSL read: error:00000000:lib(0):func(0):reason(0), errno 104
Unable to download 3.5 archive. The archive may not exist. Please consider a different version.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2018

i restarted that job. ping on green.

@jreback jreback added the Bug label Jun 19, 2018
@jreback jreback added this to the 0.23.2 milestone Jun 19, 2018
@@ -1527,6 +1527,22 @@ def test_duplicated_with_misspelled_column_name(self, subset):
with pytest.raises(KeyError):
df.drop_duplicates(subset)

def test_duplicated_do_not_fail_on_wide_dataframes(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this test take to run? May be worth a slow tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that original test case needed to hit recursion limit... it can't be super fast.
It takes a second or two on my laptop.
I'll add the slow tag

def maybe_lift(lab, size):
# promote nan values (assigned to -1 here)
# so that all values are non-negative
return (lab + 1, size + 1) if (lab == -1).any() else (lab, size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not obfuscate NA and 0 values?

Copy link
Contributor Author

@Templarrr Templarrr Jun 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works with labels, not the original values. labels got here from
df.duplicate -> core.algorithms.factorize -> _factorize_array call and that method replace NA with -1 (na_sentinel : int, default -1) and assign all other values >=0
Also, I did not change this in any way, it's an existing code that already in master :)

@Templarrr
Copy link
Contributor Author

Templarrr commented Jun 20, 2018

@WillAyd I've addressed your review comments

def test_duplicated_do_not_fail_on_wide_dataframes(self):
# Given the wide dataframe with a lot of columns
# with different (important!) values
data = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a dict comprehenseion

for i in range(100):
data['col_{0:02d}'.format(i)] = np.random.randint(0, 1000, 30000)
df = pd.DataFrame(data).T
# When we request to calculate duplicates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this comment here.
call this result=

@Templarrr
Copy link
Contributor Author

@jreback your comments are addressed as well

@jreback
Copy link
Contributor

jreback commented Jun 20, 2018

thanks @Templarrr

@WillAyd lgtm. pls approve & merge when satisifed

@WillAyd WillAyd merged commit f91a704 into pandas-dev:master Jun 21, 2018
@WillAyd
Copy link
Member

WillAyd commented Jun 21, 2018

Thanks @Templarrr !

@Templarrr
Copy link
Contributor Author

Yay, I'm pandas contributor :-D
Thanks for the reviews and comments!

jorisvandenbossche pushed a commit that referenced this pull request Jun 29, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jul 2, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
5 participants