New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"maximum recursion depth exceeded" when calculating duplicates in big DataFrame (regression comparing to the old version) #21524

Closed
Templarrr opened this Issue Jun 18, 2018 · 9 comments

Comments

Projects
None yet
4 participants
@Templarrr
Contributor

Templarrr commented Jun 18, 2018

Code Sample, a copy-pastable example if possible

I'm currently in the middle of upgrading old system from old pandas (0.12) to the new version (0.23.0). One of the parts of the system is duplicate columns detection in medium-sized DataFrames (~100 columns, few thousand rows). We were detecting it like this dupes = df.T.duplicated() and previously it worked but after the upgrade, it started failing. Simplest snippet to reproduce this locally

import numpy as np
import pandas as pd

data = {}
for i in range(70):
    data['col_{0:02d}'.format(i)] = np.random.randint(0, 1000, 20000)
df = pd.DataFrame(data)
dupes = df.T.duplicated()
print dupes

Problem description

To the contrast of the note below, this issue isn't resolved by upgrading to the newest pandas. On the contrary, it is caused by such upgrade :) Old version I've copied below from 0.12 works on a snippet above

def old_duplicated(self, cols=None, take_last=False):
    """
    Return boolean Series denoting duplicate rows, optionally only
    considering certain columns

    Parameters
    ----------
    cols : column label or sequence of labels, optional
        Only consider certain columns for identifying duplicates, by
        default use all of the columns
    take_last : boolean, default False
        Take the last observed row in a row. Defaults to the first row

    Returns
    -------
    duplicated : Series
    """

    # kludge for #1833
    def _m8_to_i8(x):
        if issubclass(x.dtype.type, np.datetime64):
            return x.view(np.int64)
        return x

    if cols is None:
        values = list(_m8_to_i8(self.values.T))
    else:
        if np.iterable(cols) and not isinstance(cols, basestring):
            if isinstance(cols, tuple):
                if cols in self.columns:
                    values = [self[cols]]
                else:
                    values = [_m8_to_i8(self[x].values) for x in cols]
            else:
                values = [_m8_to_i8(self[x].values) for x in cols]
        else:
            values = [self[cols]]

    keys = lib.fast_zip_fillna(values)
    duplicated = lib.duplicated(keys, take_last=take_last)
    return pd.Series(duplicated, index=self.index)

but the new one now fails with

Traceback (most recent call last):
  File "/home/modintsov/workspace/DataRobot/playground.py", line 56, in <module>
    dupes = df.T.duplicated()
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4384, in duplicated
    ids = get_group_index(labels, shape, sort=False, xnull=False)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 95, in get_group_index
    return loop(list(labels), list(shape))
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 86, in loop
    return loop(labels, shape)

... many-many lines of the same...

  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/sorting.py", line 60, in loop
    stride = np.prod(shape[1:nlev], dtype='i8')
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2566, in prod
    out=out, **kwargs)
RuntimeError: maximum recursion depth exceeded

Which is obviously a regression.

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

I expect no exception and return of bool Series. Example above in old pandas output this

col_00    False
col_01    False
col_02    False
col_03    False
col_04    False
col_05    False
col_06    False
col_07    False
col_08    False
col_09    False
col_10    False
col_11    False
col_12    False
col_13    False
col_14    False
...
col_55    False
col_56    False
col_57    False
col_58    False
col_59    False
col_60    False
col_61    False
col_62    False
col_63    False
col_64    False
col_65    False
col_66    False
col_67    False
col_68    False
col_69    False
Length: 70, dtype: bool

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
pd.show_versions()
INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.21
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.5.5
patsy: 0.2.1
dateutil: 2.7.3
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.3.2.dr2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd

This comment has been minimized.

Member

WillAyd commented Jun 18, 2018

I was unable to reproduce this on master - can you check there and see if that resolves your issue?

@WillAyd WillAyd added the Needs Info label Jun 18, 2018

@Liam3851

This comment has been minimized.

Contributor

Liam3851 commented Jun 18, 2018

@WillAyd Master works for me on Python 3.6 but I can reproduce issue under Python 2.7.15 (OP is using 2.7.12).

@WillAyd

This comment has been minimized.

Member

WillAyd commented Jun 18, 2018

Thanks @Liam3851 . Well in that case this is a recursive function call and I think the compatibility difference is that sys.getrecursionlimit() is at 3000 for Python3 and only 1000 for Python2. Tracing this call with this size DataFrame requires 2,223 recursive calls on my end, hence the failure.

Can you see if increasing that limit in Python2 resolves the issue?

@WillAyd WillAyd added 2/3 Compat and removed Needs Info labels Jun 18, 2018

@Liam3851

This comment has been minimized.

Contributor

Liam3851 commented Jun 18, 2018

@WillAyd OP's test case works on Python 2 with the system recursion limit increased to 3,000 as written.

If I increase the number of columns in OP's test case from 20,000 to 30,000, that re-breaks Python 2 with the maximum recursion depth (perhaps as expected).

However, at least on my box (Win7, 40 GB RAM free for process) increasing the number of columns from 20,000 to 30,000 causes Python 3 to crash completely (I think this was not expected, or at least I didn't expect it).

@Templarrr

This comment has been minimized.

Contributor

Templarrr commented Jun 18, 2018

Mostly I was wandering why it's written in this way? Tail recursion is a beautiful thing, but in a language without tail recursion optimization it can (imho, should...) be replaced with simple loop. That will be just as fast, consume less resources and will not depend on the recursion limit.

@Templarrr

This comment has been minimized.

Contributor

Templarrr commented Jun 18, 2018

Sorry for multiple edits of my previous comment, phone autocorrect decided he knows best what I'm trying to say :)

@WillAyd

This comment has been minimized.

Member

WillAyd commented Jun 18, 2018

I can't speak to the history of that code but if you have a way to optimize it to perform better, use less resources, avoid the recursion limit, etc... then PRs are always welcome!

@Templarrr

This comment has been minimized.

Contributor

Templarrr commented Jun 19, 2018

@WillAyd Well, I've never did any PR to pandas before (and only one one-liner to numpy...) so I've thought someone with more pandas-specific experience will be better here :) but I can certainly try

@Templarrr

This comment has been minimized.

Contributor

Templarrr commented Jun 19, 2018

So, I've opened a PR, not sure it's 100% follows best pandas practices, but I've tried

@jreback jreback added the Reshaping label Jun 19, 2018

@jreback jreback added this to the 0.23.2 milestone Jun 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment