DataFrame.describe() output is not deterministic #32528

ShaharNaveh · 2020-03-07T20:23:45Z

Running this:

pd.DataFrame({
'categorical': pd.Categorical(['d', 'e', 'f']),
'numeric': [1, 2, 3],
'object': ['a', 'b', 'c']
})

df.describe(include="all")

Will sometimes have the output of:

     categorical  numeric object
        count            3      3.0      3
        unique           3      NaN      3
        top              f      NaN      a
        freq             1      NaN      1
        mean           NaN      2.0    NaN
        std            NaN      1.0    NaN
        min            NaN      1.0    NaN
        25%            NaN      1.5    NaN
        50%            NaN      2.0    NaN
        75%            NaN      2.5    NaN
        max            NaN      3.0    NaN

And sometimes the output of:

     categorical  numeric object
        count            3      3.0      3
        unique           3      NaN      3
        top              f      NaN      c
        freq             1      NaN      1
        mean           NaN      2.0    NaN
        std            NaN      1.0    NaN
        min            NaN      1.0    NaN
        25%            NaN      1.5    NaN
        50%            NaN      2.0    NaN
        75%            NaN      2.5    NaN
        max            NaN      3.0    NaN

(Changes at the row of top)

This is making the doctests fail. We should remove the SKIP in the tests in the describe docstring when we identify the problem: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L9649

The text was updated successfully, but these errors were encountered:

mroeschke · 2020-03-07T21:16:25Z

Could you pd.show_versions()? I am not getting the same behavior on

INSTALLED VERSIONS
------------------
commit           : 2a2258d64400b0f535502d903c9ab05b7d696af2
python           : 3.7.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.7.0
Version          : Darwin Kernel Version 18.7.0: Thu Jan 23 06:52:12 PST 2020; root:xnu-4903.278.25~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0.dev0+710.g2a2258d64

datapythonista · 2020-03-08T00:20:14Z

Checking at the code, this code should also reproduce the problem:

import pandas

data = pandas.Series(list('abc'))
data.value_counts().index[0]

(it always returns a to me)

weikhor · 2022-02-17T15:07:31Z

@ShaharNaveh @datapythonista @mroeschke

Hi, the code that I run

df = pd.DataFrame({
        'categorical': pd.Categorical(['d', 'e', 'f']),
        'numeric': [1, 2, 3],
        'object': ['a', 'b', 'c']
        })

print(df.describe(include="all"))

I am able to get deterministic output after running many rounds.

       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              d      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

The result is expected

top              d      NaN      a

Pandas version I run:

pandas           : 1.5.0.dev0+376.g9cc98a064e

Thank

datapythonista added the DataFrame DataFrame data structure label Mar 8, 2020

ShaharNaveh mentioned this issue Mar 19, 2020

DOC: Remove # doctest: +SKIP #32837

Closed

5 tasks

mroeschke added the Bug label Apr 20, 2020

mroeschke added the Docs label Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.describe() output is not deterministic #32528

DataFrame.describe() output is not deterministic #32528

ShaharNaveh commented Mar 7, 2020 •

edited by datapythonista

mroeschke commented Mar 7, 2020

datapythonista commented Mar 8, 2020 •

edited

weikhor commented Feb 17, 2022 •

edited

DataFrame.describe() output is not deterministic #32528

DataFrame.describe() output is not deterministic #32528

Comments

ShaharNaveh commented Mar 7, 2020 • edited by datapythonista

mroeschke commented Mar 7, 2020

datapythonista commented Mar 8, 2020 • edited

weikhor commented Feb 17, 2022 • edited

ShaharNaveh commented Mar 7, 2020 •

edited by datapythonista

datapythonista commented Mar 8, 2020 •

edited

weikhor commented Feb 17, 2022 •

edited