Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.describe() output is not deterministic #32528

Open
ShaharNaveh opened this issue Mar 7, 2020 · 3 comments
Open

DataFrame.describe() output is not deterministic #32528

ShaharNaveh opened this issue Mar 7, 2020 · 3 comments
Labels
Bug DataFrame DataFrame data structure Docs

Comments

@ShaharNaveh
Copy link
Member

ShaharNaveh commented Mar 7, 2020

xref #31472

Running this:

pd.DataFrame({
'categorical': pd.Categorical(['d', 'e', 'f']),
'numeric': [1, 2, 3],
'object': ['a', 'b', 'c']
})

df.describe(include="all")

Will sometimes have the output of:

     categorical  numeric object
        count            3      3.0      3
        unique           3      NaN      3
        top              f      NaN      a
        freq             1      NaN      1
        mean           NaN      2.0    NaN
        std            NaN      1.0    NaN
        min            NaN      1.0    NaN
        25%            NaN      1.5    NaN
        50%            NaN      2.0    NaN
        75%            NaN      2.5    NaN
        max            NaN      3.0    NaN

And sometimes the output of:

     categorical  numeric object
        count            3      3.0      3
        unique           3      NaN      3
        top              f      NaN      c
        freq             1      NaN      1
        mean           NaN      2.0    NaN
        std            NaN      1.0    NaN
        min            NaN      1.0    NaN
        25%            NaN      1.5    NaN
        50%            NaN      2.0    NaN
        75%            NaN      2.5    NaN
        max            NaN      3.0    NaN

(Changes at the row of top)

This is making the doctests fail. We should remove the SKIP in the tests in the describe docstring when we identify the problem: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L9649

@mroeschke
Copy link
Member

Could you pd.show_versions()? I am not getting the same behavior on

INSTALLED VERSIONS
------------------
commit           : 2a2258d64400b0f535502d903c9ab05b7d696af2
python           : 3.7.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.7.0
Version          : Darwin Kernel Version 18.7.0: Thu Jan 23 06:52:12 PST 2020; root:xnu-4903.278.25~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0.dev0+710.g2a2258d64

@datapythonista datapythonista added the DataFrame DataFrame data structure label Mar 8, 2020
@datapythonista
Copy link
Member

datapythonista commented Mar 8, 2020

Checking at the code, this code should also reproduce the problem:

import pandas

data = pandas.Series(list('abc'))
data.value_counts().index[0]

(it always returns a to me)

@weikhor
Copy link
Contributor

weikhor commented Feb 17, 2022

@ShaharNaveh @datapythonista @mroeschke

Hi, the code that I run

df = pd.DataFrame({
        'categorical': pd.Categorical(['d', 'e', 'f']),
        'numeric': [1, 2, 3],
        'object': ['a', 'b', 'c']
        })

print(df.describe(include="all"))

I am able to get deterministic output after running many rounds.

       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              d      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

The result is expected

top              d      NaN      a

Pandas version I run:

pandas           : 1.5.0.dev0+376.g9cc98a064e

Thank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug DataFrame DataFrame data structure Docs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants