Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modin describe output different from pandas for empty dataframes #4191

Closed
c3-cjazra opened this issue Feb 11, 2022 · 3 comments
Closed

Modin describe output different from pandas for empty dataframes #4191

c3-cjazra opened this issue Feb 11, 2022 · 3 comments
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin P2 Minor bugs or low-priority feature requests pandas concordance 🐼 Functionality that does not match pandas

Comments

@c3-cjazra
Copy link

System information

ray 1.9.2
modin 0.12.0
pandas 1.3.4

Describe the problem

In modin, the describe result of an empty dataframe is the count, unique, top, and freq format, even if the dtypes are specified to numeric types. In pandas, if the dtypes are specified as numeric for an empty dataframe, the describe format count, mean, std, min, max.

modin_df = modin_pd.DataFrame(columns=['A', 'B'], dtype=(np.int64, np.float64))

modin_df.describe()
—
A	B
count	0	0
unique	0	0
top	NaN	NaN
freq	NaN	NaN

Pandas seems to properly looks at dtypes and provide the description relevant to numerical numbers

pandas_df = pandas_pd.DataFrame(columns=['A', 'B'], dtype=(np.int64, np.float64))

pandas_df.describe(percentiles=[0.25, 0.5, 0.75])
—
A	        B
count	0.0	0.0
mean	NaN	NaN
std	        NaN	NaN
min	        NaN	NaN
25%	NaN	NaN
50%	NaN	NaN
75%	        NaN	NaN
max	        NaN	NaN
@mvashishtha
Copy link
Collaborator

mvashishtha commented Feb 11, 2022

@c3-cjazra thanks for reporting this bug! I can reproduce it on Modin version 0.13.0+22.gaebdb522.

Here's a copy-pastable example:

import modin.pandas as pd
import numpy as np
import pandas

kwargs = dict(columns=['A', 'B'], dtype=(np.int64, np.float64))

df = pd.DataFrame(**kwargs)
print(df.describe())

print('\n\n')

pdf = pandas.DataFrame(**kwargs)
print(pdf.describe())

@mvashishtha mvashishtha added bug 🦗 Something isn't working pandas concordance 🐼 Functionality that does not match pandas labels Feb 11, 2022
@mvashishtha
Copy link
Collaborator

Modin defaults to pandas for describing an empty dataframe, but it loses types when it converts empty Modin frames to pandas. Fixing #4060 should fix this issue, too.

@vnlitvinov vnlitvinov added the P2 Minor bugs or low-priority feature requests label Aug 29, 2022
@anmyachev anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023
@anmyachev
Copy link
Collaborator

Reproducer shows the same output for pandas and Modin on current master: 5b297a1.

>>> print(df.describe())     
UserWarning: `DataFrame.describe` for empty DataFrame is not currently supported by PandasOnDask, defaulting to pandas implementation.
UserWarning: Distributing <class 'pandas.core.frame.DataFrame'> object. This may take some time.
         A    B
count  0.0  0.0
mean   NaN  NaN
std    NaN  NaN
min    NaN  NaN
25%    NaN  NaN
50%    NaN  NaN
75%    NaN  NaN
max    NaN  NaN
>>> print(pdf.describe())
         A    B
count  0.0  0.0
mean   NaN  NaN
std    NaN  NaN
min    NaN  NaN
25%    NaN  NaN
50%    NaN  NaN
75%    NaN  NaN
max    NaN  NaN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin P2 Minor bugs or low-priority feature requests pandas concordance 🐼 Functionality that does not match pandas
Projects
None yet
Development

No branches or pull requests

4 participants