Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.agg() can return Series or DataFrame with dict argument #846

Closed
davetapley opened this issue Dec 29, 2023 · 7 comments · Fixed by #861
Closed

DataFrame.agg() can return Series or DataFrame with dict argument #846

davetapley opened this issue Dec 29, 2023 · 7 comments · Fixed by #861
Labels
Awaiting Response Waiting for response

Comments

@davetapley
Copy link
Contributor

davetapley commented Dec 29, 2023

Describe the bug

There is no overload for agg returning a scalar, which it can per the docs and:

To Reproduce

df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [np.nan, np.nan, np.nan]],
                  columns=['A', 'B', 'C'])



# Expression of type "DataFrame" cannot be assigned to declared type "float64"
#  "DataFrame" is incompatible with "float64"Pylance[reportGeneralTypeIssues](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportGeneralTypeIssues)
A_sum: np.float64 = df.agg({'A': 'sum'})


# But:
print(A_sum)

# A    12.0
# dtype: float64

Please complete the following information:

  • Linux
  • Ubuntu 20.04.6 LTS
  • python 3.11.4
  • pyright
  • `pandas-stubs == 2.1.4.231227

Additional context
Add any other context about the problem here.

@davetapley
Copy link
Contributor Author

... while writing this I realized this is better anyway 😁

A_sum = df['A'].sum()

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Dec 29, 2023

You're using a 7 month old version of pandas-stubs. Please upgrade to the latest version (2.1.4.231227) and retest. I tested with that release and pyright did not report an error in your code.

@Dr-Irv Dr-Irv added the Awaiting Response Waiting for response label Dec 29, 2023
@davetapley
Copy link
Contributor Author

@Dr-Irv whoops, updated and seeing same thing,
I amended the snippet to actually generate an error (by adding a type annotation):

A_sum: np.float64 = df.agg({'A': 'sum'})
Expression of type "DataFrame" cannot be assigned to declared type "float64"
  "DataFrame" is incompatible with "float64"Pylance[reportGeneralTypeIssues](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportGeneralTypeIssues)

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Dec 30, 2023

I'm not sure that we can fix this in terms of always returning the correct type from a static typing perspective.

import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [np.nan, np.nan, np.nan]],
                  columns=['A', 'B', 'C'])


A_sum = df.agg({'A': 'sum'})
print(type(A_sum))
print(A_sum)

S_sum = df.agg({'A': 'sum', "B": 'mean'})
print(type(S_sum))
print(S_sum)

T_sum = df.agg({"A": ['sum', 'mean'], "B": ["min", "max"]})
print(type(T_sum))
print(T_sum)

produces

<class 'pandas.core.series.Series'>
A    12.0
dtype: float64
<class 'pandas.core.series.Series'>
A    12.0
B     5.0
dtype: float64
<class 'pandas.core.frame.DataFrame'>
         A    B
sum   12.0  NaN
mean   4.0  NaN
min    NaN  2.0
max    NaN  8.0

So A_sum and S_sum are Series, while T_sum is DataFrame, and the result is solely dependent on how many items are in the dictionary or in the dictionary items. That can't be detected via static typing.

Having said that, we can change the type stub for DataFrame.agg() to return Series | DataFrame if a dict is passed.

I don't see how agg on a DataFrame can return a Scalar, so will change the title of this issue to reflect the issue with the dict argument.

@Dr-Irv Dr-Irv changed the title agg can return a scalar DataFrame.agg() can return Series or DataFrame with dict argument Dec 30, 2023
@hamdanal
Copy link
Contributor

hamdanal commented Jan 7, 2024

So A_sum and S_sum are Series, while T_sum is DataFrame, and the result is solely dependent on how many items are in the dictionary or in the dictionary items. That can't be detected via static typing.

Having said that, we can change the type stub for DataFrame.agg() to return Series | DataFrame if a dict is passed.

I think what we should look at in case of a dictionary is the type of the dictionary values, not the number of its items. In your example, A_sum and S_sum are Series because they are the result of a single aggregation (indicated by the scalar values in the dictionary), while T_sum is DataFrame because it results from multiple aggregations (2 in this case).

If we look at the following example:

>>> U_sum = df.agg({"A": ['sum', 'mean']})
>>> print(type(U_sum))
<class 'pandas.core.frame.DataFrame'>
>>> print(U_sum)
         A
sum   12.0
mean   4.0

We can see that even if the dictionary contains only one item, it returns a dataframe because it is performing two aggregations as indicated by the list ['sum', 'mean'].
I can take a stab at this and add some tests after I get my groupby PR sorted out (unless someone else beats me to it).

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jan 8, 2024

We can see that even if the dictionary contains only one item, it returns a dataframe because it is performing two aggregations as indicated by the list ['sum', 'mean'].

Good point.

I can take a stab at this and add some tests after I get my groupby PR sorted out (unless someone else beats me to it).

You can work on the 2 PR's independently. I'll be reviewing the groupby PR shortly

@hamdanal
Copy link
Contributor

You can work on the 2 PR's independently. I'll be reviewing the groupby PR shortly

Of course. I wasn’t suggesting otherwise, it is just that I need to find some free time to work on it.

hamdanal added a commit to hamdanal/pandas-stubs that referenced this issue Feb 11, 2024
Dr-Irv pushed a commit that referenced this issue Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Response Waiting for response
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants