Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT #43132

Closed
behrenhoff opened this issue Aug 20, 2021 · 8 comments · Fixed by #43467
Closed

BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT #43132

behrenhoff opened this issue Aug 20, 2021 · 8 comments · Fixed by #43467
Labels
Apply Apply, Aggregate, Transform Bug Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@behrenhoff
Copy link
Contributor

Code Sample, a copy-pastable example

The problem is best shown in the following example:

import pandas as pd
s = pd.Series(['1 day', '3 days', pd.NaT], dtype='timedelta64[ns]')

print(f'{s.mean()=}\n')
# as expected: result is '2 days'

print(f"{s.apply('mean')=}\n")
# as expected: result is '2 days'

print("transform('mean')")
print(s.groupby([1,1,1]).transform('mean'))
# whoops: (a series of 3x) -35583 days +08:04:14.381741568
# expectation was: (a series of 3x) '2 days'

# Fixing the problem
print("\ntransform(pd.Series.mean)")
print(s.groupby([1,1,1]).transform(pd.Series.mean))
# (a series of 3x) '2 days', sanity restored :-)

Problem description

When passing a function name by string to transform, pd.NaT is not handled correctly. This not only happens for 'mean' but also for other functions like 'sum'.

I don't know how transform is looking up the function if a string is passed (and it's not really documented), but it certainly doesn't select pd.Series.mean but some other non-NaT-aware mean function.

This is surprising since NaT is handled correctly when calling apply('mean').

Tested with pd 1.3.2.

@behrenhoff behrenhoff added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 20, 2021
@behrenhoff behrenhoff changed the title BUG: timeseries.groupby(...).transform('mean') wrong if aggregating over pd.NaT BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT Aug 20, 2021
@mroeschke mroeschke added Apply Apply, Aggregate, Transform Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 22, 2021
@AlexeyGy
Copy link
Contributor

Can I work on this issue?

@simonjayhawkins
Copy link
Member

s.groupby([1, 1, 1]).transform("mean") was raising DataError: No numeric types to aggregate with 1.2.5

I suspect this issue maybe related to or a duplicate of #42659. will label as a regression pending further investigation cc @jbrockmendel

@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 26, 2021
@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 26, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 27, 2021
@simonjayhawkins
Copy link
Member

I suspect this issue maybe related to or a duplicate of #42659. will label as a regression pending further investigation cc @jbrockmendel

first bad commit: [fa8f68e] BUG/API: SeriesGroupBy reduction with numeric_only=True (#41342)

@AlexeyGy
Copy link
Contributor

AlexeyGy commented Aug 27, 2021

This happens because pd.Series.mean and 'mean' will lead to totally different paths through the code and hence different calculations:

values = values.view("int64")

which calls numpy and turns pd.NaT into a very low integer causing this problem.

Questions

@simonjayhawkins

  1. Are there any plans to harmonize both paths? It would surely be easy to just make 'mean' take the same route as pd.Series.mean.

  2. To mitigate, we could modify _resolve_numeric_only to return True if there are any NaN values in an otherwise numeric set. What do you think?

@simonjayhawkins
Copy link
Member

Thanks @AlexeyGy for exploring this issue further.

  • Are there any plans to harmonize both paths? It would surely be easy to just make 'mean' take the same route as pd.Series.mean.

refactoring, de-duplication, cleaning and maintaining api consistency is an ongoing process, but that may not be the best option if we wanted to backport any fixes to 1.3.x unless we limited the dispatch from groupby.mean to DatetimeLikeArrayMixin.mean @jbrockmendel?

  • To mitigate, we could modify _resolve_numeric_only to return True if there are any NaN values in an otherwise numeric set.

There has been much discussion in #42395 and #43154 about changes to _resolve_numeric_only so I think further changes there is likely not desirable.

There is also the case that the 1.2.x behavior was incorrect and that the new behavior is a new bug and not a regression. But personally I would prefer to see 1.3.x patched to fix this.

#42659 is clearly a regression and it may well be that a fix for that case fixes this one too.

@jbrockmendel
Copy link
Member

The correct long-term solution is for libgroupby.group_mean to take a is_datetimelike keyword like many of the other functions there do.

@AlexeyGy any interest in implementing this?

@AlexeyGy
Copy link
Contributor

AlexeyGy commented Sep 1, 2021

Yes, happy to help. Should not be a problem since I can peek at the other implementations in the file.

Awesome that there are even tests for both, libgroupby, and groupby already which I can extend.

AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 8, 2021
## Background
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

## Implementation
On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

## Tests
This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

## Notes
Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
@AlexeyGy
Copy link
Contributor

AlexeyGy commented Sep 8, 2021

I've drafted up #43467. Will want to add a couple more tests to ensure that nothing is breaking but want to share it out early. Open for feedback on it.

AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 9, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 9, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 9, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 9, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 9, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 10, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 10, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 10, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of else/if.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry => different format but it's there
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 10, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64-bit integer -(2**63). Previously, we could not support this value in any `groupby.mean()` calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

## Tests
This PR adds an integration and two unit tests for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of if/else.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 11, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64-bit integer -(2**63). Previously, we could not support this value in any `groupby.mean()` calculations which lead to pandas-dev#43132.

On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64.

## Tests
This PR adds an integration and two unit tests for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Additionally, I've taken the liberty to:
 * Add a docstring for the group_mean algorithm.
 * Change the algorithm to use guard clauses instead of if/else.
 * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file).

- [x] closes pandas-dev#43132
- [x] tests added / passed
- [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
- [x] whatsnew entry
AlexeyGy added a commit to AlexeyGy/pandas that referenced this issue Sep 13, 2021
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64-bit integer -(2**63). Previously, we could not support this value in any `groupby.mean()` calculations which lead to pandas-dev#43132.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Bug Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants