BUG: downsampling with last doesn't accept "min_count" keyword #37768

DiSchi123 · 2020-11-11T21:11:02Z

I verified this on v 1.1.3. I am on miniconda, 1.1.4 is not available yet. There are ways to install probably but if I read documentation right, this feature should be around since 0.22

import numpy as np
import pandas as pd

index = pd.date_range(start='2020', freq='M', periods=6)
data = np.ones(6)
data[4:6] = np.nan
datetime = pd.Series(data, index)
period = datetime.to_period()
datetime
2020-01-31    1.0
2020-02-29    1.0
2020-03-31    1.0
2020-04-30    1.0
2020-05-31    NaN
2020-06-30    NaN
Freq: M, dtype: float64

datetime.resample('Q').sum(min_count=2)
2020-03-31    3.0
2020-06-30    NaN
Freq: Q-DEC, dtype: float64

datetime.resample('Q').last(min_count=3)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-bd5dfd934676> in <module>
----> 1 datetime.resample('Q').last(min_count=3)

~\miniconda3\lib\site-packages\pandas\core\resample.py in g(self, _method, *args, **kwargs)
    933 
    934     def g(self, _method=method, *args, **kwargs):
--> 935         nv.validate_resampler_func(_method, args, kwargs)
    936         return self._downsample(_method)
    937 

~\miniconda3\lib\site-packages\pandas\compat\numpy\function.py in validate_resampler_func(method, args, kwargs)
    395             )
    396         else:
--> 397             raise TypeError("too many arguments passed in")
    398 
    399 

TypeError: too many arguments passed in

Problem description

According to the documentation, .last() does accept the keyword "min_count", just like for example .sum()
where it works fine, see above

So I should not see the error above. The "min_count" is useful also for .last() if you have nans in your data and want to avoid the record that is not truly the last record in the segment.

Doc:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.resample.Resampler.last.html#pandas.core.resample.Resampler.last)

Expected Output

2020-03-31    1.0
2020-06-30    NaN

Output of `pd.show_versions()`

See above - verified with pd 1.1.3. I started the issue on my laptop with older pandas but finished with 1.1.3. Error message and pd.show_versions() is up to date

INSTALLED VERSIONS

commit : db08276
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 8.1
Version : 6.3.9600
machine : AMD64
processor : Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : None
pytest : None
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.49.1

The text was updated successfully, but these errors were encountered:

phofl · 2020-11-13T20:15:24Z

Actually it looks like the docstrings are wrong here. The underlyin implementation does not support a min_count keyword for the functions ["min", "max", "first", "last", "mean", "sem", "median", "ohlc"]

GroupBy docs are more or less erroneous too. You can pass min_count to groupby.last, but it raises

df =  DataFrame({"a": [1, 2, 3, 4, 5, 6]})
print(df.groupby(level=0).last(min_count=1))

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.2/scratches/scratch_4.py", line 97, in <module>
    print(df.groupby(level=0).last(min_count=1))
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1710, in last
    return self._agg_general(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1032, in _agg_general
    result = self._cython_agg_general(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1018, in _cython_agg_general
    agg_mgr = self._cython_agg_blocks(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1116, in _cython_agg_blocks
    new_mgr = data.apply(blk_func, ignore_failures=True)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 425, in apply
    applied = b.apply(f, **kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 368, in apply
    result = func(self.values, **kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1067, in blk_func
    result, _ = self.grouper.aggregate(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 594, in aggregate
    return self._cython_operation(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 547, in _cython_operation
    result = self._aggregate(result, counts, values, codes, func, min_count)
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 608, in _aggregate
    agg_func(result, counts, values, comp_ids, min_count)
  File "pandas/_libs/groupby.pyx", line 906, in pandas._libs.groupby.group_last
AssertionError: 'min_count' only used in add and prod

Process finished with exit code 1

DiSchi123 · 2020-11-13T20:29:21Z

It might be a copy paste error in the documentation. If so I still would like to see this as an enhancement. Let me know which way you and team decide to go, to fix the documentation or enhance the code. Happy to raise an enhancement request in the prior case.

There is a case to have "min_count" option , especially when downsampling with .last(). The reason is that .last() takes the last non-nan value of the bin/segment, without actually ensuring it is the last value in the time series. "min_count" can prevent undesired values from being chosen.
E.g. as in the example above, the last 2 months are NAN. If the goal is to have comparable quarterly numbers, this quarter would have to be nan since it doesn't have a full 3 months of data. This can be important when using the quarterly data for further statistical analysis. Setting "min_count" to 3 in this example ensures that the last quarter gets marked NA, similar to the .sum()

rhshadrach · 2020-11-14T14:04:24Z

Looking at the implementation of the functions @phofl identified, it appears straightforward to add min_count to the implementation. +1 on adding this functionality.

DiSchi123 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 11, 2020

phofl added Docs Groupby Resample resample method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 13, 2020

phofl mentioned this issue Nov 13, 2020

BUG: Groupby.last accepts min_count but implementation raises #37821

Closed

3 tasks

phofl removed the Groupby label Nov 13, 2020

phofl mentioned this issue Nov 15, 2020

ENH: Add support for min_count keyword for Resample and Groupby functions #37870

Merged

6 tasks

jreback added this to the 1.2 milestone Nov 25, 2020

jreback closed this as completed in #37870 Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: downsampling with last doesn't accept "min_count" keyword #37768

BUG: downsampling with last doesn't accept "min_count" keyword #37768

DiSchi123 commented Nov 11, 2020 •

edited

Loading

INSTALLED VERSIONS

phofl commented Nov 13, 2020

DiSchi123 commented Nov 13, 2020

rhshadrach commented Nov 14, 2020

BUG: downsampling with last doesn't accept "min_count" keyword #37768

BUG: downsampling with last doesn't accept "min_count" keyword #37768

Comments

DiSchi123 commented Nov 11, 2020 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

phofl commented Nov 13, 2020

DiSchi123 commented Nov 13, 2020

rhshadrach commented Nov 14, 2020

DiSchi123 commented Nov 11, 2020 •

edited

Loading

Output of `pd.show_versions()`