Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: downsampling with last doesn't accept "min_count" keyword #37768

Closed
DiSchi123 opened this issue Nov 11, 2020 · 3 comments · Fixed by #37870
Closed

BUG: downsampling with last doesn't accept "min_count" keyword #37768

DiSchi123 opened this issue Nov 11, 2020 · 3 comments · Fixed by #37870
Labels
Bug Docs Resample resample method
Milestone

Comments

@DiSchi123
Copy link

DiSchi123 commented Nov 11, 2020

I verified this on v 1.1.3. I am on miniconda, 1.1.4 is not available yet. There are ways to install probably but if I read documentation right, this feature should be around since 0.22

import numpy as np
import pandas as pd

index = pd.date_range(start='2020', freq='M', periods=6)
data = np.ones(6)
data[4:6] = np.nan
datetime = pd.Series(data, index)
period = datetime.to_period()
datetime
2020-01-31    1.0
2020-02-29    1.0
2020-03-31    1.0
2020-04-30    1.0
2020-05-31    NaN
2020-06-30    NaN
Freq: M, dtype: float64
datetime.resample('Q').sum(min_count=2)
2020-03-31    3.0
2020-06-30    NaN
Freq: Q-DEC, dtype: float64
datetime.resample('Q').last(min_count=3)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-bd5dfd934676> in <module>
----> 1 datetime.resample('Q').last(min_count=3)

~\miniconda3\lib\site-packages\pandas\core\resample.py in g(self, _method, *args, **kwargs)
    933 
    934     def g(self, _method=method, *args, **kwargs):
--> 935         nv.validate_resampler_func(_method, args, kwargs)
    936         return self._downsample(_method)
    937 

~\miniconda3\lib\site-packages\pandas\compat\numpy\function.py in validate_resampler_func(method, args, kwargs)
    395             )
    396         else:
--> 397             raise TypeError("too many arguments passed in")
    398 
    399 

TypeError: too many arguments passed in

Problem description

According to the documentation, .last() does accept the keyword "min_count", just like for example .sum()
where it works fine, see above

So I should not see the error above. The "min_count" is useful also for .last() if you have nans in your data and want to avoid the record that is not truly the last record in the segment.

Doc:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.resample.Resampler.last.html#pandas.core.resample.Resampler.last)

Expected Output

2020-03-31    1.0
2020-06-30    NaN

Output of pd.show_versions()

See above - verified with pd 1.1.3. I started the issue on my laptop with older pandas but finished with 1.1.3. Error message and pd.show_versions() is up to date

INSTALLED VERSIONS

commit : db08276
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 8.1
Version : 6.3.9600
machine : AMD64
processor : Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : None
pytest : None
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.49.1

@DiSchi123 DiSchi123 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 11, 2020
@phofl
Copy link
Member

phofl commented Nov 13, 2020

Actually it looks like the docstrings are wrong here. The underlyin implementation does not support a min_count keyword for the functions ["min", "max", "first", "last", "mean", "sem", "median", "ohlc"]

GroupBy docs are more or less erroneous too. You can pass min_count to groupby.last, but it raises

df =  DataFrame({"a": [1, 2, 3, 4, 5, 6]})
print(df.groupby(level=0).last(min_count=1))
Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.2/scratches/scratch_4.py", line 97, in <module>
    print(df.groupby(level=0).last(min_count=1))
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1710, in last
    return self._agg_general(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1032, in _agg_general
    result = self._cython_agg_general(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1018, in _cython_agg_general
    agg_mgr = self._cython_agg_blocks(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1116, in _cython_agg_blocks
    new_mgr = data.apply(blk_func, ignore_failures=True)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 425, in apply
    applied = b.apply(f, **kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 368, in apply
    result = func(self.values, **kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/generic.py", line 1067, in blk_func
    result, _ = self.grouper.aggregate(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 594, in aggregate
    return self._cython_operation(
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 547, in _cython_operation
    result = self._aggregate(result, counts, values, codes, func, min_count)
  File "/home/developer/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 608, in _aggregate
    agg_func(result, counts, values, comp_ids, min_count)
  File "pandas/_libs/groupby.pyx", line 906, in pandas._libs.groupby.group_last
AssertionError: 'min_count' only used in add and prod

Process finished with exit code 1

@phofl phofl added Docs Groupby Resample resample method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 13, 2020
@phofl phofl removed the Groupby label Nov 13, 2020
@DiSchi123
Copy link
Author

It might be a copy paste error in the documentation. If so I still would like to see this as an enhancement. Let me know which way you and team decide to go, to fix the documentation or enhance the code. Happy to raise an enhancement request in the prior case.

There is a case to have "min_count" option , especially when downsampling with .last(). The reason is that .last() takes the last non-nan value of the bin/segment, without actually ensuring it is the last value in the time series. "min_count" can prevent undesired values from being chosen.
E.g. as in the example above, the last 2 months are NAN. If the goal is to have comparable quarterly numbers, this quarter would have to be nan since it doesn't have a full 3 months of data. This can be important when using the quarterly data for further statistical analysis. Setting "min_count" to 3 in this example ensures that the last quarter gets marked NA, similar to the .sum()

@rhshadrach
Copy link
Member

Looking at the implementation of the functions @phofl identified, it appears straightforward to add min_count to the implementation. +1 on adding this functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Docs Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants