BUG: resampling DataFrame with DateTimeIndex with holes and `uint64` columns leads to error on `pandas==1.3.2` (not in `1.2.5`) #43329

julienl-met · 2021-08-31T13:39:35Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

# Data generation: DataFrame with DateTimeIndex, one row per hour, values are 0 or 1.
df = pd.DataFrame(index=pd.date_range(start="2000-01-01", end="2000-01-03 23", freq="H"), columns=["x"], data=[0, 1, 0] * 24)

# Removing some rows in order to have a hole in the dataset
df = df.loc[(df.index < "2000-01-02") | (df.index > "2000-01-03"), :]

# Create dummy indicator
one_hot = pd.get_dummies(df["x"])  # This line leads to having "RuntimeError: empty group with uint64_t"
# one_hot = pd.get_dummies(df["x"], dtype=int)  # This line leads to having expected dataframe

# Keeping, for each day, the maximum day value.
df_output = one_hot.resample("D").max()

# Expected_dataframe: 
df_expected = pd.DataFrame(
    index=pd.date_range(start="2000-01-01", end="2000-01-03", freq="D"),
    data={col: [1, np.nan, 1] for col in [0,1]}
)
pd.testing.assert_frame_equal(df_expected, df_output)

Problem description

With pandas=1.3.2, above code block leads to "RuntimeError: empty group with uint64_t". It was not the case with pandas==1.1.0 for instance. Not an issue for me (problem solved specifying dtype), but probably an issue to solve.

Expected Output

Given in code sample section

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 5f648bf
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.72-microsoft-standard-WSL2
Version : #1 SMP Wed Oct 28 23:40:43 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 57.4.0
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.27.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2021-08-31T17:02:47Z

Thanks @julienlmet for the report

It was not the case with pandas==1.1.0 for instance.

first bad commit: [c294bdd] CLN: remove ensure_int_or_float (#41011)

cc @jbrockmendel

debnathshoham · 2021-08-31T18:48:38Z

Hi @julienlmet , could you please confirm if you are getting the error on one_hot = pd.get_dummies(df["x"])?
Because I am getting the same error on df_output = one_hot.resample("D").max() in master.

jbrockmendel · 2021-08-31T21:58:35Z

Looks like ensure_int_or_float cast uint8 dtype to int64, and failing to do that (when there are empty groups, as there are here) raises. so we need to restore that particular casting

julienl-met · 2021-09-01T13:26:37Z

Hi @julienlmet , could you please confirm if you are getting the error on one_hot = pd.get_dummies(df["x"])?
Because I am getting the same error on df_output = one_hot.resample("D").max() in master.

Hi @debnathshoham,
With code provided above, error occurs in my venv for instruction df_output = one_hot.resample("D").max().

Here is the traceback that I get (sorry, I should have given it when submitting the issue):

Traceback (most recent call last):
  File ".../pd132_issue.py", line 15, in <module>
    df_output = one_hot.resample("D").max()
  File ".../venv/lib/python3.8/site-packages/pandas/core/resample.py", line 986, in f
    return self._downsample(_method, min_count=min_count)
  File ".../venv/lib/python3.8/site-packages/pandas/core/resample.py", line 1146, in _downsample
    result = obj.groupby(self.grouper, axis=self.axis).aggregate(how, **kwargs)
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 979, in aggregate
    result = op.agg()
  File ".../venv/lib/python3.8/site-packages/pandas/core/apply.py", line 158, in agg
    return self.apply_str()
  File ".../venv/lib/python3.8/site-packages/pandas/core/apply.py", line 507, in apply_str
    return self._try_aggregate_string_function(obj, f, *self.args, **self.kwargs)
  File ".../venv/lib/python3.8/site-packages/pandas/core/apply.py", line 577, in _try_aggregate_string_function
    return f(*args, **kwargs)
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1857, in max
    return self._agg_general(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1342, in _agg_general
    result = self._cython_agg_general(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1082, in _cython_agg_general
    new_mgr = data.grouped_reduce(array_func, ignore_failures=True)
  File ".../venv/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1243, in grouped_reduce
    applied = blk.apply(func)
  File ".../venv/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 382, in apply
    result = func(self.values, **kwargs)
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1068, in array_func
    result = self.grouper._cython_operation(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 978, in _cython_operation
    return cy_op.cython_operation(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 639, in cython_operation
    return self._cython_op_ndim_compat(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 495, in _cython_op_ndim_compat
    return self._call_cython_op(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 548, in _call_cython_op
    func(
  File "pandas/_libs/groupby.pyx", line 1281, in pandas._libs.groupby.group_max
  File "pandas/_libs/groupby.pyx", line 1269, in pandas._libs.groupby.group_min_max
RuntimeError: empty group with uint64_t

simonjayhawkins · 2021-10-16T19:31:01Z

changing milestone to 1.3.5

simonjayhawkins · 2021-11-10T15:04:19Z

Looks like ensure_int_or_float cast uint8 dtype to int64, and failing to do that (when there are empty groups, as there are here) raises. so we need to restore that particular casting

Restoring the previous casting would fix the regression case in the OP, but there is an underlying issue (latent bug) that we cannot (and could not in 1.2.5 and before) resample and aggregate a column with uint64 values as ensure_int_or_float cannot cast uint64 -> int64 safely.

In the code sample, the pd.get_dummies(df["x"]) step results in columns with uint8 columns. Not casting this to int64 raises the misleading RuntimeError: empty group with uint64_t message and this is a regression.

DataFrames with uint64 columns raised this message before the regression where the error message was more accurate and informative.

considering a simplied code sample (can maybe use as the regression test) without using pd.get_dummies

import numpy as np
import pandas as pd

print(pd.__version__)
df = pd.DataFrame(
    index=pd.date_range(start="2000-01-01", end="2000-01-03 23", freq="12H"),
    columns=["x"],
    data=[0, 1, 0] * 2,
    dtype="uint8",
)
df = df.loc[(df.index < "2000-01-02") | (df.index > "2000-01-03"), :]
result = df.resample("D").max()
print(result)

expected = pd.DataFrame(
    [1, np.nan, 0],
    columns=["x"],
    index=pd.date_range(start="2000-01-01", end="2000-01-03 23", freq="D"),
)
pd.testing.assert_frame_equal(result, expected)

gives on 1.2.5

1.2.5
              x
2000-01-01  1.0
2000-01-02  NaN
2000-01-03  0.0

and on master

1.4.0.dev0+1068.g87b8a6ea06
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
...
RuntimeError: empty group with uint64_t

Changing dtype="uint8" to dtype="uint64" gives RuntimeError: empty group with uint64_t on both 1.2.5 and master.

I think this is the correct result and this issue title is misleading. The title should probably be "
BUG: resampling DataFrame with DateTimeIndex with empty groups and uint8, uint16 and uint32 columns leads to error on pandas==1.3.2 (not in 1.2......"

restoring ensure_int_or_float to ensure the old casting is straightfoward and all tests pass with the following change.

diff --git a/pandas/core/dtypes/common.py b/pandas/core/dtypes/common.py
index 815a0a2040..132d6c9610 100644
--- a/pandas/core/dtypes/common.py
+++ b/pandas/core/dtypes/common.py
@@ -111,6 +111,54 @@ def ensure_str(value: bytes | Any) -> str:
     return value
 
 
+def ensure_int_or_float(arr: ArrayLike, copy: bool = False) -> np.ndarray:
+    """
+    Ensure that an dtype array of some integer dtype
+    has an int64 dtype if possible.
+    If it's not possible, potentially because of overflow,
+    convert the array to float64 instead.
+    Parameters
+    ----------
+    arr : array-like
+          The array whose data type we want to enforce.
+    copy: bool
+          Whether to copy the original array or reuse
+          it in place, if possible.
+    Returns
+    -------
+    out_arr : The input array cast as int64 if
+              possible without overflow.
+              Otherwise the input array cast to float64.
+    Notes
+    -----
+    If the array is explicitly of type uint64 the type
+    will remain unchanged.
+    """
+    # TODO: GH27506 potential bug with ExtensionArrays
+    try:
+        # error: No overload variant of "astype" of "ExtensionArray" matches
+        # argument types "str", "bool", "str"
+        return arr.astype(  # type: ignore[call-overload]
+            "int64", copy=copy, casting="safe"
+        )
+    except TypeError:
+        pass
+    try:
+        # error: No overload variant of "astype" of "ExtensionArray" matches
+        # argument types "str", "bool", "str"
+        return arr.astype(  # type:ignore[call-overload]
+            "uint64", copy=copy, casting="safe"
+        )
+    except TypeError:
+        if is_extension_array_dtype(arr.dtype):
+            # pandas/core/dtypes/common.py:168: error: Item "ndarray" of
+            # "Union[ExtensionArray, ndarray]" has no attribute "to_numpy"  [union-attr]
+            return arr.to_numpy(  # type: ignore[union-attr]
+                dtype="float64", na_value=np.nan
+            )
+        return arr.astype("float64", copy=copy)
+
+
 def ensure_python_int(value: int | np.integer) -> int:
     """
     Ensure that a value is a python int.
diff --git a/pandas/core/groupby/ops.py b/pandas/core/groupby/ops.py
index 60c8851f05..bf4b219455 100644
--- a/pandas/core/groupby/ops.py
+++ b/pandas/core/groupby/ops.py
@@ -44,6 +44,7 @@ from pandas.core.dtypes.cast import (
 from pandas.core.dtypes.common import (
     ensure_float64,
     ensure_int64,
+    ensure_int_or_float,
     ensure_platform_int,
     is_1d_only_ea_obj,
     is_bool_dtype,
@@ -500,9 +501,7 @@ class WrappedCythonOp:
         elif is_bool_dtype(dtype):
             values = values.astype("int64")
         elif is_integer_dtype(dtype):
-            # e.g. uint8 -> uint64, int16 -> int64
-            dtype_str = dtype.kind + "8"
-            values = values.astype(dtype_str, copy=False)
+            values = ensure_int_or_float(values)
         elif is_numeric:
             if not is_complex_dtype(dtype):
                 values = ensure_float64(values)

simonjayhawkins · 2021-11-18T11:26:32Z

@jbrockmendel any objections to #43329 (comment). If not, will open a PR to get this fixed for 1.3.5

jbrockmendel · 2021-11-18T21:45:30Z

No objection, though I'd suggest something more narrowly targeted:

         elif is_integer_dtype(dtype):
            # GH#43329 helpful comment
            if dtype != np.uint64:
                 values = values.astype(np.int64, copy=False)

julienl-met added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2021

simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2021

simonjayhawkins added this to the 1.3.3 milestone Aug 31, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 31, 2021

code sample for pandas-dev#43329

77607dd

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021

simonjayhawkins mentioned this issue Nov 28, 2021

RLS: 1.3.5 #44080

Closed

simonjayhawkins mentioned this issue Dec 9, 2021

REGR: resampling DataFrame with DateTimeIndex with empty groups and uint8, uint16 or uint32 columns incorrectly raising RuntimeError #44828

Merged

4 tasks

jreback closed this as completed in #44828 Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: resampling DataFrame with DateTimeIndex with holes and `uint64` columns leads to error on `pandas==1.3.2` (not in `1.2.5`) #43329

BUG: resampling DataFrame with DateTimeIndex with holes and `uint64` columns leads to error on `pandas==1.3.2` (not in `1.2.5`) #43329

julienl-met commented Aug 31, 2021

INSTALLED VERSIONS

simonjayhawkins commented Aug 31, 2021

debnathshoham commented Aug 31, 2021

jbrockmendel commented Aug 31, 2021

julienl-met commented Sep 1, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins commented Nov 10, 2021

simonjayhawkins commented Nov 18, 2021

jbrockmendel commented Nov 18, 2021

BUG: resampling DataFrame with DateTimeIndex with holes and uint64 columns leads to error on pandas==1.3.2 (not in 1.2.5) #43329

BUG: resampling DataFrame with DateTimeIndex with holes and uint64 columns leads to error on pandas==1.3.2 (not in 1.2.5) #43329

Comments

julienl-met commented Aug 31, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Aug 31, 2021

debnathshoham commented Aug 31, 2021

jbrockmendel commented Aug 31, 2021

julienl-met commented Sep 1, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins commented Nov 10, 2021

simonjayhawkins commented Nov 18, 2021

jbrockmendel commented Nov 18, 2021

BUG: resampling DataFrame with DateTimeIndex with holes and `uint64` columns leads to error on `pandas==1.3.2` (not in `1.2.5`) #43329

BUG: resampling DataFrame with DateTimeIndex with holes and `uint64` columns leads to error on `pandas==1.3.2` (not in `1.2.5`) #43329

Output of `pd.show_versions()`