Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: resampling DataFrame with DateTimeIndex with holes and uint64 columns leads to error on pandas==1.3.2 (not in 1.2.5) #43329

Closed
2 of 3 tasks
julienl-met opened this issue Aug 31, 2021 · 8 comments · Fixed by #44828
Labels
Bug Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@julienl-met
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

# Data generation: DataFrame with DateTimeIndex, one row per hour, values are 0 or 1.
df = pd.DataFrame(index=pd.date_range(start="2000-01-01", end="2000-01-03 23", freq="H"), columns=["x"], data=[0, 1, 0] * 24)

# Removing some rows in order to have a hole in the dataset
df = df.loc[(df.index < "2000-01-02") | (df.index > "2000-01-03"), :]

# Create dummy indicator
one_hot = pd.get_dummies(df["x"])  # This line leads to having "RuntimeError: empty group with uint64_t"
# one_hot = pd.get_dummies(df["x"], dtype=int)  # This line leads to having expected dataframe

# Keeping, for each day, the maximum day value.
df_output = one_hot.resample("D").max()

# Expected_dataframe: 
df_expected = pd.DataFrame(
    index=pd.date_range(start="2000-01-01", end="2000-01-03", freq="D"),
    data={col: [1, np.nan, 1] for col in [0,1]}
)
pd.testing.assert_frame_equal(df_expected, df_output)

Problem description

With pandas=1.3.2, above code block leads to "RuntimeError: empty group with uint64_t". It was not the case with pandas==1.1.0 for instance. Not an issue for me (problem solved specifying dtype), but probably an issue to solve.

Expected Output

Given in code sample section

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.72-microsoft-standard-WSL2
Version : #1 SMP Wed Oct 28 23:40:43 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 57.4.0
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.27.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@julienl-met julienl-met added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2021
@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 31, 2021
@simonjayhawkins simonjayhawkins changed the title BUG: resampling DataFrame with DateTimeIndex with holes and uint64 columns leads to error on pandas==1.3.2 (not in 1.1.0) BUG: resampling DataFrame with DateTimeIndex with holes and uint64 columns leads to error on pandas==1.3.2 (not in 1.2.5) Aug 31, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 31, 2021
@simonjayhawkins
Copy link
Member

Thanks @julienlmet for the report

It was not the case with pandas==1.1.0 for instance.

first bad commit: [c294bdd] CLN: remove ensure_int_or_float (#41011)

cc @jbrockmendel

@debnathshoham
Copy link
Member

Hi @julienlmet , could you please confirm if you are getting the error on one_hot = pd.get_dummies(df["x"])?
Because I am getting the same error on df_output = one_hot.resample("D").max() in master.

@jbrockmendel
Copy link
Member

Looks like ensure_int_or_float cast uint8 dtype to int64, and failing to do that (when there are empty groups, as there are here) raises. so we need to restore that particular casting

@julienl-met
Copy link
Author

Hi @julienlmet , could you please confirm if you are getting the error on one_hot = pd.get_dummies(df["x"])?
Because I am getting the same error on df_output = one_hot.resample("D").max() in master.

Hi @debnathshoham,
With code provided above, error occurs in my venv for instruction df_output = one_hot.resample("D").max().

Here is the traceback that I get (sorry, I should have given it when submitting the issue):

Traceback (most recent call last):
  File ".../pd132_issue.py", line 15, in <module>
    df_output = one_hot.resample("D").max()
  File ".../venv/lib/python3.8/site-packages/pandas/core/resample.py", line 986, in f
    return self._downsample(_method, min_count=min_count)
  File ".../venv/lib/python3.8/site-packages/pandas/core/resample.py", line 1146, in _downsample
    result = obj.groupby(self.grouper, axis=self.axis).aggregate(how, **kwargs)
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 979, in aggregate
    result = op.agg()
  File ".../venv/lib/python3.8/site-packages/pandas/core/apply.py", line 158, in agg
    return self.apply_str()
  File ".../venv/lib/python3.8/site-packages/pandas/core/apply.py", line 507, in apply_str
    return self._try_aggregate_string_function(obj, f, *self.args, **self.kwargs)
  File ".../venv/lib/python3.8/site-packages/pandas/core/apply.py", line 577, in _try_aggregate_string_function
    return f(*args, **kwargs)
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1857, in max
    return self._agg_general(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1342, in _agg_general
    result = self._cython_agg_general(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1082, in _cython_agg_general
    new_mgr = data.grouped_reduce(array_func, ignore_failures=True)
  File ".../venv/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1243, in grouped_reduce
    applied = blk.apply(func)
  File ".../venv/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 382, in apply
    result = func(self.values, **kwargs)
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 1068, in array_func
    result = self.grouper._cython_operation(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 978, in _cython_operation
    return cy_op.cython_operation(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 639, in cython_operation
    return self._cython_op_ndim_compat(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 495, in _cython_op_ndim_compat
    return self._call_cython_op(
  File ".../venv/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 548, in _call_cython_op
    func(
  File "pandas/_libs/groupby.pyx", line 1281, in pandas._libs.groupby.group_max
  File "pandas/_libs/groupby.pyx", line 1269, in pandas._libs.groupby.group_min_max
RuntimeError: empty group with uint64_t

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@simonjayhawkins
Copy link
Member

Looks like ensure_int_or_float cast uint8 dtype to int64, and failing to do that (when there are empty groups, as there are here) raises. so we need to restore that particular casting

Restoring the previous casting would fix the regression case in the OP, but there is an underlying issue (latent bug) that we cannot (and could not in 1.2.5 and before) resample and aggregate a column with uint64 values as ensure_int_or_float cannot cast uint64 -> int64 safely.

In the code sample, the pd.get_dummies(df["x"]) step results in columns with uint8 columns. Not casting this to int64 raises the misleading RuntimeError: empty group with uint64_t message and this is a regression.

DataFrames with uint64 columns raised this message before the regression where the error message was more accurate and informative.


considering a simplied code sample (can maybe use as the regression test) without using pd.get_dummies

import numpy as np
import pandas as pd

print(pd.__version__)
df = pd.DataFrame(
    index=pd.date_range(start="2000-01-01", end="2000-01-03 23", freq="12H"),
    columns=["x"],
    data=[0, 1, 0] * 2,
    dtype="uint8",
)
df = df.loc[(df.index < "2000-01-02") | (df.index > "2000-01-03"), :]
result = df.resample("D").max()
print(result)

expected = pd.DataFrame(
    [1, np.nan, 0],
    columns=["x"],
    index=pd.date_range(start="2000-01-01", end="2000-01-03 23", freq="D"),
)
pd.testing.assert_frame_equal(result, expected)

gives on 1.2.5

1.2.5
              x
2000-01-01  1.0
2000-01-02  NaN
2000-01-03  0.0

and on master

1.4.0.dev0+1068.g87b8a6ea06
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
...
RuntimeError: empty group with uint64_t

Changing dtype="uint8" to dtype="uint64" gives RuntimeError: empty group with uint64_t on both 1.2.5 and master.

I think this is the correct result and this issue title is misleading. The title should probably be "
BUG: resampling DataFrame with DateTimeIndex with empty groups and uint8, uint16 and uint32 columns leads to error on pandas==1.3.2 (not in 1.2......"


restoring ensure_int_or_float to ensure the old casting is straightfoward and all tests pass with the following change.

diff --git a/pandas/core/dtypes/common.py b/pandas/core/dtypes/common.py
index 815a0a2040..132d6c9610 100644
--- a/pandas/core/dtypes/common.py
+++ b/pandas/core/dtypes/common.py
@@ -111,6 +111,54 @@ def ensure_str(value: bytes | Any) -> str:
     return value
 
 
+def ensure_int_or_float(arr: ArrayLike, copy: bool = False) -> np.ndarray:
+    """
+    Ensure that an dtype array of some integer dtype
+    has an int64 dtype if possible.
+    If it's not possible, potentially because of overflow,
+    convert the array to float64 instead.
+    Parameters
+    ----------
+    arr : array-like
+          The array whose data type we want to enforce.
+    copy: bool
+          Whether to copy the original array or reuse
+          it in place, if possible.
+    Returns
+    -------
+    out_arr : The input array cast as int64 if
+              possible without overflow.
+              Otherwise the input array cast to float64.
+    Notes
+    -----
+    If the array is explicitly of type uint64 the type
+    will remain unchanged.
+    """
+    # TODO: GH27506 potential bug with ExtensionArrays
+    try:
+        # error: No overload variant of "astype" of "ExtensionArray" matches
+        # argument types "str", "bool", "str"
+        return arr.astype(  # type: ignore[call-overload]
+            "int64", copy=copy, casting="safe"
+        )
+    except TypeError:
+        pass
+    try:
+        # error: No overload variant of "astype" of "ExtensionArray" matches
+        # argument types "str", "bool", "str"
+        return arr.astype(  # type:ignore[call-overload]
+            "uint64", copy=copy, casting="safe"
+        )
+    except TypeError:
+        if is_extension_array_dtype(arr.dtype):
+            # pandas/core/dtypes/common.py:168: error: Item "ndarray" of
+            # "Union[ExtensionArray, ndarray]" has no attribute "to_numpy"  [union-attr]
+            return arr.to_numpy(  # type: ignore[union-attr]
+                dtype="float64", na_value=np.nan
+            )
+        return arr.astype("float64", copy=copy)
+
+
 def ensure_python_int(value: int | np.integer) -> int:
     """
     Ensure that a value is a python int.
diff --git a/pandas/core/groupby/ops.py b/pandas/core/groupby/ops.py
index 60c8851f05..bf4b219455 100644
--- a/pandas/core/groupby/ops.py
+++ b/pandas/core/groupby/ops.py
@@ -44,6 +44,7 @@ from pandas.core.dtypes.cast import (
 from pandas.core.dtypes.common import (
     ensure_float64,
     ensure_int64,
+    ensure_int_or_float,
     ensure_platform_int,
     is_1d_only_ea_obj,
     is_bool_dtype,
@@ -500,9 +501,7 @@ class WrappedCythonOp:
         elif is_bool_dtype(dtype):
             values = values.astype("int64")
         elif is_integer_dtype(dtype):
-            # e.g. uint8 -> uint64, int16 -> int64
-            dtype_str = dtype.kind + "8"
-            values = values.astype(dtype_str, copy=False)
+            values = ensure_int_or_float(values)
         elif is_numeric:
             if not is_complex_dtype(dtype):
                 values = ensure_float64(values)

@simonjayhawkins
Copy link
Member

@jbrockmendel any objections to #43329 (comment). If not, will open a PR to get this fixed for 1.3.5

@jbrockmendel
Copy link
Member

No objection, though I'd suggest something more narrowly targeted:

         elif is_integer_dtype(dtype):
            # GH#43329 helpful comment
            if dtype != np.uint64:
                 values = values.astype(np.int64, copy=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version
Projects
None yet
4 participants