Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: .unstack() with recarray column raises TypeError since 1.4.0 #49388

Open
3 tasks done
stefan-jansen opened this issue Oct 29, 2022 · 5 comments
Open
3 tasks done
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure

Comments

@stefan-jansen
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

c = np.array([2] * 20, dtype='f8')
r = np.rec.fromarrays([c], names=['c'])

df = pd.DataFrame({'a': np.arange(20) // 5, 'b': list('ABCDE') * 4, 'c': r})
# df.info() => see GH48526
df.set_index(['a', 'b']).c.unstack()

Issue Description

Starting with 1.4.0, Including a column of dtype np.record(related to #48526) as follows:

c = np.array([2] * 9, dtype='f8')
r = np.rec.fromarrays([c], names=['c'])
df = pd.DataFrame({'a': np.arange(9) // 3,
                   'b': list('ABC') * 3,
                   'c': r})
print(df.dtypes)

a                             int64
b                            object
c    (numpy.record, [('c', '<f8')])

raises a TypeError:

   df.set_index(['a', 'b']).unstack()
  File "/home/stefan/.pyenv/versions/pd_bug/lib/python3.9/site-packages/pandas/core/frame.py", line 9060, in unstack
    result = unstack(self, level, fill_value)
  File "/home/stefan/.pyenv/versions/pd_bug/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 479, in unstack
    return _unstack_frame(obj, level, fill_value=fill_value)
  File "/home/stefan/.pyenv/versions/pd_bug/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 508, in _unstack_frame
    return unstacker.get_result(
  File "/home/stefan/.pyenv/versions/pd_bug/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 215, in get_result
    values, _ = self.get_new_values(values, fill_value)
  File "/home/stefan/.pyenv/versions/pd_bug/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 228, in get_new_values
    sorted_values = self._make_sorted_values(values)
  File "/home/stefan/.pyenv/versions/pd_bug/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 167, in _make_sorted_values
    sorted_values = algos.take_nd(values, indexer, axis=0)
  File "/home/stefan/.pyenv/versions/pd_bug/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 118, in take_nd
    return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
  File "/home/stefan/.pyenv/versions/pd_bug/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 135, in _take_nd_ndarray
    dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value(
  File "/home/stefan/.pyenv/versions/pd_bug/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 587, in _take_preprocess_indexer_and_fill_value
    dtype, fill_value = arr.dtype, arr.dtype.type()
TypeError: void() takes exactly 1 positional argument (0 given)

Expected Behavior

Before 1.4.0, the output of the same code shows that the recarray has dtype object and the unstack does not throw an error:

a     int64
b    object
c    object
dtype: object
        c                
b       A       B       C
a                        
0  (2.0,)  (2.0,)  (2.0,)
1  (2.0,)  (2.0,)  (2.0,)
2  (2.0,)  (2.0,)  (2.0,)

Installed Versions

INSTALLED VERSIONS

commit : 91111fd
python : 3.9.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-52-generic
Version : #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.5
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3
Cython : 0.29.32
pytest : 6.2.5
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.42
tables : 3.7.0
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@stefan-jansen stefan-jansen added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 29, 2022
@topper-123
Copy link
Contributor

topper-123 commented Oct 30, 2022

Yes. I can confirm this error.

Also, just printing you original dataframe gives an error also:

import numpy as np
import pandas as pd

c = np.array([2] * 20, dtype='f8')
r = np.rec.fromarrays([c], names=['c'])

df = pd.DataFrame({'a': np.arange(20) // 5, 'b': list('ABCDE') * 4, 'c': r})
print(df)

gives error:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Want to take a stab at this, @stefan-jansen?

@topper-123 topper-123 added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 30, 2022
@jbrockmendel
Copy link
Member

i dont think we support record-dtypes. these should be cast somewhere along the way

@topper-123
Copy link
Contributor

Ok, thanks, makes sense several errors pop up then, when using them:-).

I see instantiating Series using recarrays gives an error:

In [1]:  r = np.rec.fromarrays([c], names=['c'])
In [2]: pd.Series(r)
ValueError: Cannot construct a Series from an ndarray with compound dtype.  Use DataFrame instead.

The case with pd.DataFrame({'c': r}) should logically give a similar error, so I say OP's example should have raised in the DataFrame construction.

@topper-123 topper-123 added DataFrame DataFrame data structure Constructors Series/DataFrame/Index/pd.array Constructors and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 30, 2022
@topper-123 topper-123 reopened this Oct 31, 2022
@topper-123
Copy link
Contributor

topper-123 commented Oct 31, 2022

Looking further, we also can't use dataframes (a multidim object) as single columns in a dataframe:

In [1]: df = pd.DataFrame({"a": "a b a b".split(), "b": range(4)})
In [2]: pd.DataFrame({"a": df}, index=df.index)
ValueError: Data must be 1-dimensional

IMO we should disallow single columns being constructed from multidim object (like recarrays and DataFrames), in order to keep things consistent.

@jbrockmendel, do you agree?

@stefan-jansen
Copy link
Author

It seems like the np.record case worked before 1.4.0 because it was cast to / treated as dtype 'object'.

I couldn't find which change in 1.4.0 caused this change in behavior where np.record would keep its dtype through the constructor, but this may be the source. It appeared to relate to the BlockManager but I didn't have the time to drill deeper. #48637 (the print / .info() error) also appears with np.record but has a different cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants