Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: (regression? v2 vs v1.5) ValueError: Big-endian buffer not supported on little-endian compiler #53234

Open
3 tasks done
st-bender opened this issue May 15, 2023 · 13 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@st-bender
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
import xarray as xr

ds = xr.Dataset(
    {
        "a": (("x", "y"), np.arange(24).reshape((6, 4)))
    },
    coords={"x": np.arange(6, dtype=">f4")}
)

# raises
# ValueError: Big-endian buffer not supported on little-endian compiler
# on pandas 2.0.1 but *not* on pandas 1.5.3
dsi = ds.interp(x=np.array([1.3, 2.5]))

Issue Description

Hi there,
I onserved that one of my tests failed with
ValueError: Big-endian buffer not supported on little-endian compiler
which had no problem before.
I am not sure what changed internally and how, but observed that this is raised when using pandas version 2 and still succeeds with pandas 1.5.

I found some old reports and the FAQ, but since the behaviour is different between versions, this might be of interest anyway. Or maybe I need to file it with xarray.

The full traceback is:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py39/lib/python3.9/site-packages/xarray/core/dataset.py:3366: in interp
    obj, newidx = missing._localize(obj, {k: v})
.tox/py39/lib/python3.9/site-packages/xarray/core/missing.py:565: in _localize
    imin = index.get_indexer([minval], method="nearest").item()
.tox/py39/lib/python3.9/site-packages/pandas/core/indexes/base.py:3730: in get_indexer
    if not self._index_as_unique:
.tox/py39/lib/python3.9/site-packages/pandas/core/indexes/base.py:6006: in _index_as_unique
    return self.is_unique
pandas/_libs/properties.pyx:36: in pandas._libs.properties.CachedProperty.__get__
    ???
.tox/py39/lib/python3.9/site-packages/pandas/core/indexes/base.py:2238: in is_unique
    return self._engine.is_unique
pandas/_libs/index.pyx:236: in pandas._libs.index.IndexEngine.is_unique.__get__
    ???
pandas/_libs/index.pyx:241: in pandas._libs.index.IndexEngine._do_unique_check
    ???
pandas/_libs/index.pyx:303: in pandas._libs.index.IndexEngine._ensure_mapping_populated
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   ValueError: Big-endian buffer not supported on little-endian compiler

pandas/_libs/hashtable_class_helper.pxi:7104: ValueError

A slightly different test provides also the function name:

File "pandas/_libs/hashtable_class_helper.pxi", line 7104, in pandas._libs.hashtable.PyObjectHashTable.map_locations
ValueError: Big-endian buffer not supported on little-endian compiler

Expected Behavior

No exception is raised.

Installed Versions

INSTALLED VERSIONS

commit : 37ea63d
python : 3.9.16.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1160.71.1.el7.x86_64
Version : #1 SMP Tue Jun 28 15:37:28 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_GB.UTF-8
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.1
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.0.1
pytest : 7.3.1
scipy : 1.10.1
xarray : 2023.4.2
tzdata : 2023.3

(all others are "none")

@st-bender st-bender added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 15, 2023
@jbrockmendel
Copy link
Member

Or maybe I need to file it with xarray.

It looks like the error is getting raised inside pandas, so this is a fine place to report. It would be helpful if you can narrow it down to a reproducible example that doesn't require xarray

@st-bender
Copy link
Author

@jbrockmendel Thanks for looking into it. It looks like index.get_indexer() fails, here is an updated test case without xarray:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    data=np.arange(6),
    index=np.arange(6, dtype=">f4"),
    columns=["x"],
)
dfi = df.index.get_indexer([1.3], method="nearest")

Same effect, works with pandas 1.5 but raises an exception with pandas 2.0.

@jbrockmendel
Copy link
Member

Thanks for updating the example, much easier to look into on our end!

Looks like we should probably disallow big-endian dtypes in the Index constructor.

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label May 22, 2023
@st-bender
Copy link
Author

Thanks for updating the example, much easier to look into on our end!

Looks like we should probably disallow big-endian dtypes in the Index constructor.

I am not sure that is a good idea. It will probably break downstream packages such as xarray, e.g. when reading netcdf files with a different endianess than the system. Sometimes the user does not have control over the endianess because the files are produced on different systems. Those would be left unable to read and process such files.

Also, if I understand correctly, this is not an issue with big-endian per-se, but when the endianess of the index is opposite to the system's endianess. Unfortunately I cannot test the case with little-endian index on a big-endian system.

@jbrockmendel
Copy link
Member

My best guess (worth checking) is that in 1.5 we silently converted to little-endian, which would make a copy. If that guess is correct, then the choice is to either restore that behavior or to raise, telling users to convert themselves. I lean towards the raising, but wouldn't mind either way

@st-bender
Copy link
Author

Indeed, it looks like pandas 1.5 converts to native byteorder, the dtype changes from '>f4' to 'float' on little endian, in v2 it stays '>f4'. Note that native order can be either, little or big endian. So just converting big endian might only catch half the cases.

I would prefer backwards compatibility, internally converting seems to have worked fine so far. One could raise a warning though, so that the user can decide if it is important or not.

@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 30, 2023
@ejhyer
Copy link

ejhyer commented Aug 30, 2023

Just writing to bump this seeing no activity. This is obviously an edge case that won't affect many people, but it's still an egregious regression. Based on the consequences, namely, netCDF files generated on some systems becoming unreadable via xarray on other systems, I would say the only appropriate course is to restore this transform to Pandas.

@jbrockmendel
Copy link
Member

A PR would be welcome.

@ejhyer
Copy link

ejhyer commented Aug 30, 2023

Apparently, this is a documented issue in gotchas.rst in tagged releases at least as far back as v1.0.0: https://github.com/pandas-dev/pandas/blob/609c3b74b0da87e5c1f36bcf4f6b490ac94413a0/doc/source/user_guide/gotchas.rst#byte-ordering-issues

But it did work in v1.5.3, and in fact works in v2+ for many operations that aren't get_indexer(). The other methods implemented in https://github.com/pandas-dev/pandas/blame/main/pandas/core/indexes/base.py still do this conversion automatically. Here is an example that illustrates what works in old and new versions of pandas:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    data=np.arange(6),
    index=np.arange(6, dtype=">f4"),
    columns=["x"],
)
df2 = pd.DataFrame(
    data=np.arange(6),
    index=np.arange(6, dtype="<f4")+3,
    columns=["x"],
)
print(df.index.union(df2.index))
print(df.index.intersection(df2.index))
print(df.index.get_indexer(df2.index[0:2], method="nearest"))

With pandas v1.5.3:

Float64Index([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype='float64')
Float64Index([3.0, 4.0, 5.0], dtype='float64')
[3 4]

With pandas v2.0.3:

Index([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype='float32')
Index([3.0, 4.0, 5.0], dtype='float32')
<...>
ValueError: Big-endian buffer not supported on little-endian compiler

It's a very blunt recasting clause in union() and the other routines: https://github.com/pandas-dev/pandas/blob/609c3b74b0da87e5c1f36bcf4f6b490ac94413a0/pandas/core/indexes/base.py#L3289C11-L3292C48
There is a recasting clause in get_indexer() but it's slightly different: https://github.com/pandas-dev/pandas/blob/609c3b74b0da87e5c1f36bcf4f6b490ac94413a0/pandas/core/indexes/base.py#L3901C2-L3910C14

It's not clear to me how (or if) that difference is causing this behavior:

@st-bender
Copy link
Author

Hi there,
Thanks for still looking into it. I think I eventually programmed around it, converting the endianess myself after reading the file before doing any indexing or selecting.
I got it to work for the big -> little endian case (using xarray) by turning ">" into "<" in the dtype string and using .astype(). It might be a bit trickier for the general case, or it might have some unwanted side effects, but it worked for me.

In @ejhyer's example, it looks like pandas 1.5.x converts all float types to float64, but pandas 2.0.x keeps the types as float32 and also keeps the endianess.
Don't know which one is better, I'd probably prefer the new behaviour that seems to have less un-intended type conversions (from a user's point of view), except for the indexing issue. The last example works when converting the first index to little endian before indexing:

print(df.index.astype("<f4").get_indexer(df2.index[0:2], method="nearest"))

Note that it does not work the other way around, converting both to big-endian on a little endian machine. Can't test the behaviour on a big-endian machine.

@ejhyer
Copy link

ejhyer commented Aug 31, 2023

  1. Here is an even shorter test case:
import numpy as np
import pandas as pd
idx = pd.Index(np.array([1, 5,  7]).astype('>f4'))
idx.is_unique
  1. My argument for restoring automatic byteswap/recast basically boils down to "many other numpy/pandas/xarray operations do this (silently)." So endianness is transparent to the user in many cases, except when attempting certain pandas operations.
  2. My understanding of the internals of pandas is insufficient to go much farther. @jbrockmendel said:

we should probably disallow big-endian dtypes in the Index constructor.

I agree with this. Looking at the case above, I think my preference would be for an automatic byteswap/recast, and I think a Warning is appropriate if the function returns (or could return) something with a dtype different from what the user explicitly asked for. The Warning could possible be something like RuntimeWarning: values for Index automatically recast to system endianness.

@st-bender
Copy link
Author

we should probably disallow big-endian dtypes in the Index constructor.

I agree with this.

Wouldn't that render pandas unusable on big-endian machines?

@ejhyer
Copy link

ejhyer commented Sep 1, 2023

Wouldn't that render pandas unusable on big-endian machines?

Should have said "disallow construction of Indexes with non-native endianness."

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Bug labels Sep 14, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.1.1 milestone Sep 14, 2023
@lithomas1 lithomas1 modified the milestones: 2.1.1, 2.1.2 Sep 21, 2023
@lithomas1 lithomas1 modified the milestones: 2.1.2, 2.1.3 Oct 26, 2023
@jorisvandenbossche jorisvandenbossche modified the milestones: 2.1.3, 2.1.4 Nov 13, 2023
@lithomas1 lithomas1 modified the milestones: 2.1.4, 2.2 Dec 8, 2023
@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

5 participants