BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only #36014

jeet-parekh · 2020-08-31T18:23:31Z

I have checked that this issue has not already been reported.
Two variants of this bug have been reported - BUG: pd.read_parquet with pyarrow fails when row number is 0 and contains Pandas extensions type #35436 and BUG: read-only buffer failures in datetime parsing #34857

EDIT: I read into those two issues a bit more. They don't seem similar. But I'll keep it there.
I have confirmed this bug exists on the latest version of pandas.
Bug exists in pandas 1.1.1

Code Sample, a copy-pastable example

import pandas as pd
import pyarrow as pa

df = pd.DataFrame(
    {
        "sepal_length": [5.1, 4.9, 4.7, 4.6, 5.0],
        "species": ["setosa", "setosa", "setosa", "setosa", "setosa"],
    }
)

context = pa.default_serialization_context()
data = context.serialize(df).to_buffer().to_pybytes()
df_new = context.deserialize(data)

# this fails
df_new.groupby(["species"]).agg({"sepal_length": "sum"})

# this works
# df_new.copy().groupby(["species"]).agg({"sepal_length": "sum"})

Problem description

This is the traceback.

Traceback (most recent call last):
  File "demo.py", line 16, in <module>
    df_new.groupby(["species"]).agg({"sepal_length": "sum"})
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 949, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 416, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 383, in _agg
    result[fname] = func(fname, agg_how)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 367, in _agg_1dim
    return colg.aggregate(how)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 240, in aggregate
    return getattr(self, func)(*args, **kwargs)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1539, in sum
    return self._agg_general(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 999, in _agg_general
    return self._cython_agg_general(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1033, in _cython_agg_general
    result, agg_names = self.grouper.aggregate(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 584, in aggregate
    return self._cython_operation(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 537, in _cython_operation
    result = self._aggregate(result, counts, values, codes, func, min_count)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 599, in _aggregate
    agg_func(result, counts, values, comp_ids, min_count)
  File "pandas/_libs/groupby.pyx", line 475, in pandas._libs.groupby._group_add
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only

In the .agg line that fails, if you do a min, max, median, or count aggregation, then it's going to work.

But if you do a sum or mean, then it fails.

Expected Output

I expected the aggregation to succeed without any error.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.8.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-7642-generic
Version          : #46~1597422484~20.04~e78f762-Ubuntu SMP Wed Aug 19 14:35:06 UTC 
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.1
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.6.0.post20200814
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.5 (dt dec pq3 ext lo64)
jinja2           : 2.11.2
IPython          : 7.17.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 1.0.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-08-31T19:08:44Z

@jeet-parekh Can you create a copy / pastable example that doesn't use external links?

https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

jeet-parekh · 2020-08-31T19:18:12Z

A couple of more things.

This fails

df_new.groupby(["species"])["sepal_length"].sum()

This works

df_new.groupby(["species"])[["sepal_length"]].sum()

jeet-parekh · 2020-08-31T19:24:38Z

@dsaxton, fixed it. I missed that fact that it isn't copy-pastable. Will edit in the main issue post as well.

dsaxton · 2020-08-31T19:32:07Z

Thanks @jeet-parekh. Fails on master as well and looks like a bug to me.

rhshadrach · 2020-08-31T21:41:48Z

Another temporary workaround is to make a copy:

df_new = context.deserialize(data).copy()

Then it seems to me that all groupby ops work, whether as a Series or a DataFrame.

jorisvandenbossche · 2020-09-01T08:56:12Z

A reproducer without the use of pyarrow:

df = pd.DataFrame(
    {
        "sepal_length": [5.1, 4.9, 4.7, 4.6, 5.0],
        "species": ["setosa", "setosa", "setosa", "setosa", "setosa"],
    }
)
df._mgr.blocks[0].values.flags.writeable = False

df.groupby(["species"]).agg({"sepal_length": "sum"})

It's already failing in 1.0, but not in 0.25. So not a very recent regression, but still a regression compared to 0.25.

jeet-parekh · 2020-09-01T09:02:38Z

I see the same behaviour with @jorisvandenbossche's code. It succeeds for min, max, count, and median aggregations. But fails for sum and mean. Not sure if that's relevant.

jorisvandenbossche · 2020-09-01T09:23:07Z

The direct fix would be to add a const to the values keyword declaration at

pandas/pandas/_libs/groupby.pyx

Lines 473 to 477 in b528be6

    
           def _group_add(complexfloating_t[:, :] out, 
        
                          int64_t[:] counts, 
        
                          complexfloating_t[:, :] values, 
        
                          const int64_t[:] labels, 
        
                          Py_ssize_t min_count=0):

however, using const with fused types will only be available for cython 3. So a workaround for now would be to use ndarray interface (ndarray[complexfloating_t, ndim=2]) instead of memoryview, I suppose.

jeet-parekh added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2020

dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 31, 2020

rhshadrach added the IO Parquet parquet, feather label Aug 31, 2020

rhshadrach added this to the Contributions Welcome milestone Aug 31, 2020

jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version and removed Bug IO Parquet parquet, feather labels Sep 1, 2020

jorisvandenbossche changed the title ~~BUG: ValueError: buffer source array is read-only - on doing a groupby and agg after deserializing dataframe using pyarrow~~ BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only Sep 1, 2020

jorisvandenbossche modified the milestones: Contributions Welcome, 1.1.2 Sep 1, 2020

jeet-parekh mentioned this issue Sep 2, 2020

BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only #36061

Merged

5 tasks

jreback closed this as completed in #36061 Sep 4, 2020

clarkzinzow mentioned this issue Jan 8, 2021

[dask-on-ray] ValueError on read-only memory ray-project/ray#10124

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only #36014

BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only #36014

jeet-parekh commented Aug 31, 2020 •

edited

dsaxton commented Aug 31, 2020

jeet-parekh commented Aug 31, 2020

jeet-parekh commented Aug 31, 2020

dsaxton commented Aug 31, 2020

rhshadrach commented Aug 31, 2020

jorisvandenbossche commented Sep 1, 2020

jeet-parekh commented Sep 1, 2020

jorisvandenbossche commented Sep 1, 2020 •

edited

BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only #36014

BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only #36014

Comments

jeet-parekh commented Aug 31, 2020 • edited

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

dsaxton commented Aug 31, 2020

jeet-parekh commented Aug 31, 2020

jeet-parekh commented Aug 31, 2020

dsaxton commented Aug 31, 2020

rhshadrach commented Aug 31, 2020

jorisvandenbossche commented Sep 1, 2020

jeet-parekh commented Sep 1, 2020

jorisvandenbossche commented Sep 1, 2020 • edited

jeet-parekh commented Aug 31, 2020 •

edited

Output of `pd.show_versions()`

jorisvandenbossche commented Sep 1, 2020 •

edited