Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only #36014

Closed
2 tasks done
jeet-parekh opened this issue Aug 31, 2020 · 8 comments · Fixed by #36061
Closed
2 tasks done
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jeet-parekh
Copy link
Contributor

jeet-parekh commented Aug 31, 2020


Code Sample, a copy-pastable example

import pandas as pd
import pyarrow as pa

df = pd.DataFrame(
    {
        "sepal_length": [5.1, 4.9, 4.7, 4.6, 5.0],
        "species": ["setosa", "setosa", "setosa", "setosa", "setosa"],
    }
)

context = pa.default_serialization_context()
data = context.serialize(df).to_buffer().to_pybytes()
df_new = context.deserialize(data)

# this fails
df_new.groupby(["species"]).agg({"sepal_length": "sum"})

# this works
# df_new.copy().groupby(["species"]).agg({"sepal_length": "sum"})

Problem description

This is the traceback.

Traceback (most recent call last):
  File "demo.py", line 16, in <module>
    df_new.groupby(["species"]).agg({"sepal_length": "sum"})
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 949, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 416, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 383, in _agg
    result[fname] = func(fname, agg_how)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 367, in _agg_1dim
    return colg.aggregate(how)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 240, in aggregate
    return getattr(self, func)(*args, **kwargs)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1539, in sum
    return self._agg_general(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 999, in _agg_general
    return self._cython_agg_general(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1033, in _cython_agg_general
    result, agg_names = self.grouper.aggregate(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 584, in aggregate
    return self._cython_operation(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 537, in _cython_operation
    result = self._aggregate(result, counts, values, codes, func, min_count)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 599, in _aggregate
    agg_func(result, counts, values, comp_ids, min_count)
  File "pandas/_libs/groupby.pyx", line 475, in pandas._libs.groupby._group_add
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only

In the .agg line that fails, if you do a min, max, median, or count aggregation, then it's going to work.

But if you do a sum or mean, then it fails.

Expected Output

I expected the aggregation to succeed without any error.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.8.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-7642-generic
Version          : #46~1597422484~20.04~e78f762-Ubuntu SMP Wed Aug 19 14:35:06 UTC 
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.1
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.6.0.post20200814
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.5 (dt dec pq3 ext lo64)
jinja2           : 2.11.2
IPython          : 7.17.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 1.0.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@jeet-parekh jeet-parekh added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2020
@dsaxton
Copy link
Member

dsaxton commented Aug 31, 2020

@jeet-parekh Can you create a copy / pastable example that doesn't use external links?

https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@dsaxton dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 31, 2020
@jeet-parekh
Copy link
Contributor Author

A couple of more things.

This fails

df_new.groupby(["species"])["sepal_length"].sum()

This works

df_new.groupby(["species"])[["sepal_length"]].sum()

@jeet-parekh
Copy link
Contributor Author

@dsaxton, fixed it. I missed that fact that it isn't copy-pastable. Will edit in the main issue post as well.

@dsaxton
Copy link
Member

dsaxton commented Aug 31, 2020

Thanks @jeet-parekh. Fails on master as well and looks like a bug to me.

@rhshadrach
Copy link
Member

Another temporary workaround is to make a copy:

df_new = context.deserialize(data).copy()

Then it seems to me that all groupby ops work, whether as a Series or a DataFrame.

@rhshadrach rhshadrach added the IO Parquet parquet, feather label Aug 31, 2020
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Aug 31, 2020
@jorisvandenbossche
Copy link
Member

A reproducer without the use of pyarrow:

df = pd.DataFrame(
    {
        "sepal_length": [5.1, 4.9, 4.7, 4.6, 5.0],
        "species": ["setosa", "setosa", "setosa", "setosa", "setosa"],
    }
)
df._mgr.blocks[0].values.flags.writeable = False

df.groupby(["species"]).agg({"sepal_length": "sum"})

It's already failing in 1.0, but not in 0.25. So not a very recent regression, but still a regression compared to 0.25.

@jorisvandenbossche jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version and removed Bug IO Parquet parquet, feather labels Sep 1, 2020
@jorisvandenbossche jorisvandenbossche changed the title BUG: ValueError: buffer source array is read-only - on doing a groupby and agg after deserializing dataframe using pyarrow BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only Sep 1, 2020
@jeet-parekh
Copy link
Contributor Author

I see the same behaviour with @jorisvandenbossche's code. It succeeds for min, max, count, and median aggregations. But fails for sum and mean. Not sure if that's relevant.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 1, 2020

The direct fix would be to add a const to the values keyword declaration at

def _group_add(complexfloating_t[:, :] out,
int64_t[:] counts,
complexfloating_t[:, :] values,
const int64_t[:] labels,
Py_ssize_t min_count=0):

however, using const with fused types will only be available for cython 3. So a workaround for now would be to use ndarray interface (ndarray[complexfloating_t, ndim=2]) instead of memoryview, I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
4 participants