Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 #35499

Closed
2 of 3 tasks
ldacey opened this issue Jul 31, 2020 · 9 comments · Fixed by #35519
Closed
2 of 3 tasks

BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 #35499

ldacey opened this issue Jul 31, 2020 · 9 comments · Fixed by #35519
Labels
Bug Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Milestone

Comments

@ldacey
Copy link

ldacey commented Jul 31, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

I tried to pinpoint the specific row which causes the error. The column has HTML data like this:

'''<div class="comment" dir="auto"><p dir="auto">Request <a href="/agent/tickets/test" target="_blank" rel="ticket">#test</a> "RE: ** M ..." Last comment in request </a>:</p> <p dir="auto"></p> <p dir="auto">Thank you</p> <p dir="auto">Stuff.</p> <p dir="auto">We will keep you posted .</p> <p dir="auto">Regards,</p> <p dir="auto">Name</p></div>'''

#This fails (memory error below):
df['event_html_body'].astype("string")

#Filtering the dataframe to only convert rows which are not null for this field **works**
x = df[~df.event_html_body.isnull()][['event_html_body']]
x['event_html_body'].astype("string")


#Filling NAs with another value fails:
df['event_html_body'].fillna('-').astype("string")

Problem description

I have code which has been converting columns to the "string" dtype and this has worked up until pandas 1.1.0

For example, I tried to process a file which I successfully processed in April and it works when I use .astype(str), but it fails when I use .astype("string") event though this worked in pandas 1.0.5.

The column does not need to be the new "string" type, but I wanted to raise this issue anyways.

Rows: 201368
Empty/null rows for the column in question: 189014 / 201368

So this column is quite sparse, and as I mentioned below if I filtered the nulls and then do .astype("string") then it will run fine. I am not sure why this worked before (same server, 64 GB of RAM), and this file was previous processed as a "string" before the update.

Error:


MemoryError                               Traceback (most recent call last)
<ipython-input-38-939f88862e64> in <module>
----> 1 df['event_html_body'].astype("string")

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5535         else:
   5536             # else, only a single dtype is given
-> 5537             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
   5538             return self._constructor(new_data).__finalize__(self, method="astype")
   5539 

/opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    565         self, dtype, copy: bool = False, errors: str = "raise"
    566     ) -> "BlockManager":
--> 567         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    568 
    569     def convert(

/opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)
    394                 applied = b.apply(f, **kwargs)
    395             else:
--> 396                 applied = getattr(b, f)(**kwargs)
    397             result_blocks = _extend_blocks(applied, result_blocks)
    398 

/opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    588             vals1d = values.ravel()
    589             try:
--> 590                 values = astype_nansafe(vals1d, dtype, copy=True)
    591             except (ValueError, TypeError):
    592                 # e.g. astype_nansafe can fail on object-dtype of strings

/opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    911     # dispatch on extension dtype if needed
    912     if is_extension_array_dtype(dtype):
--> 913         return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
    914 
    915     if not isinstance(dtype, np.dtype):

/opt/conda/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _from_sequence(cls, scalars, dtype, copy)
    215 
    216         # convert to str, then to object to avoid dtype like '<U3', then insert na_value
--> 217         result = np.asarray(result, dtype=str)
    218         result = np.asarray(result, dtype="object")
    219         if has_nans:

/opt/conda/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

MemoryError: Unable to allocate array with shape (201368,) and data type <U131880

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0

@ldacey ldacey added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2020
@simonjayhawkins
Copy link
Member

Thanks @ldacey for the report. I can confirm this was working in 1.0.5. This change in behaviour is due to #33465 cc @topper-123

>>> print(pd.__version__)
1.1.0.dev0+1676.gb6ea970f8
>>>
>>> data = "a" * 131880
>>>
>>> res = pd.DataFrame([data] * 201368, dtype="string")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\frame.py", line 515, in __init__
    mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
  File "C:\Users\simon\pandas\pandas\core\internals\construction.py", line 186, in init_ndarray
    return arrays_to_mgr(values, columns, index, columns, dtype=dtype)
  File "C:\Users\simon\pandas\pandas\core\internals\construction.py", line 83, in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
  File "C:\Users\simon\pandas\pandas\core\internals\construction.py", line 351, in _homogenize
    val = sanitize_array(
  File "C:\Users\simon\pandas\pandas\core\construction.py", line 441, in sanitize_array
    subarr = _try_cast(data, dtype, copy, raise_cast_failure)
  File "C:\Users\simon\pandas\pandas\core\construction.py", line 542, in _try_cast
    subarr = array_type(arr, dtype=dtype, copy=copy)
  File "C:\Users\simon\pandas\pandas\core\arrays\string_.py", line 218, in _from_sequence
    result = np.asarray(result, dtype=str)
  File "C:\Users\simon\Anaconda3\envs\pandas-dev\lib\site-packages\numpy\core\_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880
>>>
>>> print(pd.__version__)
1.1.0.dev0+1675.g8c7d653a4
>>>
>>> data = "a" * 131880
>>>
>>> res = pd.DataFrame([data] * 201368, dtype="string")
>>> print(res)
                                                        0
0       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
1       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
2       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
3       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
4       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
...                                                   ...
201363  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
201364  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
201365  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
201366  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
201367  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...

[201368 rows x 1 columns]
>>>

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 1, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Aug 1, 2020
@topper-123
Copy link
Contributor

topper-123 commented Aug 1, 2020

The problem is the array conversion. Your example can be boiled down further to:

>>> data = ["a" * 131880] * 201368
>>> data = np.array(data, dtype=object)
>>> np.asarray(data, str)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880

The problem is we go from a small list (memorywise) to a large list, because data[0] is data[1] etc., while the array conversion makes each string as its own object.

The call to np.asarray(result , str) is to ensure all scalars are strings, so we don't have dangling e.g. ints or floats. For example the below example:

>>> data = [0] + ["a" * 131880] * 201368  # data[0] is not a string
>>> data = np.array(data, dtype=object)
>>> data = np.asarray(data, str)  # used to ensure data[0] converts to str
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201369,) and data type <U131880

Is there a way to avoid this/make it more efficient? I've got a feeling that this is pushiong against the limits of what numpy can do with strings.

EDIT: added an example of why ``asarray(data, str)``` is used.

@topper-123
Copy link
Contributor

@idacey, an alternative to your usecase coulds be to use StringArray, that bypasses the conversion, i.e.:

>>> data = ["a" * 131880] * 201368
>>> data = pd.arrays.StringArray(data)
>>> pd.Series(data)

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label Aug 3, 2020
@TomAugspurger
Copy link
Contributor

Just as a note: I would not expect converting a list of many references to the same object to a StringArray to necessarily preserve that fact once it's converted to StringArray's internal storage. If / when we use Arrow for storing string memory that wouldn't be possible.

@simonjayhawkins
Copy link
Member

That is probably not what the OP is doing. I used that as a reproducible code sample. I think the issue is that the memory needed is proportional to the number of rows times the longest string. (all rows could be different)

@ldacey
Copy link
Author

ldacey commented Aug 3, 2020

As far as what I am doing:

My current code has a pyarrow_schema that I defined. For any column I declared as pa.string(), my script would convert the data to .astype("string"). I tried to be explicit with these types because I faced issues with inconsistent schemas when reading data from the parquet files downstream (related to null data for the most part).

pyarrow_schema = pa.schema(
    [   ("ticket_id", pa.string()),
        ("solved_at", pa.timestamp(unit="ns")),
        ("closed_at", pa.timestamp(unit="ns")),
        ("event_subject", pa.string()),
        ("event_html_body", pa.string()),
])

For now, I switched back to .astype(str) instead since that runs on the current version of pandas for this data.

@simonjayhawkins
Copy link
Member

maybe a more representative MRE

>>> pd.__version__
'1.2.0.dev0+17.ga0c8425a5'
>>>
>>> data = np.full(201368, fill_value=np.nan, dtype=object)
>>>
>>> data[0] = "a" * 131880
>>> df = pd.DataFrame(data)
>>>
>>> df.astype("string")
Traceback (most recent call last):
...
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880
>>>
>>> pd.__version__
'1.0.5'
>>>
>>> data = np.full(201368, fill_value=np.nan, dtype=object)
>>>
>>> data[0] = "a" * 131880
>>> df = pd.DataFrame(data)
>>>
>>> df.astype("string")
                                                        0
0       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
1                                                    <NA>
2                                                    <NA>
3                                                    <NA>
4                                                    <NA>
...                                                   ...
201363                                               <NA>
201364                                               <NA>
201365                                               <NA>
201366                                               <NA>
201367                                               <NA>

[201368 rows x 1 columns]
>>>

@topper-123
Copy link
Contributor

topper-123 commented Aug 3, 2020

Just as a note: I would not expect converting a list of many references to the same object to a StringArray to necessarily preserve that fact once it's converted to StringArray's internal storage. If / when we use Arrow for storing string memory that wouldn't be possible.

Yeah I agree, but the current (simplified) conversion chain np.asarray(data, dtype=object).astype(str).astype(object) seems wasteful, compared to just keeping existing string where possible, because strings are immutable. Only non-python-strings should ideally be converted.

@topper-123
Copy link
Contributor

@simonjayhawkins , both your examples fail if dtype=str in v.1.0:

>>> data = np.full(201368, fill_value=np.nan, dtype=object)
>>>
>>> data[0] = "a" * 131880
>>> df = pd.DataFrame(data, dtype=str)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880

In some sense, that dtype="string" worked in v.1.0 was " just luck", and we should fix this problem in both the dtype=str case and the dtype="string" case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants