MAINT: Add a proper implementation for structured zerofill #23620

seberg · 2023-04-20T12:32:47Z

This reorganizes the generic traversal functions for structured dtypes a bit (before it wasn't quite generic).

Then uses that for zerofilling. We could get the two a bit closer by also supporting func==NULL explicitly for clearing. (I have not includes this here.)

The old FillObjectArray is still in use now, this is not ideal and the new approach should be duplicated to add an "emptyfill" (same semantics, normally just memset to 0, but for objects we place an explicit None object).

This is a necessary follow-up to gh-23591.

This reorganizes the generic traversal functions for structured dtypes a bit (before it wasn't quite generic). Then uses that for zerofilling. We could get the two a bit closer by also supporting `func==NULL` explicitly for clearing. (I have not includes this here.) The old `FillObjectArray` is still in use now, this is not ideal and the new approach should be duplicated to add an "emptyfill" (same semantics, normally just memset to 0, but for objects we place an explicit `None` object). This is a necessary follow-up to numpygh-23591.

ngoldbaum

All the stringdtype tests pass with this PR and I didn't see any code issues to comment about, LGTM!

seberg · 2023-04-20T18:07:14Z

One test we could add there (if you like), is that this should make np.zeros(3, dtype=[("a", np.float64), ("b", StringDType())]) work (although, I suspect there are still alignment issues, so you have to make sure to use an aligned struct dtype).

ngoldbaum · 2023-04-20T18:13:19Z

Nice! It looks like it does successfully create the array with this PR (on main it dies in an assert(0) inside refcount.c). However it crashes a little bit later because the struct dtype printing code doesn't know about new dtypes:

In [2]: In [1]: np.zeros(3, dtype=[("a", np.float64), ("b", StringDType())])
Out[2]: 
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~/.pyenv/versions/3.10.9-debug/lib/python3.10/site-packages/IPython/core/formatters.py:706, in PlainTextFormatter.__call__(self, obj)
    699 stream = StringIO()
    700 printer = pretty.RepresentationPrinter(stream, self.verbose,
    701     self.max_width, self.newline,
    702     max_seq_length=self.max_seq_length,
    703     singleton_pprinters=self.singleton_printers,
    704     type_pprinters=self.type_printers,
    705     deferred_pprinters=self.deferred_printers)
--> 706 printer.pretty(obj)
    707 printer.flush()
    708 return stream.getvalue()

File ~/.pyenv/versions/3.10.9-debug/lib/python3.10/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File ~/.pyenv/versions/3.10.9-debug/lib/python3.10/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File ~/Documents/numpy/numpy/core/arrayprint.py:1518, in _array_repr_implementation(arr, max_line_width, precision, suppress_small, array2string)
   1515 if skipdtype:
   1516     return arr_str
-> 1518 dtype_str = "dtype={})".format(dtype_short_repr(arr.dtype))
   1520 # compute whether we should put dtype on a new line: Do so if adding the
   1521 # dtype would extend the last line past max_line_width.
   1522 # Note: This line gives the correct result even when rfind returns -1.
   1523 last_line_len = len(arr_str) - (arr_str.rfind('\n') + 1)

File ~/Documents/numpy/numpy/core/arrayprint.py:1470, in dtype_short_repr(dtype)
   1467     return repr(dtype)
   1468 if dtype.names is not None:
   1469     # structured dtypes give a list or tuple repr
-> 1470     return str(dtype)
   1471 elif issubclass(dtype.type, flexible):
   1472     # handle these separately so they don't give garbage like str256
   1473     return "'%s'" % str(dtype)

File ~/Documents/numpy/numpy/core/_dtype.py:36, in __str__(dtype)
     34 def __str__(dtype):
     35     if dtype.fields is not None:
---> 36         return _struct_str(dtype, include_align=True)
     37     elif dtype.subdtype:
     38         return _subarray_str(dtype)

File ~/Documents/numpy/numpy/core/_dtype.py:310, in _struct_str(dtype, include_align)
    305 def _struct_str(dtype, include_align):
    306     # The list str representation can't include the 'align=' flag,
    307     # so if it is requested and the struct has the aligned flag set,
    308     # we must use the dict str instead.
    309     if not (include_align and dtype.isalignedstruct) and _is_packed(dtype):
--> 310         sub = _struct_list_str(dtype)
    312     else:
    313         sub = _struct_dict_str(dtype, include_align)

File ~/Documents/numpy/numpy/core/_dtype.py:297, in _struct_list_str(dtype)
    292     item += "{}, {}".format(
    293         _construction_repr(base, short=True),
    294         shape
    295     )
    296 else:
--> 297     item += _construction_repr(fld_dtype, short=True)
    299 item += ")"
    300 items.append(item)

File ~/Documents/numpy/numpy/core/_dtype.py:100, in _construction_repr(dtype, include_align, short)
     98     return _subarray_str(dtype)
     99 else:
--> 100     return _scalar_str(dtype, short=short)

File ~/Documents/numpy/numpy/core/_dtype.py:156, in _scalar_str(dtype, short)
    153     return dtype.type.__name__
    155 else:
--> 156     raise RuntimeError(
    157         "Internal error: NumPy dtype unrecognized type number")

RuntimeError: Internal error: NumPy dtype unrecognized type number

mattip · 2023-04-27T15:05:32Z

Was the error above fixed?

seberg · 2023-04-27T16:42:24Z

Was the error above fixed?

The error Nathan referred to was in the printing code and is unrelated to this code. This fixes the assert (since it won't use that function anymore).

ngoldbaum · 2023-04-27T16:44:41Z

We can file an issue about the structured dtype printing issue and fix that in a followup.

numpy/core/src/multiarray/dtype_traversal.c

charris · 2023-05-10T15:11:28Z

Thanks Sebastian.

github-actions bot added the 03 - Maintenance label Apr 20, 2023

seberg requested a review from ngoldbaum April 20, 2023 13:34

seberg force-pushed the correct-zerofill-structured branch from 5a2b25f to 534c2e5 Compare April 20, 2023 13:50

ngoldbaum approved these changes Apr 20, 2023

View reviewed changes

seberg requested a review from mattip April 26, 2023 19:33

charris reviewed May 9, 2023

View reviewed changes

numpy/core/src/multiarray/dtype_traversal.c Outdated Show resolved Hide resolved

MAINT: Rename traverse_func_get and type for clarity as per review

a5a9fe6

charris merged commit 82a8297 into numpy:main May 10, 2023
57 checks passed

seberg deleted the correct-zerofill-structured branch May 10, 2023 15:14

chrisb-fico mentioned this pull request Mar 6, 2024

BUG: Structured arrays do not clean up properly when freed, causing memory leaks #25949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Add a proper implementation for structured zerofill #23620

MAINT: Add a proper implementation for structured zerofill #23620

seberg commented Apr 20, 2023

ngoldbaum left a comment

seberg commented Apr 20, 2023

ngoldbaum commented Apr 20, 2023

mattip commented Apr 27, 2023

seberg commented Apr 27, 2023

ngoldbaum commented Apr 27, 2023

charris commented May 10, 2023

MAINT: Add a proper implementation for structured zerofill #23620

MAINT: Add a proper implementation for structured zerofill #23620

Conversation

seberg commented Apr 20, 2023

ngoldbaum left a comment

Choose a reason for hiding this comment

seberg commented Apr 20, 2023

ngoldbaum commented Apr 20, 2023

mattip commented Apr 27, 2023

seberg commented Apr 27, 2023

ngoldbaum commented Apr 27, 2023

charris commented May 10, 2023