Questions about pep3118 format strings #24428

hoodmane · 2023-08-16T12:26:58Z

I am trying to understand pep3118 since it is essentially undocumented, see the discussion here: https://discuss.python.org/t/question-pep-3118-format-strings-and-the-buffer-protocol/31264/7

@mattip @seberg @rgommers @pitrou

Numpy implements a large subset of it in numpy/core/_internal.py. I think the parser in _internal.py implements the following lark grammar:

Lark grammar for numpy's _dtype_from_pep3118

?start: root
root: entry+
?entry: ( array | padding | _normal_entry ) name?

struct: "T{" entry* "}"
padding: "x"
name:  ":" IDENTIFIER ":"

array: shape _normal_entry
shape: "(" _shape_body ")"
_shape_body: (NUMBER ",")* NUMBER

_normal_entry: byteorder? repeat? ( struct | prim )
byteorder: BYTEORDER
repeat: NUMBER
prim: PRIMITIVE

IDENTIFIER: /[^:^\s]+/
NUMBER: ("0".."9")+
BYTEORDER: "@" | "=" | "<" | ">" | "^" | "!"
PRIMITIVE: "Zf" | "Zd" | "Zg" | /[?cbBhHiIlLqQfdgswOx]/

%ignore /\s+/

There are a few things I think are weird about this grammar:

The location of the byte order marks in relation to shapes

The pep says:

Endian-specification (‘!’, ‘@’,’=’,’>’,’<’, ‘^’) is also allowed inside the string so that it can change if needed. The previously-specified endian string is in force until changed. The default endian is ‘@’ which means native data-types and alignment. If un-aligned, native data-types are requested, then the endian specification is ‘^’.

This is completely ambiguous about where these marks can go. Prior to pep3118 it seems that the marks are only allowed at the very start of the format string. It seems to me that the most logical location would be that one is allowed between each pair of adjacent entries. But _dtype_from_pep3118 expects them to come between the shape and the primitive:

>>> _dtype_from_pep3118("@(3,1)i") # @ before shape not allowed
ValueError: Unknown PEP 3118 data type specifier '(3,1)i'
>>> _dtype_from_pep3118("(3,1)@i") # @ between shape and i allowed
dtype(('<i4', (3, 1)))

This would sort of make sense if the mark only affected the current entry but it also affects all following ones, making the location a bit perplexing. This becomes particularly noticeable when you look at the parse trees: since it affects all following entries, it should come next to the entries but the parser grammar above makes the order mark a child of a particular entry.

I think this is a bug which should be fixed by the following patch:

patch

--- a/numpy/core/_internal.py
+++ b/numpy/core/_internal.py
@@ -673,12 +673,6 @@ def __dtype_from_pep3118(stream, is_subdtype):
         if stream.consume('}'):
             break
 
-        # Sub-arrays (1)
-        shape = None
-        if stream.consume('('):
-            shape = stream.consume_until(')')
-            shape = tuple(map(int, shape.split(',')))
-
         # Byte order
         if stream.next in ('@', '=', '<', '>', '^', '!'):
             byteorder = stream.advance(1)
@@ -686,6 +680,12 @@ def __dtype_from_pep3118(stream, is_subdtype):
                 byteorder = '>'
             stream.byteorder = byteorder
 
+        # Sub-arrays (1)
+        shape = None
+        if stream.consume('('):
+            shape = stream.consume_until(')')
+            shape = tuple(map(int, shape.split(',')))
+
         # Byte order characters also control native vs. standard type sizes
         if stream.byteorder in ('@', '^'):
             type_map = _pep3118_native_map

`(4)h` vs `4h` vs `hhhh`

In the struct module documentation it says:

the format string '4h' means exactly the same as 'hhhh'.

But _dtype_from_pep3118 disagrees: it gives the same output for 4h and (4)h but both are different from the output for hhhh. Then there is the issue of (4)4h, which is treated as a array of 4 arrays of 4 h's, so not the same as (4,4)h. Also, perplexingly (4)(4)h is a syntax error. I think (4)4h should be the same as (4)T{hhhh}.

Also as I said, it seems to me that it makes more sense to allow arbitrary nested arrays like (4)(4)h to mean the current thing that (4)4h means.

Arrays of padding

I think it's weird that _dtype_from_pep3118 accepts arrays of padding like (4, 4)x. Isn't this properly rendered as 16x? It gives the same output. My grammar doesn't allow it.

Named padding

Is it intended that can be named? If you need a name for it, is it padding anymore?

A lark grammar with my suggested modifications:

Details

?start: root
root: _entry+
_entry: byteorder? entry
?entry:  padding | (_normal_entry  name?)

padding: NUMBER? "x"
name:  ":" IDENTIFIER ":"

_normal_entry: array | (repeat? ( struct | prim ))

struct: "T{" _entry* "}"

array: shape _normal_entry
shape: "(" _shape_body ")"
_shape_body: (NUMBER ",")* NUMBER


byteorder: BYTEORDER
repeat: NUMBER
prim: PRIMITIVE


IDENTIFIER: /[^:^\s]+/
NUMBER: ("0".."9")+
BYTEORDER: "@" | "=" | "<" | ">" | "^" | "!"
PRIMITIVE: "Zf" | "Zd" | "Zg" | /[?cbBhHiIlLqQfdgswOx]/

%ignore /\s+/

The text was updated successfully, but these errors were encountered:

hoodmane · 2023-08-16T13:23:03Z

Okay looking also at the format strings in ctypes, it seems that my suggestion to move the location of the byte order marks in relation to shapes is a nonstarter since ctypes puts them in the same place. But ctypes always puts a mark for each entry, so they don't seem to intend the scope rules that numpy's parser applies to them.

The following grammar accepts all the numpy formats and also the formats that are in the ctypes format string test suite.

Details

?start: root
root: entry+
?entry: (array | _normal_entry ) name?

array: shape _normal_entry
shape: "(" _shape_body ")"
_shape_body: (NUMBER ",")* NUMBER

_normal_entry: pointer | (byteorder? repeat? ( padding | struct | prim ))
pointer: "&" entry

struct: "T{" entry* "}"
padding: "x"


name:  ":" IDENTIFIER ":"
byteorder: BYTEORDER
repeat: NUMBER
prim: PRIMITIVE


IDENTIFIER: /[^:^\s]+/
NUMBER: ("0".."9")+
BYTEORDER: "@" | "=" | "<" | ">" | "^" | "!"
PRIMITIVE: "X{}" | "Zf" | "Zd" | "Zg" | /[?cbBhHiIlLqQfdgOs]/

%ignore /\s+/

ngoldbaum mentioned this issue Sep 25, 2023

fix: include numpy._core imports for NumPy 2.0 pybind/pybind11#4857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about pep3118 format strings #24428

Questions about pep3118 format strings #24428

hoodmane commented Aug 16, 2023 •

edited

Loading

hoodmane commented Aug 16, 2023

Questions about pep3118 format strings #24428

Questions about pep3118 format strings #24428

Comments

hoodmane commented Aug 16, 2023 • edited Loading

The location of the byte order marks in relation to shapes

(4)h vs 4h vs hhhh

Arrays of padding

Named padding

A lark grammar with my suggested modifications:

hoodmane commented Aug 16, 2023

hoodmane commented Aug 16, 2023 •

edited

Loading

`(4)h` vs `4h` vs `hhhh`