Skip to content

Latest commit

 

History

History
253 lines (222 loc) · 8.1 KB

ndarray.md

File metadata and controls

253 lines (222 loc) · 8.1 KB

The DyND ND::Array

The DyND nd::array is a multidimensional data storage container, inspired by the NumPy ndarray and based on the Blaze datashape system. Like NumPy, it supports strided multidimensional arrays of data with a uniform data type, but has the ability to store ragged arrays and data types with variable-sized data.

ND::Array Structure

The inspiration for the data structure is to break apart the NumPy ndarray into three components, a data type, some metadata like strides and shape, and the data. Here's how this looks in NumPy:

>>> a = np.array([[1,2,3],[4,5,6]])
# data type
>>> a.dtype
dtype('int32')
# metadata
>>> a.shape
(2L, 3L)
>>> a.strides
(12L, 4L)
>>> a.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
# data
>>> a.data
<read-write buffer for 0x00000000028131B0, size 24, offset 0 at 0x00000000031F0FB8>

In the DyND nd::array's Python exposure, the same data is:

>>> a = nd.array([[1,2,3],[4,5,6]])
>>> nd.debug_repr(a)
------ array
 address: 00000000004566A0
 refcount: 1
 type:
  pointer: 0000000000463410
  type: strided * strided * int32
 metadata:
  flags: 5 (read_access immutable )
  type-specific metadata:
   strided_dim metadata
    stride: 12
    size: 2
    strided_dim metadata
     stride: 4
     size: 3
 data:
   pointer: 00000000004566F0
   reference: 0000000000000000 (embedded in array memory)
------

Some things that were metadata in NumPy ndarrays have become part of the dtype in DyND nd::arrays. The fact that this is a two-dimensional strided array is encoded in the DyND dtype. One way to think of this dtype is that it is a strided array of strided int32 arrays.

In the debug printout of the nd::array, there is first some metadata about the nd::array, currently only access permission flags. The dtype-specific metadata is determined by the structure of the dtype. In this case, each strided_array part of the dtype owns some metadata memory which contains its stride and dimension size. The int32 doesn't require any additional metadata.

The data consists of a pointer to the memory containing the array elements, and a reference to a memory block which holds the data for the nd::array. In this case, the data has been embedded within the same memory allocation as the nd::array metadata.

Memory Blocks

Data for nd::arrays is stored in memory blocks, which are low level reference counted objects containing allocated memory or a handle/reference from some other system. For example, when a NumPy ndarray is converted into an nd::array, a PyObject pointer and a version of Py_DECREF which grabs the GIL is held by the memory block.

The memory block itself doesn't know where within it the data an nd::array needs is, so whereever nd::arrays need memory block references, they also need a raw data pointer.

Indexing Example

Here's a small example showing the result of a simple indexing operation.

>>> a = nd.array([1,2,3])
>>> nd.debug_repr(a)
------ array
 address: 000000000043F5B0
 refcount: 1
 type:
  pointer: 000000000048FA80
  type: strided * int32
 metadata:
  flags: 5 (read_access immutable )
  type-specific metadata:
   strided_dim metadata
    stride: 4
    size: 3
 data:
   pointer: 000000000043F5F0
   reference: 0000000000000000 (embedded in array memory)
------

>>> nd.debug_repr(a[1])
------ array
 address: 0000000000461460
 refcount: 1
 type:
  pointer: 0000000000000004
  type: int32
 metadata:
  flags: 5 (read_access immutable )
 data:
   pointer: 000000000043F5F4
   reference: 000000000043F5B0
    ------ memory_block at 000000000043F5B0
     reference count: 2
     type: ndarray
     type: strided * int32
    ------
------

In the printout of a[1], the first thing to note is the dtype pointer, it's just the value 4. This is because for a small number of builtin dtypes, their dtype representation in the nd::array is just a type id.

Compare the pointer and reference of a[1] with that of a. The pointer is 4 greater, as expected for indexing element 1 with a stride of 4. The reference is the same as the nd::array a's address, which you can see at the top of the printout. That's because the array data was embedded in the nd::array's memory, so a reference to that nd::array gets substituted for NULL while indexing.

NumPy Example

Here's an example of an array sourced from NumPy. To make it more interesting, we cause the memory of the array to be unaligned.

>>> mem = np.zeros(9, dtype='i1')
>>> a = mem[1:].view(dtype='i4')
>>> b = nd.view(a)
>>> nd.debug_repr(b)
------ array
 address: 000000000048CD50
 refcount: 1
 type:
  pointer: 0000000000474D90
  type: strided * unaligned[int32]
 metadata:
  flags: 3 (read_access write_access )
  type-specific metadata:
   strided_dim metadata
    stride: 4
    size: 2
 data:
   pointer: 000000000404FC61
   reference: 000000000043A790
    ------ memory_block at 000000000043A790
     reference count: 1
     type: external
     object void pointer: 0000000003F9C610
     free function: 000007FEE582251D
    ------
------

Because the data isn't aligned, the nd::array3 can't have a straight int32 dtype. The solution is to make an unaligned[int32] dtype, which has alignment 1 instead of alignment 4 like int32.

The data reference is an external memory block here. The "object void pointer" is a pointer to the PyObject*, and the "free function" is a function which wraps Py_DECREF in some code to ensure the GIL is being held. Unfortunately, this means that freeing this object will be more expensive than normal, but there isn't really another option that permits nd::arrays to be used safely across multiple threads. For memory blocks themselves, an atomic increment/decrement is used to provide this thread safety.

Variable-Length String Example

The default string dtype for dynd is parameterized by its encoding and is variable-length. This is handled by having a memory block reference in the string's metadata, and the primary string data itself being a pair of pointers to the beginning and one past the end of the string. For an example, here's how a single Python string converts to an nd::array:

>>> a = nd.array("This is a string")
>>> nd.debug_repr(a)
------ array
 address: 00000000004DA8C0
 refcount: 1
 type:
  pointer: 0000000000508EA0
  type: string
 metadata:
  flags: 5 (read_access immutable )
  type-specific metadata:
   string metadata
    ------ NULL memory block
 data:
   pointer: 00000000004DA8F8
   reference: 0000000000000000 (embedded in array memory)
------

What you will notice here is that the memory block inside of the string metadata is NULL, as is the data reference at the nd::array level. This is because a larger amount of memory was allocated for the nd::array, and the space at the end was used for the string, to minimize the number of memory allocations. Both of the NULL memory block references indicate this.

Let's do this also for an array of strings::

>>> a = nd.array([u"This", u"is", u"unicode."])
>>> nd.debug_repr(a)
------ array
 address: 000000000050F1C0
 refcount: 1
 type:
  pointer: 00000000004ED810
  type: strided * string
 metadata:
  flags: 5 (read_access immutable )
  type-specific metadata:
   strided_dim metadata
    stride: 16
    size: 3
    string metadata
     ------ memory_block at 000000000050C530
      reference count: 1
      type: pod
      finalized: 14
     ------
 data:
   pointer: 000000000050F208
   reference: 0000000000000000 (embedded in array memory)
------

In this case the nd::array's reference is still NULL, indicating its memory is combined with the nd::array, but the string data itself is in a separate memory block.