# How DyND Views Memory

## Basics of the DyND Architecture Part 1

<p style="text-align:right">
Mark Wiebe<br/>
Continuum Analytics
</p>

# Motivating the DyND Array Data Structure

To work with DyND at a low level, particularly to be able to develop in the library, one needs a good understanding of how DyND views memory. We're going to work our way from Python objects to DyND arrays, motivating its design as we go.

At the end of these slides, you will hopefully grasp why DyND is structured as it is, and be prepared to dig deeper into its code.

# Python Objects

We begin with a look at a Python object. In Python, objects are stored in memory blocks independently allocated on the heap. A single floating point value looks something like this:

![Python Float](files/python_float.svg)

This doesn’t look too bad, in a dynamically typed system we need to store some metadata about the type, and we need to store the data containing the value.

# Python List

Often, however, we don’t want just one floating point value, we want a whole array of them. In Python, this is done using a list object, which looks like this:

![Python List of Float](files/python_list_of_float.svg)

# Python List Problems

We can start to see some problems from this picture.

* The float type metadata is repeated seven times. No way to say “everything is float, just store that once.”
* The memory for the floats might be separated or out of order in memory. Bad for CPU cache utilization.
* There is memory being used for the pointers to all the individual floats.

We will call this the “Smalltalk road.” Program data is made of many little objects, connected by pointers. Another option is the “Fortran road,” where program data is made of arrays of data, stored contiguously.

# Smalltalk Road vs Fortran Road

There's a long but worthwhile set of lectures by Alexander Stepanov on Amazon A9's youtube site which goes into some depth about many programming topics. 


In [3]:
from IPython.display import YouTubeVideo
YouTubeVideo("Y4qdNddSWAg", start=190)

# Why Python Is Slow

Jake VanderPlas has [an excellent blog post](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) which dives into more details of Python's structure, that of the CPython interpreter in particular.

The game programming industry is a great place to look for performance inspiration, as they work under constraints far more harsh than most developers. Bob Nystrom's book "Game Programming Patterns" [has a chapter about data locality](http://gameprogrammingpatterns.com/data-locality.html) which goes into nice depth on the topic.



# Dynamic Array

Let's take the Python list we looked at, factor out the redundant repetition of the float type metadata, and split all the type information away from the value storage. This gives us an abstract idea of how a dynamic array should look:

![Dynamic Array](files/abstract_array_of_float.svg)

# NumPy Array

The dynamic array we madeis just a small step away from the NumPy array. In NumPy, the “Array of 7” part is represented as (ndim, shape, strides), and the “Float type” part is represented as the NumPy dtype.

![NumPy Array](files/numpy_array_of_float.svg)

# NumPy Strided Array

The strided array approach NumPy takes turns out to be a very good data structure for many kinds of numeric data. Allowing the strides to be set arbitrarily means both C-order (row-major) and Fortran-order (column-major) data can be represented naturally. There is good material out there going into more depth showing the consequences and benefits of it.

One good resource is [this section of open SciPy lecture notes.](https://scipy-lectures.github.io/intro/numpy/array_object.html)

<img src="https://scipy-lectures.github.io/_images/numpy_indexing.png" width="500"/>

# Weaknesses of NumPy

NumPy, as fantastic as it is, has proved to be lacking in many areas. It provides something good enough and stable enough that a huge ecosystem has grown on top of it, but there are many directions people try to stretch it that require incredible contortions or performance sacrifices to achieve.

A few of these weaknesses include

* A slow pace of evolution. NumPy development is highly constrained by backwards compatibility demands.
* Written directly against the CPython API. There is no core library that can be shared in a larger community.
* Implemented in C, which tends to be very verbose and error-prone compared to its expressiveness. C++, especially with the recent C++11 and C++14 standards, has many features that help implementing this kind of library
* Missing features that people want: Ragged arrays, more data types, labeled dimensions, and physical units metadata, to name a few.

# The DyND Array

DyND begins one step back from NumPy in the data representation narrative we have followed, generalizing the “Dynamic Array” we saw differently. Instead of focusing on the “Array of 7” part and turning it into a “general strided multi-dimensional array,” DyND treats it as one dimension of a hierarchically represented type. DyND takes the regularities in the types of stored data, bundles them together into a type, and takes the data as a separate bundle. This gets us a picture like this:

![DyND Array Take 1](files/dynd_array_of_float_take_1.svg)