Skip to content

Array Internal Representation

Masahiro Sakuta edited this page Jan 1, 2023 · 6 revisions

Array semantics

An array is, like every programmer expects, a sequence of values. In our language, an array index starts with 0. There are discussions whether an array should start with 1 or 0, but array index is historically a memory offset from the head of the array, so 0 is adapted by most languages. Our language is no exception.

An array element can be retrieved by a bracket [] operator. For example,

var a = [1, 2, 3];
print(a[1]);

will print 2.

It may seem trivial to implement this semantics until we have another kind of operation, which is assignment.

var a = [1, 2, 3];
a[1] = 20;
print(a);

This should print [1, 20, 3].

The LHS of the assignment operator is an indexed array. So we need to somehow retain the reference to the original array in order to modify its contentes.

You may think that we may implement assignment as a special operation, like []= operator, at least internally, but it falls apart when you access nested arrays:

var a = [[1,2],[3,4]];
a[1][0] = 30;

So we need to accept that LHS of assignment operator can be an arbitrary expression. We may enforce lvalue semantics to the LHS, but not for now. if we attempt to assign to a temporary variable, such as a function return value, the value will be gone.

Wrapping every element in Rc<RefCell<_>>

First thing you may try is to wrap every element in Rc<RefCell<_>>, so when you access an element you can return a reference to it.

struct ArrayInt {
    values: Vec<Rc<RefCell<Value>>>,
}

enum Value {
    // ...
    Array(Rc<RefCell<ArrayInt>>),
    Ref(Rc<RefCell<Value>>),
}

Alghough it works, its memory utilization is terrible. If you try to visualize a simple array of i32s var a = [1, 2, 3], and have a reference to the first element a[0], it would look like this in memory:

   +----+    +----------+---------+-----+
a: | a  | -> | RefCount | RefCell | Vec |
   +----+    +----------+---------+-----+
                                     |
                                     V
                                  +----+----+----+
                                  | Rc | Rc | Rc |
                                  +----+----+----+
                                     |    |    |      +----------+---------+-------+
                                     |    |    +----> | RefCount | RefCell | Value |
                                     |    |           +----------+---------+-------+
                                     |    |      +----------+---------+-------+
                                     |    +----> | RefCount | RefCell | Value |
                                     V           +----------+---------+-------+
                                   +----------+---------+-------+
                    +------------> | RefCount | RefCell | Value |
                    |              +----------+---------+-------+
      +------+      |
a[0]: | a[0] | -----+
      +------+

Not only Rc and RefCell occupy 8 bytes each, but also each element of the array is allocated in a heap memory block. If you imagine this with 100,000 elements, it would be a huge performance impact.

ArrayRef type

So we introduce a new kind of a reference, ArrayRef, defined as:

 struct ArrayInt {
-    values: Vec<Rc<RefCell<Value>>>,
+    values: Vec<Value>,
 }

 enum Value {
     // ...
     Array(Rc<RefCell<ArrayInt>>),
     Ref(Rc<RefCell<Value>>),
+    ArrayRef(Rc<RefCell<ArrayInt>>, usize),
 }

Note that ArrayInt is just a plain Vec of Values.

An ArrayRef is a reference to the containing array and the index into the array. The memory layout now looks like this:

   +----+    +----------+---------+-----+     +-------+-------+-------+
a: | a  | -> | RefCount | RefCell | Vec | --> | Value | Value | Value |
   +----+    +----------+---------+-----+     +-------+-------+-------+
               ^
         +-----+
         |
      +----+---+
a[0]: | a  | 0 |
      +----+---+

By allowing reference without wrapping every array element in a Rc<RefCell<_>>, the memory layout is much more compact and cache friendly. To be exact, a Value has size of 32 bytes in 64 bit computer, while Rc and RefCell costs 8 bytes each. It doesn't seem much, but keep in mind that heap memory block requires some more memory for housekeeping.

The best part of this design is that the semantics look the same to the user.

This idea of having reference to the container and an index can be generalized to a slice in the future.

Type information for homogeneous array

Now, we get into more reality. We want to have homogeneous array, which means every element in an array has the same type. Right now, each Value has a type attached to it, so it could be heterogeneous, but we want more compact and efficient format.

The first thing we need to do is to move the type information to the array header.

 pub struct ArrayInt {
+    type_decl: TypeDecl,
     values: Vec<Value>,
 }

+pub enum TypeDecl {
+    Any,
+    F64,
+    // ...
+}
Clone this wiki locally