Skip to content

Array Internal Representation

Masahiro Sakuta edited this page Dec 24, 2022 · 6 revisions

Array semantics

An array is, like every programmer expects, a sequence of values.

An array element can be retrieved by bracket [] operator. For example,

var a = [1, 2, 3];
print(a[1]);

will print 2.

It may seem trivial to implement this semantics until we have another kind of operation, which is assignment.

var a = [1, 2, 3];
a[1] = 20;
print(a);

This should print [1, 20, 3].

The LHS of the assignment operator is an indexed array. So we need to somehow retain the reference to the original array in order to modify its contentes.

You may think that we may implement assignment as a special operation, like []= operator, at least internally, but it falls apart when you access nested arrays:

var a = [[1,2],[3,4]];
a[1][0] = 30;

So we need to accept that LHS of assignment operator can be an arbitrary expression. We may enforce lvalue semantics to the LHS, but not for now. if we attempt to assign to a temporary variable, such as a function return value, the value will be gone.

Wrapping every element in Rc<RefCell<_>>

First thing you may try is to wrap every element in Rc<RefCell<_>>, so when you access an element you can return a reference to it.

struct ArrayInt {
    values: Vec<Rc<RefCell<Value>>>,
}

enum Value {
    // ...
    Array(Rc<RefCell<ArrayInt>>),
    Ref(Rc<RefCell<Value>>),
}

Alghough it works, its memory utilization is terrible. If you try to visualize a simple array of i32s var a = [1, 2, 3], it would look like this in memory:

   +----+    +----------+---------+----+----+----+
a: | a  | -> | RefCount | RefCell | Rc | Rc | Rc |
   +----+    +----------+---------+----+----+----+
                                     |    |    |      +----------+---------+-------+
                                     |    |    +----> | RefCount | RefCell | Value |
                                     |    |           +----------+---------+-------+
                                     |    |      +----------+---------+-------+
                                     |    +----> | RefCount | RefCell | Value |
                                     V           +----------+---------+-------+
                                 +----------+---------+-------+
                            +--> | RefCount | RefCell | Value |
                            |    +----------+---------+-------+
      +------+              |
a[0]: | a[0] | -------------+
      +------+

Not only Rc and RefCell occupy 8 bytes each, but also each element of the array is allocated in a heap memory block. If you imagine this with 100,000 elements, it would be a huge performance impact.

ArrayRef type

So we introduce a new kind of a reference, ArrayRef, defined as:

 struct ArrayInt {
-    values: Vec<Rc<RefCell<Value>>>,
+    values: Vec<Value>,
 }

 enum Value {
     // ...
     Array(Rc<RefCell<ArrayInt>>),
     Ref(Rc<RefCell<Value>>),
+    ArrayRef(Rc<RefCell<ArrayInt>>, usize),
 }

Note that ArrayInt is just a plain Vec of Values.

An ArrayRef is a reference to the containing array and the index into the array. The memory layout now looks like this:

   +----+    +----------+---------+-------+-------+-------+
a: | a  | -> | RefCount | RefCell | Value | Value | Value |
   +----+    +----------+---------+-------+-------+-------+
               ^
         +-----+
         |
      +----+---+
a[0]: | a  | 0 |
      +----+---+

By allowing reference without wrappnig every array element in a Rc<RefCell<_>>, the memory layout is much more compact and cache friendly. The semantics look the same to the user.

This idea of having reference to the container and an index can be generalized to a slice in the future.

Type information for homogeneous array

Now, we get into more reality. We want to have homogeneous array, which means every element in an array has the same type. Right now, each Value has a type attached to it, so it could be heterogeneous, but we want more compact and efficient format.

The first thing we need to do is to move the type information to the array header.

 pub struct ArrayInt {
+    type_decl: TypeDecl,
     values: Vec<Value>,
 }

+pub enum TypeDecl {
+    Any,
+    F64,
+    // ...
+}
Clone this wiki locally