Skip to content

Array Internal Representation

Masahiro Sakuta edited this page Dec 24, 2022 · 6 revisions

Array semantics

An array is, like every programmer expects, a sequence of values.

An array element can be retrieved by bracket [] operator. For example,

var a = [1, 2, 3];
print(a[1]);

will print 2.

It may seem trivial to implement this semantics until we have another kind of operation, which is assignment.

var a = [1, 2, 3];
a[1] = 20;
print(a);

This should print [1, 20, 3].

The LHS of the assignment operator is an indexed array. So we need to somehow retain the reference to the original array in order to modify its contentes.

Wrapping every element in Rc<RefCell<_>>

First thing you may try is to wrap every element in Rc<RefCell<_>>, so when you access an element you can return a reference to it.

struct ArrayInt {
    values: Vec<Rc<RefCell<Value>>>,
}

enum Value {
    // ...
    Array(Rc<RefCell<ArrayInt>>),
    Ref(Rc<RefCell<Value>>),
}

Alghough it works, its memory utilization is terrible. If you try to visualize a simple array of i32s var a = [1, 2, 3], it would look like this in memory:

+----+    +----------+---------+----+----+----+
| a  | -> | RefCount | RefCell | Rc | Rc | Rc |
+----+    +----------+---------+----+----+----+
                                  |    |    |      +----------+---------+-------+
                                  |    |    +----> | RefCount | RefCell | Value |
                                  |    |           +----------+---------+-------+
                                  |    |      +----------+---------+-------+
                                  |    +----> | RefCount | RefCell | Value |
                                  V           +----------+---------+-------+
                              +----------+---------+-------+
                              | RefCount | RefCell | Value |
                              +----------+---------+-------+

Not only Rc and RefCell occupy 8 bytes each, each element of the array is allocated in a heap memory. If you imagine this with 100,000 elements, it would be a huge performance impact.

ArrayRef type

So we introduce a new kind of a reference, ArrayRef, defined as:

 enum Value {
     // ...
     Array(Rc<RefCell<ArrayInt>>),
     Ref(Rc<RefCell<Value>>),
+    ArrayRef(Rc<RefCell<ArrayInt>>, usize),
 }

It is a reference to the containing array and the index into the array. The memory layout now looks like this:

+----+    +----------+---------+-------+-------+-------+
| a  | -> | RefCount | RefCell | Value | Value | Value |
+----+    +----------+---------+-------+-------+-------+

It is much more compact.

Type information for homogeneous array

Now, we get into more reality. We want to have homogeneous array, which means every element in an array has the same type. Right now, each Value has a type attached to it, so it could be heterogeneous, but we want more compact and efficient format.

The first thing we need to do is to move the type information to the array header.

 pub struct ArrayInt {
+    type_decl: TypeDecl,
     values: Vec<Value>,
 }

+pub enum TypeDecl {
+    Any,
+    F64,
+    // ...
+}
Clone this wiki locally