Skip to content

Running | Data Representation

Vlad Ureche edited this page Jun 13, 2014 · 12 revisions

This tutorial will pick up where the introduction left off. It will discuss the data representation aspects of the miniboxing transformation, as explained in the Unifying Data Representation Transformations paper.

In the introduction, we created a class C:

class C[@miniboxed T](val t: T)

and noticed that its specialized variants C_L and C_J included a weird Tsp @storage[Long] type in the minibox-inject compiler phase, which was later transformed to Long in minibox-commit. Let us look at this process.

Compiler Phases

First of all, there are several miniboxing phases in the compiler pipeline:

$ mb-scalac -Xshow-phases
    phase name  id  description
    ----------  --  -----------
           ...  ..  ...
       uncurry  13  uncurry, translate function values to anonymous classes
minibox-inject  14
minibox-coerce  15
minibox-commit  16
     tailcalls  17  replace tail calls by jumps
           ...  ..  ...

Here we see the main three phases introduced by the miniboxing plugin (there are another 3 which are introduced for purely technical reasons, to maintain compatibility with the rest of the compiler: pretyper, posttyper and hijacker). The main tree phases map directly to the data representation mechanism phases:

  • minibox-inject duplicates methods and classes and adds the @storage annotation
  • minibox-coerce introduces explicit coercions between boxed and miniboxed values
  • minibox-commit gives the final semantics to annotated types and coercions

Miniboxing and Subtyping

Having three transformation phases sounds like a lot of work for an otherwise trivial task: transforming a type parameter T to Long. To see why this is necessary, let us take an example:

object DR1 extends App {
  def foo[@miniboxed T](t: T): Unit = {
    val a: Any = t
    println(a)
  }
  foo(3.14)
}

Compiling this code will produce two versions of the method: foo, the generic variant and foo_n_J, that encodes primitive types in a long integer. The last call in the object, to foo(3.24) will be rewritten to use foo_n_J.

Yet, the more interesting part is how the val a: Any = t statement is translated. If we simply replace T by Long, a call to foo would not print 3.14 as expected, but the long integer encoding of the floating-point number 3.14. This issue stems from the interaction between the miniboxed data representation and subtyping, which occurs when assigning the miniboxed value t, encoded as a long integer, to the value a, which is of type Any, the top type in the Scala type system. Since Long is a subtype of Any a naive transformation would produce incorrect results and would not be caught by the type system.

Let us see how the miniboxing transformation handles this case (output simplified for readability):

$ mb-scalac DR1.scala -Xprint:uncurry,minibox
warning: 'minibox' selects 3 phases
[[syntax trees at end of                   uncurry]] // DR1.scala
package <empty> {
  object DR1 extends Object with App {
    def foo[@miniboxed T](t: T): Unit = {
      val a: Any = t;
      println(a)
    };
    DR1.this.foo[Double](3.14)
  }
}

[[syntax trees at end of            minibox-inject]] // DR1.scala
package <empty> {
  object DR1 extends Object with App {
    def foo[@miniboxed T](t: T): Unit = {
      val a: Any = t;
      println(a)
    };
    def foo_n_J[T](T_TypeTag: Byte, t: T @storage[Long]): Unit = {
      val a: Any = t;
      println(a)
    };
    DR1.this.foo_n_J[Double](8, 3.14)
  }
}

[[syntax trees at end of            minibox-coerce]] // DR1.scala
package <empty> {
  object DR1 extends Object with App {
    def foo[@miniboxed T](t: T): Unit = {
      val a: Any = t;
      println(a)
    };
    def foo_n_J[T](T_TypeTag: Byte, t: T @storage[Long]): Unit = {
      val a: Any = marker_minibox2box[T, Long](t);
      println(a)
    };
    DR1.this.foo_n_J[Double](8, marker_box2minibox[Double, Long](3.14))
  }
}

[[syntax trees at end of            minibox-commit]] // DR1.scala
package <empty> {
  object DR1 extends Object with App {
    def foo[@miniboxed T](t: T): Unit = {
      val a: Any = t;
      println(a)
    };
    def foo_n_J[T](T_TypeTag: Byte, t: Long): Unit = {
      val a: Any = MiniboxConversions.minibox2box[T](t, T_TypeTag);
      println(a)
    };
    DR1.this.foo_n_J[Double](8, MiniboxConversions.double2minibox(3.14))
  }
}

The first phase, minibox-inject creates the a second version of the foo method, named foo_n_J, which will be transformed to use the miniboxing data representation. It also redirects the call foo(3.14) to foo_n_J(DOUBLE, 3.14). At this point in the transformation, the signature of foo_n_J includes T @storage[Long]. This annotation signals that the type will be later represented as a long integer, but, at this stage, remains a generic type T. Therefore the code val a: Any = t is still correct, since T @storage[Long] is compatible to T, which in turn is a subtype of Any. So far so good...

But the minibox-coerce phase makes @storage[...] annotated types incompatible with their direct counterparts, which, in turn, requires the introduction of explicit coercions. Specifically, the code val a: Any = t is rewritten to val a: Any = marker_minibox2box[T, Long](t), since at this stage of the transformation, T @storage[Long] is no longer a subtype of Any, which is not annotated. As the name suggests, the coercion introduced at this point is a marker, not the final coercion.

The minibox-commit phase commits to the actual alternative representation, which, in this case, is Long. The signature of foo_n_J becomes def foo_n_J[T](T_TypeTag: Byte, t: Long) and the marker coercion is replaced by MiniboxConversions.minibox2box[T](t, T_TypeTag).

This three stage-transformation allows the miniboxing plugin to robustly, correctly and optimally transform any code, from simple examples to very complex library collection code, which uses all the language features, such as higher-kinded types, closures and implicits.

Miniboxing and Object-oriented dispatch

Let us now jump to another example:

object DR2 {
  def foo[@miniboxed T](t: T): Unit = {
    println(t.toString)
  }
  foo(2.71)
}

As before, simply transforming T to Long in the miniboxed variant of method foo would not suffice, as the program output would be the long integer encoding 2.71 instead of the double-precision floating-point number 2.71. This is again solved by using the data representation transformation mechanism:

$ mb-scalac DR2.scala -Xprint:uncurry,minibox
warning: 'minibox' selects 3 phases
[[syntax trees at end of                   uncurry]] // DR2.scala
package <empty> {
  object DR2 extends Object {
    def foo[@miniboxed T](t: T): Unit = 
      println(t.toString());
    DR2.this.foo[Double](2.71)
  }
}

[[syntax trees at end of            minibox-inject]] // DR2.scala
package <empty> {
  object DR2 extends Object {
    def foo[@miniboxed T](t: T): Unit = 
      println(t.toString());
    def foo_n_J[T](T_TypeTag: Byte, t: T @storage[Long]): Unit = 
      println(t.toString());
    DR2.this.foo_n_J[Double](8, 2.71)
  }
}

[[syntax trees at end of            minibox-coerce]] // DR2.scala
package <empty> {
  object DR2 extends Object {
    def foo[@miniboxed T](t: T): Unit = 
      println(t.toString());
    def foo_n_J[T](T_TypeTag: Byte, t: T @storage[Long]): Unit = 
      println(marker_minibox2box[T, Long](t).toString());
    DR2.this.foo_n_J[Double](8, marker_box2minibox[Double, Long](2.71))
  }
}

[[syntax trees at end of            minibox-commit]] // DR2.scala
package <empty> {
  object DR2 extends Object {
    def foo[@miniboxed T](t: T): Unit = 
      println(t.toString());
    def foo_n_J[T](T_TypeTag: Byte, t: Long): Unit = 
      println(MiniboxDispatch.mboxed_toString(t, T_TypeTag));
    DR2.this.foo_n_J[Double](8, MiniboxConversions.this.double2minibox(2.71))
  }
}

In our example, the receiver of the toString method call is a value with type T @storage[Long]. Since changing the representation will also affect the semantics of methods called on a miniboxed value, the data representation mechanism will automatically introduce a coercion to the boxed representation, which can handle method calls. This can be seen in the foo_n_J method translation in the minibox-coerce phase:

  def foo_n_J[T](T_TypeTag: Byte, t: T @storage[Long]): Unit = 
    println(marker_minibox2box[T, Long](t).toString());

What is even more interesting is what happens in the minibox-commit phase:

  def foo_n_J[T](T_TypeTag: Byte, t: Long): Unit = 
    println(MiniboxDispatch.mboxed_toString(t, T_TypeTag));

Instead of boxing the value using a miniboxToBox and calling toString on the boxed value, the miniboxing plugin short-circuited this by having a special method for this, mboxed_toString, which does not require boxing at all. This is critical to obtaining good performance in the case of arrays of miniboxed values, as was discussed in the original Miniboxing paper.

Advanced miniboxing data representation

One thing you probably noticed is that the annotated type is T @storage[Long]. Why explicitly put Long, the storage type, in the annotation? Could it be because?... Yes, miniboxing can work with multiple storage types:

$ mb-scalac C.scala -Xprint:minibox-inject -P:minibox:two-way
[[syntax trees at end of            minibox-inject]] // C.scala
package <empty> {
  abstract trait C[@miniboxed T] extends Object {
    def t(): T;
    def t_D(T_TypeTag: Byte): T @storage[Double];
    def t_J(T_TypeTag: Byte): T @storage[Long]
  };
  class C_D[Tsp] extends Object with C[Tsp] {
    private[this] val C_D|T_TypeTag: Byte = _;
    private[this] val t: Tsp @storage[Double] = _;
    def t(): Tsp = C_D.this.t_D(C_D.this.C_D|T_TypeTag);
    def t_D(T_TypeTag: Byte): Tsp @storage[Double] = C_D.this.t;
    def t_J(T_TypeTag: Byte): Tsp @storage[Long] = C_D.this.t_D(T_TypeTag)
  };
  class C_J[Tsp] extends Object with C[Tsp] {
    private[this] val C_J|T_TypeTag: Byte = _;
    private[this] val t: Tsp @storage[Long] = _;
    def t(): Tsp = C_J.this.t_J(C_J.this.C_J|T_TypeTag);
    def t_D(T_TypeTag: Byte): Tsp @storage[Double] = C_J.this.t_J(T_TypeTag);
    def t_J(T_TypeTag: Byte): Tsp @storage[Long] = C_J.this.t
  };
  class C_L[Tsp] extends Object with C[Tsp] {
    private[this] val t: Tsp = _;
    def t(): Tsp = C_L.this.t;
    def t_D(T_TypeTag: Byte): Tsp @storage[Double] = C_L.this.t();
    def t_J(T_TypeTag: Byte): Tsp @storage[Long] = C_L.this.t()
  }
}

The -P:minibox:two-way argument instructs the miniboxing plugin to use three encodings: generic, encoded in long and encoded in double. We won't go into the technical details of why this is important, but having double as a variant increases performance by at least 50% when using floating-point numbers.

The most interesting part in this translation is how the miniboxed class C_D handles converting from one representation to another in the t_J getter. In the injection phase, T, T @storage[Long] and T @storage[Double] are all compatible:

  class C_D[Tsp] extends Object with C[Tsp] {
    private[this] val C_D|T_TypeTag: Byte = _;
    private[this] val t: Tsp @storage[Double] = _;
    def t(): Tsp = C_D.this.t_D(C_D.this.C_D|T_TypeTag);
    def t_D(T_TypeTag: Byte): Tsp @storage[Double] = C_D.this.t;
    def t_J(T_TypeTag: Byte): Tsp @storage[Long] = C_D.this.t_D(T_TypeTag)
  };

In the minibox-coerce phase, there is a new coercion added, marker_minibox2minibox, which can convert between different storage types:

  class C_D[Tsp] extends Object with C[Tsp] {
    private[this] val C_D|T_TypeTag: Byte = _;
    private[this] val t: Tsp @storage[Double] = _;
    def t(): Tsp = marker_minibox2box[Tsp, Double](C_D.this.t_D(C_D.this.C_D|T_TypeTag));
    def t_D(T_TypeTag: Byte): Tsp @storage[Double] = C_D.this.t;
    def t_J(T_TypeTag: Byte): Tsp @storage[Long] = marker_minibox2minibox[Tsp, Double, Long](C_D.this.t_D(T_TypeTag))
  };

The marker_minibox2minibox contains all the necessary type information in its type arguments: marker_minibox2minibox[Tsp, Double, Long](...). The type first argument is the boxed type while the next two are the initial representation and the desired representation. This gives the next phase, minibox-commit all the information necessary for transforming the tree:

  class C_D[Tsp] extends Object with C[Tsp] {
    private[this] val C_D|T_TypeTag: Byte = _;
    private[this] val t: Double = _;
    def t(): Tsp = MiniboxConversionsDouble.minibox2box[Tsp](C_D.this.t_D(C_D.this.C_D|T_TypeTag), C_D.this.C_D|T_TypeTag);
    def t_D(T_TypeTag: Byte): Double = C_D.this.t;
    def t_J(T_TypeTag: Byte): Long = MiniboxConversions.unreachableConversion[Nothing]("Double", "Long")
  };

Surprisingly, the minibox-commit phase translates the marker_minibox2minibox to MiniboxConversions.unreachableConversion. This translation is correct as the set of primitive types that are encoded into Long and into Double do not overlap. This means there is no primitive type that could be encoded using both Long and Double at the same time, thus, there is no need for such a conversion. Still, the data representation mechanism is flexible enough to accommodate this use case too.

Conclusion

We have shown how the miniboxing transformation uses the data representation transformation mechanism to add one or two alternative representations for the boxed primitives. Miniboxing is a well-established Scala compiler plugin that we encourage everyone to try and play around with. The miniboxing website describes other examples and tests that we transformed using the miniboxing plugin.

The next chapter, macrobenchmarks will show how well the effort of transforming methods and classes classes pays off for the miniboxing plugin.

Next Steps

You can continue with the following resources: