Miniboxing should interoperate with specialization #108

Closed
VladUreche opened this Issue Jun 27, 2014 · 4 comments

Comments

Projects
None yet
2 participants
@VladUreche
Member

VladUreche commented Jun 27, 2014

... at least for FunctionX and TupleX, although arguably TupleXspecialization should be deprecated due to the duplicate field issue in specialization (SI-3585).

Suggested by @alexandru in the miniboxing mailing list: https://groups.google.com/d/msg/scala-miniboxing/dJDmxtPnwCg/cqh7Tm7xnhEJ

@VladUreche

This comment has been minimized.

Show comment
Hide comment
@VladUreche

VladUreche Jun 27, 2014

Member

The problem is that the function application requires boxing, although the caller is miniboxed and the FunctionX is specialized.

The first problem is that FunctionX is specialized but not miniboxed, and we have no means of reliably overriding the Scala standard library to replace specialized functions by miniboxed ones, at least not in a reliable way. Since FunctionX is not miniboxed, the plugin considers it generic, thus not optimizing calls to f.apply.

Currently, the miniboxing transformation occurs before specialization, so it would be specialization's responsibility to redirect the call from f.apply(...) to f.apply$mcXX$sp(...). But miniboxing transforms generics shallowly, so even though T and R are miniboxed, Function1[T, R] does not become Function1[Long, Long], but stays Function1[T, R].

Now, according to the definition of optimized stack initiators in the miniboxing tutorial, which is also applicable to specialization, a receiver of type Function1[T, R] is not a suitable as a optimized stack initiator. Also, since specialization doesn't have access to the miniboxing metadata, it is not an optimized stack propagator either. So it ends up as an optimized stack inhibitor, thus requiring boxing.

Now, how to fix this?

A proposed solution is to create a specialized bridging method, that miniboxing can call:

def function_spec_bridge_JJ[T, R](T_Tag: Byte, R_Tag: Byte, f: Function1[T, R], arg: Long): Long =
  // of course we would optimize this, but this is the idea:
  (T_Tag, R_Tag) match {
    case (INT, INT) => int2minibox(f.asInstanceOf[Function1[Int, Int]].apply(minibox2int(arg)))
    case ... =>
  }

Since Function1[Int, Int] is a valid optimized stack initiator from specialization's point of view, the call to apply will be rewritten to apply$mcII$sp and the boxing will be gone.

But what about that tuple of tags? Isn't that expensive?

Yes, it is. An optimized implementation is:

def function_spec_bridge_JJ[T, R](T_Tag: Byte, R_Tag: Byte, f: Function1[T, R], arg: Long): Long =
  T_Tag match {
    case INT => function_spec_bridge_JI[T, R](R_Tag, f, minibox2int(arg))
    ...
  }
def function_spec_bridge_IJ[T, R](R_Tag: Byte, f: Function1[T, R], arg: Int): Long =
  R_Tag match {
    case INT => int2minibox(f.asInstanceOf[Function1[Int, Int]].apply(arg))
    ...
  }

The above two-layered structure will be amenable to inlining in the caller, thus potentially optimizing the code for call sites where the tags are known statically. On the other hand, it does two steps of dispatching, thus significantly slowing down execution.

An alternative implementation would be:

def function_spec_bridge_JJ[T, R](T_Tag: Byte, R_Tag: Byte, f: Function1[T, R], arg: Long): Long =
  (T_Tag + 10 * R_Tag) match {
    case INT + 10 * INT => int2minibox(f.asInstanceOf[Function1[Int, Int]].apply(minibox2int(arg)))
    ...
  }

The second solution encourages the HotSpot vm to compile of the function_spec_bridge_JJ without inlining it in the caller. Since the function call is not inlined, and f is always megamorphic, this requires two virtual calls but a single switch and does not put pressure on the register allocator.

To sum up, it is not clear which solution would be better, if either would be better at all. I think benchmarks should decide between the two implementations. Over the next few days I will implement both solutions and make a compiler flag to trigger one or the other, so we can benchmark.

Member

VladUreche commented Jun 27, 2014

The problem is that the function application requires boxing, although the caller is miniboxed and the FunctionX is specialized.

The first problem is that FunctionX is specialized but not miniboxed, and we have no means of reliably overriding the Scala standard library to replace specialized functions by miniboxed ones, at least not in a reliable way. Since FunctionX is not miniboxed, the plugin considers it generic, thus not optimizing calls to f.apply.

Currently, the miniboxing transformation occurs before specialization, so it would be specialization's responsibility to redirect the call from f.apply(...) to f.apply$mcXX$sp(...). But miniboxing transforms generics shallowly, so even though T and R are miniboxed, Function1[T, R] does not become Function1[Long, Long], but stays Function1[T, R].

Now, according to the definition of optimized stack initiators in the miniboxing tutorial, which is also applicable to specialization, a receiver of type Function1[T, R] is not a suitable as a optimized stack initiator. Also, since specialization doesn't have access to the miniboxing metadata, it is not an optimized stack propagator either. So it ends up as an optimized stack inhibitor, thus requiring boxing.

Now, how to fix this?

A proposed solution is to create a specialized bridging method, that miniboxing can call:

def function_spec_bridge_JJ[T, R](T_Tag: Byte, R_Tag: Byte, f: Function1[T, R], arg: Long): Long =
  // of course we would optimize this, but this is the idea:
  (T_Tag, R_Tag) match {
    case (INT, INT) => int2minibox(f.asInstanceOf[Function1[Int, Int]].apply(minibox2int(arg)))
    case ... =>
  }

Since Function1[Int, Int] is a valid optimized stack initiator from specialization's point of view, the call to apply will be rewritten to apply$mcII$sp and the boxing will be gone.

But what about that tuple of tags? Isn't that expensive?

Yes, it is. An optimized implementation is:

def function_spec_bridge_JJ[T, R](T_Tag: Byte, R_Tag: Byte, f: Function1[T, R], arg: Long): Long =
  T_Tag match {
    case INT => function_spec_bridge_JI[T, R](R_Tag, f, minibox2int(arg))
    ...
  }
def function_spec_bridge_IJ[T, R](R_Tag: Byte, f: Function1[T, R], arg: Int): Long =
  R_Tag match {
    case INT => int2minibox(f.asInstanceOf[Function1[Int, Int]].apply(arg))
    ...
  }

The above two-layered structure will be amenable to inlining in the caller, thus potentially optimizing the code for call sites where the tags are known statically. On the other hand, it does two steps of dispatching, thus significantly slowing down execution.

An alternative implementation would be:

def function_spec_bridge_JJ[T, R](T_Tag: Byte, R_Tag: Byte, f: Function1[T, R], arg: Long): Long =
  (T_Tag + 10 * R_Tag) match {
    case INT + 10 * INT => int2minibox(f.asInstanceOf[Function1[Int, Int]].apply(minibox2int(arg)))
    ...
  }

The second solution encourages the HotSpot vm to compile of the function_spec_bridge_JJ without inlining it in the caller. Since the function call is not inlined, and f is always megamorphic, this requires two virtual calls but a single switch and does not put pressure on the register allocator.

To sum up, it is not clear which solution would be better, if either would be better at all. I think benchmarks should decide between the two implementations. Over the next few days I will implement both solutions and make a compiler flag to trigger one or the other, so we can benchmark.

@VladUreche

This comment has been minimized.

Show comment
Hide comment
@VladUreche

VladUreche Jun 27, 2014

Member

Of course, the best solution would be to allow miniboxing stubs in specialization, thus allowing miniboxing to optimally call into specialized code. This would actually provide wonderful performance, but that would require binary incompatible changes to the Scala library...

Member

VladUreche commented Jun 27, 2014

Of course, the best solution would be to allow miniboxing stubs in specialization, thus allowing miniboxing to optimally call into specialized code. This would actually provide wonderful performance, but that would require binary incompatible changes to the Scala library...

@alexandru

This comment has been minimized.

Show comment
Hide comment
@alexandru

alexandru Jun 30, 2014

+1 eagerly waiting to see the results of those benchmarks

+1 eagerly waiting to see the results of those benchmarks

@VladUreche

This comment has been minimized.

Show comment
Hide comment
@VladUreche

VladUreche Jul 3, 2014

Member

@alexandru, sorry for the radio silence these last few days -- I had a few distractions and I've been thinking about the best solution for this problem -- I think I have a solution that will be significantly better -- only introducing overheads when creating a function, not when calling it. I'm looking into this now.

Member

VladUreche commented Jul 3, 2014

@alexandru, sorry for the radio silence these last few days -- I had a few distractions and I've been thinking about the best solution for this problem -- I think I have a solution that will be significantly better -- only introducing overheads when creating a function, not when calling it. I'm looking into this now.

VladUreche added a commit that referenced this issue Jul 4, 2014

VladUreche added a commit that referenced this issue Jul 9, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment