Skip to content

[compiler] Code-Generated TableIntervalJoin(..., product = true) #13617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Jan 4, 2024

Conversation

ehigham
Copy link
Member

@ehigham ehigham commented Sep 13, 2023

Add staged code generation capabilities by lowering to and emitting
StreamLeftIntervalJoin.
The EmitStream rule for this node uses a code-generated min heap* of
intervals that contain the current key in the left stream, ordered by
right endpoint.
The row type of the join result is the left row with an array of values**
from the right table, ordered by their associated interval's right endpoint.

Note that this implementation is by no means optimised. There are a number of
opportunities that I'd like to consider in subsequent changes, including:

  • Laziness in restoring the heap property on push/pop from heap.
  • Not deep-copying elements of the right stream when no explicit memory
    management per element is required.

* This min heap is code generated instead of using an off-the-shelf
implementation as:

  • we dont yet to have a mapping from SType to a java class or interface to
    parameterise the heap with
  • not obvious how to handle region memory management in an existing solution

** The value is a row with the key fields omitted.

Additional Changes:

  • Make this an explicit argument to CodeBuilder.invoke()
    • Gives us control which object we're dispatching on.
    • Useful for generating more self-contained classes
    • Previously assumed all methods were defined in the same class.
  • Removed referenceGenomes from ExecuteContext
    • Delegate to Backend; in practice these come from the backend anyway.
    • These were being populated from the backend object
    • Backend is mutable meaning we can add/remove fake genomes for testing

@ehigham ehigham force-pushed the ehigham/interval-join branch from 2d18dd9 to f3bca68 Compare September 13, 2023 21:13
@ehigham ehigham force-pushed the ehigham/interval-join branch from 4d5a538 to d13efa8 Compare October 12, 2023 18:52
@ehigham ehigham marked this pull request as ready for review December 4, 2023 20:00
@ehigham ehigham changed the title [compiler] Code-Generated StreamLeftIntervalJoin [compiler] Code-Generated TableIntervalJoin(..., product = true) Dec 4, 2023
Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work! Thanks for pushing on the infrastucture for generating multiple classes, I think that's super valuable. I have some comments, but they're mostly small things.

Comment on lines +193 to +202
val expectedArgs =
if (callee.mb.isStatic) callee.emitParamTypes
else CodeParamType(callee.ecb.cb.ti) +: callee.emitParamTypes

val args = _args.toArray

if (expectedArgs.size != args.length)
throw new RuntimeException(s"invoke ${ callee.mb.methodName }: wrong number of parameters: " +
s"expected ${ expectedArgs.size }, found ${ args.length }")
val codeArgs = args.indices.flatMap { i =>
val arg = args(i)
val pt = expectedArgs(i)
throw new RuntimeException(s"invoke ${callee.mb.methodName}: wrong number of parameters: " +
s"expected ${expectedArgs.size}, found ${args.length}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should do this check in CodeBuilder.invoke[Code] too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the type-checking was done in EmitCodeBuilder. Perhaps some generic version should be in CodeBuilderLike.
I don't think this is essential for this change. We can revist this later for a proper tidy-up (which perhaps would be nice to do at some point).

Copy link
Member Author

@ehigham ehigham Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I tried to add an assert. This caused way too many failures. In order to get this to work properly I think we need to track more information about class hierarchy in the TypeInfo object.

This code fails:

val F = FunctionBuilder[Int]("F")
...
val a = cb.newLocal("a", Code.newInstance(F.cb, F.cb.ctor, FastSeq())
cb.invoke(F.mb, a) // typecheck failure

a has type LocalRef[AsmFunction0[Int]] because the compiler infers it from the FunctionBilder (type FunctionBuilder[AsmFunction0[Int]]). The TypeInfo on F, however, is L_C1F;, which is not the same as the TypeInfo on a: Lis/hail/asm4s/AsmFunction0;

Worthwhile doing, but I think it may be a larger change to get this to work properly.

@ehigham
Copy link
Member Author

ehigham commented Dec 21, 2023

Thank you for your review, Patrick - it has significantly improved this PR.

@ehigham ehigham dismissed patrick-schultz’s stale review January 4, 2024 16:24

addressed feedback

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

@danking danking merged commit fe5ed32 into hail-is:main Jan 4, 2024
@ehigham ehigham deleted the ehigham/interval-join branch January 4, 2024 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants