[query] Fix encoder correctness bug; I2B is a byte#11328
Conversation
This is one of the sneakiest bugs I've ever worked on. I learned early
that the issue was somehow related to method splitting, but it was
far more devious than I expected, and the code in LIR method splitting
is completely correct.
I2B is an instruction that truncates an integer to a byte, and it is used
in various places in code generation but primarily encoding missing bits
in arrays and structs. The particular case I was debugging had expected
behavior when trying to write the missing bits of four fields A,B,C,D,
where A and B were missing and C and D were present. These should have
written the byte
1<<0 | 1<<1 | 0<<2 | 0<<3
==> 00000011
==> 3
But instead wrote the byte `1`, incorrectly leading readers
to try to read field B when it was missing (and not written). This is
due to the load-bearing and incorrect type of an I2B instruction
generated [here](https://github.com/hail-is/hail/blob/8bd9b7b2224b77372a72f02f2b13806267892a35/hail/src/main/scala/is/hail/types/encoded/EBaseStruct.scala#L107).
I2B loads a byte to the stack, not a boolean. TypeInfos are mostly
non-structural since they rarely influence the bytecode generated.
Here is an exception, and that's where the method splitter comes in.
Method splitting exists in the Hail compiler because not only does
the JVM have limits on how large methods can be, but also the JIT
compiler handles small methods much more effectively than small methods
(and so splitting a large method into two small ones can make an order
of magnitude or more in performance difference). We have three forms
of method splitting in the Hail Query compiler. The first is a heuristic
and greedy IR-level method splitter that generates new methods every X IR nodes,
simply based on node count. However, the size of code generated by each IR
can vary widely (`I32` vs `LowerBoundOnOrderedCollection` for instance), and
so we have two other kinds of splitting that operate on the LIR level.
The first is region splitting, which is used to split large blocks of LIR.
In order to insert a split, any variables on the stack are stored in local
variables before the split and loaded from those locals after the split.
The second is method splitting, which is used to split large single methods.
A single-exit group of blocks can be split into a separate method, and we have
some machinery for replacing control flow instructions (which I will not go into
here, for they are not relevant now), as well as handling local variables that
are used across a method split. These shared Local variables are replaced by fields on a "spills"
class which is allocated any time a split method is called. Spilled local `store`s
are rewritten as field `store`s, and `load`s are rewritten as field `load`s.
What was the problem here?
A region split was inserted *directly between the `I2B` instruction and the call to
`OutputBuffer.write`. This meant that the result of `I2B` was stored in a local variable
and read in the subsequent block. **The incorrect TypeInfo of Boolean was used for
that local variable**, but this seems not to pose a problem -- both Boolean and Byte
use a single slot, and so the code still works even with the wrong variable type.
However, the method splitter then **generated a method split at the same point where
the region was split**. This means that the local variable resulting from I2B is spilled
to a class field on the spills class. Our incorrectly-Boolean local becomes an
incorrectly-Boolean **field**, and this is where things go wrong -- it seems as though
Boolean class fields (appropriately) cannot only store and load a single bit. Our value of
`3` was stored as a class Boolean, and came out `1`.
```code
# I2B is stored as a class field on spills-
31017 (PutFieldX PUTFIELD __C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills.__f2355__l2315split_large_block B
31018 (LoadX arg:1 L__C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills;)
31019 25774 (InsnX I2B
31020 25775 (InsnX IOR
31021 25776 (InsnX IOR
31022 25777 (InsnX IOR
31023 25778 (InsnX IOR
31024 25779 (LdcX 0 I)
31025 25780 (InsnX ISHL
31026 (GetFieldX GETFIELD __C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills.__f2346null Z
31027 (LoadX arg:1 L__C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills;))
31028 25782 (LdcX 0 I)))
31029 25783 (InsnX ISHL
31030 (GetFieldX GETFIELD __C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills.__f2348null Z
31031 (LoadX arg:1 L__C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills;))
31032 25785 (LdcX 1 I)))
31033 25786 (InsnX ISHL
31034 (GetFieldX GETFIELD __C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills.__f2350null Z
31035 (LoadX arg:1 L__C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills;))
31036 25788 (LdcX 2 I)))
31037 25789 (InsnX ISHL
31038 (GetFieldX GETFIELD __C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills.__f2352null Z
31039 (LoadX arg:1 L__C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills;))
31040 25791 (LdcX 3 I)))))
31041 (ReturnX)
# the resulting field is loaded and written to the output buffer
11325 (MethodStmtX INVOKEVIRTUAL __C1527collect_distributed_array.__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616_region0_0 (L__C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills;)V
11326 (LoadX arg:0 L__C1527collect_distributed_array;)
11327 (LoadX t489ae494/spills L__C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills;))
11328 25772 (MethodStmtX INVOKEINTERFACE is/hail/io/OutputBuffer.writeByte (B)Vinterface
11329 (GetFieldX GETFIELD __C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills.__f2354null Lis/hail/io/OutputBuffer;
11330 (LoadX t489ae494/spills L__C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills;))
11331 (GetFieldX GETFIELD __C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills.__f2355__l2315split_large_block B
11332 (LoadX t489ae494/spills L__C2316__m1984ENCODE_SInsertFieldsStruct_TO_EBaseStruct___iruid_8616Spills;)))
```
Testing
This is super hard to reproduce using small/public examples, and any unit tests are pretty much meaningless. John suggested we programmatically check the TypeInfo inference against some JVM reference, and I agree that's the best bet, but don't want to block this critical fix on that project. I fixed BALOAD for the same reason, but it doesn't appear that has caused trouble yet.
| @@ -874,8 +874,8 @@ class InsnX(val op: Int, _ti: TypeInfo[_], var lineNumber: Int = 0) extends Valu | |||
| case F2D => DoubleInfo | |||
| case DALOAD => DoubleInfo | |||
| // Boolean | |||
There was a problem hiding this comment.
This comment is wrong for i2b.
baload is a cursed instruction. I think it's actually impossible for us to give the right TypeInfo because it depends on the type of the array argument. It looks like we could have some special case logic in insn (where we know the arguments) where we could determine the actually correct type info.
There was a problem hiding this comment.
Ah I see we can use children.head.
|
Re testing, I think the primary error here isn't that the TypeInfo of i2b disagreed with the jvm, it's that |
|
Hmm, that's true Patrick. Bit of a pain since we'll need to use the type tag to look up an implicit type info, but certainly possible. Dan, I've backed out the BALOAD change and we can remove that from the inferred instruction list entirely in favor of explicit type info where it's generated from array[byte] and array[boolean] lookups. |
The disappearing bit
This is one of the sneakiest bugs I've ever worked on. I learned early that the issue was somehow related to method splitting, but it was far more devious than I expected, and the code in LIR method splitting is completely correct.
The particular case I was debugging had expected behavior when trying to write the missing bits of four fields A,B,C,D, where A and B were missing and C and D were present. These should have written the byte
But instead wrote the byte
b00000001 or 1, incorrectly leading readers to try to read field B when it was missing (and not written). This is due to the load-bearing and incorrect type of an I2B instruction generated here. I2B is an instruction that truncates an integer to a byte, and it is used in various places in code generation but primarily encoding missing bits in arrays and structs.I2B loads a byte to the stack, not a boolean. TypeInfos are mostly non-structural since they rarely influence the bytecode generated. Here is an exception, and that's where the method splitter comes in.
Method splitting exists in the Hail compiler because not only does the JVM have limits on how large methods can be, but also the JIT compiler handles small methods much more effectively than large methods (and so splitting a large method into two small ones can make an order of magnitude or more in performance difference). We have three forms of method splitting in the Hail Query compiler. The first is a heuristic and greedy IR-level method splitter that generates new methods every X IR nodes, simply based on node count. However, the size of code generated by each IR can vary widely (
I32vsLowerBoundOnOrderedCollectionfor instance), and so we have two other kinds of splitting that operate on the LIR level.The first is region splitting, which is used to split large blocks of LIR. In order to insert a split, any variables on the stack are stored in local variables before the split and loaded from those locals after the split.
The second is method splitting, which is used to split large single methods. A single-exit group of blocks can be split into a separate method, and we have some machinery for replacing control flow instructions (which I will not go into here, for they are not relevant now), as well as handling local variables that are used across a method split. These shared Local variables are replaced by fields on a "spills" class which is allocated any time a split method is called. Spilled local
stores are rewritten as fieldstores, andloads are rewritten as fieldloads.What was the problem here?
A region split was inserted directly between the
I2Binstruction and the call toOutputBuffer.write. This meant that the result ofI2Bwas stored in a local variable and read in the subsequent block. The incorrect TypeInfo of Boolean was used for that local variable, but this seems not to pose a problem -- both Boolean and Byte use a single slot, and so the code still works even with the wrong variable type.However, the method splitter then generated a method split at the same point where the region was split. This means that the local variable resulting from I2B is spilled to a class field on the spills class. Our incorrectly-Boolean local becomes an incorrectly-Boolean field, and this is where things go wrong -- it seems as though Boolean class fields (appropriately) truncate on store and load a single bit. Our value of
3was stored as a class Boolean, and came out1. The fact that a single field's missingness was flipped was a red herring -- all higher bits are flipped to 0 (defined)! Here's a look at the LIR looks like, though it was ultimately the JVM class file printout that tipped me off to the problem:Pervasiveness
There is absolutely nothing about this bug that is whole-stage-codegen-specific, but I suspect the much larger single IRs compiled in whole stage code generation made it exponentially more likely for this corner case to occur. I imagine it would be possible to construct a failing pipeline with whole stage code generation turned off.
Testing
This is super hard to reproduce using small/public examples, and any unit tests to capture this crazy edge case are pretty much meaningless. John suggested we programmatically check the TypeInfo inference against some JVM reference, and I agree that's the best bet, but don't want to block this critical fix on that project. I fixed BALOAD for the same reason, but it doesn't appear that has caused trouble yet.