Skip to content

[query] faster array decoder#13787

Merged
danking merged 8 commits intohail-is:mainfrom
patrick-schultz:array-decoder
Oct 17, 2023
Merged

[query] faster array decoder#13787
danking merged 8 commits intohail-is:mainfrom
patrick-schultz:array-decoder

Conversation

@patrick-schultz
Copy link
Member

@patrick-schultz patrick-schultz commented Oct 10, 2023

Picking up where #13776 left off.

CHANGELOG: improved speed of reading hail format datasets from disk

This PR speeds up decoding arrays in two main ways:

  • instead of calling arrayType.isElementDefined(array, i) on every single array element, which expands to
    val b = aoff + lengthHeaderBytes + (i >> 3)
    !((Memory.loadByte(b) & (1 << (i & 7).toInt)) != 0)
    process elements in groups of 64, and load the corresponding long of missing bits once
  • once we have a whole long of missing bits, we can be smarter than branching on each bit. After flipping to get presentBits, we use the following psuedocode to extract the positions of the set bits, with time proportional to the number of set bits:
    while (presentBits != 0) {
      val idx = java.lang.Long.numberOfTrailingZeroes(presentBits)
      // do something with idx
      presentBits = presentBits & (presentBits - 1) // unsets the rightmost set bit
    }
    

To avoid needing to handle the last block of 64 elements differently, this PR changes the layout of PCanonicalArray to ensure the missing bits are always padded out to a multiple of 64 bits. They were already padded to a multiple of 32, and I don't expect this change to have much of an effect. But if needed, blocking by 32 elements instead had very similar performance in my benchmarks.

I also experimented with unrolling loops. In the non-missing case, this is easy. In the missing case, I tried using if (presentBits.bitCount >= 8) to guard an unrolled inner loop. In both cases, unrolling was if anything slower.

Dan observed benefit from unrolling, but that was combined with the first optimization above (not loading a bit from memory every element), which I beleive was the real source of improvement.

val len = cb.newLocal[Int]("len", in.readInt())
val array = cb.newLocal[Long]("array", arrayType.allocate(region, len))
val len = cb.memoize(in.readInt())
val array = cb.memoize(arrayType.allocate(region, len))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preserve the name hints

cb.ifx((len % 64).cne(0), {
// ensure that the last missing block has all missing bits set past the last element
// arrayType.printDebug(cb, array)
val lastMissingBlockAddr = cb.memoize(arrayType.firstElementOffset(array) - 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the array is short, the last missing block might be smaller than 8 bytes, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea I'm trying here is to ensure that missing bits are always padded with 1s to a multiple of 64 bits. So we can treat the last block the same as any other, as long as we only try to access present elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the alignment change below, I think we're safe to read the whole long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

blockOff < pastLastOff,
cb.assign(blockOff, blockOff + blockOffIncrement),
{
cb.assign(presentBits, ~Region.loadLong(mbyteOffset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverse the bytes, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I was wrong about that. sun.unsafe uses the system endianness, so little endian, which means no reverse needed.

@danking
Copy link
Contributor

danking commented Oct 10, 2023

Thanks for picking up!

@patrick-schultz
Copy link
Member Author

Should be ready to go now. I'll add a better description in the morning. I also haven't benchmarked anything yet; would be good to do that and maybe tune things like amount of unrolling. @danking Could you share how you created the datatset you were using for benchmarking?

@danking danking changed the title [wip] faster array decoder [query] faster array decoder Oct 10, 2023
@danking danking self-assigned this Oct 10, 2023
PCanonicalArray(copiedElement, required)
}

def forEachDefined(cb: EmitCodeBuilder, aoff: Value[Long])(f: (EmitCodeBuilder, Value[Int], SValue) => Unit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan a follow up PR to make this use the same logic as above? This method is used in several places. We might be able to even more improvement outside of decoding!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! That was one reason I thought it worth changing the PCanonicalArray layout to not have any partial last block of missingness bits - so we can easily use blocked missingness logic elsewhere too.

@danking
Copy link
Contributor

danking commented Oct 10, 2023

Also: need a CHANGELOG and can you write a little note for our future selves in the PR description?

@danking
Copy link
Contributor

danking commented Oct 10, 2023

FWIW, I built this branch and ran it twice as described in #13776 and got 35s, 36s.

@danking
Copy link
Contributor

danking commented Oct 10, 2023

Seems to average 60MB/s. No clear culprits. Zstd decoding is the top hit right now.

The hottest generated code is inplace decoding of an optional array of optional int32. Really sucks because things like LA are somehow getting written as element optional, even though, by construction their elements are not optional.

+EBaseStruct{
  `the entries! [877f12a8827e18f61222c6c8c5fb04a8]`:
    +EArray[EBaseStruct{
       LA:EArray[EInt32]
      ,LGT:EInt32
      ,LAD:EArray[EInt32]
      ,LPGT:EInt32
      ,LPL:EArray[EInt32]
      ,RGQ:EInt32,
      gvcf_info: EBaseStruct{
         AC:EArray[EInt32]
        ,AF:EArray[EFloat64]
        ,AN:EInt32,AS_BaseQRankSum:EArray[EFloat64]
        ,AS_FS:EArray[EFloat64]
        ,AS_InbreedingCoeff:EArray[EFloat64]
        ,AS_MQ:EArray[EFloat64]
        ,AS_MQRankSum:EArray[EFloat64]
        ,AS_QD:EArray[EFloat64]
        ,AS_QUALapprox:EArray[EInt32]
        ,AS_RAW_BaseQRankSum:EBinary,AS_RAW_MQ:EArray[EFloat64]
        ,AS_RAW_MQRankSum:EArray[EBaseStruct{`0`:EFloat64,`1`:EInt32}]
        ,AS_RAW_ReadPosRankSum:EArray[EBaseStruct{`0`:EFloat64,`1`:EInt32}]
        ,AS_ReadPosRankSum:EArray[EFloat64]
        ,AS_SB_TABLE:EArray[EArray[EInt32]]
        ,AS_SOR:EArray[EFloat64]
        ,AS_VarDP:EArray[EInt32]
        ,BaseQRankSum:EFloat64,ExcessHet:EFloat64,FS:EFloat64,InbreedingCoeff:EFloat64,MQ:EFloat64,MQRankSum:EFloat64,MQ_DP:EInt32,QD:EFloat64,QUALapprox:EInt32,RAW_GT_COUNT:EArray[EInt32]
        ,RAW_MQandDP:EArray[EInt32]
        ,ReadPosRankSum:EFloat64,SOR:EFloat64,VarDP:EInt32}
      ,DP:EInt32
      ,GQ:EInt32
      ,MIN_DP:EInt32
      ,PID:EBinary
      ,PS:EInt32
      ,SB:EArray[EInt32]
    }
  ]
}

Async profiler periodic sampling:
Screenshot 2023-10-10 at 18 06 38
Screenshot 2023-10-10 at 18 07 10
Sync profiler (note safe point bias)
Screenshot 2023-10-10 at 18 32 14

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

@danking danking merged commit ecb7d86 into hail-is:main Oct 17, 2023
@patrick-schultz patrick-schultz deleted the array-decoder branch January 2, 2025 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants