Conversation
| val len = cb.newLocal[Int]("len", in.readInt()) | ||
| val array = cb.newLocal[Long]("array", arrayType.allocate(region, len)) | ||
| val len = cb.memoize(in.readInt()) | ||
| val array = cb.memoize(arrayType.allocate(region, len)) |
| cb.ifx((len % 64).cne(0), { | ||
| // ensure that the last missing block has all missing bits set past the last element | ||
| // arrayType.printDebug(cb, array) | ||
| val lastMissingBlockAddr = cb.memoize(arrayType.firstElementOffset(array) - 8) |
There was a problem hiding this comment.
If the array is short, the last missing block might be smaller than 8 bytes, right?
There was a problem hiding this comment.
The idea I'm trying here is to ensure that missing bits are always padded with 1s to a multiple of 64 bits. So we can treat the last block the same as any other, as long as we only try to access present elements.
There was a problem hiding this comment.
With the alignment change below, I think we're safe to read the whole long.
| blockOff < pastLastOff, | ||
| cb.assign(blockOff, blockOff + blockOffIncrement), | ||
| { | ||
| cb.assign(presentBits, ~Region.loadLong(mbyteOffset)) |
There was a problem hiding this comment.
No, I was wrong about that. sun.unsafe uses the system endianness, so little endian, which means no reverse needed.
|
Thanks for picking up! |
b4d635e to
f5d1e72
Compare
f5d1e72 to
f4be350
Compare
|
Should be ready to go now. I'll add a better description in the morning. I also haven't benchmarked anything yet; would be good to do that and maybe tune things like amount of unrolling. @danking Could you share how you created the datatset you were using for benchmarking? |
hail/src/main/scala/is/hail/types/physical/PCanonicalArray.scala
Outdated
Show resolved
Hide resolved
hail/src/main/scala/is/hail/types/physical/PCanonicalArray.scala
Outdated
Show resolved
Hide resolved
| PCanonicalArray(copiedElement, required) | ||
| } | ||
|
|
||
| def forEachDefined(cb: EmitCodeBuilder, aoff: Value[Long])(f: (EmitCodeBuilder, Value[Int], SValue) => Unit) { |
There was a problem hiding this comment.
Do you plan a follow up PR to make this use the same logic as above? This method is used in several places. We might be able to even more improvement outside of decoding!
There was a problem hiding this comment.
Absolutely! That was one reason I thought it worth changing the PCanonicalArray layout to not have any partial last block of missingness bits - so we can easily use blocked missingness logic elsewhere too.
hail/src/main/scala/is/hail/types/physical/stypes/interfaces/SContainer.scala
Outdated
Show resolved
Hide resolved
|
Also: need a CHANGELOG and can you write a little note for our future selves in the PR description? |
|
FWIW, I built this branch and ran it twice as described in #13776 and got 35s, 36s. |



Picking up where #13776 left off.
CHANGELOG: improved speed of reading hail format datasets from disk
This PR speeds up decoding arrays in two main ways:
arrayType.isElementDefined(array, i)on every single array element, which expands topresentBits, we use the following psuedocode to extract the positions of the set bits, with time proportional to the number of set bits:To avoid needing to handle the last block of 64 elements differently, this PR changes the layout of
PCanonicalArrayto ensure the missing bits are always padded out to a multiple of 64 bits. They were already padded to a multiple of 32, and I don't expect this change to have much of an effect. But if needed, blocking by 32 elements instead had very similar performance in my benchmarks.I also experimented with unrolling loops. In the non-missing case, this is easy. In the missing case, I tried using
if (presentBits.bitCount >= 8)to guard an unrolled inner loop. In both cases, unrolling was if anything slower.Dan observed benefit from unrolling, but that was combined with the first optimization above (not loading a bit from memory every element), which I beleive was the real source of improvement.