getUInt returns invalid values #3

mattmook · 2022-01-22T12:45:27Z

There's currently a bug in getUInt when reading values that are negative bytes.

For example, consider the following:

val bytes = byteArrayOf(0x2e, 0xf6.toByte())

println(bytes.getUInt(0)) // prints 46
println(bytes.getUInt(1)) // prints 4294967286 but should be 246

This looks as simple as just ensuring the value is masked, i.e. private fun ByteArray.getUInt(index: Int) = get(index).toUInt() and 0xffu although I've not throughly tested this. I assume the same needs to be applied to getULong too.

The test suite is probably not broad enough in terms of the byte values under test; which is why it doesn't pick up this issue. I'm currently in the process of adding an incremental version of murmurhash into my own Multiplatform library, cryptohash and while looking into that came across this test suite from Apache Commons Codec which helped highlight this issue.

From a testing point of view, with a fix in place, hash32x86 with a zero seed would be expected to return 0x368f9df1 for the 2-byte array above.

The text was updated successfully, but these errors were encountered:

goncalossilva · 2022-01-22T20:19:30Z

Hey @mattmook, good to hear from you!

Good catch, thank you for taking the time. I think an idiomatic way of addressing it is to convert:

kotlinx-murmurhash/src/commonMain/kotlin/MurmurHash3.kt

Lines 298 to 300 in 7eba7f4

    
           private fun ByteArray.getUInt(index: Int) = get(index).toUInt() 
        
           private fun ByteArray.getULong(index: Int) = get(index).toULong()

into:

private fun ByteArray.getUInt(index: Int) = get(index).toUByte().toUInt()

private fun ByteArray.getULong(index: Int) = get(index).toUByte().toULong()

Which behind the scenes does the masking you suggest, but keeps it “Kotlin-esque” by leveraging the expressive conversion methods from/to signed/unsigned that the platform provides.

You're right that the test suite should have caught these. The wordlist is extensive, but it's most;y "a" to "z", i.e., "\u0041" through "\u005a", which conveniently misses on all of these edge cases. I didn't consider this before, but your Apache Commons Codec share looks like a great resource (good license, too). I'll likely borrow bits of it.

Fix coming later today.

goncalossilva closed this as completed in 6ceffd1 Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getUInt returns invalid values #3

getUInt returns invalid values #3

mattmook commented Jan 22, 2022 •

edited

goncalossilva commented Jan 22, 2022 •

edited

getUInt returns invalid values #3

getUInt returns invalid values #3

Comments

mattmook commented Jan 22, 2022 • edited

goncalossilva commented Jan 22, 2022 • edited

mattmook commented Jan 22, 2022 •

edited

goncalossilva commented Jan 22, 2022 •

edited