Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Cannot load a PLINK file containing 20 million variants #5564
Danfeng saw the referenced stack trace when trying to broadcast the variants for Plink (see: LoadPlink.scala:202. She was running a import_plink, count.
The details in EsotericSoftware/kryo#382 indicate that a bad interaction between the data and a hash function can cause this integer map to exceed its size limitations at a load factor of 5%. Even a 20x increase in footprint puts us at 400 million. Each element of that array has 6 entries, so we're at 1.2 billion. That definitely feels like the danger zone. Maybe there's more variants than Danfeng expects, maybe there's more overhead than we've accounted for.
The GATK folks have been chasing down the fix. Kryo released 4.0.0 which should fix this issue. Spark upgraded to Kryo 4.0.0 on September 8th of 2018. (resolving Spark-20389). This change made it to 2.4.0, but it was not back ported to other versions of Spark.
GATK references a temporary fix via JVM options, which apparently forces the JVM to use an alternative hash function with better behavior in this specific case:
A generally interesting blog post on Java's hashCode, which I haven't fully read, claims that the JVM previously defaulted to a PRNG draw for an object's hash code. In JDK 8 it uses some function of the current thread state. It appears this old strategy is preserved as JVM hashCode parameter value 0 and is less likely to trigger the bad behavior in Kryo. This
Another suggested Kryo option is to disable reference tracking. This would cause duplicate objects in the object graph to be serialized twice:
Kryo kryo = new Kryo(); kryo.setReferences(false);
I added a discuss post for our users https://discuss.hail.is/t/i-get-a-negativearraysizeexception-when-loading-a-plink-file/899