Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load a PLINK file containing 20 million variants #5564

Closed
danking opened this Issue Mar 8, 2019 · 3 comments

Comments

Projects
None yet
1 participant
@danking
Copy link
Collaborator

danking commented Mar 8, 2019

I believe we are encountering this known Kryo limitation: EsotericSoftware/kryo#497, EsotericSoftware/kryo#382 (also see related GATK issue: broadinstitute/gatk#1524)

Danfeng saw the referenced stack trace when trying to broadcast the variants for Plink (see: LoadPlink.scala:202. She was running a import_plink, count.

The details in EsotericSoftware/kryo#382 indicate that a bad interaction between the data and a hash function can cause this integer map to exceed its size limitations at a load factor of 5%. Even a 20x increase in footprint puts us at 400 million. Each element of that array has 6 entries, so we're at 1.2 billion. That definitely feels like the danger zone. Maybe there's more variants than Danfeng expects, maybe there's more overhead than we've accounted for.

The GATK folks have been chasing down the fix. Kryo released 4.0.0 which should fix this issue. Spark upgraded to Kryo 4.0.0 on September 8th of 2018. (resolving Spark-20389). This change made it to 2.4.0, but it was not back ported to other versions of Spark.

GATK references a temporary fix via JVM options, which apparently forces the JVM to use an alternative hash function with better behavior in this specific case:

spark.executor.extraJavaOptions -XX:hashCode=0
spark.driver.extraJavaOptions -XX:hashCode=0

A generally interesting blog post on Java's hashCode, which I haven't fully read, claims that the JVM previously defaulted to a PRNG draw for an object's hash code. In JDK 8 it uses some function of the current thread state. It appears this old strategy is preserved as JVM hashCode parameter value 0 and is less likely to trigger the bad behavior in Kryo. This -XX:hashCode option is undocumented 1, 2 🤷‍♀️.

Another suggested Kryo option is to disable reference tracking. This would cause duplicate objects in the object graph to be serialized twice:

Kryo kryo = new Kryo();
kryo.setReferences(false);
@danking

This comment has been minimized.

Copy link
Collaborator Author

danking commented Mar 8, 2019

@danking danking self-assigned this Mar 8, 2019

@danking

This comment has been minimized.

Copy link
Collaborator Author

danking commented Mar 8, 2019

Confirmed that the hashCode solution works for Danfeng.

@danking

This comment has been minimized.

Copy link
Collaborator Author

danking commented Mar 8, 2019

Closing with recommended solution hashCode=0, long term plan: eliminate Spark.

@danking danking closed this Mar 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.