Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"upgrade" adam to forked "0.20.0" #581

Merged
merged 5 commits into from
Sep 16, 2016
Merged

Conversation

ryan-williams
Copy link
Member

@ryan-williams ryan-williams commented Sep 15, 2016

this ports us on to most of the non-trivial changes that have happened in ADAM since 0.19.0, using my forked "0.20.0".

ADAM's VCF-writing now requires Sample.sampleId to match Genotype.sampleId, and I am using the "sampleName" pulled from readsets.args.Base / ReadSets rather than the loaded-from-data sampleName on Read.

The latter is now unused so I removed it. It might make sense to keep it around explicitly in some way… seems like what ADAM now does with RecordGroups is the right thing, maybe we'll want to use that if we care about that data again.


This change is Reviewable

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 78.593% when pulling 99caf8f on ryan-williams:adam into a6771ee on hammerlab:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+8.06%) to 86.628% when pulling bbad0c4 on ryan-williams:adam into a6771ee on hammerlab:master.

Copy link
Contributor

@arahuja arahuja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'd say just wait for/ask for a new ADAM release then managing one though

@@ -41,6 +41,9 @@ case class CalledSomaticAllele(sampleName: SampleName,
.newBuilder
.setAlleles(seqAsJavaList(Seq(GenotypeAllele.Ref, GenotypeAllele.Alt)))
.setSampleId(sampleName)
.setContigName(contigName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, what was happening before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know, otomh!

the semantics of the downstream adam and htsjdk models here have changed in this upgrade, so a few things have to specified differently than they were before.

it's also possible that these aren't needed now either; I added them while debugging an issue where variant's INFO fields were coming out blank, which turned out to be due to the sampleId not being set consistently throughout.

I can dig in further if you want.

@@ -16,16 +16,11 @@
<name>guacamole: variant caller</name>

<properties>
<adam.version>0.19.0</adam.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you don't like these as properties?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going with the heuristic that if it's used twice, it should be factored out, otherwise it should just be on the dependency-declaration.

Factoring out versions that only appear once seems more verbose and less clear, to me.

lmk if you disagree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree that it's not worth it to factor out for a single dependency, but once there seems useful, easy to get a read on the major dependencies and their version ... but probably fine as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I don't think any of these qualify as "major dependencies".

The two that do, spark and hadoop, coincidentally (or not!) are used in multiple places.

@@ -16,7 +16,92 @@
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="Log odds ratio of being a true variant versus being false under the trained gaussian mixture model">
##INFO=<ID=culprit,Number=1,Type=String,Description="The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the variant was filtered out">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tumor.chr20
##contig=<ID=1,length=249250621,assembly=GRCh37>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just what is generated now? Otherwise, don't really need all the contigs listed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, it's what is generated now.

if there was a more narrow a way to test this i'd be open to it, but this seems like the most straightforward way to have an end-to-end test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that makes sense, OOC - did these positions match the test positions before or are they different/whatever was output? It's possible more positions were output than the test positions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This predates this PR, but all 25 positions that used to be checked in SomaticStandardRealDataSuite are here, as well as 9 more.

@coveralls
Copy link

Coverage Status

Coverage increased (+11.5%) to 90.123% when pulling d195e6e on ryan-williams:adam into a3018fd on hammerlab:master.

@arahuja
Copy link
Contributor

arahuja commented Sep 16, 2016

Coverage increased (+11.5%)

?? Does that seem right?

@ryan-williams
Copy link
Member Author

No, coveralls is clearly confused on this PR.

It just occurred to me that it's likely to do with my caching target here; coverage-info from previous runs is hanging around.

Let me either make sure that doesn't happen, or remove that change.

In general I think zinc is reliable enough at this point that we could save a lot of time not recompiling everything every build, and only rarely (if ever) having to clear caches; having repros for such zinc bugs as cause us to need to clear caches could be useful anyway.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 78.953% when pulling 7e1a114 on ryan-williams:adam into a3018fd on hammerlab:master.

@ryan-williams
Copy link
Member Author

ryan-williams commented Sep 16, 2016

OK, I think I fixed coveralls.

@ryan-williams
Copy link
Member Author

I'd say just wait for/ask for a new ADAM release then managing one though

There's no "managing" to be done; we have the equivalent of a SNAPSHOT but that won't break under us.

@ryan-williams ryan-williams merged commit 2f86006 into hammerlab:master Sep 16, 2016
@ryan-williams ryan-williams mentioned this pull request Sep 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants