Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using PureCN.R #20

Closed
billnjcn111 opened this issue Feb 25, 2018 · 49 comments
Closed

Error using PureCN.R #20

billnjcn111 opened this issue Feb 25, 2018 · 49 comments

Comments

@billnjcn111
Copy link

...
Removing 259 variants outside intervals.
INFO [2018-02-25 11:14:01] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise.
Error: logical subscript contains NAs
Execution halted

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

Not sure where this happens. Can you by any chance share example input files and reproducible code by email?

If not, and if you use the command line interface, can you post your PureCN.R command line, if you use runAbsoluteCN, the complete list of arguments?

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

Can you share via Dropbox?

Does this error happen in a very minimal test run with just the tumor coverage file, a single normal coverage file, the VCF and the interval file? Or do you use CNVkit as input? The it should be small enough for email (markus.riester at gmail com)

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

Great, that helps at lot. So the original example without the mapping bias rds should work, right?

If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals.

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

Would you mind sharing the truncated VCF (again header + first couple of variant rows) as attachment here or by email? So that I can reproduce?

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

Ah, got it. The VCF does not contain read counts.

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

I'm not sure what's the best way of generating the normal panel VCF in GATK4, but it needs to contain all read counts. There is apparently an undocumented MergeVcfs (https://gatkforums.broadinstitute.org/gatk/discussion/10328/combinevariants-in-gatk4).

Yes, it helps quite a bit, especially in flagging variants from regions with high mapping bias where the allele frequencies are much lower than expected. Without fixing this, you'll get a lot of false somatic calls.

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

So just to make this clear: You'll need to run Mutect or Mutect2 on the normals in tumor-only mode. Get as many normals as possible. Then merge the normal VCFs into a single VCF that contains REF and ALT read counts in an AD format field.

CombineVariants did this in GATK3. For GATK4, let me know how it goes with MergeVcfs.

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

Looks like it finished the main steps and generated an output RDS file. Can you send all the successful output by mail?

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

Can you try in R:

library(PureCN)
x <- readCurationFile("Sampleid.rds", report.best.only=TRUE)
loh <- callLOH(x); # this probably crashes
saveRDS(x, file="Sampleid_bestonly.rds")

This should dramatically reduce the file size of the RDS (I don't need the others), enough to share by mail.

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

If you can't share by Dropbox, can you help me understand where the crash happens:

x <- readCurationFile("Sampleid.rds", report.best.only=TRUE)
debug(PureCN:::.getArmLocations)
PureCN:::.getArmLocations(x)

Keep pressing ENTER until it crashes, but please after the line "centromeres <- .getCentromeres(res)", enter "centromeres" so that I can see if there was an issue getting the centromere locations.

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

What --genome did you specify? My guess is not one of "hg19" or "hg38", right? It shouldn't crash, obviously, but with unknown genome version, it currently does not know the centromere positions.

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

Try re-running with --genome hg19 or --genome hg38 (NOT GRCh38) if you used something else for --genome. I'll add a check for that to give a more helpful error message.

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

Should be safe to use "hg19". It is used in IntervalFile.R to annotate gene symbols and in PureCN.R to map the centromere positions. At some point I'll add genome aliases to that b37 and GRCh37 work.

Otherwise if you want to keep b37, you need to manually change PureCN.R and add

data(centromeres) and then
add centromeres=centromeres$hg19 to the runAbsoluteCN call.

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 25, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 25, 2018

All.unique, we need the read counts from all 16 samples. You can limit the VCF to contain only variants present in at least 2-3 samples. In GATK3 CombineVariants, that's the --minN argument.

What the mapping bias function does is to use collect the alt and ref reads counts from the heterozygous samples only. The sum of alt and ref should be roughly equal, if not, PureCN will either adjust accordingly or ignore the variant if there is a large difference.

lima1 added a commit that referenced this issue Feb 26, 2018
#20); bgzip'ed example VCFs; provide support for deconstructSigs in Dx.R
@billnjcn111
Copy link
Author

billnjcn111 commented Feb 26, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 26, 2018

Sure, the callMutationBurden function is written for tumor-only and should get you accurate mutations/megabase if you follow the instructions. If you use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF, then you get a fairly well tested pipeline.

If not, then it's still largely up to you to do a proper artifact filtering (extremely important!) and testing. The default (available for all VCFs) --error argument will fairly aggressively remove reads with limited support which should remove most sequencing errors. There is also basic Mutect2 filtering as you know, but it's currently experimental.

By default, callMutationBurden is excluding everything that is annotated as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can create your own VCF info flag that summarizes all the databases you want and then make PureCN.R use this over DB (--dbinfoflag).

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 26, 2018 via email

@lima1
Copy link
Owner

lima1 commented Feb 26, 2018

With 16 you should be fine. Filter variants in simple repeats as done in the vignettes and restrict the mutation burden calculation to coding sequences only (there is a currently undocumented script FilterCallableLoci.R in inst/extdata which takes a BED file as input, usually the PASS regions from CallableLoci, and only keeps overlapping CDS). This will ignore most of the difficult regions anyways.

@billnjcn111
Copy link
Author

billnjcn111 commented Feb 26, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Mar 1, 2018 via email

@lima1
Copy link
Owner

lima1 commented Mar 1, 2018

Correct. But it's easy and very fast to run. So I would use it in your benchmarks to compare against M2 if you really need indels. If you see a shift in TMB, you know that the filtering needs work. Feel free to open an issue whenever you see problems. Also note that the default in callMutationBurden ignores indels.

Looks like there is not yet a CallableLoci replacement in GATK4 ( https://gatkforums.broadinstitute.org/gatk/discussion/11178/callableloci-replacement-in-gatk4). I would definitely recommend running GATK3 CallableLoci with --minDepth 15 or whatever you use as min coverage in Mutect.

@billnjcn111
Copy link
Author

billnjcn111 commented Mar 1, 2018 via email

lima1 added a commit that referenced this issue Mar 3, 2018
…ed; Made it possible in Dx.R to count indels in TMB calculation (#20)
@lima1
Copy link
Owner

lima1 commented Mar 3, 2018

Yes, but like I said, by default, indels are not counted. I added a flag to the Dx.R script to count them (--keepindels).

@billnjcn111
Copy link
Author

billnjcn111 commented Mar 8, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Mar 8, 2018 via email

@lima1
Copy link
Owner

lima1 commented Mar 8, 2018

I would need again the whole output of the log file and the command lines used to generate the input.

When all samples show this, then there is likely an issue with the setup.

@billnjcn111
Copy link
Author

billnjcn111 commented Mar 8, 2018 via email

@billnjcn111
Copy link
Author

billnjcn111 commented Mar 8, 2018 via email

@lima1
Copy link
Owner

lima1 commented Mar 8, 2018

Your log-ratios are extremely noisy (1.13), good data is below 0.4 with coverage that low.

Are you sure you are using the correct baits file? You have almost 40k intervals without any coverage. Did NormalDB.R give you a warning that you used the wrong baits file? If then, try to re-run IntervalFile.R and provide the low_coverage_targets.bed file via --exclude. You have a large GC bias, but could be because of the wrong baits file.

You also have only a small fraction of intervals overlapping with variants (1.4%). This should be around 10%. Make sure to run Mutect with the same interval file and add 50-75bp padding (this will at least double the number of SNPs).

And then, make sure to generate a PON VCF for a production setting.

@lima1
Copy link
Owner

lima1 commented Mar 8, 2018

Also, is there a reason you dropped --offtarget?

@billnjcn111
Copy link
Author

billnjcn111 commented Mar 8, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants