Error using PureCN.R #20

billnjcn111 · 2018-02-25T16:23:46Z

...
Removing 259 variants outside intervals.
INFO [2018-02-25 11:14:01] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise.
Error: logical subscript contains NAs
Execution halted

lima1 · 2018-02-25T16:30:22Z

Not sure where this happens. Can you by any chance share example input files and reproducible code by email?

If not, and if you use the command line interface, can you post your PureCN.R command line, if you use runAbsoluteCN, the complete list of arguments?

billnjcn111 · 2018-02-25T16:39:22Z

Do you have a place where I can share all the input files? because a few lines of input file may not reproduce the issue. Thanks A lot

…

On Sun, Feb 25, 2018 at 11:30 AM, M. Riester ***@***.***> wrote: Not sure where this happens. Can you by any chance share example input files and reproducible code by email? If not, and if you use the command line interface, can you post your PureCN.R command line, if you use runAbsoluteCN, the complete list of arguments? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DLpAt4pUS8WspTJiGgfC9msTK-Zfks5tYYqfgaJpZM4SSWvK> .

lima1 · 2018-02-25T16:45:26Z

Can you share via Dropbox?

Does this error happen in a very minimal test run with just the tumor coverage file, a single normal coverage file, the VCF and the interval file? Or do you use CNVkit as input? The it should be small enough for email (markus.riester at gmail com)

billnjcn111 · 2018-02-25T16:48:00Z

OK, let me test it using the above minimal option and report you back. I don't use the CNVkit and I used Mutect2 vcf. Hold on...

…

On Sun, Feb 25, 2018 at 11:45 AM, M. Riester ***@***.***> wrote: Can you share via Dropbox? Does this error happen in a very minimal test run with just the tumor coverage file, a single normal coverage file, the VCF and the interval file? Or do you use CNVkit as input? The it should be small enough for email (markus.riester at gmail com) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DEBEtUK3AXx26dKdw_RgdtIuk-s5ks5tYY4ngaJpZM4SSWvK> .

billnjcn111 · 2018-02-25T17:03:07Z

Well, it seems that with that minimal options there is at least no such error. Now it is doing the grid search which takes time. So initially I had a few more options like mapping bias file (from mutect2), and the target weight file as well as --outvcf --postoptimize --seed 123 Which one caused the trouble? here is a few lines of my mapping bias rds: GRanges object with 301373 ranges and 2 metadata columns: seqnames ranges strand | bias <Rle> <IRanges> <Rle> | <numeric> 1:13289_CCT/C 1 [13289, 13291] * | NaN 1:13494_A/G 1 [13494, 13494] * | NaN 1:14907_A/G 1 [14907, 14907] * | NaN 1:14930_A/G 1 [14930, 14930] * | NaN 1:14933_G/A 1 [14933, 14933] * | NaN

…

On Sun, Feb 25, 2018 at 11:47 AM, billnjcn ***@***.***> wrote: OK, let me test it using the above minimal option and report you back. I don't use the CNVkit and I used Mutect2 vcf. Hold on... On Sun, Feb 25, 2018 at 11:45 AM, M. Riester ***@***.***> wrote: > Can you share via Dropbox? > > Does this error happen in a very minimal test run with just the tumor > coverage file, a single normal coverage file, the VCF and the interval > file? Or do you use CNVkit as input? The it should be small enough for > email (markus.riester at gmail com) > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#20 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AXw6DEBEtUK3AXx26dKdw_RgdtIuk-s5ks5tYY4ngaJpZM4SSWvK> > . >

lima1 · 2018-02-25T17:06:59Z

Great, that helps at lot. So the original example without the mapping bias rds should work, right?

If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals.

billnjcn111 · 2018-02-25T17:14:17Z

##tumor_sample=TCGA-DQ-5629-10A-01D-1870-08 #CHROM POS ID REF ALT QUAL FILTER INFO 1 13289 . CCT C . . . 1 13494 . A G . . . 1 14907 . A G . . . 1 14930 . A G . . . 1 14933 . G A . . . 1 14948 . G A . . . 1 14976 . G A . . .

…

On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***> wrote: Great, that helps at lot. So the original example without the mapping bias rds should work, right? If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK> .

billnjcn111 · 2018-02-25T17:15:07Z

Header is very long, here is the initial ones: ##fileformat=VCFv4.2 ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele"> ##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=MBQ,Number=A,Type=Float,Description="median base quality"> ##FORMAT=<ID=MFRL,Number=R,Type=Float,Description="median fragment length"> ##FORMAT=<ID=MMQ,Number=A,Type=Float,Description="median mapping quality"> ##FORMAT=<ID=MPOS,Number=A,Type=Float,Description="median distance from end of read"> ##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another"> ##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##FORMAT=<ID=SA_MAP_AF,Number=3,Type=Float,Description="MAP estimates of allele fraction given z"> ##FORMAT=<ID=SA_POST_PROB,Number=3,Type=Float,Description="posterior probabilities of the presence of strand artifact">

…

On Sun, Feb 25, 2018 at 12:14 PM, billnjcn ***@***.***> wrote: ##tumor_sample=TCGA-DQ-5629-10A-01D-1870-08 #CHROM POS ID REF ALT QUAL FILTER INFO 1 13289 . CCT C . . . 1 13494 . A G . . . 1 14907 . A G . . . 1 14930 . A G . . . 1 14933 . G A . . . 1 14948 . G A . . . 1 14976 . G A . . . On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***> wrote: > Great, that helps at lot. So the original example without the mapping > bias rds should work, right? > > If you can share a truncated version of your normal panel VCF > (header+those 5 lines), the one that was used to generate the rds, I'll > check why this does not work with M2 normals. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#20 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK> > . >

billnjcn111 · 2018-02-25T17:16:08Z

It does not have BQ instead using MBQ, I am not sure how you filtered the low quality variants. but that is another minor thing. Thanks

…

On Sun, Feb 25, 2018 at 12:15 PM, billnjcn ***@***.***> wrote: Header is very long, here is the initial ones: ##fileformat=VCFv4.2 ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele"> ##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=MBQ,Number=A,Type=Float,Description="median base quality"> ##FORMAT=<ID=MFRL,Number=R,Type=Float,Description="median fragment length"> ##FORMAT=<ID=MMQ,Number=A,Type=Float,Description="median mapping quality"> ##FORMAT=<ID=MPOS,Number=A,Type=Float,Description="median distance from end of read"> ##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another"> ##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##FORMAT=<ID=SA_MAP_AF,Number=3,Type=Float,Description="MAP estimates of allele fraction given z"> ##FORMAT=<ID=SA_POST_PROB,Number=3,Type=Float,Description="posterior probabilities of the presence of strand artifact"> On Sun, Feb 25, 2018 at 12:14 PM, billnjcn ***@***.***> wrote: > ##tumor_sample=TCGA-DQ-5629-10A-01D-1870-08 > #CHROM POS ID REF ALT QUAL FILTER INFO > 1 13289 . CCT C . . . > 1 13494 . A G . . . > 1 14907 . A G . . . > 1 14930 . A G . . . > 1 14933 . G A . . . > 1 14948 . G A . . . > 1 14976 . G A . . . > > On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***> > wrote: > >> Great, that helps at lot. So the original example without the mapping >> bias rds should work, right? >> >> If you can share a truncated version of your normal panel VCF >> (header+those 5 lines), the one that was used to generate the rds, I'll >> check why this does not work with M2 normals. >> >> — >> You are receiving this because you authored the thread. >> Reply to this email directly, view it on GitHub >> <#20 (comment)>, or mute >> the thread >> <https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK> >> . >> > >

billnjcn111 · 2018-02-25T17:18:08Z

You are right, as I said, it passed that place where that error occurred. I am using the Rscript on the command line. It is doing grid search w/o the mapping bias file

…

On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***> wrote: Great, that helps at lot. So the original example without the mapping bias rds should work, right? If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK> .

lima1 · 2018-02-25T17:20:43Z

Would you mind sharing the truncated VCF (again header + first couple of variant rows) as attachment here or by email? So that I can reproduce?

billnjcn111 · 2018-02-25T17:21:32Z

Also do you think by adding the mapping bias will raise the accuracy of the predication? I mainly try to find a method to compare to FM 's SGZ algorithm. Thanks

…

On Sun, Feb 25, 2018 at 12:18 PM, billnjcn ***@***.***> wrote: You are right, as I said, it passed that place where that error occurred. I am using the Rscript on the command line. It is doing grid search w/o the mapping bias file On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***> wrote: > Great, that helps at lot. So the original example without the mapping > bias rds should work, right? > > If you can share a truncated version of your normal panel VCF > (header+those 5 lines), the one that was used to generate the rds, I'll > check why this does not work with M2 normals. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#20 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK> > . >

lima1 · 2018-02-25T17:21:34Z

Ah, got it. The VCF does not contain read counts.

lima1 · 2018-02-25T17:26:07Z

I'm not sure what's the best way of generating the normal panel VCF in GATK4, but it needs to contain all read counts. There is apparently an undocumented MergeVcfs (https://gatkforums.broadinstitute.org/gatk/discussion/10328/combinevariants-in-gatk4).

Yes, it helps quite a bit, especially in flagging variants from regions with high mapping bias where the allele frequencies are much lower than expected. Without fixing this, you'll get a lot of false somatic calls.

billnjcn111 · 2018-02-25T17:26:35Z

Great, is there an easy fix or I can just leave it. Also as I asked did you see improvement of accuracy of prediction by adding that feature? thx

…

On Sun, Feb 25, 2018 at 12:21 PM, M. Riester ***@***.***> wrote: Ah, got it. The VCF does not contain read counts. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DHvt3xq00I4oaqr1-cjtqNaCQUJOks5tYZafgaJpZM4SSWvK> .

lima1 · 2018-02-25T17:39:07Z

So just to make this clear: You'll need to run Mutect or Mutect2 on the normals in tumor-only mode. Get as many normals as possible. Then merge the normal VCFs into a single VCF that contains REF and ALT read counts in an AD format field.

CombineVariants did this in GATK3. For GATK4, let me know how it goes with MergeVcfs.

billnjcn111 · 2018-02-25T17:45:54Z

thx

…

On Sun, Feb 25, 2018 at 12:39 PM, M. Riester ***@***.***> wrote: So just to make this clear: You'll need to run Mutect or Mutect2 on the normals in tumor-only mode. Get as many normals as possible. Then merge the normal VCFs into a single VCF that contains REF and ALT read counts in an AD format field. CombineVariants did this in GATK3. For GATK4, let me know how it goes with MergeVcfs. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DN5TQsIm8oqMvpHyo9vabaFKy78Oks5tYZq7gaJpZM4SSWvK> .

billnjcn111 · 2018-02-25T18:00:38Z

Another error using the minimal options: ... INFO [2018-02-25 12:57:24] Optimized contamination rate: 0.071 INFO [2018-02-25 12:57:24] Done. INFO [2018-02-25 12:57:24] ------------------------------------------------------------ Warning message: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames null device 1 null device 1 Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘seqnames’ for signature ‘"NULL"’ Calls: write.csv ... .getArmLocations -> match -> seqnames -> <Anonymous> Execution halted

…

On Sun, Feb 25, 2018 at 12:45 PM, billnjcn ***@***.***> wrote: thx On Sun, Feb 25, 2018 at 12:39 PM, M. Riester ***@***.***> wrote: > So just to make this clear: You'll need to run Mutect or Mutect2 on the > normals in tumor-only mode. Get as many normals as possible. Then merge the > normal VCFs into a single VCF that contains REF and ALT read counts in an > AD format field. > > CombineVariants did this in GATK3. For GATK4, let me know how it goes > with MergeVcfs. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#20 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AXw6DN5TQsIm8oqMvpHyo9vabaFKy78Oks5tYZq7gaJpZM4SSWvK> > . >

lima1 · 2018-02-25T18:33:37Z

Looks like it finished the main steps and generated an output RDS file. Can you send all the successful output by mail?

billnjcn111 · 2018-02-25T18:43:47Z

.log, _segmentation.pdf, .rds, .csv, .pdf, _local_optima.pdf, _dnacopy.seg, _genes.csv, _variants.csv

…

On Sun, Feb 25, 2018 at 1:33 PM, M. Riester ***@***.***> wrote: Looks like it finished the main steps and generated an output RDS file. Can you send all the successful output by mail? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DJZTrFK3xq6DaJBL8uHQpXoUmcTeks5tYaeCgaJpZM4SSWvK> .

billnjcn111 · 2018-02-25T18:45:02Z

Do you need the files? the rds is too big. Let me see if I can use drop box

…

On Sun, Feb 25, 2018 at 1:43 PM, billnjcn ***@***.***> wrote: .log, _segmentation.pdf, .rds, .csv, .pdf, _local_optima.pdf, _ dnacopy.seg, _genes.csv, _variants.csv On Sun, Feb 25, 2018 at 1:33 PM, M. Riester ***@***.***> wrote: > Looks like it finished the main steps and generated an output RDS file. > Can you send all the successful output by mail? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#20 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AXw6DJZTrFK3xq6DaJBL8uHQpXoUmcTeks5tYaeCgaJpZM4SSWvK> > . >

lima1 · 2018-02-25T19:18:53Z

Can you try in R:

library(PureCN)
x <- readCurationFile("Sampleid.rds", report.best.only=TRUE)
loh <- callLOH(x); # this probably crashes
saveRDS(x, file="Sampleid_bestonly.rds")

This should dramatically reduce the file size of the RDS (I don't need the others), enough to share by mail.

billnjcn111 · 2018-02-25T20:22:50Z

Yes, you are right, it is the loh step that caused the crash. and the best.only res is still 100 M.

…

On Sun, Feb 25, 2018 at 2:18 PM, M. Riester ***@***.***> wrote: Can you try in R: library(PureCN) x <- readCurationFile("Sampleid.rds", report.best.only=TRUE) loh <- callLOH(x); # this probably crashes saveRDS(x, file="Sampleid_bestonly.rds") This should dramatically reduce the file size of the RDS (I don't need the others), enough to share by mail. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DIjbuV-JeD1R9kVn8qIxNZtk-C9Xks5tYbIdgaJpZM4SSWvK> .

lima1 · 2018-02-25T21:45:37Z

If you can't share by Dropbox, can you help me understand where the crash happens:

x <- readCurationFile("Sampleid.rds", report.best.only=TRUE)
debug(PureCN:::.getArmLocations)
PureCN:::.getArmLocations(x)

Keep pressing ENTER until it crashes, but please after the line "centromeres <- .getCentromeres(res)", enter "centromeres" so that I can see if there was an issue getting the centromere locations.

lima1 · 2018-02-25T21:51:32Z

What --genome did you specify? My guess is not one of "hg19" or "hg38", right? It shouldn't crash, obviously, but with unknown genome version, it currently does not know the centromere positions.

billnjcn111 · 2018-02-25T21:53:18Z

Browse[2]> debug: chromCoords <- chromCoords[as.integer(match(seqnames(centromeres), rownames(chromCoords))), ] Browse[2]> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘seqnames’ for signature ‘"NULL"’

…

On Sun, Feb 25, 2018 at 4:45 PM, M. Riester ***@***.***> wrote: If you can't share by Dropbox, can you help me understand where the crash happens: x <- readCurationFile("Sampleid.rds", report.best.only=TRUE) debug(PureCN:::.getArmLocations) PureCN:::.getArmLocations(x) Keep pressing ENTER until it crashes, but please after the line "centromeres <- .getCentromeres(res)", enter "centromeres" so that I can see if there was an issue getting the centromere locations. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DC29lAkEJuVsGV0jIsVwDtsIHMCZks5tYdSBgaJpZM4SSWvK> .

lima1 · 2018-02-25T21:58:29Z

Try re-running with --genome hg19 or --genome hg38 (NOT GRCh38) if you used something else for --genome. I'll add a check for that to give a more helpful error message.

billnjcn111 · 2018-02-25T21:58:48Z

I used b37 because all other files are b37

…

On Sun, Feb 25, 2018 at 4:51 PM, M. Riester ***@***.***> wrote: What --genome did you specify? My guess is not one of "hg19" or "hg38", right? It shouldn't crash, obviously, but with unknown genome version, it currently does not know the centromere positions. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DLZnC1Qt9PumO_AgocO0QCyKy62oks5tYdXkgaJpZM4SSWvK> .

billnjcn111 · 2018-02-25T21:59:53Z

I can make another file ripping off car from your hg19 of centromere. Which file shall I change?

…

On Sun, Feb 25, 2018 at 4:58 PM, billnjcn ***@***.***> wrote: I used b37 because all other files are b37 On Sun, Feb 25, 2018 at 4:51 PM, M. Riester ***@***.***> wrote: > What --genome did you specify? My guess is not one of "hg19" or "hg38", > right? It shouldn't crash, obviously, but with unknown genome version, it > currently does not know the centromere positions. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#20 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AXw6DLZnC1Qt9PumO_AgocO0QCyKy62oks5tYdXkgaJpZM4SSWvK> > . >

lima1 · 2018-02-25T22:07:38Z

Should be safe to use "hg19". It is used in IntervalFile.R to annotate gene symbols and in PureCN.R to map the centromere positions. At some point I'll add genome aliases to that b37 and GRCh37 work.

Otherwise if you want to keep b37, you need to manually change PureCN.R and add

data(centromeres) and then
add centromeres=centromeres$hg19 to the runAbsoluteCN call.

billnjcn111 · 2018-02-25T22:14:31Z

Do you mean the combined VCF has one sample or needs different sample names? I got 16 normal samples only, when combine them, there are 2 ways, one is to list all samples unique and so that all 16 samples will be in each row of vcf. the other is that regardless of sample names, treat them as same sample and just combine all the variants. Which one does it need?

…

On Sun, Feb 25, 2018 at 12:39 PM, M. Riester ***@***.***> wrote: So just to make this clear: You'll need to run Mutect or Mutect2 on the normals in tumor-only mode. Get as many normals as possible. Then merge the normal VCFs into a single VCF that contains REF and ALT read counts in an AD format field. CombineVariants did this in GATK3. For GATK4, let me know how it goes with MergeVcfs. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DN5TQsIm8oqMvpHyo9vabaFKy78Oks5tYZq7gaJpZM4SSWvK> .

lima1 · 2018-02-25T22:22:05Z

All.unique, we need the read counts from all 16 samples. You can limit the VCF to contain only variants present in at least 2-3 samples. In GATK3 CombineVariants, that's the --minN argument.

What the mapping bias function does is to use collect the alt and ref reads counts from the heterozygous samples only. The sum of alt and ref should be roughly equal, if not, PureCN will either adjust accordingly or ignore the variant if there is a large difference.

#20); bgzip'ed example VCFs; provide support for deconstructSigs in Dx.R

billnjcn111 · 2018-02-26T16:10:06Z

Riester, When you compare your pureCN TMB with FMI TMB, was it tumor only sample? also did you filtering Mutect results using all the Mutect filters, such as t_lod, str_contraction etc??? Did you also remove all common germline variants in all the public data bases, such as 1000G, ExAct, gnomAD etc before /after using pureCN? I am looking for a algorithm to work on Tumor only samples to get FMI like TMB values. Do you have any suggestion when I do this comparison? Thanks

…

On Sun, Feb 25, 2018 at 5:22 PM, M. Riester ***@***.***> wrote: All.unique, we need the read counts from all 16 samples. You can limit the VCF to contain only variants present in at least 2-3 samples. In GATK3 CombineVariants, that's the --minN argument. What the mapping bias function does is to use collect the alt and ref reads counts from the heterozygous samples only. The sum of alt and ref should be roughly equal, if not, PureCN will either adjust accordingly or ignore the variant if there is a large difference. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DF95g4sAN0TVvgdgk2AP8UkcyRP8ks5tYd0NgaJpZM4SSWvK> .

lima1 · 2018-02-26T16:31:25Z

Sure, the callMutationBurden function is written for tumor-only and should get you accurate mutations/megabase if you follow the instructions. If you use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF, then you get a fairly well tested pipeline.

If not, then it's still largely up to you to do a proper artifact filtering (extremely important!) and testing. The default (available for all VCFs) --error argument will fairly aggressively remove reads with limited support which should remove most sequencing errors. There is also basic Mutect2 filtering as you know, but it's currently experimental.

By default, callMutationBurden is excluding everything that is annotated as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can create your own VCF info flag that summarizes all the databases you want and then make PureCN.R use this over DB (--dbinfoflag).

billnjcn111 · 2018-02-26T19:16:34Z

So the key is the PON VCF. The problem for me is that there is not enough normal VCFs available there. So based on your experience, how many normal vcfs you tried with best results? thx

…

On Mon, Feb 26, 2018 at 11:31 AM, M. Riester ***@***.***> wrote: Sure, the callMutationBurden function is written for tumor-only and should get you accurate mutations/megabase if you follow the instructions. If you use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF, then you get a fairly well tested pipeline. If not, then it's still largely up to you to do a proper artifact filtering (extremely important!) and testing. The default (available for all VCFs) --error argument will fairly aggressively remove reads with limited support which should remove most sequencing errors. There is also basic Mutect2 filtering as you know, but it's currently experimental. By default, callMutationBurden is excluding everything that is annotated as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can create your own VCF info flag that summarizes all the databases you want and then make PureCN.R use this over DB (--dbinfoflag). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DPwiJR3alsNPCHZCZgj2HWibIiYpks5tYtxfgaJpZM4SSWvK> .

lima1 · 2018-02-26T20:26:16Z

With 16 you should be fine. Filter variants in simple repeats as done in the vignettes and restrict the mutation burden calculation to coding sequences only (there is a currently undocumented script FilterCallableLoci.R in inst/extdata which takes a BED file as input, usually the PASS regions from CallableLoci, and only keeps overlapping CDS). This will ignore most of the difficult regions anyways.

billnjcn111 · 2018-02-26T22:28:52Z

Thanks!

…

On Mon, Feb 26, 2018 at 3:26 PM, M. Riester ***@***.***> wrote: With 16 you should be fine. Filter variants in simple repeats as done in the vignettes and restrict the mutation burden calculation to coding sequences only (there is a currently undocumented script FilterCallableLoci.R in inst/extdata which takes a BED file as input, usually the PASS regions from CallableLoci, and only keeps overlapping CDS). This will ignore most of the difficult regions anyways. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DMqnGCJyyfjxOYwOTM8bdFZ8hN4jks5tYxNpgaJpZM4SSWvK> .

billnjcn111 · 2018-03-01T14:55:55Z

Riester, The recommended Mutect 1.1.7 does not call indels? right?

…

On Mon, Feb 26, 2018 at 11:31 AM, M. Riester ***@***.***> wrote: Sure, the callMutationBurden function is written for tumor-only and should get you accurate mutations/megabase if you follow the instructions. If you use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF, then you get a fairly well tested pipeline. If not, then it's still largely up to you to do a proper artifact filtering (extremely important!) and testing. The default (available for all VCFs) --error argument will fairly aggressively remove reads with limited support which should remove most sequencing errors. There is also basic Mutect2 filtering as you know, but it's currently experimental. By default, callMutationBurden is excluding everything that is annotated as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can create your own VCF info flag that summarizes all the databases you want and then make PureCN.R use this over DB (--dbinfoflag). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DPwiJR3alsNPCHZCZgj2HWibIiYpks5tYtxfgaJpZM4SSWvK> .

lima1 · 2018-03-01T18:39:06Z

Correct. But it's easy and very fast to run. So I would use it in your benchmarks to compare against M2 if you really need indels. If you see a shift in TMB, you know that the filtering needs work. Feel free to open an issue whenever you see problems. Also note that the default in callMutationBurden ignores indels.

Looks like there is not yet a CallableLoci replacement in GATK4 ( https://gatkforums.broadinstitute.org/gatk/discussion/11178/callableloci-replacement-in-gatk4). I would definitely recommend running GATK3 CallableLoci with --minDepth 15 or whatever you use as min coverage in Mutect.

billnjcn111 · 2018-03-01T21:51:25Z

There will be a slight shift since the indwells are missing, right?

…

On Thu, Mar 1, 2018 at 1:39 PM, M. Riester ***@***.***> wrote: Correct. But it's easy and very fast to run. So I would use it in your benchmarks to compare against M2 if you really need indels. If you see a shift in TMB, you know that the filtering needs work. Feel free to open an issue whenever you see problems. Also note that the default in callMutationBurden ignores indels. Looks like there is not yet a CallableLoci replacement in GATK4 ( https://gatkforums.broadinstitute.org/gatk/discussion/11178/callableloci- replacement-in-gatk4). I would definitely recommend running GATK3 CallableLoci with --minDepth 15 or whatever you use as min coverage in Mutect. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DIKdBvEUeu512GNen_zEJJAXv2c5ks5taEBKgaJpZM4SSWvK> .

…ed; Made it possible in Dx.R to count indels in TMB calculation (#20)

lima1 · 2018-03-03T02:50:32Z

Yes, but like I said, by default, indels are not counted. I added a flag to the Dx.R script to count them (--keepindels).

billnjcn111 · 2018-03-08T15:14:03Z

There appears to be a new fatal error:

…

On Fri, Mar 2, 2018 at 9:50 PM, M. Riester ***@***.***> wrote: Yes, but like I said, by default, indels are not counted. I added a flag to the Dx.R script to count them (--keepindels). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DPS2ZYZ9GT7-YeIEWH5GPf4u9k3Tks5tagT4gaJpZM4SSWvK> .

billnjcn111 · 2018-03-08T15:14:53Z

All 46 samples are the same: ------------------------------------------------------------ INFO [2018-03-08 09:44:42] PureCN 1.9.28 INFO [2018-03-08 09:44:42] ------------------------------------------------------------ ..... Cannot find valid purity/ploidy solution. This happens when input FATAL [2018-03-08 09:51:23] segmentations are garbage, most likely due to a catastrophic sample QC FATAL [2018-03-08 09:51:23] failure. Re-check standard QC metrics for this sample. FATAL [2018-03-08 09:51:23] FATAL [2018-03-08 09:51:23] This is most likely a user error due to invalid input data or FATAL [2018-03-08 09:51:23] parameters (PureCN 1.9.28).

…

On Thu, Mar 8, 2018 at 10:13 AM, billnjcn ***@***.***> wrote: There appears to be a new fatal error: On Fri, Mar 2, 2018 at 9:50 PM, M. Riester ***@***.***> wrote: > Yes, but like I said, by default, indels are not counted. I added a flag > to the Dx.R script to count them (--keepindels). > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#20 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AXw6DPS2ZYZ9GT7-YeIEWH5GPf4u9k3Tks5tagT4gaJpZM4SSWvK> > . >

lima1 · 2018-03-08T16:09:06Z

I would need again the whole output of the log file and the command lines used to generate the input.

When all samples show this, then there is likely an issue with the setup.

billnjcn111 · 2018-03-08T16:16:39Z

command line argument: INFO [2018-03-08 09:13:41] Arguments: -tumor.coverage.file 71-D_recal_coverage_loess.txt -seg.file -vcf.file 71-D_filtered_dbsnp.vcf.gz -genome hg19 -sex ? -args.setMappingBiasVcf NULL -args.segmentation target_weights_SS_v6_hg19.txt,0.005,NULL -sampleid 71-D -min.ploidy 1 -max.ploidy 6 -max.non.clonal 0.2 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file baits_hg19_V6_intervals.txt -gc.gene.file -max.segments 300 -plot.cnv TRUE -DB.info.flag DB -model beta -post.optimize FALSE -log.file 71-D.log -normal.coverage.file <data> -normalDB <data> -args.filterVcf <data> -fun.segmentation <data> -test.num.copy <data> -test.purity <data> -speedup.heuristics <data>

…

On Thu, Mar 8, 2018 at 11:09 AM, M. Riester ***@***.***> wrote: I would need again the whole output of the log file and the command lines used to generate the input. When all samples show this, then there is likely an issue with the setup. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DAoMm_0Y1Uv13JNAoKrKKIyjhovrks5tcVeigaJpZM4SSWvK> .

billnjcn111 · 2018-03-08T16:19:06Z

INFO [2018-03-08 09:13:41] Loading coverage files... INFO [2018-03-08 09:13:48] Mean target coverages: 88X (tumor) 76X (normal). INFO [2018-03-08 09:13:50] Mean coverages: chrX: 72.95, chrY: 1.41, chr1-22: 82.92. INFO [2018-03-08 09:13:50] Mean coverages: chrX: 82.62, chrY: 0.64, chr1-22: 71.27. INFO [2018-03-08 09:14:16] Removing 35593 intervals with missing log.ratio. INFO [2018-03-08 09:14:16] Removing 9 low/high GC targets. INFO [2018-03-08 09:14:21] Removing 3762 targets excluded in normalDB. INFO [2018-03-08 09:14:21] Removing 265 targets with low total coverage in normal (< 150.00 reads). INFO [2018-03-08 09:14:21] normalDB provided. Setting minimum coverage for segmentation to 0.0015X. INFO [2018-03-08 09:14:21] Removing 9 low coverage (< 0.0015X) targets. INFO [2018-03-08 09:14:22] Using 247792 intervals (247792 on-target, 0 off-target). INFO [2018-03-08 09:14:22] No off-target intervals. If this is hybrid-capture data, consider adding them. INFO [2018-03-08 09:14:23] AT/GC dropout: 1.07 (tumor), 1.08 (normal). INFO [2018-03-08 09:14:23] Loading VCF... INFO [2018-03-08 09:14:23] Found 12980 variants in VCF file. INFO [2018-03-08 09:14:24] Removing 428 triallelic sites. WARN [2018-03-08 09:14:24] vcf.file has no DB info field for dbSNP membership. Guessing it based on ID. INFO [2018-03-08 09:14:24] 7504 (59.8%) variants annotated as likely germline (DB INFO flag). INFO [2018-03-08 09:14:24] 71-D is tumor in VCF file. INFO [2018-03-08 09:14:24] 39 homozygous and 104 heterozygous variants on chrX. INFO [2018-03-08 09:14:24] Sex from VCF: F (Fisher's p-value: 0.924, odds-ratio: 1.03). INFO [2018-03-08 09:14:24] Detected MuTect2 VCF. INFO [2018-03-08 09:14:24] Removing 6392 MuTect2 calls due to blacklisted failure reasons. INFO [2018-03-08 09:14:25] Initial testing for significant sample cross-contamination: maybe INFO [2018-03-08 09:14:25] Removing 2233 variants with AF < 0.030 or AF >= 0.970 or less than 4 supporting reads or depth < 15. INFO [2018-03-08 09:14:25] Removing 15 low quality variants with BQ < 25. INFO [2018-03-08 09:14:25] Total size of targeted genomic region: 55.83Mb (75.03Mb with 50bp padding). INFO [2018-03-08 09:14:25] 1.4% of targets contain variants. INFO [2018-03-08 09:14:25] Removing 403 variants outside intervals. INFO [2018-03-08 09:14:25] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise. INFO [2018-03-08 09:14:25] VCF does not contain somatic status. For best results, consider providing normal.panel.vcf.file when matched normals are not available. INFO [2018-03-08 09:14:25] Sample sex: F INFO [2018-03-08 09:14:25] Segmenting data... INFO [2018-03-08 09:14:26] Target weights found, will use weighted CBS. INFO [2018-03-08 09:14:26] Loading pre-computed boundaries for DNAcopy... INFO [2018-03-08 09:14:26] Setting undo.SD parameter to 1.250000. INFO [2018-03-08 09:14:39] Setting prune.hclust.h parameter to 0.150000. INFO [2018-03-08 09:14:39] Found 99 segments with median size of 1.64Mb. INFO [2018-03-08 09:14:39] Removing 2 variants outside segments. INFO [2018-03-08 09:14:39] Using 3507 variants. INFO [2018-03-08 09:14:40] Mean standard deviation of log-ratios: 1.13 INFO [2018-03-08 09:14:40] 2D-grid search of purity and ploidy... FATAL [2018-03-08 09:19:40] Cannot find valid purity/ploidy solution. This happens when input FATAL [2018-03-08 09:19:40] segmentations are garbage, most likely due to a catastrophic sample QC FATAL [2018-03-08 09:19:40] failure. Re-check standard QC metrics for this sample. FATAL [2018-03-08 09:19:40] FATAL [2018-03-08 09:19:40] This is most likely a user error due to invalid input data or FATAL [2018-03-08 09:19:40] parameters (PureCN 1.9.28).

…

On Thu, Mar 8, 2018 at 11:16 AM, billnjcn ***@***.***> wrote: command line argument: INFO [2018-03-08 09:13:41] Arguments: -tumor.coverage.file 71-D_recal_coverage_loess.txt -seg.file -vcf.file 71-D_filtered_dbsnp.vcf.gz -genome hg19 -sex ? -args.setMappingBiasVcf NULL -args.segmentation target_weights_SS_v6_hg19.txt,0.005,NULL -sampleid 71-D -min.ploidy 1 -max.ploidy 6 -max.non.clonal 0.2 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file baits_hg19_V6_intervals.txt -gc.gene.file -max.segments 300 -plot.cnv TRUE -DB.info.flag DB -model beta -post.optimize FALSE -log.file 71-D.log -normal.coverage.file <data> -normalDB <data> -args.filterVcf <data> -fun.segmentation <data> -test.num.copy <data> -test.purity <data> -speedup.heuristics <data> On Thu, Mar 8, 2018 at 11:09 AM, M. Riester ***@***.***> wrote: > I would need again the whole output of the log file and the command lines > used to generate the input. > > When all samples show this, then there is likely an issue with the setup. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#20 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AXw6DAoMm_0Y1Uv13JNAoKrKKIyjhovrks5tcVeigaJpZM4SSWvK> > . >

lima1 · 2018-03-08T18:49:10Z

Your log-ratios are extremely noisy (1.13), good data is below 0.4 with coverage that low.

Are you sure you are using the correct baits file? You have almost 40k intervals without any coverage. Did NormalDB.R give you a warning that you used the wrong baits file? If then, try to re-run IntervalFile.R and provide the low_coverage_targets.bed file via --exclude. You have a large GC bias, but could be because of the wrong baits file.

You also have only a small fraction of intervals overlapping with variants (1.4%). This should be around 10%. Make sure to run Mutect with the same interval file and add 50-75bp padding (this will at least double the number of SNPs).

And then, make sure to generate a PON VCF for a production setting.

lima1 · 2018-03-08T18:55:43Z

Also, is there a reason you dropped --offtarget?

billnjcn111 · 2018-03-08T19:09:54Z

Thanks, the bait file may be the exact reason why it failed!

…

On Thu, Mar 8, 2018 at 1:55 PM, M. Riester ***@***.***> wrote: Also, is there a reason you dropped --offtarget? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXw6DHhF8d27cUYzvasaFoeiz__jeKx7ks5tcX6vgaJpZM4SSWvK> .

lima1 added a commit that referenced this issue Feb 26, 2018

Check that provided normal_panel does not contain only a single sample (

053b5a0

#20); bgzip'ed example VCFs; provide support for deconstructSigs in Dx.R

lima1 added a commit that referenced this issue Mar 3, 2018

Fixed bug that samples with contamination less than 1% were not flagg…

c96d241

…ed; Made it possible in Dx.R to count indels in TMB calculation (#20)

lima1 added a commit that referenced this issue Mar 20, 2018

Fix for crash in callLOH without known centromeres (#20).

e5c17d4

lima1 closed this as completed Mar 20, 2018

Shenglai mentioned this issue Oct 9, 2019

PureCN error #107

Closed

Error using PureCN.R #20

Error using PureCN.R #20

Comments

billnjcn111 commented Feb 25, 2018

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

billnjcn111 commented Feb 25, 2018 via email

billnjcn111 commented Feb 25, 2018 via email

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 25, 2018 via email

lima1 commented Feb 25, 2018

billnjcn111 commented Feb 26, 2018 via email

lima1 commented Feb 26, 2018

billnjcn111 commented Feb 26, 2018 via email

lima1 commented Feb 26, 2018

billnjcn111 commented Feb 26, 2018 via email

billnjcn111 commented Mar 1, 2018 via email

lima1 commented Mar 1, 2018

billnjcn111 commented Mar 1, 2018 via email

lima1 commented Mar 3, 2018

billnjcn111 commented Mar 8, 2018 via email

billnjcn111 commented Mar 8, 2018 via email

lima1 commented Mar 8, 2018

billnjcn111 commented Mar 8, 2018 via email

billnjcn111 commented Mar 8, 2018 via email

lima1 commented Mar 8, 2018

lima1 commented Mar 8, 2018

billnjcn111 commented Mar 8, 2018 via email