-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error using PureCN.R #20
Comments
Not sure where this happens. Can you by any chance share example input files and reproducible code by email? If not, and if you use the command line interface, can you post your PureCN.R command line, if you use runAbsoluteCN, the complete list of arguments? |
Do you have a place where I can share all the input files? because a few
lines of input file may not reproduce the issue.
Thanks A lot
…On Sun, Feb 25, 2018 at 11:30 AM, M. Riester ***@***.***> wrote:
Not sure where this happens. Can you by any chance share example input
files and reproducible code by email?
If not, and if you use the command line interface, can you post your
PureCN.R command line, if you use runAbsoluteCN, the complete list of
arguments?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DLpAt4pUS8WspTJiGgfC9msTK-Zfks5tYYqfgaJpZM4SSWvK>
.
|
Can you share via Dropbox? Does this error happen in a very minimal test run with just the tumor coverage file, a single normal coverage file, the VCF and the interval file? Or do you use CNVkit as input? The it should be small enough for email (markus.riester at gmail com) |
OK, let me test it using the above minimal option and report you back. I
don't use the CNVkit and I used Mutect2 vcf.
Hold on...
…On Sun, Feb 25, 2018 at 11:45 AM, M. Riester ***@***.***> wrote:
Can you share via Dropbox?
Does this error happen in a very minimal test run with just the tumor
coverage file, a single normal coverage file, the VCF and the interval
file? Or do you use CNVkit as input? The it should be small enough for
email (markus.riester at gmail com)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DEBEtUK3AXx26dKdw_RgdtIuk-s5ks5tYY4ngaJpZM4SSWvK>
.
|
Well, it seems that with that minimal options there is at least no such
error. Now it is doing the grid search which takes time.
So initially I had a few more options like mapping bias file (from
mutect2), and the target weight file as well as --outvcf --postoptimize
--seed 123
Which one caused the trouble? here is a few lines of my mapping bias rds:
GRanges object with 301373 ranges and 2 metadata columns:
seqnames ranges strand | bias
<Rle> <IRanges> <Rle> | <numeric>
1:13289_CCT/C 1 [13289, 13291] * | NaN
1:13494_A/G 1 [13494, 13494] * | NaN
1:14907_A/G 1 [14907, 14907] * | NaN
1:14930_A/G 1 [14930, 14930] * | NaN
1:14933_G/A 1 [14933, 14933] * | NaN
…On Sun, Feb 25, 2018 at 11:47 AM, billnjcn ***@***.***> wrote:
OK, let me test it using the above minimal option and report you back. I
don't use the CNVkit and I used Mutect2 vcf.
Hold on...
On Sun, Feb 25, 2018 at 11:45 AM, M. Riester ***@***.***>
wrote:
> Can you share via Dropbox?
>
> Does this error happen in a very minimal test run with just the tumor
> coverage file, a single normal coverage file, the VCF and the interval
> file? Or do you use CNVkit as input? The it should be small enough for
> email (markus.riester at gmail com)
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#20 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AXw6DEBEtUK3AXx26dKdw_RgdtIuk-s5ks5tYY4ngaJpZM4SSWvK>
> .
>
|
Great, that helps at lot. So the original example without the mapping bias rds should work, right? If you can share a truncated version of your normal panel VCF (header+those 5 lines), the one that was used to generate the rds, I'll check why this does not work with M2 normals. |
##tumor_sample=TCGA-DQ-5629-10A-01D-1870-08
#CHROM POS ID REF ALT QUAL FILTER INFO
1 13289 . CCT C . . .
1 13494 . A G . . .
1 14907 . A G . . .
1 14930 . A G . . .
1 14933 . G A . . .
1 14948 . G A . . .
1 14976 . G A . . .
…On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***> wrote:
Great, that helps at lot. So the original example without the mapping bias
rds should work, right?
If you can share a truncated version of your normal panel VCF
(header+those 5 lines), the one that was used to generate the rds, I'll
check why this does not work with M2 normals.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK>
.
|
Header is very long, here is the initial ones:
##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the
ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of
alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth
(reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2
pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1
pair orientation supporting each allele">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MBQ,Number=A,Type=Float,Description="median base quality">
##FORMAT=<ID=MFRL,Number=R,Type=Float,Description="median fragment length">
##FORMAT=<ID=MMQ,Number=A,Type=Float,Description="median mapping quality">
##FORMAT=<ID=MPOS,Number=A,Type=Float,Description="median distance from end
of read">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing
haplotype information, describing how the alternate alleles are phased in
relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID
information, where each unique ID within a given sample (but not across
samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled
likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SA_MAP_AF,Number=3,Type=Float,Description="MAP estimates of
allele fraction given z">
##FORMAT=<ID=SA_POST_PROB,Number=3,Type=Float,Description="posterior
probabilities of the presence of strand artifact">
…On Sun, Feb 25, 2018 at 12:14 PM, billnjcn ***@***.***> wrote:
##tumor_sample=TCGA-DQ-5629-10A-01D-1870-08
#CHROM POS ID REF ALT QUAL FILTER INFO
1 13289 . CCT C . . .
1 13494 . A G . . .
1 14907 . A G . . .
1 14930 . A G . . .
1 14933 . G A . . .
1 14948 . G A . . .
1 14976 . G A . . .
On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***>
wrote:
> Great, that helps at lot. So the original example without the mapping
> bias rds should work, right?
>
> If you can share a truncated version of your normal panel VCF
> (header+those 5 lines), the one that was used to generate the rds, I'll
> check why this does not work with M2 normals.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#20 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK>
> .
>
|
It does not have BQ instead using MBQ, I am not sure how you filtered the
low quality variants. but that is another minor thing.
Thanks
…On Sun, Feb 25, 2018 at 12:15 PM, billnjcn ***@***.***> wrote:
Header is very long, here is the initial ones:
##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the
ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of
alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth
(reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in
F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in
F2R1 pair orientation supporting each allele">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MBQ,Number=A,Type=Float,Description="median base quality">
##FORMAT=<ID=MFRL,Number=R,Type=Float,Description="median fragment
length">
##FORMAT=<ID=MMQ,Number=A,Type=Float,Description="median mapping quality">
##FORMAT=<ID=MPOS,Number=A,Type=Float,Description="median distance from
end of read">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing
haplotype information, describing how the alternate alleles are phased in
relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID
information, where each unique ID within a given sample (but not across
samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized,
Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SA_MAP_AF,Number=3,Type=Float,Description="MAP estimates of
allele fraction given z">
##FORMAT=<ID=SA_POST_PROB,Number=3,Type=Float,Description="posterior
probabilities of the presence of strand artifact">
On Sun, Feb 25, 2018 at 12:14 PM, billnjcn ***@***.***> wrote:
> ##tumor_sample=TCGA-DQ-5629-10A-01D-1870-08
> #CHROM POS ID REF ALT QUAL FILTER INFO
> 1 13289 . CCT C . . .
> 1 13494 . A G . . .
> 1 14907 . A G . . .
> 1 14930 . A G . . .
> 1 14933 . G A . . .
> 1 14948 . G A . . .
> 1 14976 . G A . . .
>
> On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***>
> wrote:
>
>> Great, that helps at lot. So the original example without the mapping
>> bias rds should work, right?
>>
>> If you can share a truncated version of your normal panel VCF
>> (header+those 5 lines), the one that was used to generate the rds, I'll
>> check why this does not work with M2 normals.
>>
>> —
>> You are receiving this because you authored the thread.
>> Reply to this email directly, view it on GitHub
>> <#20 (comment)>, or mute
>> the thread
>> <https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK>
>> .
>>
>
>
|
You are right, as I said, it passed that place where that error occurred. I
am using the Rscript on the command line.
It is doing grid search w/o the mapping bias file
…On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***> wrote:
Great, that helps at lot. So the original example without the mapping bias
rds should work, right?
If you can share a truncated version of your normal panel VCF
(header+those 5 lines), the one that was used to generate the rds, I'll
check why this does not work with M2 normals.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK>
.
|
Would you mind sharing the truncated VCF (again header + first couple of variant rows) as attachment here or by email? So that I can reproduce? |
Also do you think by adding the mapping bias will raise the accuracy of the
predication?
I mainly try to find a method to compare to FM 's SGZ algorithm.
Thanks
…On Sun, Feb 25, 2018 at 12:18 PM, billnjcn ***@***.***> wrote:
You are right, as I said, it passed that place where that error occurred.
I am using the Rscript on the command line.
It is doing grid search w/o the mapping bias file
On Sun, Feb 25, 2018 at 12:07 PM, M. Riester ***@***.***>
wrote:
> Great, that helps at lot. So the original example without the mapping
> bias rds should work, right?
>
> If you can share a truncated version of your normal panel VCF
> (header+those 5 lines), the one that was used to generate the rds, I'll
> check why this does not work with M2 normals.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#20 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AXw6DJndAKDOEulsMv22wLBQ819HHv52ks5tYZM0gaJpZM4SSWvK>
> .
>
|
Ah, got it. The VCF does not contain read counts. |
I'm not sure what's the best way of generating the normal panel VCF in GATK4, but it needs to contain all read counts. There is apparently an undocumented MergeVcfs (https://gatkforums.broadinstitute.org/gatk/discussion/10328/combinevariants-in-gatk4). Yes, it helps quite a bit, especially in flagging variants from regions with high mapping bias where the allele frequencies are much lower than expected. Without fixing this, you'll get a lot of false somatic calls. |
Great, is there an easy fix or I can just leave it. Also as I asked did you
see improvement of accuracy of prediction by adding that feature?
thx
…On Sun, Feb 25, 2018 at 12:21 PM, M. Riester ***@***.***> wrote:
Ah, got it. The VCF does not contain read counts.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DHvt3xq00I4oaqr1-cjtqNaCQUJOks5tYZafgaJpZM4SSWvK>
.
|
So just to make this clear: You'll need to run Mutect or Mutect2 on the normals in tumor-only mode. Get as many normals as possible. Then merge the normal VCFs into a single VCF that contains REF and ALT read counts in an AD format field. CombineVariants did this in GATK3. For GATK4, let me know how it goes with MergeVcfs. |
thx
…On Sun, Feb 25, 2018 at 12:39 PM, M. Riester ***@***.***> wrote:
So just to make this clear: You'll need to run Mutect or Mutect2 on the
normals in tumor-only mode. Get as many normals as possible. Then merge the
normal VCFs into a single VCF that contains REF and ALT read counts in an
AD format field.
CombineVariants did this in GATK3. For GATK4, let me know how it goes with
MergeVcfs.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DN5TQsIm8oqMvpHyo9vabaFKy78Oks5tYZq7gaJpZM4SSWvK>
.
|
Another error using the minimal options:
...
INFO [2018-02-25 12:57:24] Optimized contamination rate: 0.071
INFO [2018-02-25 12:57:24] Done.
INFO [2018-02-25 12:57:24]
------------------------------------------------------------
Warning message:
In .bcfHeaderAsSimpleList(header) :
duplicate keys in header will be forced to unique rownames
null device
1
null device
1
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘seqnames’ for signature
‘"NULL"’
Calls: write.csv ... .getArmLocations -> match -> seqnames -> <Anonymous>
Execution halted
…On Sun, Feb 25, 2018 at 12:45 PM, billnjcn ***@***.***> wrote:
thx
On Sun, Feb 25, 2018 at 12:39 PM, M. Riester ***@***.***>
wrote:
> So just to make this clear: You'll need to run Mutect or Mutect2 on the
> normals in tumor-only mode. Get as many normals as possible. Then merge the
> normal VCFs into a single VCF that contains REF and ALT read counts in an
> AD format field.
>
> CombineVariants did this in GATK3. For GATK4, let me know how it goes
> with MergeVcfs.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#20 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AXw6DN5TQsIm8oqMvpHyo9vabaFKy78Oks5tYZq7gaJpZM4SSWvK>
> .
>
|
Looks like it finished the main steps and generated an output RDS file. Can you send all the successful output by mail? |
.log, _segmentation.pdf, .rds, .csv,
.pdf, _local_optima.pdf, _dnacopy.seg, _genes.csv, _variants.csv
…On Sun, Feb 25, 2018 at 1:33 PM, M. Riester ***@***.***> wrote:
Looks like it finished the main steps and generated an output RDS file.
Can you send all the successful output by mail?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DJZTrFK3xq6DaJBL8uHQpXoUmcTeks5tYaeCgaJpZM4SSWvK>
.
|
Do you need the files? the rds is too big.
Let me see if I can use drop box
…On Sun, Feb 25, 2018 at 1:43 PM, billnjcn ***@***.***> wrote:
.log, _segmentation.pdf, .rds, .csv, .pdf, _local_optima.pdf, _
dnacopy.seg, _genes.csv, _variants.csv
On Sun, Feb 25, 2018 at 1:33 PM, M. Riester ***@***.***>
wrote:
> Looks like it finished the main steps and generated an output RDS file.
> Can you send all the successful output by mail?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#20 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AXw6DJZTrFK3xq6DaJBL8uHQpXoUmcTeks5tYaeCgaJpZM4SSWvK>
> .
>
|
Can you try in R:
This should dramatically reduce the file size of the RDS (I don't need the others), enough to share by mail. |
Yes, you are right, it is the loh step that caused the crash. and the
best.only res is still 100 M.
…On Sun, Feb 25, 2018 at 2:18 PM, M. Riester ***@***.***> wrote:
Can you try in R:
library(PureCN)
x <- readCurationFile("Sampleid.rds", report.best.only=TRUE)
loh <- callLOH(x); # this probably crashes
saveRDS(x, file="Sampleid_bestonly.rds")
This should dramatically reduce the file size of the RDS (I don't need the
others), enough to share by mail.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DIjbuV-JeD1R9kVn8qIxNZtk-C9Xks5tYbIdgaJpZM4SSWvK>
.
|
If you can't share by Dropbox, can you help me understand where the crash happens: x <- readCurationFile("Sampleid.rds", report.best.only=TRUE) Keep pressing ENTER until it crashes, but please after the line "centromeres <- .getCentromeres(res)", enter "centromeres" so that I can see if there was an issue getting the centromere locations. |
What --genome did you specify? My guess is not one of "hg19" or "hg38", right? It shouldn't crash, obviously, but with unknown genome version, it currently does not know the centromere positions. |
Browse[2]>
debug: chromCoords <- chromCoords[as.integer(match(seqnames(centromeres),
rownames(chromCoords))), ]
Browse[2]>
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘seqnames’ for signature
‘"NULL"’
…On Sun, Feb 25, 2018 at 4:45 PM, M. Riester ***@***.***> wrote:
If you can't share by Dropbox, can you help me understand where the crash
happens:
x <- readCurationFile("Sampleid.rds", report.best.only=TRUE)
debug(PureCN:::.getArmLocations)
PureCN:::.getArmLocations(x)
Keep pressing ENTER until it crashes, but please after the line
"centromeres <- .getCentromeres(res)", enter "centromeres" so that I can
see if there was an issue getting the centromere locations.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DC29lAkEJuVsGV0jIsVwDtsIHMCZks5tYdSBgaJpZM4SSWvK>
.
|
Try re-running with --genome hg19 or --genome hg38 (NOT GRCh38) if you used something else for --genome. I'll add a check for that to give a more helpful error message. |
I used b37 because all other files are b37
…On Sun, Feb 25, 2018 at 4:51 PM, M. Riester ***@***.***> wrote:
What --genome did you specify? My guess is not one of "hg19" or "hg38",
right? It shouldn't crash, obviously, but with unknown genome version, it
currently does not know the centromere positions.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DLZnC1Qt9PumO_AgocO0QCyKy62oks5tYdXkgaJpZM4SSWvK>
.
|
I can make another file ripping off car from your hg19 of centromere. Which
file shall I change?
…On Sun, Feb 25, 2018 at 4:58 PM, billnjcn ***@***.***> wrote:
I used b37 because all other files are b37
On Sun, Feb 25, 2018 at 4:51 PM, M. Riester ***@***.***>
wrote:
> What --genome did you specify? My guess is not one of "hg19" or "hg38",
> right? It shouldn't crash, obviously, but with unknown genome version, it
> currently does not know the centromere positions.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#20 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AXw6DLZnC1Qt9PumO_AgocO0QCyKy62oks5tYdXkgaJpZM4SSWvK>
> .
>
|
Should be safe to use "hg19". It is used in IntervalFile.R to annotate gene symbols and in PureCN.R to map the centromere positions. At some point I'll add genome aliases to that b37 and GRCh37 work. Otherwise if you want to keep b37, you need to manually change PureCN.R and add data(centromeres) and then |
Do you mean the combined VCF has one sample or needs different sample names?
I got 16 normal samples only, when combine them, there are 2 ways, one is
to list all samples unique and so that all 16 samples will be in each row
of vcf. the other is that regardless of sample names, treat them as same
sample and just combine all the variants.
Which one does it need?
…On Sun, Feb 25, 2018 at 12:39 PM, M. Riester ***@***.***> wrote:
So just to make this clear: You'll need to run Mutect or Mutect2 on the
normals in tumor-only mode. Get as many normals as possible. Then merge the
normal VCFs into a single VCF that contains REF and ALT read counts in an
AD format field.
CombineVariants did this in GATK3. For GATK4, let me know how it goes with
MergeVcfs.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DN5TQsIm8oqMvpHyo9vabaFKy78Oks5tYZq7gaJpZM4SSWvK>
.
|
All.unique, we need the read counts from all 16 samples. You can limit the VCF to contain only variants present in at least 2-3 samples. In GATK3 CombineVariants, that's the --minN argument. What the mapping bias function does is to use collect the alt and ref reads counts from the heterozygous samples only. The sum of alt and ref should be roughly equal, if not, PureCN will either adjust accordingly or ignore the variant if there is a large difference. |
#20); bgzip'ed example VCFs; provide support for deconstructSigs in Dx.R
Riester,
When you compare your pureCN TMB with FMI TMB, was it tumor only sample?
also did you filtering Mutect results using all the Mutect filters, such as
t_lod, str_contraction etc???
Did you also remove all common germline variants in all the public data
bases, such as 1000G, ExAct, gnomAD etc before /after using pureCN?
I am looking for a algorithm to work on Tumor only samples to get FMI like
TMB values.
Do you have any suggestion when I do this comparison?
Thanks
…On Sun, Feb 25, 2018 at 5:22 PM, M. Riester ***@***.***> wrote:
All.unique, we need the read counts from all 16 samples. You can limit the
VCF to contain only variants present in at least 2-3 samples. In GATK3
CombineVariants, that's the --minN argument.
What the mapping bias function does is to use collect the alt and ref
reads counts from the heterozygous samples only. The sum of alt and ref
should be roughly equal, if not, PureCN will either adjust accordingly or
ignore the variant if there is a large difference.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DF95g4sAN0TVvgdgk2AP8UkcyRP8ks5tYd0NgaJpZM4SSWvK>
.
|
Sure, the callMutationBurden function is written for tumor-only and should get you accurate mutations/megabase if you follow the instructions. If you use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF, then you get a fairly well tested pipeline. If not, then it's still largely up to you to do a proper artifact filtering (extremely important!) and testing. The default (available for all VCFs) --error argument will fairly aggressively remove reads with limited support which should remove most sequencing errors. There is also basic Mutect2 filtering as you know, but it's currently experimental. By default, callMutationBurden is excluding everything that is annotated as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can create your own VCF info flag that summarizes all the databases you want and then make PureCN.R use this over DB (--dbinfoflag). |
So the key is the PON VCF. The problem for me is that there is not enough
normal VCFs available there. So based on your experience, how many normal
vcfs you tried with best results?
thx
…On Mon, Feb 26, 2018 at 11:31 AM, M. Riester ***@***.***> wrote:
Sure, the callMutationBurden function is written for tumor-only and should
get you accurate mutations/megabase if you follow the instructions. If you
use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF,
then you get a fairly well tested pipeline.
If not, then it's still largely up to you to do a proper artifact
filtering (extremely important!) and testing. The default (available for
all VCFs) --error argument will fairly aggressively remove reads with
limited support which should remove most sequencing errors. There is also
basic Mutect2 filtering as you know, but it's currently experimental.
By default, callMutationBurden is excluding everything that is annotated
as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can
create your own VCF info flag that summarizes all the databases you want
and then make PureCN.R use this over DB (--dbinfoflag).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DPwiJR3alsNPCHZCZgj2HWibIiYpks5tYtxfgaJpZM4SSWvK>
.
|
With 16 you should be fine. Filter variants in simple repeats as done in the vignettes and restrict the mutation burden calculation to coding sequences only (there is a currently undocumented script FilterCallableLoci.R in inst/extdata which takes a BED file as input, usually the PASS regions from CallableLoci, and only keeps overlapping CDS). This will ignore most of the difficult regions anyways. |
Thanks!
…On Mon, Feb 26, 2018 at 3:26 PM, M. Riester ***@***.***> wrote:
With 16 you should be fine. Filter variants in simple repeats as done in
the vignettes and restrict the mutation burden calculation to coding
sequences only (there is a currently undocumented script
FilterCallableLoci.R in inst/extdata which takes a BED file as input,
usually the PASS regions from CallableLoci, and only keeps overlapping
CDS). This will ignore most of the difficult regions anyways.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DMqnGCJyyfjxOYwOTM8bdFZ8hN4jks5tYxNpgaJpZM4SSWvK>
.
|
Riester,
The recommended Mutect 1.1.7 does not call indels? right?
…On Mon, Feb 26, 2018 at 11:31 AM, M. Riester ***@***.***> wrote:
Sure, the callMutationBurden function is written for tumor-only and should
get you accurate mutations/megabase if you follow the instructions. If you
use the recommended Mutect 1.1.7, GATK3 CallableLoci and provide a PON VCF,
then you get a fairly well tested pipeline.
If not, then it's still largely up to you to do a proper artifact
filtering (extremely important!) and testing. The default (available for
all VCFs) --error argument will fairly aggressively remove reads with
limited support which should remove most sequencing errors. There is also
basic Mutect2 filtering as you know, but it's currently experimental.
By default, callMutationBurden is excluding everything that is annotated
as DB (dbSNP) and not rescued by COSMIC (Cosmic.CNT info flag). You can
create your own VCF info flag that summarizes all the databases you want
and then make PureCN.R use this over DB (--dbinfoflag).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DPwiJR3alsNPCHZCZgj2HWibIiYpks5tYtxfgaJpZM4SSWvK>
.
|
Correct. But it's easy and very fast to run. So I would use it in your benchmarks to compare against M2 if you really need indels. If you see a shift in TMB, you know that the filtering needs work. Feel free to open an issue whenever you see problems. Also note that the default in callMutationBurden ignores indels. Looks like there is not yet a CallableLoci replacement in GATK4 ( https://gatkforums.broadinstitute.org/gatk/discussion/11178/callableloci-replacement-in-gatk4). I would definitely recommend running GATK3 CallableLoci with --minDepth 15 or whatever you use as min coverage in Mutect. |
There will be a slight shift since the indwells are missing, right?
…On Thu, Mar 1, 2018 at 1:39 PM, M. Riester ***@***.***> wrote:
Correct. But it's easy and very fast to run. So I would use it in your
benchmarks to compare against M2 if you really need indels. If you see a
shift in TMB, you know that the filtering needs work. Feel free to open an
issue whenever you see problems. Also note that the default in
callMutationBurden ignores indels.
Looks like there is not yet a CallableLoci replacement in GATK4 (
https://gatkforums.broadinstitute.org/gatk/discussion/11178/callableloci-
replacement-in-gatk4). I would definitely recommend running GATK3
CallableLoci with --minDepth 15 or whatever you use as min coverage in
Mutect.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DIKdBvEUeu512GNen_zEJJAXv2c5ks5taEBKgaJpZM4SSWvK>
.
|
…ed; Made it possible in Dx.R to count indels in TMB calculation (#20)
Yes, but like I said, by default, indels are not counted. I added a flag to the Dx.R script to count them (--keepindels). |
There appears to be a new fatal error:
…On Fri, Mar 2, 2018 at 9:50 PM, M. Riester ***@***.***> wrote:
Yes, but like I said, by default, indels are not counted. I added a flag
to the Dx.R script to count them (--keepindels).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DPS2ZYZ9GT7-YeIEWH5GPf4u9k3Tks5tagT4gaJpZM4SSWvK>
.
|
All 46 samples are the same:
------------------------------------------------------------
INFO [2018-03-08 09:44:42] PureCN 1.9.28
INFO [2018-03-08 09:44:42]
------------------------------------------------------------
.....
Cannot find valid purity/ploidy solution. This happens when input
FATAL [2018-03-08 09:51:23] segmentations are garbage, most likely due to a
catastrophic sample QC
FATAL [2018-03-08 09:51:23] failure. Re-check standard QC metrics for this
sample.
FATAL [2018-03-08 09:51:23]
FATAL [2018-03-08 09:51:23] This is most likely a user error due to invalid
input data or
FATAL [2018-03-08 09:51:23] parameters (PureCN 1.9.28).
…On Thu, Mar 8, 2018 at 10:13 AM, billnjcn ***@***.***> wrote:
There appears to be a new fatal error:
On Fri, Mar 2, 2018 at 9:50 PM, M. Riester ***@***.***>
wrote:
> Yes, but like I said, by default, indels are not counted. I added a flag
> to the Dx.R script to count them (--keepindels).
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#20 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AXw6DPS2ZYZ9GT7-YeIEWH5GPf4u9k3Tks5tagT4gaJpZM4SSWvK>
> .
>
|
I would need again the whole output of the log file and the command lines used to generate the input. When all samples show this, then there is likely an issue with the setup. |
command line argument:
INFO [2018-03-08 09:13:41] Arguments: -tumor.coverage.file
71-D_recal_coverage_loess.txt -seg.file -vcf.file
71-D_filtered_dbsnp.vcf.gz -genome hg19 -sex ? -args.setMappingBiasVcf NULL
-args.segmentation target_weights_SS_v6_hg19.txt,0.005,NULL -sampleid 71-D
-min.ploidy 1 -max.ploidy 6 -max.non.clonal 0.2 -log.ratio.calibration 0.1
-model.homozygous FALSE -error 0.001 -interval.file
baits_hg19_V6_intervals.txt -gc.gene.file -max.segments 300 -plot.cnv TRUE
-DB.info.flag DB -model beta -post.optimize FALSE -log.file 71-D.log
-normal.coverage.file <data> -normalDB <data> -args.filterVcf <data>
-fun.segmentation <data> -test.num.copy <data> -test.purity <data>
-speedup.heuristics <data>
…On Thu, Mar 8, 2018 at 11:09 AM, M. Riester ***@***.***> wrote:
I would need again the whole output of the log file and the command lines
used to generate the input.
When all samples show this, then there is likely an issue with the setup.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DAoMm_0Y1Uv13JNAoKrKKIyjhovrks5tcVeigaJpZM4SSWvK>
.
|
INFO [2018-03-08 09:13:41] Loading coverage files...
INFO [2018-03-08 09:13:48] Mean target coverages: 88X (tumor) 76X (normal).
INFO [2018-03-08 09:13:50] Mean coverages: chrX: 72.95, chrY: 1.41,
chr1-22: 82.92.
INFO [2018-03-08 09:13:50] Mean coverages: chrX: 82.62, chrY: 0.64,
chr1-22: 71.27.
INFO [2018-03-08 09:14:16] Removing 35593 intervals with missing log.ratio.
INFO [2018-03-08 09:14:16] Removing 9 low/high GC targets.
INFO [2018-03-08 09:14:21] Removing 3762 targets excluded in normalDB.
INFO [2018-03-08 09:14:21] Removing 265 targets with low total coverage in
normal (< 150.00 reads).
INFO [2018-03-08 09:14:21] normalDB provided. Setting minimum coverage for
segmentation to 0.0015X.
INFO [2018-03-08 09:14:21] Removing 9 low coverage (< 0.0015X) targets.
INFO [2018-03-08 09:14:22] Using 247792 intervals (247792 on-target, 0
off-target).
INFO [2018-03-08 09:14:22] No off-target intervals. If this is
hybrid-capture data, consider adding them.
INFO [2018-03-08 09:14:23] AT/GC dropout: 1.07 (tumor), 1.08 (normal).
INFO [2018-03-08 09:14:23] Loading VCF...
INFO [2018-03-08 09:14:23] Found 12980 variants in VCF file.
INFO [2018-03-08 09:14:24] Removing 428 triallelic sites.
WARN [2018-03-08 09:14:24] vcf.file has no DB info field for dbSNP
membership. Guessing it based on ID.
INFO [2018-03-08 09:14:24] 7504 (59.8%) variants annotated as likely
germline (DB INFO flag).
INFO [2018-03-08 09:14:24] 71-D is tumor in VCF file.
INFO [2018-03-08 09:14:24] 39 homozygous and 104 heterozygous variants on
chrX.
INFO [2018-03-08 09:14:24] Sex from VCF: F (Fisher's p-value: 0.924,
odds-ratio: 1.03).
INFO [2018-03-08 09:14:24] Detected MuTect2 VCF.
INFO [2018-03-08 09:14:24] Removing 6392 MuTect2 calls due to blacklisted
failure reasons.
INFO [2018-03-08 09:14:25] Initial testing for significant sample
cross-contamination: maybe
INFO [2018-03-08 09:14:25] Removing 2233 variants with AF < 0.030 or AF >=
0.970 or less than 4 supporting reads or depth < 15.
INFO [2018-03-08 09:14:25] Removing 15 low quality variants with BQ < 25.
INFO [2018-03-08 09:14:25] Total size of targeted genomic region: 55.83Mb
(75.03Mb with 50bp padding).
INFO [2018-03-08 09:14:25] 1.4% of targets contain variants.
INFO [2018-03-08 09:14:25] Removing 403 variants outside intervals.
INFO [2018-03-08 09:14:25] Setting somatic prior probabilities for dbSNP
hits to 0.000500 or to 0.500000 otherwise.
INFO [2018-03-08 09:14:25] VCF does not contain somatic status. For best
results, consider providing normal.panel.vcf.file when matched normals are
not available.
INFO [2018-03-08 09:14:25] Sample sex: F
INFO [2018-03-08 09:14:25] Segmenting data...
INFO [2018-03-08 09:14:26] Target weights found, will use weighted CBS.
INFO [2018-03-08 09:14:26] Loading pre-computed boundaries for DNAcopy...
INFO [2018-03-08 09:14:26] Setting undo.SD parameter to 1.250000.
INFO [2018-03-08 09:14:39] Setting prune.hclust.h parameter to 0.150000.
INFO [2018-03-08 09:14:39] Found 99 segments with median size of 1.64Mb.
INFO [2018-03-08 09:14:39] Removing 2 variants outside segments.
INFO [2018-03-08 09:14:39] Using 3507 variants.
INFO [2018-03-08 09:14:40] Mean standard deviation of log-ratios: 1.13
INFO [2018-03-08 09:14:40] 2D-grid search of purity and ploidy...
FATAL [2018-03-08 09:19:40] Cannot find valid purity/ploidy solution. This
happens when input
FATAL [2018-03-08 09:19:40] segmentations are garbage, most likely due to a
catastrophic sample QC
FATAL [2018-03-08 09:19:40] failure. Re-check standard QC metrics for this
sample.
FATAL [2018-03-08 09:19:40]
FATAL [2018-03-08 09:19:40] This is most likely a user error due to invalid
input data or
FATAL [2018-03-08 09:19:40] parameters (PureCN 1.9.28).
…On Thu, Mar 8, 2018 at 11:16 AM, billnjcn ***@***.***> wrote:
command line argument:
INFO [2018-03-08 09:13:41] Arguments: -tumor.coverage.file
71-D_recal_coverage_loess.txt -seg.file -vcf.file
71-D_filtered_dbsnp.vcf.gz -genome hg19 -sex ? -args.setMappingBiasVcf NULL
-args.segmentation target_weights_SS_v6_hg19.txt,0.005,NULL -sampleid
71-D -min.ploidy 1 -max.ploidy 6 -max.non.clonal 0.2 -log.ratio.calibration
0.1 -model.homozygous FALSE -error 0.001 -interval.file
baits_hg19_V6_intervals.txt -gc.gene.file -max.segments 300 -plot.cnv TRUE
-DB.info.flag DB -model beta -post.optimize FALSE -log.file 71-D.log
-normal.coverage.file <data> -normalDB <data> -args.filterVcf <data>
-fun.segmentation <data> -test.num.copy <data> -test.purity <data>
-speedup.heuristics <data>
On Thu, Mar 8, 2018 at 11:09 AM, M. Riester ***@***.***>
wrote:
> I would need again the whole output of the log file and the command lines
> used to generate the input.
>
> When all samples show this, then there is likely an issue with the setup.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#20 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AXw6DAoMm_0Y1Uv13JNAoKrKKIyjhovrks5tcVeigaJpZM4SSWvK>
> .
>
|
Your log-ratios are extremely noisy (1.13), good data is below 0.4 with coverage that low. Are you sure you are using the correct baits file? You have almost 40k intervals without any coverage. Did NormalDB.R give you a warning that you used the wrong baits file? If then, try to re-run IntervalFile.R and provide the low_coverage_targets.bed file via --exclude. You have a large GC bias, but could be because of the wrong baits file. You also have only a small fraction of intervals overlapping with variants (1.4%). This should be around 10%. Make sure to run Mutect with the same interval file and add 50-75bp padding (this will at least double the number of SNPs). And then, make sure to generate a PON VCF for a production setting. |
Also, is there a reason you dropped --offtarget? |
Thanks, the bait file may be the exact reason why it failed!
…On Thu, Mar 8, 2018 at 1:55 PM, M. Riester ***@***.***> wrote:
Also, is there a reason you dropped --offtarget?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AXw6DHhF8d27cUYzvasaFoeiz__jeKx7ks5tcX6vgaJpZM4SSWvK>
.
|
...
Removing 259 variants outside intervals.
INFO [2018-02-25 11:14:01] Setting somatic prior probabilities for dbSNP hits to 0.000500 or to 0.500000 otherwise.
Error: logical subscript contains NAs
Execution halted
The text was updated successfully, but these errors were encountered: