Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification - how to avoid high FPR #377

Open
exeter-matthew-wakeling opened this issue Feb 26, 2020 · 3 comments
Open

Clarification - how to avoid high FPR #377

exeter-matthew-wakeling opened this issue Feb 26, 2020 · 3 comments

Comments

@exeter-matthew-wakeling
Copy link

exeter-matthew-wakeling commented Feb 26, 2020

I'm trying to run Kevlar on a human trio, sequenced on a BGI-Seq 500, so the data has a fairly low error rate, but the reads are not error-corrected. For some inexplicable reason we have been given about 50X-worth of data when we expected 30X. I am getting some errors for too-high FPR. I ran:

kevlar count --memory 60000M b37_mask.ct Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa.gz UniVec.fasta human_g1k_v37.fasta
kevlar count --memory 60000M proband.ct proband_R1.fastq.gz proband_R2.fastq.gz
kevlar count --memory 60000M father.ct father_R1.fastq.gz father_R2.fastq.gz
kevlar count --memory 60000M mother.ct mother_R1.fastq.gz mother_R2.fastq.gz
kevlar novel --case proband_R1.fastq.gz proband_R2.fastq.gz --case-counts proband.ct --control-counts father.ct mother.ct -o proband_novel.augfastq
kevlar filter --mask b37_mask.ct -o proband_filtered.augfastq proband_novel.augfastq

The kevlar filter stage fails with a high FPR. The documentation states that a FPR of <0.5 for control samples and <0.05 for case samples are required, and that this can be achieved for uncorrected reads with 36-72GB of space with a suitable mask. I have tried the above commands for three trios so far, and the FPR rates are as follows:

  1. proband: 0.004, father: 0.053, mother: 0.005, filter: 0.123
  2. proband: 0.046, father: 0.007, mother: 0.047, filter: 0.971
  3. proband: 0.006, father: 0.019, mother: 0.005, filter: 0.629

So, the FPR rates for the samples are under the thresholds stated in the documentation, but the FPR rate is reported much higher by the filter command, and the filter command produces an error message.

I have applied the --mask option to the filter command. I just noticed that the count command also has a --mask option - should I be giving the same mask to the count command as well? The documentation doesn't really make this clear. Should I allocate even more memory to the count process?

I would greatly appreciate being set in the right direction on this one, but I think the documentation could also do with clarification. Many thanks.

@standage
Copy link
Collaborator

With respect to using the mask in the kevlar count command with or instead of the kevlar filter command, I agree that the docs could use some clarification. But yes, using the mask at the counting stage will decrease your FPR.

Another option—one that is mentioned in the docs, by the way—is to perform error correction. In my experience, this leads to a dramatic reduction in memory requirements/FPR, and the only side effect is a small loss of sensitivity (1-2 low-coverage SNPs are erroneously treated as sequencing errors and "corrected"). In kevlar's recommended BAM preprocessing workflow we use the Lighter error corrector.

@exeter-matthew-wakeling
Copy link
Author

Ok, I have tried running the software with a mask at the count stage. The commands have been:

kevlar count --memory 60000M b37_mask.ct Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa.gz UniVec.fasta human_g1k_v37.fasta
kevlar count --memory 70000M --mask b37_mask.ct proband.ct proband_R1.fastq.gz proband_R2.fastq.gz
kevlar count --memory 70000M --mask b37_mask.ct father.ct father_R1.fastq.gz father_R2.fastq.gz
kevlar count --memory 70000M --mask b37_mask.ct mother.ct mother_R1.fastq.gz mother_R2.fastq.gz
kevlar novel --case proband_R1.fastq.gz proband_R2.fastq.gz --case-counts proband.ct --control-counts father.ct mother.ct -o proband_novel.augfastq
kevlar filter --mask b37_mask.ct -o proband_filtered.augfastq proband_novel.augfastq

I have run 5 trios. The FPR rates have been as follows:

  1. Proband: 0.000, father: 0.002, mother: 0.006, filter: 0.647
  2. Proband: 0.000, father: 0.003, mother: 0.000, filter: 0.621
  3. Proband: 0.000, father: 0.014, mother: 0.000, filter: 0.125
  4. Proband: 0.011, father: 0.000, mother: 0.011, filter: 0.957
  5. Proband: 0.000, father: 0.001, mother: 0.000, filter: 0.135

I'm going to try performing error correction on the fastq files, and see if that makes any difference, however, it seems that the kevlar count outputs already have an extremely low FPR. Is it usual for the kevlar filter command to report a much higher FPR than the kevlar count data it uses? The documentation states that a much higher error rate of up to 0.05 would be acceptable, but that doesn't seem to be working for me.

@standage
Copy link
Collaborator

standage commented Mar 3, 2020

Is it usual for the kevlar filter command to report a much higher FPR than the kevlar count data it uses?

kevlar filter re-computes k-mer counts for k-mers marked as "novel" by kevlar novel. This step requires much less memory than the initial step, since we're looking at many fewer candidate k-mers. However, the default memory for kevlar filter is 1 Mb—try running it with --memory 1G, which should be more than enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants