Training tutorial? #27

jelber2 · 2022-06-08T08:17:57Z

Curious if there will be a tutorial (sorry if I missed it somewhere in repo) for training to make a custom DeepConsensus model for other than human PacBio HiFi reads? I tried DeepConsensus on Bacterial PacBio HiFi/CCS reads, and as expected, it does not perform as well as it does for Human.

danielecook · 2022-06-08T13:38:56Z

@jelber2 there is currently no tutorial for performing training with DeepConsensus.

We are planning on releasing this functionality in the future.

If you can provide more insight in terms of the performance it could be helpful to us. We are actively experimenting with training datasets.

jelber2 · 2022-06-08T14:04:39Z

Here is one example of performance, I did not use the --all setting when running pbccs https://github.com/PacificBiosciences/harmony/issues/1

AndrewCarroll · 2022-06-10T05:51:09Z

Hi @jelber2

This is an interesting observation, thank you for bringing it up. So far, we've run DeepConsensus (trained on human) on several non-human species (multiple plants, frog and mouse) and have received reports from others on E.coli. In each case, we've still seen either direct (gap-compressed identity) or indirect evidence (better assembly and YAK values) that DeepConsensus is generalizing to those other species well.

Based on those observations, we're fairly optimistic that a single model should apply well across species. If there are counter-examples, it would be good to know in order to adjust our strategy for training.

Are you able to share the subreads files for this samples? It could be useful for us to replicate your findings in order to better understand them.

Thank you,
Andrew

MariaNattestad · 2022-06-10T15:22:04Z

Looking at your other issue, I think you might get better results on DeepConsensus by just following the DeepConsensus quick start more closely, like using ccs --all and just making sure that you're not manipulating the subreads in any way other than what is explicitly stated in our quick start.

jelber2 · 2022-06-13T08:57:40Z

@AndrewCarroll I unfortunately cannot share the subreads. The only thing I can think is that these are E coli reads from a PacBio Sequel (not Sequel II). Here are the results of comparing --all to not all with and without deepconsensus.

jelber2 · 2022-07-25T08:48:49Z

I was able to get permission to share the data. @AndrewCarroll I sent you an email with links about a month ago, but I did not hear a response if you accessed the data.

AndrewCarroll · 2022-07-25T18:14:52Z

Hi @jelber2

I do see the email in my inbox when I search for your name. I am not sure how I missed itt originally. I will download the data now and take a look.

jelber2 · 2022-07-26T11:43:35Z

@AndrewCarroll , feel free to post any updates with running these data through DeepConsensus here. My guess is either I am doing something wrong or the data having come from a PacBio Sequel may be the issue.

jelber2 · 2022-07-27T07:43:37Z

So, I ran deepconsensus-0.3.1-gpu using pbccs-6.3.0 --min-rq=0.88, then ran harmony-0.2.0 (https://github.com/PacificBiosciences/harmony/releases/tag/v0.2.0) as before. One can certainly see an improvement in the deepconsenus-0.3.1 corrected reads (ccs.rq.deepcon-0.3.1) relative to the deepconsensus-0.2.0 reads (ccs.all.deepcon and ccs.notall.deepcon) and ccs reads (ccs.all, ccs.notall, ccs.rq). ccs.all is using pbccs-6.3.0 --all, ccs.notall is using pbccs-6.3.0 defaults, and ccs.rq is using pbccs-6.3.0 --min-rq=0.88.

AndrewCarroll · 2022-07-27T22:45:40Z

Hi @jelber2

I was able to run DeepConsensus v0.3.1 on your data, and everything seems to have run smoothly. Although I don't have the empirical quality calculations relative to the reference, I can use the predicted quality values from pbccs and DeepConsensus. These values are roughly in-line with expectations from datasets. I see a ~12% increase in the number of reads at >Q20 for DeepConsensus relative to pbccs.

This observation makes me wonder if one of the reasons you don't see more separation of the curves in your plot is that DeepConsensus is able to rescue many reads that would normally have <Q20 in pbccs (and therefore not be present in the notall dataset). As a result, the comparison is complicated by the fact that the DeepConsensus bins may have a larger number of more difficult reads.

I wonder if it might be more meaningful to plot the sequence yield output at given quality between the methods. For example, at a quality of Q30+, how many bases are present in the DeepConsensus dataset as compared to the pbccs. This would help to disentangle the confounding factor of the amount of sequence yield between the methods.

Alternatively, you could also explicitly filter the reads to the same read set between the different methods to get apples-to-apples comparisons on exactly the same data.

I am curious to hear your feedback on these potential strategies.

Thank you,
Andrew

jelber2 · 2022-07-28T12:03:51Z

# get pbccs-6.3.0 not all (default settings) read names
samtools view -@34 ccs.notall.bam |cut -f 1 > names1

# filter by name
filterbyname.sh include=t names=names1 in=ccs.rq.deepconsensus-0.3.11.bam out=STDOUT.sam | \
samtools view -Sb -@8 > ccs.rq.deepconsensus-0.3.11.notall.reads.bam

# run harmony
harmony -j 34 ccs.rq.deepconsensus-0.3.11.notall.reads.bam ../flye/assembly.fasta ccs.rq.deepcon-0.3.1.na

# run harmony
harmony -j 34 ccs.rq.deepconsensus-0.3.11.bam ../flye/assembly.fasta ccs.rq.deepcon-0.3.1

# make the plot
./single.R ccs.rq.deepcon-0.3.1 ccs.rq.deepcon-0.3.1.na

jelber2 closed this as completed Jun 21, 2022

jelber2 reopened this Jul 26, 2022

jelber2 closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training tutorial? #27

Training tutorial? #27

jelber2 commented Jun 8, 2022

danielecook commented Jun 8, 2022

jelber2 commented Jun 8, 2022

AndrewCarroll commented Jun 10, 2022

MariaNattestad commented Jun 10, 2022

jelber2 commented Jun 13, 2022

jelber2 commented Jul 25, 2022

AndrewCarroll commented Jul 25, 2022

jelber2 commented Jul 26, 2022

jelber2 commented Jul 27, 2022

AndrewCarroll commented Jul 27, 2022

jelber2 commented Jul 28, 2022 •

edited

Training tutorial? #27

Training tutorial? #27

Comments

jelber2 commented Jun 8, 2022

danielecook commented Jun 8, 2022

jelber2 commented Jun 8, 2022

AndrewCarroll commented Jun 10, 2022

MariaNattestad commented Jun 10, 2022

jelber2 commented Jun 13, 2022

jelber2 commented Jul 25, 2022

AndrewCarroll commented Jul 25, 2022

jelber2 commented Jul 26, 2022

jelber2 commented Jul 27, 2022

AndrewCarroll commented Jul 27, 2022

jelber2 commented Jul 28, 2022 • edited

jelber2 commented Jul 28, 2022 •

edited