Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training tutorial? #27

Closed
jelber2 opened this issue Jun 8, 2022 · 11 comments
Closed

Training tutorial? #27

jelber2 opened this issue Jun 8, 2022 · 11 comments

Comments

@jelber2
Copy link

jelber2 commented Jun 8, 2022

Curious if there will be a tutorial (sorry if I missed it somewhere in repo) for training to make a custom DeepConsensus model for other than human PacBio HiFi reads? I tried DeepConsensus on Bacterial PacBio HiFi/CCS reads, and as expected, it does not perform as well as it does for Human.

@danielecook
Copy link
Collaborator

@jelber2 there is currently no tutorial for performing training with DeepConsensus.

We are planning on releasing this functionality in the future.

If you can provide more insight in terms of the performance it could be helpful to us. We are actively experimenting with training datasets.

@jelber2
Copy link
Author

jelber2 commented Jun 8, 2022

Here is one example of performance, I did not use the --all setting when running pbccs https://github.com/PacificBiosciences/harmony/issues/1

@AndrewCarroll
Copy link
Collaborator

Hi @jelber2

This is an interesting observation, thank you for bringing it up. So far, we've run DeepConsensus (trained on human) on several non-human species (multiple plants, frog and mouse) and have received reports from others on E.coli. In each case, we've still seen either direct (gap-compressed identity) or indirect evidence (better assembly and YAK values) that DeepConsensus is generalizing to those other species well.

Based on those observations, we're fairly optimistic that a single model should apply well across species. If there are counter-examples, it would be good to know in order to adjust our strategy for training.

Are you able to share the subreads files for this samples? It could be useful for us to replicate your findings in order to better understand them.

Thank you,
Andrew

@MariaNattestad
Copy link
Collaborator

Looking at your other issue, I think you might get better results on DeepConsensus by just following the DeepConsensus quick start more closely, like using ccs --all and just making sure that you're not manipulating the subreads in any way other than what is explicitly stated in our quick start.

@jelber2
Copy link
Author

jelber2 commented Jun 13, 2022

@AndrewCarroll I unfortunately cannot share the subreads. The only thing I can think is that these are E coli reads from a PacBio Sequel (not Sequel II). Here are the results of comparing --all to not all with and without deepconsensus.
image

@jelber2 jelber2 closed this as completed Jun 21, 2022
@jelber2
Copy link
Author

jelber2 commented Jul 25, 2022

I was able to get permission to share the data. @AndrewCarroll I sent you an email with links about a month ago, but I did not hear a response if you accessed the data.

@AndrewCarroll
Copy link
Collaborator

Hi @jelber2

I do see the email in my inbox when I search for your name. I am not sure how I missed itt originally. I will download the data now and take a look.

@jelber2
Copy link
Author

jelber2 commented Jul 26, 2022

@AndrewCarroll , feel free to post any updates with running these data through DeepConsensus here. My guess is either I am doing something wrong or the data having come from a PacBio Sequel may be the issue.

@jelber2 jelber2 reopened this Jul 26, 2022
@jelber2
Copy link
Author

jelber2 commented Jul 27, 2022

So, I ran deepconsensus-0.3.1-gpu using pbccs-6.3.0 --min-rq=0.88, then ran harmony-0.2.0 (https://github.com/PacificBiosciences/harmony/releases/tag/v0.2.0) as before. One can certainly see an improvement in the deepconsenus-0.3.1 corrected reads (ccs.rq.deepcon-0.3.1) relative to the deepconsensus-0.2.0 reads (ccs.all.deepcon and ccs.notall.deepcon) and ccs reads (ccs.all, ccs.notall, ccs.rq). ccs.all is using pbccs-6.3.0 --all, ccs.notall is using pbccs-6.3.0 defaults, and ccs.rq is using pbccs-6.3.0 --min-rq=0.88.

ccs-deepcon

@AndrewCarroll
Copy link
Collaborator

Hi @jelber2

I was able to run DeepConsensus v0.3.1 on your data, and everything seems to have run smoothly. Although I don't have the empirical quality calculations relative to the reference, I can use the predicted quality values from pbccs and DeepConsensus. These values are roughly in-line with expectations from datasets. I see a ~12% increase in the number of reads at >Q20 for DeepConsensus relative to pbccs.

This observation makes me wonder if one of the reasons you don't see more separation of the curves in your plot is that DeepConsensus is able to rescue many reads that would normally have <Q20 in pbccs (and therefore not be present in the notall dataset). As a result, the comparison is complicated by the fact that the DeepConsensus bins may have a larger number of more difficult reads.

I wonder if it might be more meaningful to plot the sequence yield output at given quality between the methods. For example, at a quality of Q30+, how many bases are present in the DeepConsensus dataset as compared to the pbccs. This would help to disentangle the confounding factor of the amount of sequence yield between the methods.

Alternatively, you could also explicitly filter the reads to the same read set between the different methods to get apples-to-apples comparisons on exactly the same data.

I am curious to hear your feedback on these potential strategies.

Thank you,
Andrew

@jelber2
Copy link
Author

jelber2 commented Jul 28, 2022

# get pbccs-6.3.0 not all (default settings) read names
samtools view -@34 ccs.notall.bam |cut -f 1 > names1

# filter by name
filterbyname.sh include=t names=names1 in=ccs.rq.deepconsensus-0.3.11.bam out=STDOUT.sam | \
samtools view -Sb -@8 > ccs.rq.deepconsensus-0.3.11.notall.reads.bam

# run harmony
harmony -j 34 ccs.rq.deepconsensus-0.3.11.notall.reads.bam ../flye/assembly.fasta ccs.rq.deepcon-0.3.1.na

# run harmony
harmony -j 34 ccs.rq.deepconsensus-0.3.11.bam ../flye/assembly.fasta ccs.rq.deepcon-0.3.1

# make the plot
./single.R ccs.rq.deepcon-0.3.1 ccs.rq.deepcon-0.3.1.na

ccs-deepcon

@jelber2 jelber2 closed this as completed Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants