Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: LULU ASV post-clustering curation #609

Closed
a4000 opened this issue Aug 2, 2023 · 9 comments
Closed

Proposal: LULU ASV post-clustering curation #609

a4000 opened this issue Aug 2, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@a4000
Copy link
Contributor

a4000 commented Aug 2, 2023

Description of feature

I have a LULU subworkflow that I can add to Ampliseq. The subworkflow uses blastn to create the matchlist for LULU, then uses LULU for post-clustering curation. The input files for the subworkflow are an asv fasta file and a tsv file. The tsv file is similar to the DADA2_table.tsv file that is already produced in Ampliseq. The output file is a curated version of that tsv file. I feel this should be easy enough to add to Amliseq.

@a4000 a4000 added the enhancement New feature or request label Aug 2, 2023
@erikrikarddaniel
Copy link
Member

I suggest (like in #608) VSEARCH instead of BLASTN. Could save a lot of resources/energy I think.

We recently had a Slack discussion regarding post-denoising clustering (https://nfcore.slack.com/archives/CEA7TBJGJ/p1690893776838869), so that seems to be something people want. In that discussion, another tool -- swarm -- was proposed. I don't know what's best or most commonly used.

@a4000
Copy link
Contributor Author

a4000 commented Aug 2, 2023

Swarm is another tool I haven't tried. I'll test it out and look into the literature to see which tool seems more popular/better.

@a4000
Copy link
Contributor Author

a4000 commented Aug 3, 2023

I just read this interesting article (https://archimer.ifremer.fr/doc/00688/80057/83060.pdf), they used DADA2 for ASVs, then they used swarm on the ASVs go get OTUs, then they ran LULU to curate the ASVs and OTUs to check which methods produce the most accurate results. To sum up their argument, the best method depends on your dataset and taxa of interest. So maybe it's best to have both swarm and LULU as optional steps.

@a4000
Copy link
Contributor Author

a4000 commented Aug 9, 2023

I noticed the qiime2 modules have these lines

container "qiime2/core:2022.11"

// Exit if running this module with -profile conda / -profile mamba
if (workflow.profile.tokenize(',').intersect(['conda', 'mamba']).size() >= 1) {
    exit 1, "QIIME2 does not support Conda. Please use Docker / Singularity / Podman instead."
}

Is it acceptable for me to do something similar to this when I just have a docker container for a tool, but no conda?

@d4straub
Copy link
Collaborator

d4straub commented Aug 9, 2023

If the docker container was/is maintained over time (when no conda packages are available that isnt to be taken for granted) and there is no way around it (such as choosing a similar tool that provides conda & container), then yes. In the example you quote its QIIME2, which is a very popular tool, so that warrants the decision here I think.

@a4000
Copy link
Contributor Author

a4000 commented Aug 14, 2023

I've looked into this issue a bit more.

Problem: LULU doesn't have a conda package, and while there is a docker container I'm using, it isn't part of the quay.io/biocontainers registery.

Is there a better method of post-clustering?
It's hard to say. I can't find many studies testing different methods of post-clustering of ASVs. The LULU paper did compare LULU to post-clustering with dbotu3, with LULU performing better.
I found a more recent tool called ReClustOR, but if I'm reading their paper correctly, it doesn't seem like they compared the tool to other post-clustering methods. They don't even mention LULU despite this paper being published after LULU's paper. I also can't find any Conda packages or Docker containers, so I'm not sold on this tool yet.

Is LULU popular?
I found this paper that provides an overview of pipelines for metabarcoding studies and the paper mentions five pipelines that use LULU and one that uses ReClustOR.

Postclustering tools, such as LULU (Frøslev et al., 2017) are implemented in AMPTk (Palmer et al., 2018), eDNAflow (Mousavi-Derazmahalleh et al., 2021), APSCALE (Buchner et al., 2022), LotuS2, PipeCraft2 and ReClustOR (Terrat et al., 2020) in BIOCOM-PIPE.

So from what I can see, it seems like LULU is a relatively popular tool for post-clustering.

Do we need a tool designed for post-clustering curation, or can we use any clustering tool (e.g., Swarm) to functionally perform a similar role?
The paper I mentioned earlier in this issue thread argues that post-clustering curation with LULU is different from post-clustering with Swarm and that the two have different purposes.

This indicates that LULU curation merges less ASVs than the amount grouped through clustering, and highlights the different purposes of both tools, LULU effectively removing spurious OTUs, while clustering allows removing haplotype diversity.

So maybe post-clustering with Swarm should be a separate issue. That's something I'll think about more.

The Docker container I'm using isn't part of the biocontainers registery, but it does work. I can look into the process of getting a container added to the biocontainers repository if that would provide more confidence to the rest of the Ampliseq team, but I won't do that if it's not necessary. I have most of the code written locally to add this feature to Ampliseq, so it's just the container/conda issue that's holding me back.

@d4straub
Copy link
Collaborator

I am with @erikrikarddaniel that VSEARCH is a fine tool and it has also proper containerization. And it can cluster sequences.

Papers from software developers about their own tools should be typically taken with a bit of skepticism, independent benchmarks are usually better. But benchmarks are sometimes hard to generalize, still, the best we got to decide on tools ofc.

I had a quick look at bioconda, swarm is listed, also AMPTk and apscale. Possibly those last two containers have all requirements for LULU, thats also a way to get it ;)

Next steps if you really want to go with LULU ask the LULU devs to add it to bioconda. Otherwise there is in nf-core slack the channel #bioconda that might have experts to help.

@a4000
Copy link
Contributor Author

a4000 commented Aug 16, 2023

I think for now I'll try VSEARCH because there is already an nf-core module for VSEARCH_CLUSTER. I have noticed a bug with this module, so I'll fix the bug in the module and try adding the fixed module to the pipeline.

@a4000 a4000 mentioned this issue Aug 18, 2023
8 tasks
@a4000
Copy link
Contributor Author

a4000 commented Aug 25, 2023

I'm closing this issue for now. I've added VSEARCH instead of LULU for ASV post-clustering.

@a4000 a4000 closed this as completed Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants