Proposal: LULU ASV post-clustering curation #609

a4000 · 2023-08-02T06:33:26Z

Description of feature

I have a LULU subworkflow that I can add to Ampliseq. The subworkflow uses blastn to create the matchlist for LULU, then uses LULU for post-clustering curation. The input files for the subworkflow are an asv fasta file and a tsv file. The tsv file is similar to the DADA2_table.tsv file that is already produced in Ampliseq. The output file is a curated version of that tsv file. I feel this should be easy enough to add to Amliseq.

erikrikarddaniel · 2023-08-02T08:01:57Z

I suggest (like in #608) VSEARCH instead of BLASTN. Could save a lot of resources/energy I think.

We recently had a Slack discussion regarding post-denoising clustering (https://nfcore.slack.com/archives/CEA7TBJGJ/p1690893776838869), so that seems to be something people want. In that discussion, another tool -- swarm -- was proposed. I don't know what's best or most commonly used.

a4000 · 2023-08-02T08:54:02Z

Swarm is another tool I haven't tried. I'll test it out and look into the literature to see which tool seems more popular/better.

a4000 · 2023-08-03T02:11:41Z

I just read this interesting article (https://archimer.ifremer.fr/doc/00688/80057/83060.pdf), they used DADA2 for ASVs, then they used swarm on the ASVs go get OTUs, then they ran LULU to curate the ASVs and OTUs to check which methods produce the most accurate results. To sum up their argument, the best method depends on your dataset and taxa of interest. So maybe it's best to have both swarm and LULU as optional steps.

a4000 · 2023-08-09T02:51:31Z

I noticed the qiime2 modules have these lines

container "qiime2/core:2022.11"

// Exit if running this module with -profile conda / -profile mamba
if (workflow.profile.tokenize(',').intersect(['conda', 'mamba']).size() >= 1) {
    exit 1, "QIIME2 does not support Conda. Please use Docker / Singularity / Podman instead."
}

Is it acceptable for me to do something similar to this when I just have a docker container for a tool, but no conda?

d4straub · 2023-08-09T08:03:59Z

If the docker container was/is maintained over time (when no conda packages are available that isnt to be taken for granted) and there is no way around it (such as choosing a similar tool that provides conda & container), then yes. In the example you quote its QIIME2, which is a very popular tool, so that warrants the decision here I think.

a4000 · 2023-08-14T02:36:57Z

I've looked into this issue a bit more.

Problem: LULU doesn't have a conda package, and while there is a docker container I'm using, it isn't part of the quay.io/biocontainers registery.

Is there a better method of post-clustering?
It's hard to say. I can't find many studies testing different methods of post-clustering of ASVs. The LULU paper did compare LULU to post-clustering with dbotu3, with LULU performing better.
I found a more recent tool called ReClustOR, but if I'm reading their paper correctly, it doesn't seem like they compared the tool to other post-clustering methods. They don't even mention LULU despite this paper being published after LULU's paper. I also can't find any Conda packages or Docker containers, so I'm not sold on this tool yet.

Is LULU popular?
I found this paper that provides an overview of pipelines for metabarcoding studies and the paper mentions five pipelines that use LULU and one that uses ReClustOR.

Postclustering tools, such as LULU (Frøslev et al., 2017) are implemented in AMPTk (Palmer et al., 2018), eDNAflow (Mousavi-Derazmahalleh et al., 2021), APSCALE (Buchner et al., 2022), LotuS2, PipeCraft2 and ReClustOR (Terrat et al., 2020) in BIOCOM-PIPE.

So from what I can see, it seems like LULU is a relatively popular tool for post-clustering.

Do we need a tool designed for post-clustering curation, or can we use any clustering tool (e.g., Swarm) to functionally perform a similar role?
The paper I mentioned earlier in this issue thread argues that post-clustering curation with LULU is different from post-clustering with Swarm and that the two have different purposes.

This indicates that LULU curation merges less ASVs than the amount grouped through clustering, and highlights the different purposes of both tools, LULU effectively removing spurious OTUs, while clustering allows removing haplotype diversity.

So maybe post-clustering with Swarm should be a separate issue. That's something I'll think about more.

The Docker container I'm using isn't part of the biocontainers registery, but it does work. I can look into the process of getting a container added to the biocontainers repository if that would provide more confidence to the rest of the Ampliseq team, but I won't do that if it's not necessary. I have most of the code written locally to add this feature to Ampliseq, so it's just the container/conda issue that's holding me back.

d4straub · 2023-08-15T14:27:20Z

I am with @erikrikarddaniel that VSEARCH is a fine tool and it has also proper containerization. And it can cluster sequences.

Papers from software developers about their own tools should be typically taken with a bit of skepticism, independent benchmarks are usually better. But benchmarks are sometimes hard to generalize, still, the best we got to decide on tools ofc.

I had a quick look at bioconda, swarm is listed, also AMPTk and apscale. Possibly those last two containers have all requirements for LULU, thats also a way to get it ;)

Next steps if you really want to go with LULU ask the LULU devs to add it to bioconda. Otherwise there is in nf-core slack the channel #bioconda that might have experts to help.

a4000 · 2023-08-16T05:31:52Z

I think for now I'll try VSEARCH because there is already an nf-core module for VSEARCH_CLUSTER. I have noticed a bug with this module, so I'll fix the bug in the module and try adding the fixed module to the pipeline.

a4000 · 2023-08-25T02:05:21Z

I'm closing this issue for now. I've added VSEARCH instead of LULU for ASV post-clustering.

a4000 added the enhancement New feature or request label Aug 2, 2023

a4000 mentioned this issue Aug 18, 2023

add VSEARCH cluster #622

Merged

8 tasks

a4000 closed this as completed Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: LULU ASV post-clustering curation #609

Proposal: LULU ASV post-clustering curation #609

a4000 commented Aug 2, 2023

erikrikarddaniel commented Aug 2, 2023

a4000 commented Aug 2, 2023

a4000 commented Aug 3, 2023

a4000 commented Aug 9, 2023

d4straub commented Aug 9, 2023

a4000 commented Aug 14, 2023

d4straub commented Aug 15, 2023

a4000 commented Aug 16, 2023

a4000 commented Aug 25, 2023

Proposal: LULU ASV post-clustering curation #609

Proposal: LULU ASV post-clustering curation #609

Comments

a4000 commented Aug 2, 2023

Description of feature

erikrikarddaniel commented Aug 2, 2023

a4000 commented Aug 2, 2023

a4000 commented Aug 3, 2023

a4000 commented Aug 9, 2023

d4straub commented Aug 9, 2023

a4000 commented Aug 14, 2023

d4straub commented Aug 15, 2023

a4000 commented Aug 16, 2023

a4000 commented Aug 25, 2023