make ASVs to close #18 #21

colinbrislawn · 2019-04-27T17:01:33Z

This is working now. To test:

Install

git checkout unoise # or git pull to update
pip install .

Tun

hundo download \
    --database-dir /home/cbrislawn/hundo_annotation_references \
    --reference-database silva

cd example
hundo annotate \
    --filter-adapters qc_references/adapters.fa.gz \
    --filter-contaminants qc_references/phix174.fa.gz \
    --database-dir /home/cbrislawn/hundo_annotation_references \
    --pipeline ASV \
    --reference-database silva \
    --out-dir mothur_sop_silva \
    mothur_sop_data

colinbrislawn · 2019-04-28T00:32:27Z

Current issues:
Need to rename output features, say with >AVS_1; or with their hash like dada2 does using the vsearch --relabel_md5.

The 100% matching means that fewer reads from each sample make it into OTUs. I'm not sure if this is a good thing... or just a limitation of Exact sequence variants. 🤷‍♂ Thoughts?

EDIT: Also should we still be calling the output file OTU.fna if these are zOTUs / ASVs / ESVs?

brwnj · 2019-05-01T17:34:31Z

The 100% matching means that fewer reads from each sample make it into OTUs. I'm not sure if this is a good thing... or just a limitation of Exact sequence variants. 🤷‍♂ Thoughts?

exact sequence variants with no singletons, right? a drop in counts seems inevitable in that case.

EDIT: Also should we still be calling the output file OTU.fna if these are zOTUs / ASVs / ESVs?

could probably alter the output file names to reflect the header name change.

colinbrislawn · 2019-05-01T17:44:48Z

could probably alter the output file names to reflect the header name change.

That makes sense. Is there an elegant way to support ASV.fasta, ASV_tax.fasta, ect. without duplicating the rest of the pipeline?

brwnj · 2019-05-01T17:51:54Z

maybe use config.get("pipeline") to set file names, like:

rule compile_counts:
    input:
        seqs = "all-sequences.fasta",
        db = "%s_tax.fasta" % config.get("pipeline")

colinbrislawn · 2019-05-01T18:05:12Z

Got it. Should I include this each and every time we refer to those files?

On a related note, how should I approach updating the report?

brwnj · 2019-05-01T18:21:29Z

Got it. Should I include this each and every time we refer to those files?

Yes, where rules are re-used (where it makes sense). This will minimize potential issues with file naming and could make it simpler to support other strategies later.

On a related note, how should I approach updating the report?

I don't know everything that's changed, but clearly there's a large chunk of text that will be quite different. Maybe start with figuring out everything that needs to be altered then decide if we need an entirely separate script or if we can pass relevant args into the existing to set specific pieces.

colinbrislawn · 2019-05-01T19:41:21Z

OK, here's my todo list:

Rename OTU* files to ASV* files
support these new names using db = "%s_tax.fasta" % config.get("pipeline")
update the section of the report that's different
automatically change report based on --pipeline

Any advice on word choice? Is --pipeline a good name for this flag? Should we call our features ASVs / ESVs / zOTUs?

brwnj · 2019-05-01T20:19:56Z

The first two bullets are related in that the second bullet solves the first (at least I think it does).

I think --pipeline is fair. I honestly haven't been keeping up with terminology in this area, so whatever you think is best for feature name is fine with me.

colinbrislawn · 2019-05-03T22:04:57Z

I did a quick parameter sweep to explore non-exact matching. 🎯

vsearch --usearch_global --id 1.0...
Matching unique query sequences: 9797 of 1057180 (0.93%)
Matching total query sequences: 6029201 of 9391211 (64.20%)

vsearch --usearch_global --id 0.995...
Matching unique query sequences: 428897 of 1057180 (40.57%)
Matching total query sequences: 8439865 of 9391211 (89.87%)

vsearch --usearch_global --id 0.99...
Matching unique query sequences: 847648 of 1057180 (80.18%)
Matching total query sequences: 9054425 of 9391211 (96.41%)

vsearch --usearch_global --id 0.97...
Matching unique query sequences: 1008576 of 1057180 (95.40%)
Matching total query sequences: 9294045 of 9391211 (98.97%)

Robert Edgar recommends using 97% for counting zOTUs, and 98% is mentioned in this vsearch thread.

Joe, what do you think about counting up non-exact matches after building zOTUs? What threshold should we use?

colinbrislawn · 2019-05-28T23:05:37Z

Joe, I'm getting ready to wrap this up. My current solution is to have two reports for the two pipelines. There is a lot of duplicate code, but it was easy to implement. This also makes it easy to add other much more divergent pipelines like picrust2 closed-ref if we wanted.

I've also changed the counting step to use --id 0.99 as that captures more reads with that I predict is little loss of quality. (1% diff is 2 bp diff for most amplicons)

Finally, how do we update the docs? Does the docs folder end up on readthedocs?

colinbrislawn · 2019-05-28T23:41:38Z

Is there a clean way to do this?

rule build_report:
    input:
        report_script = os.path.join(
            os.path.dirname(os.path.abspath(workflow.snakefile)),
            "scripts",
            "build_report_OTU.py"
        ) if config.get("pipeline") == "OTU" else os.path.join(
            os.path.dirname(os.path.abspath(workflow.snakefile)),
            "scripts",
            "build_report_ASV.py"),

Additionally, the report scripts don't seem to be copied when hundo is installed. Am I using this section wrong?
Missing input files for rule build_report: /Users/bris469/miniconda3/envs/hundo-dev/lib/python3.7/site-packages/hundo/scripts/build_report_ASV.py

brwnj · 2019-05-29T16:15:46Z

Finally, how do we update the docs? Does the docs folder end up on readthedocs?

Yes, the docs folder is all the needs updating. RTD will rebuild when there are changes to the docs.

Is there a clean way to do this?

This is clean to me. Alternatively, if you wanted something shorter in the input block you can write a separate functions that returns the correct path. That'd move the code into a function and out of the input, but ultimately would look the same.

Additionally, the report scripts don't seem to be copied when hundo is installed. Am I using this section wrong?

You just need to update the manifest (https://github.com/pnnl/hundo/blob/master/MANIFEST.in)

colinbrislawn · 2019-05-31T05:25:31Z

So this bug with json parsing is holding up testing:
biocore/biom-format#816

I'll work on docs until then.

brwnj · 2024-03-19T01:36:19Z

oof, I can attempt to get to this but no promises. do you lack permission?

…

On Mon, Mar 18, 2024 at 6:25 PM Colin J. Brislawn ***@***.***> wrote: Assigned #21 <#21> to @brwnj <https://github.com/brwnj>. — Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAF77PWT4GAVGK6DGRQJNXDYY6HY5AVCNFSM4HI4TYMKU5DIOJSWCZC7NNSXTWQAEJEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW4OZRGIYTMMRQHEZDGNRQ> . You are receiving this because you were assigned.Message ID: ***@***.***>

colinbrislawn · 2024-03-19T01:39:40Z

You are good!

I can include benchmarks to compare, then merge this myself.

(I may need a hand distributing this on pip conda, but we can do that later.)

Sorry to 'at' you on a Monday night.

colinbrislawn added 9 commits April 27, 2019 10:00

Add click option --pipeline

1a4fa51

Add basic ASV pipeline

4ba7af2

bump version

35cd461

bump minor version, not patch version

f807168

Add pipline more places

3ea993c

Correct order of pipeline and add it more places

fabdd9a

Bump vsearch version to support --unoise

93eb5ab

Bump to more flexible version of vsearch and update Snakefile

4db5129

Remove extra --strand argument

ac52599

colinbrislawn mentioned this pull request Apr 28, 2019

make ASVs using --cluster_unoise #18

Open

Use --relabel_md5 when building ASVs

0a81c9f

colinbrislawn added 5 commits May 28, 2019 14:52

compile_counts using .99 instead of 1.0

82ce59a

use ASV or OTU in full Snakefile'

9af03f3

duplicate report for ASVs

733d55d

update new report to discribe ASVs

895c307

switch report based on config.get("pipeline")

e00131a

colinbrislawn added 2 commits May 28, 2019 16:07

strip size from ASV headers

0c717f5

bump vsearch to include gzip support

47423f5

troubleshoot results.zip and build_report

5d6e212

update manifest to include reports

f3506d3

colinbrislawn added 3 commits March 18, 2024 14:22

Merge branch 'master' into unoise

6bae178

unpin bzip2 to updata vsearch

08392b8

Update zenodo.org download URL

3278382

colinbrislawn assigned brwnj Mar 19, 2024

colinbrislawn assigned colinbrislawn and unassigned brwnj Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make ASVs to close #18 #21

make ASVs to close #18 #21

colinbrislawn commented Apr 27, 2019 •

edited

Loading

colinbrislawn commented Apr 28, 2019 •

edited

Loading

brwnj commented May 1, 2019

colinbrislawn commented May 1, 2019

brwnj commented May 1, 2019

colinbrislawn commented May 1, 2019

brwnj commented May 1, 2019

colinbrislawn commented May 1, 2019

brwnj commented May 1, 2019

colinbrislawn commented May 3, 2019 •

edited

Loading

colinbrislawn commented May 28, 2019

colinbrislawn commented May 28, 2019

brwnj commented May 29, 2019

colinbrislawn commented May 31, 2019

brwnj commented Mar 19, 2024 via email

colinbrislawn commented Mar 19, 2024

make ASVs to close #18 #21

Are you sure you want to change the base?

make ASVs to close #18 #21

Conversation

colinbrislawn commented Apr 27, 2019 • edited Loading

colinbrislawn commented Apr 28, 2019 • edited Loading

brwnj commented May 1, 2019

colinbrislawn commented May 1, 2019

brwnj commented May 1, 2019

colinbrislawn commented May 1, 2019

brwnj commented May 1, 2019

colinbrislawn commented May 1, 2019

brwnj commented May 1, 2019

colinbrislawn commented May 3, 2019 • edited Loading

colinbrislawn commented May 28, 2019

colinbrislawn commented May 28, 2019

brwnj commented May 29, 2019

colinbrislawn commented May 31, 2019

brwnj commented Mar 19, 2024 via email

colinbrislawn commented Mar 19, 2024

colinbrislawn commented Apr 27, 2019 •

edited

Loading

colinbrislawn commented Apr 28, 2019 •

edited

Loading

colinbrislawn commented May 3, 2019 •

edited

Loading