Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make ASVs to close #18 #21

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

make ASVs to close #18 #21

wants to merge 22 commits into from

Conversation

colinbrislawn
Copy link
Collaborator

@colinbrislawn colinbrislawn commented Apr 27, 2019

This is working now. To test:

Install

git checkout unoise # or git pull to update
pip install .

Tun

hundo download \
    --database-dir /home/cbrislawn/hundo_annotation_references \
    --reference-database silva

cd example
hundo annotate \
    --filter-adapters qc_references/adapters.fa.gz \
    --filter-contaminants qc_references/phix174.fa.gz \
    --database-dir /home/cbrislawn/hundo_annotation_references \
    --pipeline ASV \
    --reference-database silva \
    --out-dir mothur_sop_silva \
    mothur_sop_data

@colinbrislawn
Copy link
Collaborator Author

colinbrislawn commented Apr 28, 2019

Current issues:
Need to rename output features, say with >AVS_1; or with their hash like dada2 does using the vsearch --relabel_md5.

The 100% matching means that fewer reads from each sample make it into OTUs. I'm not sure if this is a good thing... or just a limitation of Exact sequence variants. 🤷‍♂ Thoughts?

EDIT: Also should we still be calling the output file OTU.fna if these are zOTUs / ASVs / ESVs?

@brwnj
Copy link
Contributor

brwnj commented May 1, 2019

The 100% matching means that fewer reads from each sample make it into OTUs. I'm not sure if this is a good thing... or just a limitation of Exact sequence variants. 🤷‍♂ Thoughts?

exact sequence variants with no singletons, right? a drop in counts seems inevitable in that case.

EDIT: Also should we still be calling the output file OTU.fna if these are zOTUs / ASVs / ESVs?

could probably alter the output file names to reflect the header name change.

@colinbrislawn
Copy link
Collaborator Author

could probably alter the output file names to reflect the header name change.

That makes sense. Is there an elegant way to support ASV.fasta, ASV_tax.fasta, ect. without duplicating the rest of the pipeline?

@brwnj
Copy link
Contributor

brwnj commented May 1, 2019

maybe use config.get("pipeline") to set file names, like:

rule compile_counts:
    input:
        seqs = "all-sequences.fasta",
        db = "%s_tax.fasta" % config.get("pipeline")

@colinbrislawn
Copy link
Collaborator Author

Got it. Should I include this each and every time we refer to those files?

On a related note, how should I approach updating the report?

@brwnj
Copy link
Contributor

brwnj commented May 1, 2019

Got it. Should I include this each and every time we refer to those files?

Yes, where rules are re-used (where it makes sense). This will minimize potential issues with file naming and could make it simpler to support other strategies later.

On a related note, how should I approach updating the report?

I don't know everything that's changed, but clearly there's a large chunk of text that will be quite different. Maybe start with figuring out everything that needs to be altered then decide if we need an entirely separate script or if we can pass relevant args into the existing to set specific pieces.

@colinbrislawn
Copy link
Collaborator Author

OK, here's my todo list:

  • Rename OTU* files to ASV* files
  • support these new names using db = "%s_tax.fasta" % config.get("pipeline")
  • update the section of the report that's different
  • automatically change report based on --pipeline

Any advice on word choice? Is --pipeline a good name for this flag? Should we call our features ASVs / ESVs / zOTUs?

@brwnj
Copy link
Contributor

brwnj commented May 1, 2019

The first two bullets are related in that the second bullet solves the first (at least I think it does).

I think --pipeline is fair. I honestly haven't been keeping up with terminology in this area, so whatever you think is best for feature name is fine with me.

@colinbrislawn
Copy link
Collaborator Author

colinbrislawn commented May 3, 2019

I did a quick parameter sweep to explore non-exact matching. 🎯

vsearch --usearch_global --id 1.0...
Matching unique query sequences: 9797 of 1057180 (0.93%)
Matching total query sequences: 6029201 of 9391211 (64.20%)

vsearch --usearch_global --id 0.995...
Matching unique query sequences: 428897 of 1057180 (40.57%)
Matching total query sequences: 8439865 of 9391211 (89.87%)

vsearch --usearch_global --id 0.99...
Matching unique query sequences: 847648 of 1057180 (80.18%)
Matching total query sequences: 9054425 of 9391211 (96.41%)

vsearch --usearch_global --id 0.97...
Matching unique query sequences: 1008576 of 1057180 (95.40%)
Matching total query sequences: 9294045 of 9391211 (98.97%)

Robert Edgar recommends using 97% for counting zOTUs, and 98% is mentioned in this vsearch thread.

Joe, what do you think about counting up non-exact matches after building zOTUs? What threshold should we use?

@colinbrislawn
Copy link
Collaborator Author

Joe, I'm getting ready to wrap this up. My current solution is to have two reports for the two pipelines. There is a lot of duplicate code, but it was easy to implement. This also makes it easy to add other much more divergent pipelines like picrust2 closed-ref if we wanted.

I've also changed the counting step to use --id 0.99 as that captures more reads with that I predict is little loss of quality. (1% diff is 2 bp diff for most amplicons)

Finally, how do we update the docs? Does the docs folder end up on readthedocs?

@colinbrislawn
Copy link
Collaborator Author

Is there a clean way to do this?

rule build_report:
    input:
        report_script = os.path.join(
            os.path.dirname(os.path.abspath(workflow.snakefile)),
            "scripts",
            "build_report_OTU.py"
        ) if config.get("pipeline") == "OTU" else os.path.join(
            os.path.dirname(os.path.abspath(workflow.snakefile)),
            "scripts",
            "build_report_ASV.py"),

Additionally, the report scripts don't seem to be copied when hundo is installed. Am I using this section wrong?
Missing input files for rule build_report: /Users/bris469/miniconda3/envs/hundo-dev/lib/python3.7/site-packages/hundo/scripts/build_report_ASV.py

@brwnj
Copy link
Contributor

brwnj commented May 29, 2019

Finally, how do we update the docs? Does the docs folder end up on readthedocs?

Yes, the docs folder is all the needs updating. RTD will rebuild when there are changes to the docs.

Is there a clean way to do this?

This is clean to me. Alternatively, if you wanted something shorter in the input block you can write a separate functions that returns the correct path. That'd move the code into a function and out of the input, but ultimately would look the same.

Additionally, the report scripts don't seem to be copied when hundo is installed. Am I using this section wrong?

You just need to update the manifest (https://github.com/pnnl/hundo/blob/master/MANIFEST.in)

@colinbrislawn
Copy link
Collaborator Author

So this bug with json parsing is holding up testing:
biocore/biom-format#816

I'll work on docs until then.

@brwnj
Copy link
Contributor

brwnj commented Mar 19, 2024 via email

@colinbrislawn
Copy link
Collaborator Author

You are good!

I can include benchmarks to compare, then merge this myself.

(I may need a hand distributing this on pip conda, but we can do that later.)

Sorry to 'at' you on a Monday night.

@colinbrislawn colinbrislawn assigned colinbrislawn and unassigned brwnj Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants