Skip to content

jgolob/arf

 
 

Repository files navigation

ARF: Algorithmic rRNA Filtering

========================

Rationale

Determining the taxonomic composition of a microbial community by 16s rRNA sequencing requires a high-quality repository of 16s rRNA alleles with accurate taxonomic identificaiton.

Hand-curated repositories, such as SILVA, RDP, and greengenes are high-quality sources. ARF is a purely algorithmic approach to identifying 16S rRNA alleles with taxonomic annotation that are internally consistent and is meant as an adjunct to these hand-curated sources.

Approach

In broad strokes, ARF downloads 16s all bacterial and archaeal rRNA sequences1 directly from NCBI NT database, verifies that sequences annotated as being 16s rRNA actually are 16s rRNA genes, full length (at least 1200bp), and with minimal ambiguous bases. This becomes the 1200bp set of rRNA reads, which are valid rRNA alleles, some with and some without verified taxonomic annotations.

The 1200bp set is further subsetted into a named set of 16s rRNA genes that have some species level taxonomic assignment (versus those rRNA alleles assigned to uncultured bacterium or such).

The named set is further subsetted into a filtered subset using the deenurp filter-outliers mode. Alleles are grouped by their taxonomic assignment. Each group is clustered at the sequence level, and outliers are filtered out.(Multiple centroids are tolerated.) The basic concept is majority rules.

In parallel type strains are found in the named set and placed into the types subset.

ARF is implemented as a nextflow workflow. Nextflow can (and should) iteratively update an extant library of sequences.

Usage:

nextflow run jgolob/arf <ARGUMENTS>

Required Arguments:
--repo                          path to directory holding the current repo (default = './arf')
--out                           path where refreshed repo should be placed (default = './refreshed')
--email                         Valid email to use with NCBI
--ncbi_concurrent_connections   Number of concurrent connections (default = 3)
--retry_max                     Max retries with NCBI requests (default = 1)
--retry_delay                   Delay (ms) between retries (default=60000)
--min_len                       Minimum annotated length of 16s rRNA to even be downloaded (default 500)
--species_cap                   Maximum number of 16s rRNA genes per annotated species (default 5000)


Options:
--api_key                       NCBI api key (will increase download rate)
--debug                         Cuts down the number of records for testing

Output

arf/
├── dedup
│   └── 1200bp
│       ├── named
│       │   ├── blast.nhr
│       │   ├── blast.nin
│       │   ├── blast.nsq
│       │   ├── filtered
│       │   │   ├── blast.nhr
│       │   │   ├── blast.nin
│       │   │   ├── blast.nsq
│       │   │   ├── lineages.csv
│       │   │   ├── lineages.txt
│       │   │   ├── outliers.csv
│       │   │   ├── seq_info.csv
│       │   │   ├── seqs.fasta
│       │   │   └── taxonomy.csv
│       │   ├── lineages.csv
│       │   ├── lineages.txt
│       │   ├── seq_info.csv
│       │   ├── seqs.fasta
│       │   └── taxonomy.csv
│       ├── seq_info.csv
│       ├── seqs.fasta
│       └── types
│           ├── blast.nhr
│           ├── blast.nin
│           ├── blast.nsq
│           ├── lineages.csv
│           ├── lineages.txt
│           ├── seq_info.csv
│           ├── seqs.fasta
│           └── taxonomy.csv
├── pubmed_info.csv
├── records.txt
├── references.csv
├── refseq_info.csv
├── seq_info.csv
├── seqs.fasta
├── taxdmp.zip
├── taxonomy.csv
└── taxonomy.db

Outputs are intended to be compatible with mothur and MaLiAmPi.

Disclosures

ARF makes tails wag and is a collaboration between Noah Hoffman and Chris Rosenthal of Laboratory Medicine at the University of Washington and Fred Hutch and Jonathan Golob of the University of Michigan Department of Medicine and Division of Infectious Diseases. Use it at your own risk.

1 The exact NCBI NT search strings are:

  1. Bacteria: 16s[All Fields] AND rRNA[Feature Key] AND Bacteria[Organism] AND 500 : 99999999999[Sequence Length] NOT(environmental samples[Organism] OR unclassified Bacteria[Organism])

  2. Archaea: 16s[All Fields] AND rRNA[Feature Key] AND Archaea[Organism] AND 500 : 99999999999[Sequence Length] NOT(environmental samples[Organism] OR unclassified Bacteria[Organism])

  3. Type strains: 16s[All Fields] AND rRNA[Feature Key] AND (Bacteria[Organism] OR Archaea[Organism]) AND 500 : 99999999999[Sequence Length] NOT(environmental samples[Organism] OR unclassified Bacteria[Organism]) AND sequence_from_type[Filter]

About

ARF: Algorithmic rRNA Filtering. A hands-off means of obtaining 16s rRNA alleles with valid taxonomic identification

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 68.3%
  • Nextflow 29.8%
  • Dockerfile 1.1%
  • Other 0.8%