Skip to content

Latest commit

 

History

History
46 lines (35 loc) · 1.54 KB

index.rst

File metadata and controls

46 lines (35 loc) · 1.54 KB

augur index

Speed up filtering with a sequence index

As we describe in the zika tutorial <docs.nextstrain.org:tutorials/zika>, augur index precalculates the composition of the sequences (e.g., numbers of nucleotides, gaps, invalid characters, and total sequence length) prior to filtering. The resulting sequence index speeds up subsequent filter steps especially in more complex workflows.

mkdir -p results/
augur index \
    --sequences data/sequences.fasta \
    --output results/sequence_index.tsv

The first lines in the sequence index look like this.

strain  length  A   C   G   T   N   other_IUPAC -   ?   invalid_nucleotides
PAN/CDC_259359_V1_V3/2015   10771   2952    2379    3142    2298    0   0   0   0   0
COL/FLR_00024/2015  10659   2921    2344    3113    2281    0   0   0   0   0
PRVABC59    10675   2923    2351    3115    2286    0   0   0   0   0
COL/FLR_00008/2015  10659   2924    2344    3110    2281    0   0   0   0   0

We then provide the sequence index as an input to augur filter commands to speed up filtering on sequence-specific attributes.

augur filter \
    --sequences data/sequences.fasta \
    --sequence-index results/sequence_index.tsv \
    --metadata data/metadata.tsv \
    --exclude config/dropped_strains.txt \
    --output results/filtered.fasta \
    --group-by country year month \
    --sequences-per-group 20 \
    --min-date 2012