Skip to content

Latest commit

 

History

History
64 lines (46 loc) · 2.57 KB

filter.rst

File metadata and controls

64 lines (46 loc) · 2.57 KB

augur filter


How we subsample sequences in the zika-tutoral

As an example, we'll look that the filter command in greater detail using material from the Zika tutorial <docs.nextstrain.org:tutorials/zika>. The filter command allows you to selected various subsets of your input data for different types of analysis. A simple example use of this command would be

augur filter --sequences data/sequences.fasta --metadata data/metadata.tsv --min-date 2012 --output filtered.fasta

This command will select all sequences with collection date in 2012 or later. The filter command has a large number of options that allow flexible filtering for many common situations. One such use-case is the exclusion of sequences that are known to be outliers (e.g. because of sequencing errors, cell-culture adaptation, ...). These can be specified in a separate file:

BRA/2016/FC_DQ75D1
COL/FLR_00034/2015
...

To drop such strains, you can pass the name of this file to the augur filter command:

augur filter --sequences data/sequences.fasta \
           --metadata data/metadata.tsv \
           --min-date 2012 \
           --exclude config/dropped_strains.txt \
           --output filtered.fasta

(To improve legibility, we have wrapped the command across multiple lines.) If you run this command (you should be able to copy-paste this into your terminal) on the data provided in the Zika tutorial <docs.nextstrain.org:tutorials/zika>, you should see that one of the sequences in the data set was dropped since its name was in the dropped_strains.txt file.

Another common filtering operation is subsetting of data to a achieve a more even spatio-temporal distribution or to cut-down data set size to more manageable numbers. The filter command allows you to select a specific number of sequences from specific groups, for example one sequence per month from each country:

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --min-date 2012 \
  --exclude config/dropped_strains.txt \
  --group-by country year month \
  --sequences-per-group 1 \
  --output filtered.fasta

This subsampling and filtering will reduce the number of sequences in the tutorial data set from 34 to 24.