Skip to content
Permalink
Browse files

Merge branch 'more_docs'

  • Loading branch information...
rneher committed Oct 19, 2019
2 parents b1507ac + fe1caae commit 5a3e6a1a419f7b9e4d619957583eb03d06fdf2ef
@@ -0,0 +1,64 @@
## Augur and snakemake

The output of one augur command serves as input to the next and hence these commands need to be executed in order.
This is a common pattern in bioinformatics and multiple tools -- so called workflow managers -- exist to facilitate this.
Within nextstrain, we have relied on [Snakemake](https://snakemake.readthedocs.io/en/stable/).

Snakemake breaks a workflow into a set of rules that are specified in a file called `Snakefile`.
Each rule takes a number of input files, specifies a few parameters, and produces output files.
A simple rule would look like this:
```python
rule filter:
input:
sequences = "data/sequences.fasta",
metadata = "data/metadata.tsv"
output:
sequences = "results/filtered.fasta"
params:
min_date = 2012
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--output {output.sequences} \
--min-date {params.min_date}
"""
```
This rule would produce `results/filtered.fasta` from the input files `data/sequences.fasta` and `data/metadata.tsv` using the `augur filter` command.
Note that we explicitly specify what is an input and what is an output file.
To filter our data, we would now call snakemake as
```bash
snakemake results/filtered.fasta
```
and snakemake will run the same command as specified above.

So far, this is just a complicated reformulation of what we did above, but the benefit of workflow management becomes clear once we add more steps.
The next natural step in our phylogenetic pipeline is aligning the filtered sequences and we define a rule `align`.
```bash
rule align:
input:
sequences = rules.filter.output.sequences,
reference = "config/zika_outgroup.gb"
output:
alignment = "results/aligned.fasta"
shell:
"""
augur align \
--sequences {input.sequences} \
--reference-sequence {input.reference} \
--output {output.alignment}
"""
```
If you now want to generate the alignment, you can type
```bash
snakemake results/aligned.fasta
```
and snakemake will

* determine that `results/aligned.fasta` is an output of rule `align`
* check whether all required input files are in place and run the necessary rules if not
* run rule `align` and check whether the file `results/aligned.fasta` appeared.

If you supply a reference sequence to `augur align`, augur will include that reference sequence in the alignment and strip all insertions relative to that reference.
This will guarantee a consistent reference coordinate system and genome annotation later on.
@@ -0,0 +1,45 @@
# Labeling `clades`

Clades in phylogenetic trees are often named to facilitate discussion of genetic diversity, see for example [seasonal influenza on nextstrain](https://nextstrain.org/flu).
Augur has a command to determine the position of such clade labels and assign sequences to clades.
The definition of these clades are provided in a tab-delimited file (tsv) using the following format:
```
clade gene site alt
3b HA1 145 S
3b HA1 312 S
3b nuc 1671 G
3c HA1 48 I
3c HA1 45 N
3c nuc 456 T
3c2 HA2 160 N
3c2 nuc 693 A
```
Each line specifies a sequence feature of a clade.
The first column specified the name of the clade to which the feature belongs, the column `gene` can be any annotated gene or the underlying nucleotide sequence (`nuc`), the column `site` specifies the position (numbering starting at 1), while the column `alt` specified the state.
A clade if often defined by multiple such sequence features.

The augur command `clades` can be used to annotate such clades in your tree and a rule in a Snakefile would look this
```bash
rule clades:
input:
tree = rules.refine.output.tree,
aa_muts = rules.translate.output.aa_data,
nuc_muts = rules.ancestral.output.nt_data,
clades = "config/clades.tsv"
output:
clade_data = "results/clades.json"
shell:
"""
augur clades --tree {input.tree} \
--mutations {input.nuc_muts} {input.aa_muts} \
--clades {input.clades} \
--output {output.clade_data}
"""
```
As input, this command requires the tree, the output of the ancestral reconstruction steps and the translation step (assuming your clades are defined using translations), as well as the file with clade definitions.

The output of this command is a json file in the common augur format that specifies `clade_membership` for each node in the tree.
Nodes that didn't match any clade definition will be left `unassigned`.
Internal nodes that form the root of clades will have an additional field `clade_annotation` that auspice uses to label branches in the tree.


@@ -1,14 +1,59 @@
=============
Augur command
=============
==================
Working with augur
==================

Augur consists of a number of tools that allow the user to filter and align sequences, build trees, and integrate the phylogenetic analysis with meta data.
The different tools are meant to be composable and the output of one command will serve as the input of other commands.
All of Augur's commands are accessed through the ``augur`` program.
For example, to infer ancestral sequences from a tree, you'd run ``augur ancestral``.
Each command is documented below.
You can also run each command with the ``--help`` option, for example ``augur tree --help``, for more information at the command-line.

.. argparse::
:module: augur
:func: make_parser
:prog: augur
:nodescription:

As an example, we'll look that the ``filter`` command in greater detail using material form the `zika tutorial <zika_tutorial.html>`__.
The filter command allows you to selected various subsets of your input data for different types of analysis.
A simple example use of this command would be

.. code-block:: bash
augur filter --sequences data/sequences.fasta --metadata data/metadata.tsv --min-date 2012 --output filtered.fasta
This command will select all sequences with collection date in 2012 or later.
The filter command has a large number of options that allow flexible filtering for many common situations.
One such use-case is the exclusion of sequences that are known to be outliers (e.g. because of sequencing errors, cell-culture adaptation, ...).
These can be specified in a separate file:

.. code-block:: bash
BRA/2016/FC_DQ75D1
COL/FLR_00034/2015
...
To drop such strains, you can pass the name of this file to the augur filter command:

.. code-block:: bash
augur filter --sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude config/dropped_strains.txt \
--output filtered.fasta
(To improve legibility, we have wrapped the command across multiple lines.)
If you run this command (you should be able to copy-paste this into your terminal) on the data provided in the `zika tutorial <zika_tutorial.html>`__, you should see that one of the sequences in the data set was dropped since its name was in the ``dropped_strains.txt`` file.

Another common filtering operation is subsetting of data to a achieve a more even spatio-temporal distribution or to cut-down data set size to more manageable numbers.
The filter command allows you to select a specific number of sequences from specific groups, for example one sequence per month from each country:

.. code-block:: bash
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude config/dropped_strains.txt \
--group-by country year month \
--sequences-per-group 1 \
--output filtered.fasta
This subsampling and filtering will reduce the number of sequences in the tutorial data set from 34 to 24.
@@ -0,0 +1,45 @@
# Sharing your analysis

[nextstrain.org](https://nextstrain.org) has a feature that allows you to share your own analysis through the nextstrain.org website.
This works using github as a repository for your analysis files.
To share an analysis, you need to create a repository.
The name of the repository will be what is used to access the results.
Within this repository, there should a folder called auspice which contains the output json files of the augur pipeline.
Importantly, the name of these files has to start with the name of the repository.
If so, you should be able to access your analysis via the nextstrain community feature.

As an example, lets look at one of our nextstrain community analysis.
The following link shows you an analysis we made a few month ago of many influenza B sequences:

[nextstrain.org/community/neherlab/allflu/B_ha](https://nextstrain.org/community/neherlab/allflu/B_ha)

The analysis files are hosted on our github page in the reposity

[github.com/neherlab/allflu](https://github.com/neherlab/allflu)

In this repository, you'll find the folder `auspice` which contains the files
```
allflu_B_ha_meta.json
allflu_B_ha_tree.json
```
Note that all files start with "allflu" which matches the name of the repository.
In fact, there are multiple analysis in this folder. The corresponding files all start with "allflu" but they differ in the viral lineage they correspond to:
```
allflu_B_ha_meta.json
allflu_B_ha_tree.json
allflu_h1n1_ha_meta.json
allflu_h1n1_ha_tree.json
allflu_h1n1pdm_ha_meta.json
allflu_h1n1pdm_ha_tree.json
allflu_h3n2_ha_meta.json
allflu_h3n2_ha_tree.json
```
All these can be accessed as

[nextstrain.org/community/neherlab/allflu/h1n1_ha](https://nextstrain.org/community/neherlab/allflu/h1n1_ha)

[nextstrain.org/community/neherlab/allflu/h3n2_ha](https://nextstrain.org/community/neherlab/allflu/h3n2_ha)

[nextstrain.org/community/neherlab/allflu/h1n1pdm_ha](https://nextstrain.org/community/neherlab/allflu/h1n1pdm_ha)


@@ -8,31 +8,6 @@ issue <https://github.com/nextstrain/augur/issues/new?title=Augur%20in%20the%20w
or `submitting a PR <https://github.com/nextstrain/augur/pulls>`__.


Nextstrain tutorials
====================

There are three tutorials for getting started with Nextstrain that use Augur
for all or part of the bioinformatics and analysis:

The `Zika tutorial
<https://nextstrain.org/docs/getting-started/zika-tutorial>`__ is a good
starting point if you have FASTA files and some metadata about your sequences.
The `accompanying build repository
<https://github.com/nextstrain/zika-tutorial>`__ is a simplified version of the
full Zika build.

The `Tuberculosis tutorial
<https://nextstrain.org/docs/getting-started/tb-tutorial>`__ is a good
starting point if you have VCF files and some metadata about your sequences.
It uses the `full Tuberculosis build repository
<https://github.com/nextstrain/tb>`__.

The `MERS / BEAST tutorial
<https://github.com/nextstrain/mers-beast-tutorial>`__ is a good starting
point if you have a `BEAST <https://beast.community/>`__ tree and want to use
it with Augur and Auspice.


Nextstrain pathogens
====================

Binary file not shown.
@@ -9,11 +9,28 @@ Augur is a bioinformatics toolkit to track evolution from sequence and serologic
It provides a collection of commands which are designed to be composable into larger processing pipelines.
Augur originated as part of `Nextstrain <https://nextstrain.org>`__, an open-source project to harness the scientific and public health potential of pathogen genome data.

Augur is composed of a series of modules and different workflows will use different parts of the pipeline.
A selection of augur modules and different possible entry points are illustrated below.

.. image:: figures/augur_analysis_sketch.png

The canonical pipeline would ingest sequences and metadata such as dates and sampling locations, filter the data, align the sequences, infer a tree, and export the results in a format that can be visualized by auspice.

In some cases, you might start with a manually curated alignment and want to start the workflow at the tree building step.
Or you already have a tree inferred. In this case, you only need to feed you tree through the ``refine`` and ``export`` steps.
The ``refine`` step is necessary to ensure that cross-referencing between tree nodes and meta data works as expected.

The different augur modules can be strung together by workflow managers like snakemake and nextflow.
The nextstrain team uses `snakemake <https://snakemake.readthedocs.io/en/stable/>`__ to run and manage the different analysis that you see on `nextstrain.org <https://nextstrain.org>`__.


.. toctree::
:maxdepth: 2
:caption: Table of contents

installation
usage
tutorials
examples
api
authors
@@ -1,13 +1,19 @@
# Installation

## Using pip from PyPi

Augur is written in Python 3 and requires at least Python 3.4.
It's published on [PyPi](https://pypi.org) as [nextstrain-augur](https://pypi.org/project/nextstrain-augur), so you can install it with `pip` like so:

python -m pip install nextstrain-augur
```bash
pip install nextstrain-augur
```

You can also install from a git clone or other copy of the source code by running:

python -m pip install .
```bash
pip install .
```

If your system has both Python 2 and Python 3 installed side-by-side, you may need to use `python3` instead of just `python` (which often defaults to Python 2 when both Python versions are installed).

@@ -32,7 +38,7 @@ On Debian/Ubuntu, you can install them via:

Other Linux distributions will likely have the same packages available, although the names may differ slightly.

## With Conda
## Using Conda

Alternatively, augur itself and all of its dependencies can be installed into a [Conda](https://conda.io/miniconda.html) environment:

@@ -0,0 +1,45 @@
# Format of augur output

Many augur commands output additional annotations and inferences made for nodes of the tree.
These data are stored as json files which serve as input to subsequent steps and the export step that combines all these annotations into a file for visualization.
The files output by different steps follow a common format:
```json
{
"alignment": "results/aligned.fasta",
"input_tree": "results/tree_raw.nwk",
"clock": {
"intercept": -2.2139560216719905,
"rate": 0.0011001967850769103,
"rtt_Tmrca": 2012.3272960820566
},
"nodes": {
"1_0087_PF": {
"branch_length": 0.0001694424712162532,
"clock_length": 0.0001694424712162532,
"date": "2013-12-31",
"mutation_length": 0.00027877777449324505,
"numdate": 2013.9993155382122,
"raw_date": "2013-12-XX"
},
"1_0181_PF": {
"branch_length": 0.00009101896681161861,
"clock_length": 0.00009101896681161861,
"date": "2013-12-31",
"mutation_length": 0.0001858351915513057,
"numdate": 2013.9993155382122,
"raw_date": "2013-12-XX"
},
"NODE_0000030": {
"branch_length": 0.0006468897126106284,
"clock_length": 0.0006468897126106284,
"date": "2016-06-06",
"mutation_length": 0.0010229321826899364,
"numdate": 2016.4318911009404
}
}
}
```
The json structure has one mandatory top-level field called `nodes` that contains inferences made for each node, possibly including internal nodes.
For that purpose, it is essential that internal nodes are consistently named.
The latter are assigned unique names starting with `NODE_` in the `refine` step and it is crucial that the topology and rooting of the tree remains untouched thereafter.
In addition to the mandatory `nodes` field, the json can contain additional information on the analysis performed, for example input parameters or other inferences such as the clock model.
@@ -0,0 +1,36 @@
# `parse` metadata from fasta-headers

If you download sequence data from data bases like GISAID or fludb, there often is an option to include meta data such as dates into the header of fasta files.
This might for example look like this:
```
>A/Canoas/LACENRS_1793/2015|A|H3N2|07/17/2015||Brazil|Human|KY925125
ATG...
>A/Canoas/LACENRS_773/2015|A|H3N2|05/06/2015||Brazil|Human|KY925599
ATG...
[...]
```
The fasta header contains information such as influenza lineage, dates (in an unpreferred format), country, etc...
To turn this metadata into a table, augur has a special command called `parse`.
A rule to parse the above file could look like this:
```bash
rule parse:
input:
sequences = "data/h3n2_ha.fasta"
output:
sequences = "results/sequences_h3n2_ha.fasta",
metadata = "results/metadata_h3n2_ha.tsv"
params:
fields = "strain type subtype date season country host accession"
shell:
"""
augur parse \
--sequences {input.sequences} \
--fields {params.fields} \
--output-sequences {output.sequences} \
--output-metadata {output.metadata} \
--fix-dates monthfirst
"""
```
Note the additional argument `--fix-dates monthfirst`.
This triggers an attempt to parse these dates and turn them into ISO format assuming that the month preceeds the date in the input data.
Note that this is a brittle process that should be spot-checked.

0 comments on commit 5a3e6a1

Please sign in to comment.
You can’t perform that action at this time.