# Analisis _genomic epidemiology_ dengan menggunakan Nextstrain

---

_oleh: Matin Nuhamunada, M.Sc._

Department of Tropical Biology, Universitas Gadjah Mada;   
Jl. Teknika Selatan, Sekip Utara, Bulaksumur, Yogyakarta, Indonesia, 55281;   

email: [matin_nuhamunada@ugm.ac.id](mailto:matin_nuhamunada@mail.ugm.ac.id)  

---
#### Notebook Links
* 1. [Sub-sampling data genome SARS-CoV-2](01_sub-sampling.ipynb) 
* 2. [Analisis _genomic epidemiology_ dengan menggunakan Nextstrain](02_analysis.ipynb) (notebook ini)
* 3. [Koleksi spike gene per clade](03_clade_s_gene_analysis.ipynb)

## Deskripsi
Pada notebook ini, dilakukan analisis genomic epidemiology menggunakan platform nextstrain (augur & auspice) sesuai dengan panduan dari: https://github.com/nextstrain/ncov/blob/master/docs/running.md

File yang perlu disiapkan:
* 1 Dataset dari GISAID:
    * data/sequences.fasta
    * data/metadata.tsv
* 2 Config files dari: https://github.com/nextstrain/ncov/blob/master/config/:
    * config/include.txt
    * config/reference.gb
    * config/clades.tsv
    * config/color_schemes.tsv
    * config/lat_longs.tsv
    * config/auspice_config.json
* 3 Hasil sub sampling dari notebook sebelumnya:
    * config/exclude_subsampling.txt

Berikut gambaran pipeline analisis yang dilakukan:
![Pipeline Analisis](https://raw.githubusercontent.com/nextstrain/ncov/master/docs/images/basic_snakemake_build.png)

In [1]:
# create result directory
! mkdir -p results

In [2]:
! augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--exclude config/exclude_subsampling.txt \
--include config/include.txt \
--output results/filtered.fasta \
--min-length 25000 \
--exclude-where date='2020' date='2020-01-XX' date='2020-02-XX' date='2020-03-XX' date='2020-04-XX' date='2020-01' date='2020-02' date='2020-03' date='2020-04' \
--group-by division year month \
--sequences-per-group 2000


17140 sequences were dropped during filtering
	17130 of these were dropped because they were in config/exclude_subsampling.txt
	0 of these were dropped because of 'date=2020'
	1 of these were dropped because of 'date=2020-01-XX'
	2 of these were dropped because of 'date=2020-02-XX'
	0 of these were dropped because of 'date=2020-03-XX'
	0 of these were dropped because of 'date=2020-04-XX'
	2 of these were dropped because of 'date=2020-01'
	0 of these were dropped because of 'date=2020-02'
	0 of these were dropped because of 'date=2020-03'
	0 of these were dropped because of 'date=2020-04'
	6 of these were dropped because they were shorter than minimum length of 25000bp
	0 of these were dropped because of subsampling criteria

	1 sequences were added back because they were in config/include.txt
398 sequences have been written out to results/filtered.fasta


In [3]:
! augur align \
  --sequences results/filtered.fasta \
  --reference-sequence config/reference.gb \
  --output results/aligned.fasta \
  --nthreads auto \
  --fill-gaps


using mafft to align via:
	mafft --reorder --anysymbol --nomemsave --adjustdirection --thread 8 results/aligned.fasta.to_align.fasta 1> results/aligned.fasta 2> results/aligned.fasta.log 

	Katoh et al, Nucleic Acid Research, vol 30, issue 14
	https://doi.org/10.1093%2Fnar%2Fgkf436

Trimmed gaps in MN908947 from the alignment


In [4]:
! python scripts/mask-alignment.py \
--alignment results/aligned.fasta \
--mask-from-beginning 130 \
--mask-from-end 50 \
--mask-sites 18529 29849 29851 29853 \
--output results/mask_aligned.fasta

In [5]:
! augur tree \
  --alignment results/mask_aligned.fasta \
  --output results/tree_raw.nwk

Building a tree via:
	iqtree -ninit 2 -n 2 -me 0.05 -nt 1 -s results/mask_aligned-delim.fasta -m GTR  > results/mask_aligned-delim.iqtree.log
	Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
	Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300


Building original tree took 23.73842763900757 seconds


In [6]:
! augur refine \
--root Wuhan-Hu-1/2019 Wuhan/WH01/2019 \
--tree results/tree_raw.nwk \
--alignment results/mask_aligned.fasta \
--metadata data/metadata.tsv \
--output-tree results/tree.nwk \
--output-node-data results/branch_lengths.json \
--coalescent skyline \
--clock-rate 0.0008 \
--clock-std-dev 0.0004 \
--date-inference marginal \
--clock-filter-iqd 4 \
--timetree


5.76	TreeTime.reroot: with method or node: ['Wuhan-Hu-1/2019',
    	'Wuhan/WH01/2019']

6.08	TreeTime.reroot: with method or node: ['Wuhan-Hu-1/2019',
    	'Wuhan/WH01/2019']
pruning leaf  MN908947
pruning leaf  Wales/PHWC-2B0F0/2020

    	tips at positions with AMBIGUOUS bases. This resulted in unexpected
    	behavior is some cases and is no longer done by default. If you want to
    	replace those ambiguous sites with their most likely state, rerun with
    	`reconstruct_tip_states=True` or `--reconstruct-tip-states`.

11.17	TreeTime.reroot: with method or node: ['Wuhan-Hu-1/2019',
     	'Wuhan/WH01/2019']

14.38	###TreeTime.run: INITIAL ROUND

24.93	TreeTime.reroot: with method or node: ['Wuhan-Hu-1/2019',
     	'Wuhan/WH01/2019']

25.39	###TreeTime.run: rerunning timetree after rerooting

36.97	###TreeTime.run: ITERATION 1 out of 2 iterations

71.13	###TreeTime.run: ITERATION 2 out of 2 iterations

Inferred a time resolved phylogeny using TreeTime:
	Sagulenko et al. TreeTime: Max

In [7]:
! augur traits \
  --tree results/tree.nwk \
  --metadata data/metadata.tsv \
  --output results/traits.json \
  --columns region country \
  --confidence \
  --sampling-bias-correction 2.5

Assigned discrete traits to 397 out of 397 taxa.

NOTE: previous versions (<0.7.0) of this command made a 'short-branch
length assumption. TreeTime now optimizes the overall rate numerically
and thus allows for long branches along which multiple changes
accumulated. This is expected to affect estimates of the overall rate
while leaving the relative rates mostly unchanged.
Assigned discrete traits to 397 out of 397 taxa.

NOTE: previous versions (<0.7.0) of this command made a 'short-branch
length assumption. TreeTime now optimizes the overall rate numerically
and thus allows for long branches along which multiple changes
accumulated. This is expected to affect estimates of the overall rate
while leaving the relative rates mostly unchanged.

Inferred ancestral states of discrete character using TreeTime:
	Sagulenko et al. TreeTime: Maximum-likelihood phylodynamic analysis
	Virus Evolution, vol 4, https://academic.oup.com/ve/article/4/1/vex042/4794731

results written to results/traits.j

In [8]:
! augur ancestral \
  --tree results/tree.nwk \
  --alignment results/aligned.fasta \
  --output-node-data results/nt_muts.json \
  --inference joint 


Inferred ancestral sequence states using TreeTime:
	Sagulenko et al. TreeTime: Maximum-likelihood phylodynamic analysis
	Virus Evolution, vol 4, https://academic.oup.com/ve/article/4/1/vex042/4794731

ancestral mutations written to results/nt_muts.json


In [9]:
! augur translate \
  --tree results/tree.nwk \
  --ancestral-sequences results/nt_muts.json \
  --reference-sequence config/reference.gb \
  --output results/aa_muts.json

Read in 15 features from reference sequence file
amino acid mutations written to results/aa_muts.json


In [10]:
! augur clades -h

usage: augur clades [-h] [--tree TREE] [--mutations MUTATIONS [MUTATIONS ...]]
                    [--reference REFERENCE [REFERENCE ...]] [--clades CLADES]
                    [--output-node-data OUTPUT_NODE_DATA]

Assign clades to nodes in a tree based on amino-acid or nucleotide signatures.

optional arguments:
  -h, --help            show this help message and exit
  --tree TREE           prebuilt Newick -- no tree will be built if provided
                        (default: None)
  --mutations MUTATIONS [MUTATIONS ...]
                        JSON(s) containing ancestral and tip nucleotide and/or
                        amino-acid mutations (default: None)
  --reference REFERENCE [REFERENCE ...]
                        fasta files containing reference and tip nucleotide
                        and/or amino-acid sequences (default: None)
  --clades CLADES       TSV file containing clade definitions by amino-acid
                        (default: None)
  --output-node-data OUTPUT_NOD

In [11]:
! augur clades \
--tree results/tree.nwk \
--mutations results/aa_muts.json results/nt_muts.json\
--clades config/clades.tsv \
--output-node-data results/clades.json

Validating schema of 'results/aa_muts.json'...
clades written to results/clades.json


In [12]:
! augur export v2 \
  --tree results/tree.nwk \
  --metadata data/metadata.tsv \
  --node-data results/branch_lengths.json \
              results/traits.json \
              results/nt_muts.json \
              results/aa_muts.json \
              results/clades.json \
  --colors config/color_schemes.tsv \
  --lat-longs config/lat_longs.tsv \
  --auspice-config config/auspice_config.json \
  --output auspice/ncov2019.json

Validating schema of 'results/aa_muts.json'...
Validating config file config/auspice_config.json against the JSON schema
Validating schema of 'config/auspice_config.json'...


Validating produced JSON
Validating schema of 'auspice/ncov2019.json'...
Validating that the JSON is internally consistent...



## Hasil Analisis

In [14]:
! auspice view --datasetDir auspice

[94m[39m
[94m[39m
[94m---------------------------------------------------[39m
[94mAuspice server now running at [39m[94m[4m[1mhttp://localhost:4000[22m[24m[39m
[94mServing auspice version 2.15.0[39m
[94mLooking for datasets in /home/matin_nuhamunada/ncov2019_ugm/nextstrain-ncov2019/auspice[39m
[94mLooking for narratives in /home/matin_nuhamunada/ncov2019_ugm/nextstrain-ncov2019[39m
[94m---------------------------------------------------[39m
[94m[39m
[94m[39m
[94mGET AVAILABLE returning locally available datasets & narratives[39m
[94mGET DATASET query received: prefix=/ncov2019[39m
[94mGET AVAILABLE returning locally available datasets & narratives[39m
[94mGET AVAILABLE returning locally available datasets & narratives[39m
[94mGET DATASET query received: prefix=/ncov2019[39m
[94mGET AVAILABLE returning locally available datasets & narratives[39m
^C


Hasil analisis dari Auspice dapat diakses did http://localhost:4000

## Next
[Koleksi spike gene per clade](03_clade_s_gene_analysis.ipynb)  
        