### Scenario
The dataset consists of 30 single *Brucella* spp. genome assemblies of the species *B. abortus*, *B. canis*, *B. melitensis* and *B. suis*, assigned to derive from human (16 isolates), animal (5 isolates), food (4 isolate) or environmental (5 isolates) sources. 
Phylogenetic cluster analysis should be conducted to identify the suspected deliberately released *Brucella* strain causing a human outbreak. 

### Data to download
#### a. set locations for demo dataset, reference genome and metadata

In [None]:
%%bash
#URLs for demo dataset, reference genome and metadata
DATASET_URL=''
REFERENCE_URL=''
METADATA_URL=''

#### b. create work directory and download dataset and reference genome

In [None]:
%%bash
#create demo directory
mkdir demo && cd demo/

#download dataset
wget \"${DATASET_URL}\" -O Brucella.outbreak.30.subset.tar.gz
#untar dataset and unzip all genome assemblies
tar xf Brucella.outbreak.30.subset.tar.gz
gunzip Component01.data.subset/*.gz
#download reference genome for cgSNPs analysis
wget \"${REFERENCE_URL}\" -O GCF_000007125.1_ASM712v1_genomic.fna.gz
gunzip GCF_000007125.1_ASM712v1_genomic.fna.gz
#download metadata for mlst analysis
wget \"${METADATA_URL}\" -O mlst.ST.species.tsv

#download parallel tool
wget https://github.com/shenwei356/rush/releases/download/v0.4.2/rush_linux_amd64.tar.gz
tar xf rush_linux_amd64.tar.gz && rm -f rush_linux_amd64.tar.gz

### Tasks
#### 1. provide a summary of each isolate’s genome assembly 

In [None]:
%%bash
#run seqkit - see https://github.com/shenwei356/seqkit for usage
seqkit -h

#### 2. answer the following questions
##### a. Which of the four Brucella species is assigned for each isolate? 

In [None]:
%%bash
#run mlst - see https://github.com/tseemann/mlst for usage
#run for each genome and then retrieve species from the metadata file (mapping assigned ST to species)
mkdir mlst-output/
mlst -h


##### b. Which of the four Brucella species is causing the outbreak?
##### c. What is the source of the isolate that is closely related to the human outbreak strains?

In [None]:
%%bash

#we will attempt to answer these questions using the cgSNPs method, comparing each genome to the reference genome
#run snippy and snippy-core - see https://github.com/tseemann/snippy for usage
#note that you can also run snippy in parallel using rush - see https://github.com/shenwei356/rush for usage
mkdir snippy-output/ snippy-core-output/
snippy -h

#run iqtree using the core.full.aln from snippy-core output to generate a cgSNPs tree which can be viewed using GrapeTree (see https://github.com/achtman-lab/GrapeTree)
iqtree -h

##### d.	What are the virulence factors found in the genome of the outbreak strain?

In [None]:
%%bash

#run abricate with the appropriate database - see https://github.com/tseemann/abricate for usage
mkdir abricate-output
abricate -h