# Tutorial

This tutorial will walk you through running `staramr` on some example genomes to investigate AMR genes and point mutations.  The data we will use are two RefSeq assemblies that are available on NCBI: [GCF_001478105.1](https://www.ncbi.nlm.nih.gov/assembly/GCA_001478105.1) and [GCF_001931595.1](https://www.ncbi.nlm.nih.gov/assembly/GCA_001931595.1).

## Step 1: Download input files

You may download the input files with the following commands:

In [36]:
wget -O GCF_001478105.1.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/478/105/GCF_001478105.1_Salmonella_enterica_CVM_N31384-SQ_v1.0/GCF_001478105.1_Salmonella_enterica_CVM_N31384-SQ_v1.0_genomic.fna.gz
gunzip GCF_001478105.1.fasta.gz

wget -O GCF_001931595.1.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/931/595/GCF_001931595.1_ASM193159v1/GCF_001931595.1_ASM193159v1_genomic.fna.gz
gunzip GCF_001931595.1.fasta.gz

--2018-05-10 09:36:34--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/478/105/GCF_001478105.1_Salmonella_enterica_CVM_N31384-SQ_v1.0/GCF_001478105.1_Salmonella_enterica_CVM_N31384-SQ_v1.0_genomic.fna.gz
           => ‘GCF_001478105.1.fasta.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.229, 2607:f220:41e:250::7
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.229|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /genomes/all/GCF/001/478/105/GCF_001478105.1_Salmonella_enterica_CVM_N31384-SQ_v1.0 ... done.
==> SIZE GCF_001478105.1_Salmonella_enterica_CVM_N31384-SQ_v1.0_genomic.fna.gz ... 1454519
==> PASV ... done.    ==> RETR GCF_001478105.1_Salmonella_enterica_CVM_N31384-SQ_v1.0_genomic.fna.gz ... done.
Length: 1454519 (1.4M) (unauthoritative)


2018-05-10 09:36:34 (5.02 MB/s) - ‘GCF_001478105.1.fasta.gz’ saved [1454519]

--2018-05-10 09:36:35--  ftp://ftp.ncbi.n

Now we have some files to work with:

In [37]:
ls

GCF_001478105.1.fasta  GCF_001931595.1.fasta  staramr-tutorial.ipynb


## Step 2: Run staramr

Now that we have some assembled genomes to work with, let's run `staramr`.

First, lets check what version of databases we are using:

In [38]:
staramr db info

resfinder_db_dir              = /home/aaron/workspace/staramr/staramr/databases/data/dist/resfinder
resfinder_db_url              = https://bitbucket.org/genomicepidemiology/resfinder_db.git
resfinder_db_commit           = dc33e2f9ec2c420f99f77c5c33ae3faa79c999f2
resfinder_db_date             = Tue, 20 Mar 2018 16:49
pointfinder_db_dir            = /home/aaron/workspace/staramr/staramr/databases/data/dist/pointfinder
pointfinder_db_url            = https://bitbucket.org/genomicepidemiology/pointfinder_db.git
pointfinder_db_commit         = ba65c4d175decdc841a0bef9f9be1c1589c0070a
pointfinder_db_date           = Fri, 06 Apr 2018 09:02
pointfinder_gene_drug_version = 050218
resfinder_gene_drug_version   = 050218


Everything looks good there. Now, let's run `staramr`:

In [39]:
staramr search --pointfinder-organism salmonella -o out *.fasta

2018-05-10 09:36:44,439 INFO: Scheduling blast for GCF_001478105.1.fasta
2018-05-10 09:36:44,514 INFO: Scheduling blast for GCF_001931595.1.fasta
2018-05-10 09:36:50,243 INFO: Finished. Took 0.10 minutes.
2018-05-10 09:36:50,246 INFO: Predicting AMR resistance phenotypes is enabled. The predictions are for microbiological resistance and *not* clinical resistance. This is an experimental feature which is continually being improved.
2018-05-10 09:36:50,329 INFO: Output files in out


There, that wasn't too long.

## Step 3: Examine results

Now, let's inspect some of the results. First, let's look at what files were produced:

In [40]:
ls out/

[0m[01;34mhits[0m  pointfinder.tsv  resfinder.tsv  results.xlsx  settings.txt  summary.tsv


The __*.tsv__ files contain the primary results we're interested in. The **settings.txt** file contains all the settings used to run `staramr`. The **results.xlsx** file contains all the previous files as separate worksheets in an Excel file. And the **hits/** directory contains the AMR gene hits as FASTA files.

Let's take a look at these files in turn.

_Note that the command `column -s$'\t' -t file.tsv` is used. This command aligns the columns and prints a table `-t` using a tab character as the delimiter `-s$'\t'`._

In [41]:
column -s$'\t' -t out/summary.tsv

Isolate ID       Genotype                                                                                 Predicted Phenotype
GCF_001478105.1  blaCMY-2                                                                                 ampicillin, amoxicillin/clavulanic acid, cefoxitin, ceftriaxone
GCF_001931595.1  aac(3)-IVa, aph(3')-Ia, aph(4)-Ia, blaCTX-M-65, dfrA14, floR, gyrA (D87Y), sul1, tet(A)  gentamicin, kanamycin, hygromicin, ampicillin, ceftriaxone, trimethoprim, chloramphenicol, ciprofloxacin I/R, nalidixic acid, unknown[sul1_2_CP002151], tetracycline


This contains a summary of all the results in a single table, one genome per line. According to these results, the genomes _GCF_001478105.1_ amd _GCF_001931595.1_ contain the listed AMR genes under **Genotype** and are resistant to the listed drugs under **Predicted Phenotype**. This also shows off the result you'll get if a gene to drug mapping is missing from our database, mainly you'll see `unknown[sul1_2_CP002151]`, suggesting that gene **sul1_2_CP002151** is missing an entry. This database is still under development so some gene to drug entries may be missing. This also depends on the exact versions of the ResFinder and PointFinder databases you have installed.

In [42]:
column -s$'\t' -t out/resfinder.tsv

Isolate ID       Gene         Predicted Phenotype                                              %Identity  %Overlap  HSP Length/Total Length  Contig             Start   End     Accession
GCF_001931595.1  aph(4)-Ia    hygromicin                                                       100.00     100.00    1026/1026                NZ_CP016411.1      290639  291664  V01499
GCF_001931595.1  aph(3')-Ia   kanamycin                                                        99.39      99.75     814/816                  NZ_CP016411.1      300747  301560  V00359
GCF_001931595.1  aac(3)-IVa   gentamicin                                                       99.87      100.00    786/786                  NZ_CP016411.1      291885  292669  X01385
GCF_001931595.1  sul1         unknown[sul1_2_CP002151]                                         100.00     100.00    927/927                  NZ_CP016411.1      159069  159995  CP002151
GCF_001931595.1  floR         chloramphenicol                                   

This shows all the BLAST hits to the **ResFinder** database, each hit on a single line.

In [43]:
column -s$'\t' -t out/pointfinder.tsv

Isolate ID       Gene         Predicted Phenotype                Type   Position  Mutation             %Identity  %Overlap  HSP Length/Total Length  Contig         Start    End
GCF_001931595.1  gyrA (D87Y)  ciprofloxacin I/R, nalidixic acid  codon  87        GAC -> TAC (D -> Y)  99.43      100.00    2637/2637                NZ_CP016410.1  1597907  1600543


This shows all the aquired point mutations leading to antimicrobial resistance, one per line.

In [44]:
cat out/settings.txt

command_line                  = /home/aaron/miniconda2/envs/jupyterlab/bin/staramr search --pointfinder-organism salmonella -o out GCF_001478105.1.fasta GCF_001931595.1.fasta
version                       = 0.2.0.dev0
start_time                    = 2018-05-10 09:36:44
end_time                      = 2018-05-10 09:36:50
total_minutes                 = 0.10
resfinder_db_dir              = /home/aaron/workspace/staramr/staramr/databases/data/dist/resfinder
resfinder_db_url              = https://bitbucket.org/genomicepidemiology/resfinder_db.git
resfinder_db_commit           = dc33e2f9ec2c420f99f77c5c33ae3faa79c999f2
resfinder_db_date             = Tue, 20 Mar 2018 16:49
pointfinder_db_dir            = /home/aaron/workspace/staramr/staramr/databases/data/dist/pointfinder
pointfinder_db_url            = https://bitbucket.org/genomicepidemiology/pointfinder_db.git
pointfinder_db_commit         = ba65c4d175decdc841a0bef9f9be1c1589c0070a
pointfinder_db_date           = Fri, 06 Apr 2018 09:02

This shows the command-line options used to run `staramr`, runtime, as well as the **ResFinder** and **PointFinder** database versions.

In [45]:
ls out/hits

pointfinder_GCF_001931595.1.fasta  resfinder_GCF_001931595.1.fasta
resfinder_GCF_001478105.1.fasta


This directory contains all the BLAST hits that were found in the `out/{resfinder,pointfinder}.tsv` files, in FASTA format.

In [47]:
head out/hits/resfinder_GCF_001931595.1.fasta

>aph(4)-Ia_1_V01499 isolate: GCF_001931595.1, contig: NZ_CP016411.1, contig_start: 290639, contig_end: 291664, resistance_gene_start: 1026, resistance_gene_end: 1, hsp/length: 1026/1026, pid: 100.00%, plength: 100.00%
ATGAAAAAGCCTGAACTCACCGCGACGTCTGTCGAGAAGTTTCTGATCGAAAAGTTCGAC
AGCGTCTCCGACCTGATGCAGCTCTCGGAGGGCGAAGAATCTCGTGCTTTCAGCTTCGAT
GTAGGAGGGCGTGGATATGTCCTGCGGGTAAATAGCTGCGCCGATGGTTTCTACAAAGAT
CGTTATGTTTATCGGCACTTTGCATCGGCCGCGCTCCCGATTCCGGAAGTGCTTGACATT
GGGGAATTCAGCGAGAGCCTGACCTATTGCATCTCCCGCCGTGCACAGGGTGTCACGTTG
CAAGACCTGCCTGAAACCGAACTGCCCGCTGTTCTGCAGCCGGTCGCGGAGGCCATGGAT
GCGATCGCTGCGGCCGATCTTAGCCAGACGAGCGGGTTCGGCCCATTCGGACCGCAAGGA
ATCGGTCAATACACTACATGGCGTGATTTCATATGCGCGATTGCTGATCCCCATGTGTAT
CACTGGCAAACTGTGATGGACGACACCGTCAGTGCGTCCGTCGCGCAGGCTCTCGATGAG


## Step 4: Validation

Let's look back at our **summary.tsv** file.

In [48]:
column -s$'\t' -t out/summary.tsv

Isolate ID       Genotype                                                                                 Predicted Phenotype
GCF_001478105.1  blaCMY-2                                                                                 ampicillin, amoxicillin/clavulanic acid, cefoxitin, ceftriaxone
GCF_001931595.1  aac(3)-IVa, aph(3')-Ia, aph(4)-Ia, blaCTX-M-65, dfrA14, floR, gyrA (D87Y), sul1, tet(A)  gentamicin, kanamycin, hygromicin, ampicillin, ceftriaxone, trimethoprim, chloramphenicol, ciprofloxacin I/R, nalidixic acid, unknown[sul1_2_CP002151], tetracycline


We can validate these results by comparing them to this drugs and AMR resistances available from NCBI.  Let's take a look:

### GCF_001478105.1

#### Genotypes

For **GCF_001478105.1** we can find the detected AMR genes by NCBI at <https://www.ncbi.nlm.nih.gov/pathogens/isolates/#/search/GCA_001478105.1>.  From here we see that `blaCMY-2` is listed under the **AMR geneotypes** column, which exactly matches what we see from `staramr`.

#### Predicted Phenotypes

The phenotypes are also in this same table under **AST Phenotypes** (or at <https://www.ncbi.nlm.nih.gov/biosample/SAMN02699230/>). This contains the list: `amoxicillin-clavulanic acid, ampicillin, cefoxitin, ceftiofur, ceftriaxone`. Comparing to the results from `staramr` we can see we are missing `ceftiofur`.

### GCF_001931595.1

#### Genotypes

For **GCA_001931595.1** we can find the detected AMR genes by NCBI at <https://www.ncbi.nlm.nih.gov/pathogens/isolates/#/search/GCA_001931595.1>.  From here we see that `aac(3)-IV, aph(3')-Ia, aph(4)-Ia, blaCTX-M-65, dfrA14, floR, sul1, tet(A)` are listed under the **AMR geneotypes** column. When compared to `staramr`, it looks like `staramr` has one additional gene, mainly `gyrA (D87Y)`, which is a point mutation resistance.

#### Predicted Phenotypes

The phenotypes are also in this same table under **AST Phenotypes** (or at <https://www.ncbi.nlm.nih.gov/biosample/SAMN03988471>). This contains the list (when including the *Intermediate* category*): `ampicillin, ceftiofur, ceftriaxone, chloramphenicol, nalidixic acid, tetracycline, ciprofloxacin, gentamicin`. Comparing to the results from `staramr` we can see that `staramr` is missing `ceftiofur`, and `staramr` additionally includes `kanamycin, hygromicin, trimethoprim`.

For which set of antimicrobial resistances is correct, I am currently unsure, but this does highlight the need for additional testing of the AMR predictions produced by `staramr`, which is an ongoing effort.