# piezo

This jupyter notebook walks you through the basic functionality of the Classes provided using the resistance catalogue `config/LID2015-RSU-catalogue-v1.0-H37rV_v2.csv`. This is almost exactly what is contained in the Supplement of this paper

Walker TM, Kohl TA, Omar S V, Hedge J, Del Ojo Elias C, et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infec Dis 2015;15:1193–202. doi:10.1016/S1473-3099(15)00062-6

It is important to note that this catalogue is relative to version 2 of the H37rV GenBank reference (the current version is 3).

The differences are
* some of the rows have not been included, in particular some very detailed indels, as I don't know what they mean..
* some default rows have been added which describe the underlying assumptions such as any synoymous mutation is susceptible (in other words, all the logic is now contained in the catalogue, not in the code)
* the mutations are formatted according to the General Ontology for Antimicrobial Resistance Catalogues (GOARC) which is described in the NOMENCLATURE.md file in the repo and also [here](http://fowlerlab.org/2018/11/25/goarc-a-general-ontology-for-antimicrobial-resistance-catalogues/)

Philip W Fowler

24 Jan 2019

In [2]:
# we will need pandas so we can have a look at the Resistance Catalogue which is stored as a CSV
import pandas

import piezo

In [3]:
lid_catalogue=pandas.read_csv("config/LID2015-RSU-catalogue-v1.0-H37rV_v2.csv")

In [4]:
lid_catalogue[:5]

Unnamed: 0,DRUG,GENE,MUTATION,POSITION,GENE_MUTATION,GENE_TYPE,VARIANT_AFFECTS,VARIANT_TYPE,INDEL_1,INDEL_2,INDEL_3,GENBANK_REFERENCE,LID2015A_PREDICTION,LID2015B_PREDICTION
0,AMI,gidB,202_indel,202,gidB_202_indel,GENE,CDS,INDEL,202_del,202_del_1,,NC_000962.2,U,U
1,AMI,gidB,215_indel,215,gidB_215_indel,GENE,CDS,INDEL,215_del,215_del_1,,NC_000962.2,S,S
2,AMI,gidB,293_indel,293,gidB_293_indel,GENE,CDS,INDEL,293_ins,293_ins_2,293_ins_ac,NC_000962.2,,U
3,AMI,gidB,399_indel,399,gidB_399_indel,GENE,CDS,INDEL,399_del,399_del_10,,NC_000962.2,,S
4,AMI,gidB,451_indel,451,gidB_451_indel,GENE,CDS,INDEL,451_del,451_del_1,,NC_000962.2,,S


Let's check a well-known mutation that confers resistance to rifampicin

In [5]:
lid_catalogue.loc[lid_catalogue.GENE_MUTATION=="rpoB_S450L"]

Unnamed: 0,DRUG,GENE,MUTATION,POSITION,GENE_MUTATION,GENE_TYPE,VARIANT_AFFECTS,VARIANT_TYPE,INDEL_1,INDEL_2,INDEL_3,GENBANK_REFERENCE,LID2015A_PREDICTION,LID2015B_PREDICTION
2023,RIF,rpoB,S450L,450,rpoB_S450L,GENE,CDS,SNP,,,,NC_000962.2,R,R


These are the default rules that have been added for this gene

In [6]:
lid_catalogue.loc[(lid_catalogue.GENE=="rpoB") & (lid_catalogue.POSITION=="*")]

Unnamed: 0,DRUG,GENE,MUTATION,POSITION,GENE_MUTATION,GENE_TYPE,VARIANT_AFFECTS,VARIANT_TYPE,INDEL_1,INDEL_2,INDEL_3,GENBANK_REFERENCE,LID2015A_PREDICTION,LID2015B_PREDICTION
2453,RIF,rpoB,*=,*,rpoB_*=,GENE,CDS,SNP,,,,NC_000962.2,S,S
2454,RIF,rpoB,*?,*,rpoB_*?,GENE,CDS,SNP,,,,NC_000962.2,U,U
2455,RIF,rpoB,-*?,*,rpoB_-*?,GENE,PROM,SNP,,,,NC_000962.2,U,U
2456,RIF,rpoB,*_indel,*,rpoB_*_indel,GENE,CDS,INDEL,,,,NC_000962.2,U,U
2457,RIF,rpoB,-*_indel,*,rpoB_-*_indel,GENE,PROM,INDEL,,,,NC_000962.2,U,U


`*=` means any synonymous mutation in the protein coding sequence (CDS). Here `*` is reserved for 'any position' and `!` for the Stop codon. `*?` is any non-synoymous mutation in the CDS and `*_indel` is any insertion or deletion at any position in the CDS. Finally, `-*?` and `-*_indel` mean any nonsynoymous mutation or insertion/deletion in the promoter, respectively

## Instantiating a catalogue

You need to specify the catalogue CSV, the matching GenBank file and the name of the catalogue (since a single CSV can have multiple columns associated with a prediction: there MUST be a column called "LID2015B_PREDICTION" otherwise the code will complain).



In [7]:
cat=piezo.ResistanceCatalogue(input_file="config/LID2015-RSU-catalogue-v1.0-H37rV_v2.csv",
                              genbank_file="config/H37rV_v2.gbk",
                              catalogue_name="LID2015B")

## Predicting the effect of a mutation

Now it is a simple as using the `predict()` method. This method takes two arguments, (1) `gene_mutation` which requires mutations in the form `gene_mutation` e.g. `rpoB_S450L` or `rpoB_1300_ins` and (2) `verbose` which if set to `True` the code will print out to STDOUT the rules that have been met for that mutation and their associated priorties. The prediction with the highest priority is then chosen.

In [8]:
cat.predict(gene_mutation='rpoB_S450L')

{'RIF': 'R'}

A dictionary is returned since a single genetic variant can affect multiple drugs, for example

In [9]:
cat.predict(gene_mutation='gyrA_A90V')

{'CIP': 'U', 'MXF': 'R', 'OFX': 'R'}

These are the rows in the catalogue that will be considered for a mutation at `rpoB_S450` 

In [10]:
cat.resistance_catalogue.loc[(cat.resistance_catalogue.GENE=='rpoB') & 
                             (cat.resistance_catalogue.POSITION.isin(['*','450'])) &
                             (cat.resistance_catalogue.VARIANT_TYPE=='SNP') & 
                             (cat.resistance_catalogue.VARIANT_AFFECTS=='CDS')]

Unnamed: 0,DRUG,GENE,MUTATION,POSITION,GENE_MUTATION,GENE_TYPE,VARIANT_AFFECTS,VARIANT_TYPE,INDEL_1,INDEL_2,INDEL_3,GENBANK_REFERENCE,LID2015A_PREDICTION,LID2015B_PREDICTION
2022,RIF,rpoB,S450F,450,rpoB_S450F,GENE,CDS,SNP,,,,NC_000962.2,R,R
2023,RIF,rpoB,S450L,450,rpoB_S450L,GENE,CDS,SNP,,,,NC_000962.2,R,R
2024,RIF,rpoB,S450Q,450,rpoB_S450Q,GENE,CDS,SNP,,,,NC_000962.2,,U
2025,RIF,rpoB,S450W,450,rpoB_S450W,GENE,CDS,SNP,,,,NC_000962.2,R,R
2453,RIF,rpoB,*=,*,rpoB_*=,GENE,CDS,SNP,,,,NC_000962.2,S,S
2454,RIF,rpoB,*?,*,rpoB_*?,GENE,CDS,SNP,,,,NC_000962.2,U,U


So, let's repeat the prediction and see which rules are hit

In [11]:
cat.predict(gene_mutation='rpoB_S450L',verbose=True)

RIF rpoB_S450L
2. U nonsyn SNP at any position in the CDS or PROM
4. R exact SNP match
----------------------


{'RIF': 'R'}

As you can see the mutation first matches `rpoB_*?` (any non-synoymous mutation in the protein coding region of rpoB) and this has a priority of 2. It then matches an exact rule for `rpoB_S450L` which has a priority of 4. The predicted phenotype of the rule with the highest priority is returned, in this case `R` as expected.

This hierarchical approach will become increasingly important, for example, the `fabG1_L203L` synonymous mutation is (unusually) predicted to confer resistance to INH, however, simply adding a specific row to the catalogue will naturally ensure the specific rule overrides the default rule that any synoymous mutation is assumed to be susceptible.

## Validation

The code uses a separate repository called [gemucator](https://github.com/philipwfowler/gemucator). You instantiate a copy of the class with the correct (i.e. same) GenBank file.

This all happens "under the hood" in the piezo module, but it is useful to see the extent of validation that is performed on each mutation.

In [12]:
from gemucator import gemucator

In [13]:
tb=gemucator(genbank_file="config/H37rV_v2.gbk")

The two methods most useful for validation are `valid_gene` and `valid_mutation`. 

In [14]:
tb.valid_gene("rpoB")

True

Let's try a gene that doesn't exist

In [15]:
tb.valid_gene("rpoD")

False

In [16]:
tb.valid_mutation("rpoB_S450L")

True

and a mutation that doesn't (this checks the REF amino acid against the GenBank file which isn't often done but should be correct, as otherwise there may be an inconsistency between the catalogue and the GenBank File)

In [17]:
tb.valid_mutation("rpoB_T450L")

False

It also insists that all amino acids are UPPERCASE and are drawn from the correct 20 amino acid alphabet and nucleotides are lowercase `[a,c,t,g]`

In [18]:
tb.valid_mutation("rpoB_J450L")

AssertionError: J is not an amino acid!

In [19]:
tb.valid_mutation("rpoB_c-15t")

True

In [20]:
tb.valid_mutation("rpoB_d-15t")

AssertionError: d is not a nucleotide!

Now let's see it in action inside `piezo`

In [21]:
cat.predict(gene_mutation='rpoB_J450L')

AssertionError: J is not an amino acid!

In [22]:
cat.predict(gene_mutation='rpoB_D450L')

AssertionError: gene exists but rpoB_D450L is badly formed; check the reference amino acid or nucleotide!

In [23]:
cat.predict(gene_mutation='rpoD_S450L')

AssertionError: rpoD does not exist in the specified GENBANK file!

It should be able to parse all the styles of mutations that are defined by GOARC, e.g.

In [25]:
cat.predict(gene_mutation='rpoB_S450?')

{'RIF': 'U'}

In GOARC, INDELs are put into a hierachy, but it parses them all

In [28]:
cat.predict(gene_mutation='katG_100_indel')

{'INH': 'U'}

In [29]:
cat.predict(gene_mutation='katG_100_ins')

{'INH': 'U'}

In [30]:
cat.predict(gene_mutation='katG_100_del')

{'INH': 'U'}

In [31]:
cat.predict(gene_mutation='katG_100_ins_4')

{'INH': 'U'}

In [32]:
cat.predict(gene_mutation='katG_100_ins_actg')

{'INH': 'U'}