**We would like to thank all MalariaGEN Plasmodium falciparum Community Project partners for their contribution. If you use this resource please remember to also site the following studies:**
[Pf6 partner studies](http://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6_partner_studies.pdf) and [GenRe partner studies](http://ngs.sanger.ac.uk/production/malaria/Resource/29/20200705-GenRe-07-PartnerStudyInformation-0.39.pdf).

# Phenotyper

This notebook allows you to use the **Phenotyper** tool to infer drug-resistant phenotypes from your own data. After inferring the phenotypes, you can use our simple visualisation tools to explore your results and compare them with our **Pf6+** dataset, which stores over 13,500 samples, collected across the world.

Please see `https://gitlab.internal.sanger.ac.uk/dt5/phenotyper` to explore the complete documentation.

**When to use?** If your Genetic Report Card (GRC) doesn't have information regarding resistant phenotypes or you would like to update previous inferred phenotypes, you can use phenotyper to evaluate a set of rules to infer drug resistance states. Note that any file with information on the genotypes described below, can be used to infer the drug-resistant phenotypes.

**How to use?**

```
Usage: ./phenotyper.r [options]
Evaluates a set of phenotyping rules on all the samples of the given data file.

Options:
        --datafile=DATAFILE
                Path to the data file, a tabular file with the required fields (first column reserverd for sample names).

        --rulesfile=RULESFILE
                Path to the JSON file that encodes the rules to be used for evaluating each drug.

        --ruleout=RULEOUT
                The field to be extracted from a rule when evaluated as true. Helps with debugging.

        --samplefield=SAMPLEFIELD
                The field on the data file that designates the sample id (optional).

        --verbose
                A flag to obtain verbose detailed output in the standard output (can generate very big logs, for tracebacks).

        --output=OUTPUT
                Base path for the output files.

        -h, --help
                Show this help message and exit
```

**Which are the required fields for `--datafile` ?** You will need a tabulated file with the following fields: 

```
crt_76[K]       crt_72-76[CVMNK]        dhfr_51[N]      dhfr_59[C]      dhfr_108[S]     dhfr_164[I]     dhps_437[G]     dhps_540[K]     dhps_581[A]     dhps_613[A]     k13_class       k13_alleles     mdr1_dup_call   pm2_dup_call    gch1_dup_call 
```

As you can see there is no need for your file to include any further metadata such as location; given that the **phenotyper** only uses genetic data for the inference. Just make sure that you have keep sample_ids from our output file, so you can cross-reference it with other metadata you might have.

An example of the complete file used can be found on our example file `mygrc_metadata_nophenotypes.tsv`. If you have your GRC, this tool will create this file for you, before moving on to the phenotype prediction 

**What can I do with the output?** An easy way to visualise your results, is by using `2_prevalence_DR.ipynb`; where you can also compare your data against Pf6+ (a public resource which combines information on over 13,000 samples worldwide), increasing the power of your data.  


If you are happy to use the default rules we've generated, you can continue with the procedure explained here; otherwise, please check the `Further technical details` section below. 





## Setup

Download the phenotyper tool if you haven't already

In [1]:
#TO DO: this will only work once the phenotyper repo has been made public
!git clone https://github.com/malariagen/phenotyper.git

Cloning into 'phenotyper'...
Username for 'https://github.com': ^C


Below we set the paths to the phenotyper.r script, the example GRC, the rules file, and the output file containing the phenotypes. `PATH_TO_PHENOTYPER`, `PATH_TO_GRC`, and `PATH_TO_RULES` are set to the files cloned from git above (You may need to change the paths if you downloaded the phenotyper somewhere else). Please set `PATH_TO_OUTPUT` to the path of the file you would like the phenotyper to create and put the output in.

In [2]:
#SET VARIABLES FOR RUNNING PHENOTYPER HERE 
%env PATH_TO_PHENOTYPER=./phenotyper/phenotyper.r 
%env PATH_TO_GRC=./phenotyper/ext/fake_grc/mygrc_metadata_genotypes.tsv
%env PATH_TO_RULES=./phenotyper/ext/rules/test_rules_310821.json
%env PATH_TO_OUTPUT=#Enter path to output file you would like to create

env: PATH_TO_PHENOTYPER=./phenotyper/phenotyper.r
env: PATH_TO_GRC=./phenotyper/ext/fake_grc/mygrc_metadata_genotypes.tsv
env: PATH_TO_RULES=./phenotyper/ext/rules/test_rules_310821.json
env: PATH_TO_OUTPUT=#Enter path to output file you would like to create


Running the following will allow you to run R and install the dependencies that you need to run the phenotyper tool.

In [5]:
#required to use R inside the Jupyter Notebook
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


Install R dependencies for phenotyper

In [6]:
%%R 

install.packages('optparse', quiet = TRUE)
install.packages('stringr', quiet = TRUE)
install.packages('stringi', quiet = TRUE)
install.packages('jsonlite', quiet = TRUE)

--- Please select a CRAN mirror for use in this session ---


### Running on Colab

If you are running the notebooks on colab then run the cell below to clone the python code needed for the plots below

In [7]:
!git clone https://github.com/malariagen/Pf6plus.git 
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .

Cloning into 'Pf6plus'...
Username for 'https://github.com': ^C
cp: /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis: No such file or directory


#### Using Your Own GRC

We have already set the `PATH_TO_GRC` above to the example included in the phenotyper repository. If you would like to use your own you can overide this below by uploading from your own machine or from your google drive. 

##### Uploading from your local machine 


You can upload a GRC from your own machine to run the phenotyper. Run the following and select the file from your local machine.

In [8]:
from google.colab import files
uploaded = files.upload()

ModuleNotFoundError: No module named 'google.colab'

Once the file has been uploaded you should see the path, copy this into the cell below

In [None]:
%env PATH_TO_GRC=#Enter the path here

##### Upload from Google Drive

To use a file that is stored on your google drive first connect to your personal google drive by running the cell below and following the instructions that are output.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Copy the GRC to google colab by running the following, replacing `<Path to GRC on your drive>` with the path where your file is stored on your google drive.

In [None]:
!cp -r <Path to GRC on your drive> .

Finally, run the following to set PATH_TO_GRC to where your GRC is stored on google colab.

In [None]:
%env PATH_TO_GRC=/content/drive/<GRC file name>

### Running Locally

There are some steps you need to follow before opening the notebooks to run them locally. If you haven't already, please follow these [instructions](https://gitlab.com/malariagen/gsp/pf6plus/-/tree/add_jupyter_notebooks/notebooks#running-locally).

#### Using Your Own GRC

We have already set `PATH_TO_GRC` above to use the example in the phenotyper repository. If you would like to use your own please set the path to the GRC you would like to use below:

In [None]:
%env PATH_TO_GRC=#Path to GRC 

## Run the Phenotyper

`--verbose` returns an execution log; which is useful to document and track how the samples have been classified

In [None]:
# import packages to visualise results
import pandas as pd
from data_analysis.plot_dr_prevalence import plot_phenotype_bar_chart_compared_to_pf6plus
from data_analysis.map_samples import map_samples
import bokeh.io
import os 

In [None]:
! $PATH_TO_PHENOTYPER \
  --datafile $PATH_TO_GRC  \
  --rulesfile $PATH_TO_RULES \
  --ruleout phenotype \
  --output $PATH_TO_OUTPUT \
  --verbose 

## Explore the detailed output




In [None]:
# import results from Phenotyper run above
phenotyper_results = os.environ.get('PATH_TO_OUTPUT')
results = pd.read_csv(phenotyper_results, sep='\t', index_col=0, low_memory=False).reset_index()

# import pf6+ data 
pf6plus_metadata = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'
pf6plus = pd.read_csv(pf6plus_metadata, sep='\t', index_col=0, low_memory=False)

# import GRC 
grc_metadata = os.environ.get('PATH_TO_GRC')
grc = pd.read_csv(grc_metadata, sep='\t', index_col=0, low_memory=False)

In [None]:
bokeh.io.output_notebook()

Let's map the phenotypes we have found back to the sample metadata

In [None]:
results['SampleId'] = grc.index
results = results.reset_index().set_index('SampleId')

In [None]:
results_with_geo = pd.concat([results, pf6plus], axis=1, join="inner")

The maps below show how including the Pf6+ data can increase the number of samples and the coverage that you have for a country. Click on the icons to find out more

In [None]:
map_samples(results_with_geo, zoom_to_start=4)

In [None]:
map_samples(pf6plus[pf6plus.Country == 'Mali'].copy(),zoom_to_start=4)

The plots below compare the phenotypes found for the GRC, which includes samples from Mali, compared to the samples in the Pf6+ dataset that come from Mali.

In [None]:
plot_phenotype_bar_chart_compared_to_pf6plus(results,pf6plus[pf6plus.Country == 'Mali'].copy())

## What's next?

After obtaining your results from this tool, you can go back to the `2_variants_prevalence.ipynb` notebook to easily visualise your results and compare them with our Pf6+ dataset, which stores over 13,500 samples with inferred phenotypes, collected across the world.

---

## Further technical details when using phenotyper

### Rule example 
Drugs are evaluated according to a set of rules specified in a JSON document. The only required fields for a rule are name and evaluation. Within evaluation one can access any of the fields/columns provided in the input data file using an R valid expression (see example below).



```
"Chloroquine":{
         "rules":[
            {
               "name":"Chloroquine-1",
               "change":"76-Het",
               "evaluation":"`crt_76[K]` %contains% ','",
               "interpretation":"Het",
               "phenotype":"Resistant/het",
               "analytics":"Resistant"
            },
            {
               "name":"Chloroquine-2",
               "change":"76-Missing",
               "evaluation":"`crt_76[K]` %==% '-'",
               "interpretation":"Missing",
               "phenotype":"Undetermined/missing",
               "analytics":"Undetermined"
            },
            {
               "name":"Chloroquine-3",
               "change":"K76",
               "evaluation":"`crt_76[K]` %==% 'K'",
               "interpretation":"WT",
               "phenotype":"Sensitive",
               "analytics":"Sensitive"
            },
            {
               "name":"Chloroquine-4",
               "change":"76T",
               "evaluation":"`crt_76[K]` %==% 'T'",
               "interpretation":"Mutant",
               "phenotype":"Resistant [1]",
               "analytics":"Resistant"
            },
            {
               "name":"Chloroquine-5",
               "change":"76-nonT",
               "evaluation":"T",
               "interpretation":"Unknown mutant",
               "phenotype":"Undetermined/unknown",
               "analytics":"Undetermined"
            }
         ]
      }
```

### Technicalities
Better to protect field names with backticks (`) in the JSON file as you'll find weird characters there (for instance they come from Excel files) that will cause havoc in R. One can also do this programmatically by providing a character protection mapping (within the code but not recommended).


The software interprets the rules executing the interpretable R code given in evaluation with some custom-made operators, like %==%, which means loose equality (symmetric any operation, any(a in b) or any(b in a)), to ease the evaluation. When dealing with this kind of task one usually writes a little parser for the operations allowed so no arbitrary code is evaluated. However, as we had to deal with different and changing formatting encodings (e.g. the way hets or missing calls are represented) the prototype evaluates any valid R expression given in the rules. We overwrite R system call functions to prevent attempts of code injection but this is obviously something to bear in mind.


There are metarules that check the output status of a previously executed rule by referencing the name of the drug with @, for instance:



```
           {
               "name":"Dihydroartemisinin-piperaquine-3",
               "change":"Missing",
               "evaluation":"@Artemisinin@ %contains% 'Missing' || @Piperaquine@ %contains% 'Missing'",
               "interpretation":"Undetermined/Missing",
               "phenotype":"Missing",
               "analytics":"Undetermined"
            },
```

One can run the program in --verbose mode and save the log. The file will be huge (and the execution is very slow), but I use it to trace back which rules were executed and why when phenotypes changed or were weird.

#Important things to bear in mind 
1. For each drug we define a set of rules that are executed sequentially.
2. **Order is important** (top-down execution until finding one that evaluates to true).
3. There should be always a default catch-up rule at the end (it will be executed if nothing else has been executed. A default rule just evaluates to `True` or `T`.
```
           {
               "name":"Chloroquine-5",
               "change":"76-nonT",
               "evaluation":"T",
               "interpretation":"Unknown mutant",
               "phenotype":"Undetermined/unknown",
               "analytics":"Undetermined"
            }
```
4. The script will evaluate each rule according to the `evaluation` field (required).
5. The output that will be obtained from a rule that evaluated to true must be specified using the `--ruleout` option. This allows for different outputs even when working with the same set of rules. Example: in our current phenotyping rules we made some arbitrary decisions (samples with het resistance markers are considered undetermined instead of resistant, you can imagine a different world where hets are resistant or dominant hets are resistant). However, one can check what difference these decisions make with the same set of rules by providing different output fields and indicating so when calling the program. For instance, using `--ruleout het_phenotype` instead of `--ruleout phenotype` (see example rule definition below)

``` "Chloroquine":{
         "rules":[
            {
               "name":"Chloroquine-1",
               "change":"76-Het",
               "evaluation":"`PfCRT:76` %==% 'K/T' || `PfCRT:76` %==% 'I/T'",
               "het_phenotype":"Resistant",
               "phenotype":"Undetermined",
               "analytics":"Resistant/Het"
            }, (...)
```




