<img src="../images/MalariaGEN.png" alt="MalariaGEN logo" width="375px" align="left">

**We would like to thank all MalariaGEN Plasmodium falciparum Community Project partners for their contribution. If you use this resource please remember to also site the following studies:**
[Pf6 partner studies](http://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6_partner_studies.pdf) and [GenRe partner studies](http://ngs.sanger.ac.uk/production/malaria/Resource/29/20200705-GenRe-07-PartnerStudyInformation-0.39.pdf).

# Drug Resistant variants visualisation

This notebook allows you to use the Phenotyper tool to infer phenotypes from your own data. After achieving this, you can use our simple visualisation tools to see your results and compare them with our Pf6+ dataset, which stores over 13,500 samples with inferred phenotypes, collected across the world.

## Setup

### Running on Colab

In [None]:
!git clone https://github.com/malariagen/Pf6plus.git 
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp -r drive/MyDrive/data_analysis .

### Running Locally

There are some steps you need to follow before opening the notebooks to run them locally. If you haven't already, please follow these [instructions](https://gitlab.com/malariagen/gsp/pf6plus/-/tree/add_jupyter_notebooks/notebooks#running-locally).

### Python Setup

Import the functions from the `data_analysis` directory. This contains all of the code you will need to generate the plots in this notebook. 

In [3]:
from data_analysis.plot_dr_prevalence import *
from data_analysis.plot_haplotype_frequency import *
from data_analysis.tabulate_drug_resistance import *

Running the following will make sure the interactive plots are output to the notebook.

In [4]:
import bokeh.io
bokeh.io.output_notebook()

### Import data

In [5]:
#input Pf6+ data 
pf6plus_metadata = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'
pf6plus = pd.read_csv(pf6plus_metadata, sep='\t', index_col=0, low_memory=False)

Here we filter out the samples which have IncludeInAnalysis set to True. This will filter out only the high-quality samples. (This includes a combination of QC samples for WGS and samples “included” in the GenRe analysis for AmpSeq)

In [6]:
pf6plus=pf6plus.loc[pf6plus.IncludeInAnalysis==True]

## Prevalence of Resistant Variants

Note: If your GRC doesn't have information on Drug Resistance, you can use the Phenotyper tool (`3_phenotyper.ipynb`), to get it. 



### Tabulate drug resistant variants


In [7]:
# use help(name_of_function) to access the documentation notes
help(tabulate_drug_resistant)

Help on function tabulate_drug_resistant in module data_analysis.tabulate_drug_resistance:

tabulate_drug_resistant(data, drug, country=None, population=None, year=None, bin=False)
    Tabulate the frequency of drug resistant samples per country/year
    
    Parameters:
      - drug: Any of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon', 'Colombia', 'Congo DR', 'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia', 'Ivory Coast', 'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru', 'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Viet Nam']
      - population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WA

In [8]:
tabulate_drug_resistant(pf6plus, 'S-P')

S-P resistant samples on all years


S-P,Resistant,Sensitive,Undetermined,Total,Resistant Frequency
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bangladesh,715,834,221,1770,0.46
Benin,34,1,1,36,0.97
Burkina Faso,10,27,19,56,0.27
Cambodia,1697,96,82,1875,0.95
Cameroon,230,2,3,235,0.99
Colombia,0,16,0,16,0.0
Côte d'Ivoire,34,28,8,70,0.55
Democratic Republic of the Congo,268,55,141,464,0.83
Ethiopia,18,2,1,21,0.9
Gambia,182,25,12,219,0.88


In [9]:
tabulate_drug_resistant(pf6plus, 'S-P', population='WAF')

S-P resistant samples on all years


S-P,Resistant,Sensitive,Undetermined,Total,Resistant Frequency
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
WAF,1381,470,380,2231,0.75


In [10]:
tabulate_drug_resistant(pf6plus,'S-P', year = [2007, 2010], bin=False)

S-P resistant samples on all years


S-P,Resistant,Sensitive,Undetermined,Total,Resistant Frequency
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bangladesh,715,834,221,1770,0.46
Benin,34,1,1,36,0.97
Burkina Faso,10,27,19,56,0.27
Cambodia,1697,96,82,1875,0.95
Cameroon,230,2,3,235,0.99
Colombia,0,16,0,16,0.0
Côte d'Ivoire,34,28,8,70,0.55
Democratic Republic of the Congo,268,55,141,464,0.83
Ethiopia,18,2,1,21,0.9
Gambia,182,25,12,219,0.88


In [11]:
tabulate_drug_resistant(pf6plus,'S-P', country='Mali', year = [2007,2010], bin=True)

S-P resistant samples in Mali from 2007 to 2010


S-P,Resistant,Sensitive,Undetermined,Total,Resistant Frequency
Samples,26,35,24,85,0.43


In [12]:
tabulate_drug_resistant(pf6plus,'S-P', country='Gambia', year = [2007,2010], bin=True)

S-P resistant samples in Gambia from 2007 to 2010


S-P,Resistant,Sensitive,Undetermined,Total,Resistant Frequency
Samples,60,11,2,73,0.85


**The number of samples collected across different locations in different years varies widely in the Pf6+ data resource. To increase confidence in the plots shown below, a threshold is set to only include country (or) population/year combinations with n_samples>25. You can change this default value by using the `threshold` flag, but please be cautious.**

### Plot Drug Resistant Prevalence

In [13]:
help(plot_dr_prevalence)

Help on function plot_dr_prevalence in module data_analysis.plot_dr_prevalence:

plot_dr_prevalence(data, drugs, country=None, population=None, years=None, bin=False, threshold=25)
    Plot the prevalence of resistant samples per country/year
    
    Parameters:
      - drug: Any/list of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon', 'Colombia', 'Congo DR', 'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia', 'Ivory Coast', 'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru', 'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Viet Nam']
      - population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF',

In [14]:
plot_dr_prevalence(pf6plus, drugs=['S-P','Sulfadoxine','Chloroquine','Artemisinin','DHA-PPQ','Piperaquine'], country = 'Gambia', population = 'WAF')



In [15]:
plot_dr_prevalence(pf6plus, drugs=['S-P', 'Sulfadoxine', 'Chloroquine'], country = 'Gambia', population = 'WAF')

In [16]:
plot_dr_prevalence(pf6plus, drugs=['S-P'], country = 'Gambia')

In [17]:
plot_dr_prevalence(pf6plus, drugs=['S-P'], country = 'Mali', population = 'WAF')

### Plot most common haplotypes per population/country

In [18]:
help(plot_haplotype_frequency)

Help on function plot_haplotype_frequency in module data_analysis.plot_haplotype_frequency:

plot_haplotype_frequency(data, gene, num_top_haplotypes=5, threshold=25, countries=None, populations=None, years=None, bin=False)
    Tabulate the frequency of top n haplotypes on a specific gene per country (or) population per year
    
    Parameters:
      - gene: Any of the genes in the Pf6+ dataframe ['PfCRT', 'Kelch', 'PfDHFR', 'PfEXO', 'PGB', 'Plasmepsin2/3', 'PfDHPS', 'PfMDR1']
      - num_top_haplotypes: The (n) most common haplotypes, default is 5. These excludes missing haplotypes.
      - countries: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon', 'Colombia', 'Congo DR', 'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia', 'Ivory Coast', 'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru', 'Senegal'

In [19]:
plot_haplotype_frequency(pf6plus, 'PfDHFR', num_top_haplotypes=5, populations = ['CAF'], years = None, bin=False)

<data_analysis.plotting.Subplots at 0x7f91649dd8d0>

In [20]:
plot_haplotype_frequency(pf6plus, 'PfDHFR', num_top_haplotypes=5, populations = ['CAF', 'EAF', 'ESEA', 'OCE', 'SAS', 'WAF', 'WSEA'], years = None, bin=False)

<data_analysis.plotting.Subplots at 0x7f9164a38b50>

In [21]:
plot_haplotype_frequency(pf6plus, 'PfDHFR', num_top_haplotypes=5, countries = ['Mali'], bin=False)

<data_analysis.plotting.Subplots at 0x7f9163f0f390>

## What else can I do? 

We are interested on the evolution of Kelch haplotypes in ESEA & would like to know how different are countries within this population & whether we can detect country-specific mutations. 


In [22]:
plot_haplotype_frequency(pf6plus, 'Kelch', populations =  ['ESEA'])

<data_analysis.plotting.Subplots at 0x7f9163e30950>

In [23]:
plot_haplotype_frequency(pf6plus, 'Kelch', countries = ['Cambodia'])

<data_analysis.plotting.Subplots at 0x7f9163e52510>

In [24]:
plot_haplotype_frequency(pf6plus, 'Kelch', countries = ['Vietnam','Cambodia'])

<data_analysis.plotting.Subplots at 0x7f9163eb8450>

In [25]:
plot_haplotype_frequency(pf6plus, 'Kelch', countries = ['Laos'])

<data_analysis.plotting.Subplots at 0x7f916251f910>

### An extra use case.. 
Explore how different are PfDHFR haplotypes between ESEA and WSEA


In [26]:
plot_haplotype_frequency(pf6plus, 'PfDHFR', populations = ['WSEA','ESEA','WAF','EAF'])

<data_analysis.plotting.Subplots at 0x7f91624a5610>