> We would like to thank all **MalariaGEN Plasmodium falciparum Community Project partners** for their contribution. This resource builds from enormous team efforts around the globe, encompasing 61 studies in 30 countries. With over 16,500 samples available, we can only hope to continue disseminating these unique resources to increase their accesibility to allow them to translate into improvements for public health.
If you use this resource please remember to also site the following studies:
[Pf6 partner studies](http://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6_partner_studies.pdf) and [GenRe partner studies](http://ngs.sanger.ac.uk/production/malaria/Resource/29/20200705-GenRe-07-PartnerStudyInformation-0.39.pdf). 

# Visualizing drug resistant variants

This notebook allows you to explore the genotypes present in a dataset. You can use it to:

- Visualise the frequency of individual genetic mutations over time (e.g. showing which alleles are dominant in different countries at specific time points).

- Explore haplotype frequencies at different geographical levels (population, country and administrative level 1) over time, displaying regional diversity and haplotype compositions.


- Visualise the prevalence of drug resistance over time for key drugs. 


## Setup

### Running on Colab

In [None]:
!git clone https://github.com/malariagen/Pf6plus.git 
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .

### Running Locally

There are some steps you need to follow to run the notebooks locally. If you haven't already, please follow these [instructions](https://github.com/malariagen/Pf6plus#running-the-notebooks-locally).

### Python Setup

Import the functions from the `data_analysis` directory. This contains all of the code you will need to generate the plots in this notebook. 

In [1]:
from data_analysis.plot_dr_prevalence import *
from data_analysis.plot_haplotype_frequency import *
from data_analysis.tabulate_drug_resistance import *

Running the following will make sure the interactive plots are output to the notebook.

In [2]:
import bokeh.io
bokeh.io.output_notebook()

### Import data

In [3]:
#input Pf6+ data 
pf6plus_metadata = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'
pf6plus = pd.read_csv(pf6plus_metadata, sep='\t', index_col=0, low_memory=False)

Here we filter out the samples which have `IncludeInAnalysis` set to `True`. This will retain only high-quality samples (both WGS and AmpSeq). 

In [4]:
pf6plus=pf6plus.loc[pf6plus.IncludeInAnalysis==True]

## Prevalence of Resistant Variants

Note: If your GRC data doesn't have information on drug resistance, you can use `phenotyper` (`3_phenotyper.ipynb`), to generate it. 



### Tabulate drug resistant variants
To start exploring the dataset we can tabulate different combinations of drugs in different countries & years.

In [5]:
# use help(name_of_function) to access the documentation notes
help(tabulate_drug_resistant)

Help on function tabulate_drug_resistant in module data_analysis.tabulate_drug_resistance:

tabulate_drug_resistant(data, drug, country=None, population=None, years=None, bin=False)
    Tabulate the frequency of drug resistant samples per country/year
    
    Parameters:
      - drug: Any of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
       'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
       'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
       'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
       'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
      - population: Any of the populations in the P

Imagine we are interested on S-P resistance around the globe...

In [6]:
tabulate_drug_resistant(pf6plus, 'S-P')

S-P resistant samples 


S-P,Resistant,Sensitive,Undetermined,Total,Resistant Frequency
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bangladesh,715,834,221,1770,0.46
Benin,34,1,1,36,0.97
Burkina Faso,10,27,19,56,0.27
Cambodia,1697,96,82,1875,0.95
Cameroon,230,2,3,235,0.99
Colombia,0,16,0,16,0.0
Côte d'Ivoire,34,28,8,70,0.55
Democratic Republic of the Congo,268,55,141,464,0.83
Ethiopia,18,2,1,21,0.9
Gambia,182,25,12,219,0.88


The proportion of resistant variants seems to be high in West Africa, let's aggregate the countries on this population and see the result..

In [7]:
tabulate_drug_resistant(pf6plus, 'S-P', population='WAF')

S-P resistant samples in WAF 


S-P,Resistant,Sensitive,Undetermined,Total,Resistant Frequency
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2007,26,35,24,170,0.43
2008,70,38,21,258,0.65
2009,51,32,35,236,0.61
2010,66,45,45,312,0.59
2011,157,50,42,498,0.76
2012,51,11,19,162,0.82
2013,595,169,134,1796,0.78
2014,336,81,37,908,0.81
2015,29,9,23,122,0.76


We can also see the temporal aspect, trying to spot temporal patterns..

In [8]:
tabulate_drug_resistant(pf6plus,'S-P', years = [2007, 2010], bin=False)

S-P resistant samples for years 2007, 2010


S-P,Resistant,Sensitive,Undetermined,Total,Resistant Frequency
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Cambodia,180,11,4,195,0.94
Ghana,66,45,45,156,0.59
Indonesia,0,1,0,1,0.0
Kenya,30,15,4,49,0.67
Laos,12,22,2,36,0.35
Mali,26,35,24,85,0.43
Papua New Guinea,0,3,0,3,0.0
Tanzania,34,6,7,47,0.85
Thailand,39,1,1,41,0.98
Uganda,10,1,1,12,0.91


**The number of samples collected across different locations in different years varies widely in the Pf6+ data resource. To increase confidence in the plots shown below, a threshold is set to only include country (or) population/year combinations with `n_samples > 25`. You can change this default value by using the `threshold` flag, but please be cautious.**

### Plot drug resistance prevalence

This function explores the frequencies of drug resistant variants within different spatial (country/population) and temporal (year) combinations. This is quite a useful way of visualising time series data in regions & years of interest.

You can also hover over the data points to see the specific numbers and zoom in and out to change the scale.

In [9]:
help(plot_dr_prevalence)

Help on function plot_dr_prevalence in module data_analysis.plot_dr_prevalence:

plot_dr_prevalence(data, drugs, country=None, population=None, years=None, bin=False, threshold=25)
    Plot the prevalence of resistant samples per country/year
    
    Parameters:
      - drug: Any/list of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
       'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
       'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
       'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
       'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
      - population: Any of the populations in the Pf6+ 

We can continue exploring patterns specific to West Africa, by looking at the infromation available for specific drugs in Gambia. By comparing this country versus the population average, we can make it easier to spot differences..

In [10]:
plot_dr_prevalence(pf6plus, drugs=['S-P','Sulfadoxine','Chloroquine','Artemisinin','DHA-PPQ','Piperaquine'], country = 'Gambia', population = 'WAF')



The drug resistance pattern looks interesting for S-P, how would this look for another country within the same population?

In [11]:
plot_dr_prevalence(pf6plus, drugs=['S-P'], country = 'Mali')

### Plot most common haplotypes per population/country

Drug resistant haplotypes allow us to visualise key positions in the genome together and look out for changes within these structures. By plotting the top haplotypes in a region, we can spot emerging trends.

In [12]:
help(plot_haplotype_frequency)

Help on function plot_haplotype_frequency in module data_analysis.plot_haplotype_frequency:

plot_haplotype_frequency(data, gene, num_top_haplotypes=5, threshold=25, countries=None, populations=None, years=None, bin=False)
    Plot the top n haplotypes on a specife gene per country (or) population per year
    
    Parameters:
      - gene: Any of the genes in the Pf6+ dataframe ['PfCRT', 'Kelch', 'PfDHFR', 'PfEXO', 'PGB', 'Plasmepsin2/3', 'PfDHPS', 'PfMDR1']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
       'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
       'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
       'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
       'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
      - population: A

In [13]:
pf6plus

Unnamed: 0_level_0,Study,Year,Country,AdmDiv1,Population,Process,IncludeInAnalysis,Latitude_country,Longitude_country,Latitude_adm1,...,PfMDR1:1034,PfMDR1:1042,PfMDR1:1226,PfMDR1:1246,PfARPS10:127,PfARPS10:128,PfFD:193,PfCRT:326,PfCRT:356,PfMDR2:484
SampleId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
FP0008-C,1147-PF-MR-CONWAY,2014,Mauritania,Hodh el Gharbi,WAF,WGS,True,20.265149,-10.337093,16.565426,...,S,N,F,D,V,D,D,N,[T/I],T
FP0009-C,1147-PF-MR-CONWAY,2014,Mauritania,Hodh el Gharbi,WAF,WGS,True,20.265149,-10.337093,16.565426,...,S,N,F,Y,V,D,D,N,T,T
FP0015-C,1147-PF-MR-CONWAY,2014,Mauritania,Hodh el Gharbi,WAF,WGS,True,20.265149,-10.337093,16.565426,...,S,N,F,D,V,D,D,N,I,T
FP0016-C,1147-PF-MR-CONWAY,2014,Mauritania,Hodh el Gharbi,WAF,WGS,True,20.265149,-10.337093,16.565426,...,S,N,F,D,V,D,D,N,I,T
FP0017-C,1147-PF-MR-CONWAY,2014,Mauritania,Hodh el Gharbi,WAF,WGS,True,20.265149,-10.337093,16.565426,...,S,N,F,D,V,D,D,N,I,T
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
RCN12896,1238-PF-VN-NIMPE-GENRE,2018,Vietnam,Gia Lai,ESEA,AmpSeq-V1,True,16.694365,106.551796,13.797699,...,-,-,-,-,M,H,-,S,T,-
RCN12897,1238-PF-VN-NIMPE-GENRE,2018,Vietnam,Binh Phuoc,ESEA,AmpSeq-V1,True,16.694365,106.551796,11.755317,...,S,N,F,D,M,H,Y,S,T,I
RCN12899,1238-PF-VN-NIMPE-GENRE,2018,Vietnam,Binh Phuoc,ESEA,AmpSeq-V1,True,16.694365,106.551796,11.755317,...,-,-,F,D,[V/M],[D/H],-,S,-,-
RCN12900,1238-PF-VN-NIMPE-GENRE,2018,Vietnam,Binh Phuoc,ESEA,AmpSeq-V1,True,16.694365,106.551796,11.755317,...,S,N,F,D,M,H,Y,S,T,I


In [14]:
plot_haplotype_frequency(pf6plus, 'PfDHFR', num_top_haplotypes=5, populations = ['CAF', 'EAF', 'ESEA', 'OCE', 'SAS', 'WAF', 'WSEA'], years = None, bin=False)

<data_analysis.plotting.Subplots at 0x11a8edd00>

In [15]:
plot_haplotype_frequency(pf6plus, 'PfDHFR', num_top_haplotypes=5, populations = ['CAF'], years = None, bin=False)

<data_analysis.plotting.Subplots at 0x11ac70c40>

In [16]:
plot_haplotype_frequency(pf6plus, 'PfDHFR', num_top_haplotypes=5, countries = ['Mali'], bin=False)

<data_analysis.plotting.Subplots at 0x11ac70c10>

## What else can I do? 

Imagine we are interested on the evolution of *Kelch13* haplotypes in ESEA, a quick way to visualise the context, would be to compare different countries within this population and see if it is possible to detect any country-specific mutations. 


In [17]:
plot_haplotype_frequency(pf6plus, 'Kelch', populations =  ['ESEA'])

<data_analysis.plotting.Subplots at 0x11ac70eb0>

### An extra use case.. 
Imagine we want to explore how different are *dhfr* haplotypes between ESEA and WSEA. By plotting the top 1 haplotype in each region and increasing this threshold, we can easily see dynamics.


In [18]:
plot_haplotype_frequency(pf6plus, 'PfDHFR', populations = ['WSEA','ESEA','WAF','EAF'], num_top_haplotypes = 1)

<data_analysis.plotting.Subplots at 0x11ac70e80>

In [19]:
plot_haplotype_frequency(pf6plus, 'PfDHFR', populations = ['WSEA','ESEA','WAF','EAF'])

<data_analysis.plotting.Subplots at 0x11ad88d90>

---

As you can see there is a lot of room for further exploration on any of the use-cases above. 