
# Introduction to loading and plotting a load_cohort
This document conatins a brief introduction on how to load load a cohort dataset and store it into a comprehensive list of *participants* and *trajectories* using **basic.py** module. Further we review some pre-made plotting functions included in **plot.py** module.

## Loading the module
First we need to load the module as well as pandas package.


In [1]:
from ARCH import basic, plot

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"


pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


Use pandas to import your dataset into a DataFrame, then use load function in
basic model to produce a list of participant class objects with trajectories
class objects.


In [2]:
df = pd.read_csv(r'Datasets/LBC_ARCHER.1PCT_VAF.Mar21.non-synonymous.tsv', sep='\t')
lbc = basic.load(df)

In the last section we provide a quicker way of loading a dataset using multiprocessing.

## Participant class

Each participant of the cohort has been stored in a
participant class object with attributes:
* **id**: a string with the participant's id
* **mutation_list**: a list of strings of all mutations detected at any wave
* **trajectories**: a list of trajectory class objects corresponding to each trajectory present in a participant.
* **data**: pandas dataframe slice of the full dataset containing information about this participant.

In [3]:
# Exploring the attributes
print('This participant\'s id is', lbc[0].id)
print('The first mutations present are: ', lbc[0].mutation_list[0:3])
print('This is a sample of its slice of dataset')
lbc[0].data.head()

This participant's id is LBC0001A
The first mutations present are:  ['CUX1 c.1255+6492A>G', 'ATRX c.2454C>G', 'TET2 c.2290dup']
This is a sample of its slice of dataset


Unnamed: 0,PreferredSymbol,type,HGVSp,protein_substitution,HGVSc,base_substitution,DP,AO,UAO,AF,participant_id,wave,chromosome,position,reference,mutation,gnomAD_AF,consequence,annotation,transcript_id,quality,FunctionalStatus,COSMICID,SIFT,PolyPhen,HasSeqDirBias,PosPrimerRefCount,NegPrimerRefCount,PosPrimerAltCount,NegPrimerAltCount,UDP,DAO,AUAO,UAO.1,SampleStrandBiasRatio,SampleStrandBiasProb,Min_Outlier_PValue,AF_Outlier_Pvalue,95MDAF,key,p_key,event_key,Variant_Classification,TYPE,p_key_1,age
0,CUX1,snp,NP_853530.2:p.Glu933Gly,p.Glu933Gly,NM_001913.3:c.1255+6492A>G,c.1255+6492A>G,943,6,6,0.0064,LBC0001A,1,chr7,101845375,A,G,,missense_variant,,NM_001913.3,204.817,Unknown,,deleterious(0.02),probably_damaging(0.997),No,419,508,2,4,207,0,192,6,1.65,0.695,,,,CUX1 c.1255+6492A>G,CUX1 p.Glu933Gly,CUX1_c.1255+6492A>G_LBC0001A,Missense_Mutation,SNP,CUX1 p.E933G,0
1,CUX1,snp,NP_853530.2:p.Glu933Gly,p.Glu933Gly,NM_001913.3:c.1255+6492A>G,c.1255+6492A>G,734,8,8,0.0109,LBC0001A,4,chr7,101845375,A,G,,missense_variant,,NM_001913.3,284.0,Unknown,,deleterious(0.02),probably_damaging(0.997),No,328,393,1,7,206,0,194,8,5.842,0.079,,,,CUX1 c.1255+6492A>G,CUX1 p.Glu933Gly,CUX1_c.1255+6492A>G_LBC0001A,Missense_Mutation,SNP,CUX1 p.E933G,9
2,ATRX,snp,NP_000480.2:p.Asp818Glu,p.Asp818Glu,NM_000489.3:c.2454C>G,c.2454C>G,2345,8,2,0.0034,LBC0001A,1,chrX,76938294,G,C,,missense_variant,,NM_000489.3,301.0,Unknown,,,probably_damaging(0.978),No,747,1588,2,6,258,0,254,2,1.411,1.0,,,,ATRX c.2454C>G,ATRX p.Asp818Glu,ATRX_c.2454C>G_LBC0001A,Missense_Mutation,SNP,ATRX p.D818E,0
3,ATRX,snp,NP_000480.2:p.Asp818Glu,p.Asp818Glu,NM_000489.3:c.2454C>G,c.2454C>G,2103,21,5,0.01,LBC0001A,3,chrX,76938294,G,C,,missense_variant,,NM_000489.3,794.0,Unknown,,,probably_damaging(0.978),No,664,1414,6,15,258,0,249,5,1.174,0.819,,,,ATRX c.2454C>G,ATRX p.Asp818Glu,ATRX_c.2454C>G_LBC0001A,Missense_Mutation,SNP,ATRX p.D818E,6
4,ATRX,snp,NP_000480.2:p.Asp818Glu,p.Asp818Glu,NM_000489.3:c.2454C>G,c.2454C>G,1801,11,1,0.0061,LBC0001A,4,chrX,76938294,G,C,,missense_variant,,NM_000489.3,400.0,Unknown,,,probably_damaging(0.978),No,603,1184,0,11,252,0,248,1,inf,0.02,,,,ATRX c.2454C>G,ATRX p.Asp818Glu,ATRX_c.2454C>G_LBC0001A,Missense_Mutation,SNP,ATRX p.D818E,9



## Trajectories class
All mutations present in a participant is stored in a list of mutation
trajectories.


In [4]:
print('These are the first trajectories of this participant:')
print(lbc[0].trajectories[0:2])

# For future examples we will explore the first stored trajectory 
traj = lbc[0].trajectories[0]

These are the first trajectories of this participant:
[<ARCH.basic.trajectory object at 0x7f8fd5fd11d0>, <ARCH.basic.trajectory object at 0x7f8fd5fd1110>]


Each trajectory class object has the following attributes:
* **mutation**: string with the name of the mutation this trajectory follows
* **germline**: boolean with the germline status of the mutation
* **data**: pandas dataframe with the trajectory data
* **gradient**: gradient between the first and the last point of the trajectory


In [5]:
# Exploring trajectory attributes
print("This trajectory follows mutation ", traj.mutation, ".")
print('The mutated gene is', traj.mutation.split()[0], '.') 
print('The germline status of this trajectory is ', traj.germline, '.')  
print('The gradient from frist time point to the last is', traj.gradient, '.')

This trajectory follows mutation  STAG2 c.1018C>A .
The mutated gene is STAG2 .
The germline status of this trajectory is  False .
The gradient from frist time point to the last is -0.0037999999999999996 .


In [6]:
# data stores relevant information of a trajectory
traj.data

Unnamed: 0,AF,age,regularized_gradient
0,0.0231,79,-0.00196
1,0.0183,85,-0.003811
2,0.0117,88,


# Plotting data

We have created several plot functions to explore the data in the cohort.

## Longitudinal trajectories

### Profile of participants

The participant class comes with a **profile** method to plot its genetic longitudinal profile.

In [8]:
lbc[9].profile(germline=False)   # germline filters trajectories with germline status = True

We can also plot a participant profile using it's participant id with **plot.plot_id**.

In [8]:
plot.plot_id(lbc,'LBC360021', germline=False)

We can also visualize longitudinal trajectories as a stack plot using **plot.stack_plot**.

In [9]:
plot.stack_plot(lbc[0], norm=False) # norm = True uses the % of the total contribution rather than VAF

### Mutation trajectories

We can plot all trajectories present in the cohort containing a mutation in a gene using **plot.mutation**.

In [10]:
plot.mutation(lbc,'DNMT3A')

## Most prevalent mutations

In [11]:
plot.top_bar(lbc, n_genes=10)

Alternatively, we can plot counts for all genes in the cohort

In [12]:
plot.top_bar(lbc, all=True)

## Gradient statistics


Function *gradients* allows us to crate violin plots of the overall gradient of all trajectories containing a mutation in a list of genes.

In [13]:
plot.gradients(lbc, ['DNMT3A','JAK2','TET2','ASXL1'])

We can produce a similar plot, but instead of using the overall gradient, showing the local gradients sorted grouped by wave:

In [14]:
plot.local_gardients(lbc,['DNMT3A','JAK2','TET2','ASXL1'])

# Speeding up loading process with multiprocessing

If the number of detected mutations grows, it will considerably slow down the process of loading a cohort and its trajectories. Here is an alternative way of loading exploiting the fact that we can load each participant separately.

In [15]:
import multiprocessing as mp
from functools import partial

df = pd.read_csv(r'Datasets/LBC_ARCHER.1PCT_VAF.Jan2021.non-synonymous.tsv', sep='\t')

ids = df.participant_id.unique()         # list of all participant ids
partial_load_id = partial(basic.load_id, df=df)   # freeze the second variable

#compute the trajectory simulations in parallel for all mutations 
pool = mp.Pool()
lbc = list(pool.map(partial_load_id,ids))

## comparing the speed improvement

In [16]:
import time

df = pd.read_csv(r'Datasets/LBC_ARCHER.1PCT_VAF.Jan2021.non-synonymous.tsv', sep='\t')

start_time = time.time() 
lbc = basic.load(df)
print("basi.load method --- %s seconds ---" % (time.time() - start_time))

start_time = time.time()                 # transform the time into years since first wave
ids = df.participant_id.unique()         # list of all participant ids
partial_load_id = partial(basic.load_id, df=df)   # freeze the second variable

#compute the trajectory simulations in parallel for all mutations 
pool = mp.Pool()
lbc = list(pool.map(partial_load_id,ids))

print("multiprocessing method --- %s seconds ---" % (time.time() - start_time))

basi.load method --- 12.851371049880981 seconds ---
multiprocessing method --- 7.300813436508179 seconds ---
