
# Preprocessing the LBC for longitunidal analysis

This document contains a brief introduction on how to load load a cohort dataset and store it into a comprehensive list of *participants* and *trajectories* using **basic.py** module. Further we review some pre-made plotting functions included in **plot.py** module.

## Loading ARCH python package
First we need to load the module as well as pandas package.


In [3]:
# Append root directory to system's path

import sys
sys.path.append('../ARCH_package')

import pandas as pd
import basic, plot

## Loading and preprocessing the cohort


Use pandas to import your dataset into a DataFrame, then use load function in
basic model to produce a list of participant class objects with trajectories
class objects.


In [4]:
# Load non-synonymous dataset
df = pd.read_csv(r'../Datasets/LBC_ARCHER.1PCT_VAF.13Jan22.non-synonymous.IncludesZeros.tsv', sep='\t')
lbc = basic.load(df, export_name='LBC_non-synonymous', create_dict=True)

100%|██████████| 82/82 [00:06<00:00, 13.53it/s]


In the last section we provide a quicker way of loading a dataset using multiprocessing.

## Participant class

Each participant of the cohort has been stored in a
participant class object with attributes:
* **id**: a string with the participant's id
* **mutation_list**: a list of strings of all mutations detected at any wave
* **trajectories**: a list of trajectory class objects corresponding to each trajectory present in a participant.
* **data**: pandas dataframe slice of the full dataset containing information about this participant.

In [5]:
# Exploring the attributes
print('This participant\'s id is', lbc[0].id)
print('The first mutations present are: ', lbc[0].mutation_list[0:2])
print('This is a sample of its slice of dataset')
lbc[0].data.head()

This participant's id is LBC0001A
The first mutations present are:  ['ATRX c.2454C>G', 'DDX41 c.1213A>C']
This is a sample of its slice of dataset


Unnamed: 0,PreferredSymbol,type,HGVSp,protein_substitution,HGVSc,base_substitution,DP,AO,UAO,AF,...,95MDAF,key,p_key,event_key,Variant_Classification,TYPE,delta_t,cohort,p_key_1,age
0,ATRX,snp,NP_000480.2:p.Asp818Glu,p.Asp818Glu,NM_000489.3:c.2454C>G,c.2454C>G,1997,21,5,0.0105,...,0.010516,ATRX c.2454C>G,ATRX p.Asp818Glu,ATRX_c.2454C>G_LBC0001A,Missense_Mutation,SNP,0,LBC21,ATRX p.D818E,85
1,ATRX,snp,NP_000480.2:p.Asp818Glu,p.Asp818Glu,NM_000489.3:c.2454C>G,c.2454C>G,1724,12,1,0.007,...,0.011601,ATRX c.2454C>G,ATRX p.Asp818Glu,ATRX_c.2454C>G_LBC0001A,Missense_Mutation,SNP,0,LBC21,ATRX p.D818E,88
2,DDX41,snp,NP_057306.2:p.Ser405Arg,p.Ser405Arg,NM_016222.2:c.1213A>C,c.1213A>C,876,4,4,0.0046,...,0.021689,DDX41 c.1213A>C,DDX41 p.Ser405Arg,DDX41_c.1213A>C_LBC0001A,Missense_Mutation,SNP,0,LBC21,DDX41 p.S405R,85
3,DDX41,snp,NP_057306.2:p.Ser405Arg,p.Ser405Arg,NM_016222.2:c.1213A>C,c.1213A>C,880,11,4,0.0125,...,0.021591,DDX41 c.1213A>C,DDX41 p.Ser405Arg,DDX41_c.1213A>C_LBC0001A,Missense_Mutation,SNP,0,LBC21,DDX41 p.S405R,88
4,KMT2A,snp,NP_005924.2:p.Val46Ile,p.Val46Ile,NM_005933.3:c.136G>A,c.136G>A,629,4,2,0.0064,...,0.022258,KMT2A c.136G>A,KMT2A p.Val46Ile,KMT2A_c.136G>A_LBC0001A,Missense_Mutation,SNP,0,LBC21,KMT2A p.V46I,85



## Trajectories class
All mutations present in a participant are stored in a list of class
trajectory objects.


In [6]:
n_sample = 2
print(f'First {n_sample} trajectories of this participant:')
print(lbc[0].trajectories[0: n_sample])

# For future examples we will explore the first stored trajectory 
traj = lbc[0].trajectories[0]

First 2 trajectories of this participant:
[<basic.trajectory object at 0x7f269eccf8d0>, <basic.trajectory object at 0x7f269eccfe50>]


Each trajectory class object has the following attributes:
* **mutation**: string with the name of the mutation this trajectory follows
* **germline**: boolean with the germline status of the mutation
* **data**: pandas dataframe with the trajectory data
* **gradient**: gradient between the first and the last point of the trajectory


In [7]:
# Exploring trajectory attributes
print("This trajectory follows mutation ", traj.mutation, ".")
print('The mutated gene is', traj.mutation.split()[0], '.') 
print('The germline status of this trajectory is ', traj.germline, '.')  
print('The gradient from frist time point to the last is', traj.gradient, '.')

This trajectory follows mutation  ATRX c.2454C>G .
The mutated gene is ATRX .
The germline status of this trajectory is  False .
The gradient from frist time point to the last is -0.0020207259421636905 .


In [8]:
# data stores relevant information of a trajectory
traj.data

Unnamed: 0,AF,age,DP,AO,delta_t
0,0.0105,85,1997,21,3
1,0.007,88,1724,12,0


# Plotting data

We have created several plot functions to explore the data in the cohort.

## Longitudinal trajectories

### Profile of participants

The participant class comes with a **profile** method to plot its genetic longitudinal profile.

In [9]:
lbc[-3].profile(germline=False)   # germline filters trajectories with germline status = True

We can also plot a participant profile using it's participant id with **plot.plot_id**.

In [10]:
plot.plot_id(lbc,'LBC361214', germline=False)

### Most prevalent mutations in the cohort

In [13]:
# Bar plot of trajectory counts by genes.
# Parameter n_genes determines the amount of genes to be shown.
plot.top_bar(lbc, n_genes=20)

Alternatively, we can plot counts for all genes in the cohort

In [14]:
plot.top_bar(lbc, all=True)

## Gradient statistics


Function *gradients* allows us to crate violin plots of the overall gradient of all trajectories containing a mutation in a list of genes.

In [15]:
plot.gradients(lbc, ['DNMT3A','JAK2','TET2','ASXL1'])

# Load and export 2% and synonymous data

In [16]:
df = pd.read_csv(r'../Datasets/LBC_ARCHER.1PCT_VAF.13Jan22.synonymous.IncludesZeros.tsv', sep='\t')
syn = basic.load(df, export_name='LBC_synonymous', create_dict=False)

100%|██████████| 81/81 [00:02<00:00, 36.70it/s]


In [17]:
df = pd.read_csv(r'../Datasets/LBC_ARCHER.2PCT_VAF.13Jan22.non-synonymous.IncludesZeros.tsv', sep='\t')
lbc = basic.load(df, export_name='LBC_non-synonymous_2', create_dict=False)

100%|██████████| 57/57 [00:00<00:00, 214.36it/s]


In [18]:
df = pd.read_csv(r'../Datasets/LBC_ARCHER.2PCT_VAF.13Jan22.synonymous.IncludesZeros.tsv', sep='\t')
syn = basic.load(df, export_name='LBC_synonymous_2', create_dict=False)

100%|██████████| 7/7 [00:00<00:00, 259.08it/s]
