# Modelling ARCH

## Fitting exponential and sigmoid curves to trajectories

We want to fit sigmoid curves to different trajectories containing mutations on the same gene.


Each curve has the following parameters: 
* **fitness**: relative fitness advantage confered by the mutation
* **displacement**: time of mutation gain
* **amplitude**: maximum capacity (only for sigmoid curves)


Since we follow the same mutation accross several individuals, although each trajectory can have different displacements and amplitudes, the fitness parameter will be maintained accross individuals.

### Loading the dataset and creating trajectories

In [213]:
from ARCH import basic, modelling
%load_ext autoreload
%autoreload 2

import pandas as pd

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
colors=px.colors.qualitative.Plotly

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [214]:
# Load dataset and create a list of participants
df = pd.read_excel (r'LBC_ARCHER.non-synonymous.Aug_2020.xlsx')
lbc , total_grad = basic.load(df)

## Setting up trajectories



In [215]:
# Create filters 
l_filter = 0.005
u_filter = np.quantile(total_grad, 0.999)

In [216]:
DNMT3A = modelling.model(lbc,'DNMT3A',l_filter,u_filter)
DNMT3A.trajectories

# Training a model
We have implemented 2 different models: **Exponential** and **Logistic**.

The exponential model fits the following curve:
$$f(x)=e^{\textit{fit}(x-\textit{dis})}.$$

The logistic model fits the following curve:
$$\frac{amp}{1+e^{-\textit{fit}(x-\textit{dis})}}.$$

For both models parameters are as follow:
* **fit**: fitness parameter of exponential growth of a clone.
* **dis**: horizontal displacement accounting for acquisition of mutation time.
* **amp**: carrying capacity threshold.

**model.fit** method has three parameters:
* **model = Default 'logistic'**. *Strings* 'Logistic' or 'exponential' - choose between  to fit either model.
* **common_fit = Default False**. *Boolean* - choose if you want to enforce a common fitness parameter accross all trajectories.
* **l2 = Default 0.1**. *Float* - Weight of l2 regularization on amplitude parameter.

In [176]:
DNMT3A.fit()

We can now check the fitted trajectories and model report

In [177]:
DNMT3A.plot()

In [186]:
DNMT3A.out

0,1,2
fitting method,leastsq,
# function evals,4274,
# data points,51,
# variables,33,
chi-square,0.00433555,
reduced chi-square,2.4086e-04,
Akaike info crit.,-412.009408,
Bayesian info crit.,-348.259163,

name,value,initial value,min,max,vary
amp_1,0.4852412,0.4,0.2,0.5,True
fit_1,0.07756096,0.0,-inf,inf,True
dis_1,24.9988817,0.0,-25.0,25.0,True
amp_2,0.49780063,0.4,0.2,0.5,True
fit_2,0.06070824,0.0,-inf,inf,True
dis_2,21.0459858,0.0,-25.0,25.0,True
amp_3,0.49882007,0.4,0.2,0.5,True
fit_3,0.16571019,0.0,-inf,inf,True
dis_3,8.65646578,0.0,-25.0,25.0,True
amp_4,0.5,0.4,0.2,0.5,True


## Fitness parameter estimates
We can now access fitness parameter estimates as an attribute **model.fitness** and we have implemented a method to plot is distribution, **model.fitness_plot()**.

In [184]:
print('Mean of fitness parameters:', np.mean(DNMT3A.fitness))
DNMT3A.fitness_plot()

Mean of fitness parameters: 0.09583748577625041


# Comparing exponential  vs logistic fit

In [208]:
# create two models with the same data corresponding to JAK2 c.1849G>T trajectories
logistic = modelling.model(lbc,'JAK2',l_filter,u_filter)
exponential = modelling.model(lbc,'JAK2',l_filter,u_filter)
logistic.trajectories

In [210]:
# train both models:
exponential.fit(model='exponential', common_fit=True)
logistic.fit(common_fit=True)
exponential.plot().show()
logistic.plot().show()

## Overlapping exponential and logistic plots

In [211]:
# round up fitness
log_fit=round(logistic.out.params['fit_1']*1,3)
exp_fit=round(exponential.out.params['fit_1']*1,3)
log_err=round(logistic.out.chisqr*1,3)
exp_err=round(exponential.out.chisqr*1,3)

fig=go.Figure()

x_line = np.linspace(0,4,1000)
y_log = modelling.logistic_dataset(logistic.out.params, 0, x_line)
y_exp = modelling.exponential_dataset(exponential.out.params, 0, x_line)
    

fig.add_trace(go.Scatter(x=x_line, y=y_log,mode='lines',line=dict(color='#222A2A'), 
                             legendgroup="group",  # this can be any string, not just "group"
                             name="Logistic fit",
                            ))

fig.add_trace(go.Scatter(x=x_line, y=y_exp,mode='lines',line=dict(color='#222A2A',dash='dash'), 
                             legendgroup="group",  # this can be any string, not just "group"
                             name="Exponential fit",
                            ))



fig.add_trace(go.Scatter(x=exponential.x[0] , y=exponential.y[0] ,mode='markers', line=dict(color='#222A2A'), 
                             #name=f'{participant[i]} data',
                             legendgroup="group",  # this can be any string, not just "group"
                             name="Data points",
                            ))

    


for i, points in enumerate(zip(exponential.x,exponential.y)):
    x_line = np.linspace(0,10,1000)
    y_log = modelling.logistic_dataset(logistic.out.params, i, x_line)
    y_exp = modelling.exponential_dataset(exponential.out.params, i, x_line)

    fig.add_trace(go.Scatter(x=points[0], y=points[1],mode='markers', line=dict(color=colors[i%10-1]), 
                             legendgroup="group2",  # this can be any string, not just "group"
                             name=f'{exponential.participants[i]}'
                            ))

    fig.add_trace(go.Scatter(x=x_line, y=y_log,mode='lines',line=dict(color=colors[i%10-1]), 
                             legendgroup="group2",  # this can be any string, not just "group"
                             showlegend=False
                            ))
    fig.add_trace(go.Scatter(x=x_line, y=y_exp,mode='lines', line=dict(color=colors[i%10-1],dash='dash'), 
                             legendgroup="group2",  # this can be any string, not just "group"
                             name ='Fitted trajectory',
                             showlegend=False
                            ))

# Edit the layout
fig.update_layout(title = f'Exponential vs logistic fit of {exponential.gene} trajectories<br>'
                          f'Total squared error: logistic = {log_err}, exponential = {exp_err}',
                   xaxis_title='Time (years since first data collection))',
                   yaxis_title='VAF',
                )
fig