**All work presented in these notebooks is based on data provided by the research group associated with the work presented in Gardner, Zanic, et al. paper titled: Depolymerizing Kinesins Kip3 and MCAK Shape Cellular Microtubule Architecture by Differential Control of Catastrophe (https://www.sciencedirect.com/science/article/pii/S0092867411012876?via%3Dihub).**

# Looking for a Distribution

To better analyze the catastrophe times, and the factors influencing their distribution, it would be useful to deduce an approximate theoretical distribution describing the data. From the ECDF's previously presented, it would appear that the older the microtubule is, the more likely it is to undergo catastrophe. This is indicated by the fact that very few microtubules experience catastrophe within the first ~2000 seconds. The presence of some sort of age-dependence implies that there may be a step or steps, occurring at some rate, that must happen in order for catastrophe to occur. This is similar in story to the gamma distribution.

The gamma distribution describes the waiting time for multiple arrivals, or multiple steps, of a Poisson process. The underlying assumption of the gamma distribution is that the steps occur at the same rate. While this may not be true for the microtubulin catastrophe process, it is a useful estimate of the lower limit of the number of steps involved in the catastrophe process. In a more physical sense, the steps here are features or events which promote catastrophe. In a gamma distribution, the required parameters are $\alpha$, the number of steps, and $\beta$, the rate of occurrence of those steps. Thus, $\alpha = 1$ describes a single-step process. 

A single-step process is not age-dependent, as it requires only on the random occurrence of a single event, which can occur with equal probability at any moment in time. Since from the ECDF's it would appear that there is an age dependence and catastrophe is more likely to happen as time progresses (with all microtubules depolymerizing before approximately 2000 seconds), it is likely that n > 1 and the process involves multiple steps.

## Two successive events to trigger catastrophe

It has been established now that catastrophe is a multi-step process, however, it is highly unlikely that each of the steps occur at the same rate. Thus to explore a more realistic model of the process we assume that catastrophe occurs by the completion of two molecular processes, which can be estimated as Poisson events (resulting in a time of arrival of these events being described by the Gamma distribution). The frequencies of each of these events are $\beta_{1}$ and $\beta_{2}$, respectively. Since these events are steps in a process, event 1 must occur first, then event 2, before catastrophe can be triggered.

This prospective distribution will be explored in the code below by simulating an experiment where 150 catastrophe events are observed. These "observations" are simulated by drawing random numbers from an exponential distribution (the time for a single arrival of a poisson event is exponentially distributed) for each step of the process. Thus, the number drawn corresponds to the time of arrival of that event.

For catastrophe to happen frequently, event 1 must occur faster and more frequently than event 2. Therefore, when a random number is drawn from the distribution for each event, the time drawn for the first step must be less than that of the second step. If the time drawn for event 1 is less than that of event 2, catastrophe has been triggered, and the total time to catastrophe is equivalent to the time for event 2.

If the arrival time for event 2 is faster than that of event 1, this situation does not match the story of the multi-step process. Thus, the arrival time for event 2 is redrawn. Now the overall time to catastrophe is the original arrival time for event 1 plus the new arrival time for event 2.

These catastrophe times are collected and used to plot ECDFs for various values of $\beta_{1}$ and $\beta_{2}$ below. This will give a visual representation of the effect of ratios of the rate of arrival of each event on the distribution.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats

import bokeh.plotting
import bokeh.io
import bokeh_catplot
bokeh.io.output_notebook()

rg = np.random.default_rng(seed=0)

In [2]:
def wait_times(beta1, beta2_ratio_list, num_trials, normalize=True):
    '''
    Parameters:
    -----------
    beta1: rate of occurrence of event 1
    beta2_ratio_list: list of ratios of beta2/beta1
    num_trials: number of times to catastrophe to draw from distribution
    normalize: whether to normalize results by beta1
    
    Returns:
    --------
    t_array: (num_trials x len(beta2_ratio_list)) array of wait times 
        for catastrophe for each beta2/beta1 ratio
    '''
    
    t_array = np.zeros((num_trials, len(beta2_ratio_list)))

    for i, beta2_ratio in enumerate(beta2_ratio_list):

        beta2 = beta2_ratio * beta1
        
        for j in range(num_trials):

            #find how long it takes for one event to happen
            t_1 = rg.exponential(1/beta1)
            t_2 = rg.exponential(1/beta2)

            # event 1 then event 2 --> catastrophe
            if t_1 <= t_2:
                t_event = t_2

            # else, redraw time for event 2 
            else:
                t_2 = rg.exponential(1/beta2)

                t_event = t_1 + t_2
                
            if normalize:
                
                t_event = t_event * beta1

            t_array[j,i] = t_event
            
    return t_array

Next, we'll call the function to generate 150 samples from a distribution corresponding to varying ratios of $\beta_{2} / \beta_{1}$ :

In [3]:
beta1 = 1
beta2_ratio_list = [0.3, 1, 3]
num_trials=150

# call the wait_times function
t_array = wait_times(beta1, beta2_ratio_list, num_trials, normalize=True)

We'll load this data into a ```dataframe``` logging the wait time and the corresponding $\beta_{2} / \beta_{1}$ ratio for easy plotting.

In [4]:
# presize lists for loading with data
t_column = (num_trials * len(beta2_ratio_list)) * [None]
beta_column = (num_trials * len(beta2_ratio_list)) * ['']

for i, beta2_ratio in enumerate(beta2_ratio_list):
    
    # this will index the row to start storing data corresponding to a beta2/beta1 ratio
    row_num = i*num_trials
    
    # store corresponding values
    t_column[row_num:row_num+num_trials] = t_array[:,i]
    beta_column[row_num:row_num+num_trials] = num_trials * [str(beta2_ratio)]

# store values in a dictionary to convert to dataframe
name_dict = {'wait time (1/beta1)':t_column, 'beta2/beta1':beta_column}

df_for_ecdf = pd.DataFrame(name_dict)

A quick check to make sure that the data is loaded properly...

In [5]:
df_for_ecdf.tail()

Unnamed: 0,wait time (1/beta1),beta2/beta1
445,0.578961,3
446,0.866102,3
447,0.287249,3
448,1.66993,3
449,2.427431,3


This time we'll use an ECDF plotting tool from the ```bebi103``` plotting package. The ```val``` keyword indicates the value we are examining the distribution of, and the ```cats``` indicate to plot for each $\beta_{2} / \beta_{1}$ ratio, coloring by ratio.

In [6]:
#make the ecdf
p = bokeh_catplot.ecdf(
    data=df_for_ecdf,
    cats=['beta2/beta1'],
    val='wait time (1/beta1)',
    style='staircase'
)

p.legend.location = 'bottom_right'

bokeh.io.show(p)

For smaller $\beta_{2}$ values compared to $\beta_{1}$, the lower the rate of occurrence of event 2. Thus, event 1 will happen more frequently, and it will take longer for event 2 to occur, resulting in longer waiting times for catastrophe. This explains the shifting of the distributions to the right with decreasing $\beta_{2} / \beta_{1}$ ratios. As the ratio grows, event 2 is likely to happen quickly, thus, with the occurrence of event 1, successful catastrophe follows quickly. The result of this ratio is shorter times to catastrophe. Recalling the distribution from the last notebook for labeled and unlabeled tubulin, the ECDF was more gradual (as that of $\beta_{2} / \beta_{1} = 3$, and the wait times were as long as 2000 seconds. This indicates that the steps following the first step are likely rare, leading to an unlikely occurrence that the second event, and third and so on, will quickly follow an occurrence of the first trigger event.

## Deriving an analytical PDF expression

It will be useful for further analysis to derive an analytical PDF expression for this process of two successive events to trigger catastrophe.

To do this, we need to consider the probability of event 1 happening first, followed by event 2, within a time t. The probability distribution function $f(t; \beta_{1}, \beta_{2})$ will then describe the probability of catastrophe occurring within a given time. 

Divide the time interval $[0, t]$ into two intervals by introducing an intermediate time, $t_{1}$, with $0 \leq t_{1} \leq t$. This time describes the amount of time it takes for event 1 to occur. This leaves $t - t_{1}$ for event 2 to occur. Importantly, both events 1 and 2 are Poisson events and occur independently with the following exponential probability distributions:

$$f_{1}(t; \beta_{1}) = \beta_{1} \exp{( -\beta_{1}t)}$$
$$f_{2}(t; \beta_{2}) = \beta_{2} \exp{( -\beta_{2}t)}$$

The probability of event 1 occurring within the time duration $t_{1}$ is:

$$f_{1}(t_{1}) = \beta_{1} \exp{( -\beta_{1}t_{1})}$$

Similarly, the probability that event 2 occurs within the time duration $t - t_{1}$ is:

$$f_{2}(t - t_{1}) = \beta_{2} \exp{( -\beta_{2}(t - t_{1}))}$$

Thus, the probability of catastrophe occurring within time t AND with the first event occurring before time $t_{1}$ is:

$$P(\textrm{catastrophe, event 1 in t1}) = P(\textrm{event 1 in t1}) \times P(\textrm{event 2 in t-t1})$$

Which is:

$$P(\textrm{catastrophe, event 1 in t1}) = \beta_{1} \exp{( -\beta_{1}t_{1})} \times \beta_{2} \exp{( -\beta_{2}(t - t_{1}))} = \beta_{1} \beta_{2} \exp{(-(\beta_{1} - \beta_{2})t_{1})} \exp{(-\beta_{2}t)}$$

We have now found the probability of the specific case in which event 1 happens within t1. But t1 can be any time between 0 and t. To get the probability of catastrophe within time t, we need to integrate over all possible t1, or from 0 to t:

$$P(\textrm{catastrophe in t}) = \int_\textrm{all t1} P(\textrm{catastrophe, event 1 in t1}) \, dt_{1} = \int_0^t \beta_{1} \beta_{2} \exp{(-(\beta_{1} - \beta_{2})t_{1})} \exp{(-\beta_{2}t)} \, dt_{1}$$

This integral can be analytically evaluated:

$$P(\textrm{catastrophe in t}) = \int_0^t \beta_{1} \beta_{2} \exp{(-(\beta_{1} - \beta_{2})t_{1})} \exp{(-\beta_{2}t)} \, dt_{1} = \beta_{1} \beta_{2} \int_0^t \exp{(-(\beta_{1} - \beta_{2})t_{1})} \exp{(-\beta_{2}t)} \, dt_{1}$$

$$= \beta_{1} \beta_{2} \exp{(-\beta_{2}t)} \int_0^t \exp{(-(\beta_{1} - \beta_{2})t_{1})} \, dt_{1}$$

Evaluating the simplified integral:

$$= \beta_{1} \beta_{2} \exp{(-\beta_{2}t)} [\frac{1}{\beta_{2} - \beta_{1}} \exp{(-(\beta_{1} - \beta_{2})t_{1})}] \, \big|_0^t$$

Further simplifying:

$$= \frac{\beta_{1}\beta_{2}}{\beta_{2} - \beta_{1}} \exp{(-\beta_{2}t)} [\exp{(-(\beta_{1} - \beta_{2})t)} - 1]$$

$$= \frac{\beta_{1}\beta_{2}}{\beta_{2} - \beta_{1}} [\exp{(-\beta_{2}t)} \frac{\exp{(-\beta_{1}t)}}{\exp{(-\beta_{2}t)}} - \exp{(-\beta_{2}t)}]$$

Our final expression is the following:

$$P(\textrm{catastrophe in t}) = \frac{\beta_{1}\beta_{2}}{\beta_{2} - \beta_{1}} [\exp{(-\beta_{1}t)} - \exp{(-\beta_{2}t)}]$$

## Confirming the analytical expression

To confirm that the expression obtained describes the ECDF's presented for values of $\beta_{1}$ and $\beta_{2}$, this function will be used to overlay a graph of the CDF with the ECDF's plotted above. However, there is an issue of units: the derived CDF is in terms of t while the ECDF's above are functions of the dimensionless time $\beta_{1}t$. In order to make these values comparable, the CDF must be expressed in terms of dimensionless time. This will involve a change of variables transforming from t to $\tilde{t} = \beta_{1}t$.

$$F_{T}(t) = \frac{\beta_{1}\beta_{2}}{\beta_{2} - \beta_{1}} [\frac{1}{\beta_{1}} (1 - \exp{(-\beta_{1}t)}) - \frac{1}{\beta_{2}} (1 - \exp{(-\beta_{2}t)})]$$

$$g(t) = \tilde{t} = \beta_{1}t$$

$$g^{-1}(\tilde{t}) = \frac{\tilde{t}}{\beta_{1}}$$ 

Combining these expressions:
$$F_{\tilde{T}}(\tilde{t}) = | \frac{d}{d\tilde{t}} \frac{\tilde{t}}{\beta_{1}} | F_{T}(\frac{\tilde{t}}{\beta_{1}})$$

This can be expressed as follows:
$$F_{\tilde{T}}(\tilde{t}) = \frac{1}{\beta_{1}} F_{T}(\frac{\tilde{t}}{\beta_{1}}) = \frac{\beta_{2}}{\beta_{2} - \beta_{1}} [\frac{1}{\beta_{1}} (1 - \exp{(-\tilde{t})}) - \frac{1}{\beta_{2}} (1 - \exp{(-\frac{\beta_{2}}{\beta_{1}}\tilde{t})})]$$

This new CDF in terms of dimensionless time will be used to generate graphs of the analytical CDF and overlay them onto the ECDF.

We'll first write a function to return values of the CDF expression, then write a function that calls this function for various values of $\beta_{1}$ and $\beta_{2}$.

In [7]:
def cdf_function(beta1, beta2, t_tilde):
    '''
    Parameters:
    -----------
    beta1, beta2: rate of occurrence of events 1 and 2, as before
    t_tilde: dimensionless time, beta1 * time
    
    Returns:
    --------
    the value of F(t_tilde; b1, b2) where F is the cdf
    '''
    
    # evaluate individual terms within the expression
    beta_frac = beta2 / (beta2 - beta1)
    term_1 = (1 / beta1) * (1 - np.exp(-t_tilde))
    term_2 = (1 / beta2) * (1 - np.exp(-t_tilde * beta2 / beta1))
    
    return beta_frac * (term_1 - term_2)
    

def cdf_overlay(beta1, beta2_ratio_list):
    '''
    Returns arrays of t and y coordinates for plotting the CDF of
    values of beta1 and beta2/beta1.
    '''
    
    #determine the t-axis upper bound by doing 6 x (beta1 / min(beta2_list))
    #i.e. if the smallest ratio of beta1/beta2 is 0.3, then the t axis goes to around 20
    t_limit = 6 * beta1 / min(beta2_ratio_list)
    
    #make 100 points from 0 to the upper limit for you to plot over
    t_coords = np.linspace(0, t_limit, 100)
    
    # make an empty array to hold the y-values
    y_coords = np.empty([len(beta2_ratio_list), len(t_coords)])
    
    for i, beta2 in enumerate(beta2_ratio_list):
        
        #evaluate function for time values
        y_coords[i,:] = cdf_function(beta1, beta2, t_coords)
            
    return t_coords, y_coords

In [8]:
t_coords = np.linspace(0, 20, 100)
result = cdf_function(1, 0.3, t_coords)
result.shape

(100,)

In [9]:
#now overlay everything on the previous plot

beta1 = 1
beta2_list = [0.3, 0.999, 3] #can't have beta1 = beta2 so I made beta2 be 0.999 instead of 1

x_data, y_data_list = cdf_overlay(beta1, beta2_list)

for y_data in y_data_list:
    
    p.line(
        x=x_data,
        y=y_data
    )
    
    
bokeh.io.show(p)

In the graph above, we see that the analytical expression for the CDF matches the ECDFs very closely.