# Guide to tHMM

In [1]:
import numpy as np
import scipy.stats as sp

In [25]:
obs = sp.gamma.rvs(a=0.5, scale=10.0, size=300)
a,b,c = sp.gamma.fit(obs)
print(a,b,c)
import matplotlib.pyplot as plt
plt.hist(obs)



0.44713425811477236 0.0011697672343250452 12.03651397618688


### Synthesizing Cells (not required by the user)

In [2]:
from lineage.CellVar import CellVar as c
from lineage.CellVar import _double

In [3]:
T = np.array([[1.0, 0.0],
              [0.0, 1.0]])
    
parent_state = 1
parent_cell = c(state=parent_state, left=None, right=None, parent=None, gen=1)
left_cell, right_cell = parent_cell._divide(T)

In [4]:
print(left_cell, parent_cell.left)


 Generation: 2, State: 1, Observation: This cell has no observations to report. 
 Generation: 2, State: 1, Observation: This cell has no observations to report.


In [5]:
print(right_cell, parent_cell.right)


 Generation: 2, State: 1, Observation: This cell has no observations to report. 
 Generation: 2, State: 1, Observation: This cell has no observations to report.


## Creating a synthetic lineage (required by the user) "Two State Model"

In [6]:
from lineage.LineageTree import LineageTree
from lineage.StateDistribution import StateDistribution, get_experiment_time

### Creating a lineage and setting the full lineage (unpruned) as the one to be used

The required probabilities are those that define the tree and act of state switching. This process works by first creating a hidden tree of empty cells. Empty cells are those that have their states set but do not have any observations attached to them. We then draw as many observations from each state distribution and assign those observations to those cells. The $\pi$ and $T$ parameters are easy to define. The number of states is $k$. We require for $\pi$ a $k\times 1$ list of probabilities. These probabilities must add up to $1$ and they should be either in a $1$-dimensional list or a $1$-dimensional numpy array. The $T$ parameter should be a square numpy matrix of size $k\times k$. The rows are the states in which we are transitioning from and the columns are the states in which we are transitioning to. Each row of $T$ should sum to $1$. The columns need not sum to $1$.

In [7]:
# pi: the initial probability vector
pi = np.array([0.6, 0.4], dtype="float")

# T: transition probability matrix
T = np.array([[0.85, 0.15],
              [0.15, 0.85]])

The emission matrix $E$ is a little more complicated to define because this is where the user has complete freedom in defining what type of observation they care about. In particular, the user has to first begin with defining what observation he or she will want in their cells in their synthetic images. For example, if one is observing kinematics or physics, they might want to use Gaaussian distribution observations. In defining the random variables, the user will pull from a Gaussian distribution based on the mean and standard deviation of the different states he or she picks. They can also utilize the Gaussian probability distribution to define the likelihood as well. Furthermore, they can build an analytical estimator for their state distributions that yield the parameter estimates when given a list of observations. Finally, the user can also define a prune rule, which is essentially a boolean function that inspects a cell's observations and returns True if the cell's subtree (all the cells that are related to the cell in question and are of older generation) is to be pruned or False if the cell is safe from pruning. In the Gaussian example, a user can remove a cell's subtree if its observation is higher or lower than some fixed value.

We have already built, as an example, and as bioengineers, a model that resembles lineage trees. In our synthetic model, our emissions are multivariate. This first emission is a Bernoulli observation, $0$ implying death and $1$ implying division. The second and third emissions are continuous and are from exponential and gamma distributions respectively. Though these can be thought of cell lifetime's or periods in a certain cell phase, we want the user to know that these values can really mean anything and they are completely free in choosing what the emissions and their values mean. We define ways to calculate random variables for these multivariate observations and likelihoods of an observations. We also provide as a prune rule, keeping with the cell analogy, that if a cell has a $0$ in its Bernoulli observation, then its subtree is pruned from the full lineage tree. Though this will obviously introduce bias into estimation, we keep both the full tree and the pruned tree in the lineage objects, in the case a user would like to see the effects of analyzing on one versus the other.

Ultimately, $E$ is defined as a $k\times 1$ size list of state distribution objects. These distribution objects are rich in what they can already do, and a user can easily add more to their functionality. They only need to be instantiated by what parameters define that state's distribution.

In [8]:
# E: states are defined as StateDistribution objects

# State 0 parameters "Resistant"
state0 = 0
bern_p0 = 0.99
gamma_a1 = 5.0
loc = 0
gamma_scale1 = 1.0

# State 1 parameters "Susceptible"
state1 = 1
bern_p1 = 0.88
gamma_a2 = 10.0
gamma_scale2 = 2.0

state_obj0 = StateDistribution(state0, bern_p0, gamma_a1, gamma_scale1)
state_obj1 = StateDistribution(state1, bern_p1, gamma_a2, gamma_scale2)

E = [state_obj0, state_obj1]

The final required parameters are more obvious. The first is the desired number of cells one would like in their full unpruned lineage tree. This can be any number. The lineage tree is built 'from left to right'. What this means is that, we construct the binary tree by going to the left-most cell, dividing then walking through the generation. For example, if someone requested for

In [9]:
desired_num_cells = 2**10 - 1 
prune_boolean = False # To get the full tree

In [10]:
lineage1 = LineageTree(pi, T, E, desired_num_cells, prune_boolean)
print(lineage1)

This tree is NOT pruned. It is made of 2 states.
 For each state in this tree: 
 	 There are 441 cells of state 0, 
 	 There are 582 cells of state 1.
 This UNpruned tree has 1023 cells in total


### Obtaining how long the experiment ran by checking the time length of the longest branch

In [11]:
longest_branch_time = get_experiment_time(lineage1)
print(longest_branch_time)

245.83415267738206


### Estimation of distribution parameters using our estimators for full lineage

In [12]:
for state in range(lineage1.num_states):
    print("State {}:".format(state))
    print("                    estimated state", E[state].estimator(lineage1.lineage_stats[state].full_lin_cells_obs))
    print("original parameters given for state", E[state])

State 0:
these are           3.447425075618052 0.6783942236344626 1.1928152972725468
                    estimated state State object w/ parameters: 0.9909297052151969, 3.447425075618052, 1.1928152972725468.
original parameters given for state State object w/ parameters: 0.99, 5.0, 1.0.
State 1:
these are           8.154257382880573 2.0367040248088495 2.2390321810826057
                    estimated state State object w/ parameters: 0.912371134020477, 8.154257382880573, 2.2390321810826057.
original parameters given for state State object w/ parameters: 0.88, 10.0, 2.0.


### Estimation of distribution parameters using our estimators for pruned lineage

In [13]:
for state in range(lineage1.num_states):
    print("State {}:".format(state))
    print("                    estimated state", E[state].estimator(lineage1.lineage_stats[state].pruned_lin_cells_obs))
    print("original parameters given for state", E[state])

State 0:
these are           3.31172443345495 0.7266983964349358 1.2137155080403241
                    estimated state State object w/ parameters: 0.9897959183670971, 3.31172443345495, 1.2137155080403241.
original parameters given for state State object w/ parameters: 0.99, 5.0, 1.0.
State 1:
these are           6.94939112884267 3.5281271791650024 2.4039300504895253
                    estimated state State object w/ parameters: 0.910913140311621, 6.94939112884267, 2.4039300504895253.
original parameters given for state State object w/ parameters: 0.88, 10.0, 2.0.


### Analyzing our first full lineage

In [14]:
from lineage.Analyze import Analyze, accuracyG
import copy as cp
X = [lineage1] # population just contains one lineage
states = [cell.state for cell in lineage1.output_lineage]
print(states)
# deltas, state_ptrs, all_states, tHMMobj, NF, LL = Analyze(X, 2) # find two states


[1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 

In [15]:
# tHMMobj.estimate.E

In [16]:
def accuracy_for_lineages(tHMMobj, all_states):
    accuracy = []
    for num, lineageObj in enumerate(tHMMobj.X):
        lin_true_states = [cell.state for cell in lineageObj.output_lineage]

        bern_diff = np.zeros((lineageObj.num_states))
        gamma_a_diff = np.zeros((lineageObj.num_states))
        gamma_scale_diff = np.zeros((lineageObj.num_states))
        for state in range(lineageObj.num_states):
            bern_diff[state] = abs(tHMMobj.estimate.E[state].bern_p - lineageObj.E[0].bern_p)
            gamma_a_diff[state] = abs(tHMMobj.estimate.E[state].gamma_a - lineageObj.E[0].gamma_a)
            gamma_scale_diff[state] = abs(tHMMobj.estimate.E[state].gamma_scale - lineageObj.E[0].gamma_scale)

        bern_diff = bern_diff / sum(bern_diff)
        gamma_a_diff = gamma_a_diff / sum(gamma_a_diff)
        gamma_scale_diff = gamma_scale_diff / sum(gamma_scale_diff)

        total_errs = bern_diff + gamma_a_diff + gamma_scale_diff
        if total_errs[0] <= total_errs[1]:
            new_all_states = all_states[num]
        else:
            print('SWITCHING!')
            new_all_states = [not(x) for x in all_states[num]] 
            tmp = cp.deepcopy(tHMMobj.estimate.E[1])
            tHMMobj.estimate.E[1] = tHMMobj.estimate.E[0]
            tHMMobj.estimate.E[0] = tmp

        counter = [1 if a==b else 0 for (a,b) in zip(new_all_states,lin_true_states)]
        acc = sum(counter)/len(lin_true_states)
        accuracy.append(acc)

    return accuracy

In [17]:
import copy as cp
# pi: the initial probability vector
pi = np.array([0.5, 0.5], dtype="float")

# T: transition probability matrix
T = np.array([[0.99, 0.01],
              [0.15, 0.85]], dtype='float')

# State 0 parameters "Resistant"
state0 = 0
bern_p0 = 0.99
gamma_a0 = 20
gamma_scale0 = 5

# State 1 parameters "Susciptible"
state1 = 1
bern_p1 = 0.8
gamma_a1 = 10
gamma_scale1 = 1

state_obj0 = StateDistribution(state0, bern_p0, gamma_a0, gamma_scale0)
state_obj1 = StateDistribution(state1, bern_p1, gamma_a1, gamma_scale1)

E = [state_obj0, state_obj1]

desired_num_cells = 2**6 - 1
# increasing number of lineages from 1 to 10 and calculating accuracy and estimate parameters for both pruned and unpruned lineages.
num_lineages = list(range(1, 4))

accuracies_unpruned = []
accuracies_pruned = []
bern_unpruned = []
gamma_a_unpruned = []
gamma_scale_unpruned = []
bern_pruned = []
gamma_a_pruned = []
gamma_scale_pruned = []

prunedNewAcc = []
unprunedNewAcc = []
X_p = []
X_unp = []
for num in num_lineages:
    lineage_unpruned = LineageTree(pi, T, E, desired_num_cells, prune_boolean=False)
    lineage_pruned = cp.deepcopy(lineage_unpruned)
    lineage_pruned.prune_boolean = True

    X_unp.append(lineage_unpruned)
    X_p.append(lineage_pruned)
    deltas, state_ptrs, all_states, tHMMobj, NF, LL = Analyze(X_unp, 2)
    deltas2, state_ptrs2, all_states2, tHMMobj2, NF2, LL2 = Analyze(X_p, 2)
    acc1 = accuracy_for_lineages(tHMMobj, all_states)
    acc2 = accuracy_for_lineages(tHMMobj2, all_states2)
    accuracies_unpruned.append(acc1)
    accuracies_pruned.append(acc2)

    bern_p_total = ()
    gamma_a_total = ()
    gamma_scale_total = ()
    bern_p_total2 = ()
    gamma_a_total2 = ()
    gamma_scale_total2 = ()

    for state in range(tHMMobj.numStates):
        bern_p_total += (tHMMobj.estimate.E[state].bern_p,)
        gamma_a_total += (tHMMobj.estimate.E[state].gamma_a,)
        gamma_scale_total += (tHMMobj.estimate.E[state].gamma_scale,)

        bern_p_total2 += (tHMMobj2.estimate.E[state].bern_p,)
        gamma_a_total2 += (tHMMobj2.estimate.E[state].gamma_a,)
        gamma_scale_total2 += (tHMMobj2.estimate.E[state].gamma_scale,)

    bern_unpruned.append(bern_p_total)
    gamma_a_unpruned.append(gamma_a_total)
    gamma_scale_unpruned.append(gamma_scale_total)
    bern_pruned.append(bern_p_total2)
    gamma_a_pruned.append(gamma_a_total2)
    gamma_scale_pruned.append(gamma_scale_total2)

print("this is accurcy unpruned", accuracies_unpruned)
for i in range(len(accuracies_unpruned)):
    unprunedNewAcc.append(sum(accuracies_unpruned[i])/(i+1))
    prunedNewAcc.append(sum(accuracies_pruned[i])/(i+1))
    print("this is unprunedddddddddddddd", unprunedNewAcc)

these are           308.9431058398511 -332.02626188029546 1.386334850494482


  mu = data.mean()
  ret = ret.dtype.type(ret / rcount)
  m2 = ((data - mu)**2).mean()
  m3 = ((data - mu)**3).mean()
  muhat = tmp.mean()
  mu2hat = tmp.var()
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)


ValueError: zero-size array to reduction operation minimum which has no identity

In [None]:
accuracy_for_lineages(tHMMobj, all_states)

In [None]:
import matplotlib.pyplot as plt
for state in range(lineage1.num_states):
    b, g1, g2 = list(zip(*lineage1.lineage_stats[state].pruned_lin_cells_obs))
    plt.hist(g2)
    plt.show()


In [None]:
for state in range(lineage1.num_states):
    askjdha, sduhfksj, n = list(zip(*lineage1.lineage_stats[state].full_lin_cells_obs))
    plt.hist(sduhfksj, bins = 30)

### Estimated Markov parameters ($\pi$, $T$, $E$)

In [None]:
print(tHMMobj.estimate.pi)

In [None]:
print(tHMMobj.estimate.T)

In [None]:
for state in range(tHMMobj.numStates):
    print(tHMMobj.estimate.E[state])

## Trying another lineage, this time pruning branches with ancestors that die

In [None]:
desired_num_cells = 2**12 -1 
prune_boolean = True # To get pruned tree

In [None]:
lineage2 = LineageTree(pi, T, E, desired_num_cells, prune_boolean)
print(lineage2)

In [None]:
longest2 = get_experiment_time(lineage2)
print(longest2)

### Estimation of distribution parameters using our estimators for pruned lineage

In [None]:
for state in range(lineage2.num_states):
    print("State {}:".format(state))
    print("                    estimated state", E[state].estimator(lineage2.lineage_stats[state].pruned_lin_cells_obs))
    print("original parameters given for state", E[state])

### Analyzing a population of lineages

In [None]:
X = [lineage1, lineage2] # population just contains one lineage

#deltas, state_ptrs, all_states, tHMMobj, NF, LL = Analyze(X, 2) # find two states

In [None]:
for num, lineageObj in enumerate(X):
    lin_estimated_states = all_states[num]
    lin_true_states = [cell.state for cell in lineageObj.output_lineage]
    total = len(lin_estimated_states)
    assert total == len(lin_true_states)
    counter = [1 if a==b else 0 for (a,b) in zip(lin_estimated_states,lin_true_states)]
    print("Accuracy or 1-Accuracy is {}".format(sum(counter)/total))

### Estimated Markov parameters ($\pi$, $T$, $E$)

In [None]:
print(tHMMobj.estimate.pi)

In [None]:
print(tHMMobj.estimate.T)

In [None]:
for state in range(tHMMobj.numStates):
    print(tHMMobj.estimate.E[state])

## Creating a synthetic lineage that has three states

Here we generate a lineage with three states, which would be 1) Susciptible 2) Middle State 3) Resistant. The aim here is to show the transition from susciptible to resistant state doesn't happen immediately, and there is a gradual transition which is modeled as a middle state. The point to be considered here is that transition from 1 to 3 or otherwise is not possible so the probability of these transitions are zero, and most likely the initial cells are in susciptible state.

**State 1**: Susceptible

**State 2**: Transition state

**State 3**: Resistant state


In [None]:
# pi: the initial probability vector
pi_3 = np.array([0.5, 0.25, 0.25])

# T: transition probability matrix
T_3 = np.array([[0.65, 0.35, 0.00],
                [0.20, 0.40, 0.40],
                [0.00, 0.10, 0.90]])

In [None]:
# E: states are defined as StateDistribution objects

# State 0 parameters "Susciptible"
state0 = 0
bern_p0 = 0.7
expon_scale_beta0 = 20
gamma_a0 = 5.0
gamma_scale0 = 1.0

# State 1 parameters "Middle state"
state1 = 1
bern_p1 = 0.85
expon_scale_beta1 = 60
gamma_a1 = 10.0
gamma_scale1 = 2.0

# State 2 parameters "Resistant"
state2 = 2
bern_p2 = 0.99
expon_scale_beta2 = 80
gamma_a2 = 15.0
gamma_scale2 = 3.0

state_obj0 = StateDistribution(state0, bern_p0, gamma_a0, gamma_scale0)
state_obj1 = StateDistribution(state1, bern_p1, gamma_a1, gamma_scale1)
state_obj2 = StateDistribution(state2, bern_p2, gamma_a2, gamma_scale2)

E_3 = [state_obj0, state_obj1, state_obj2]

In [None]:
desired_num_cells = 2**13 - 1 
prune_boolean = False # To get the full tree

In [None]:
lineage3 = LineageTree(pi_3, T_3, E_3, desired_num_cells, prune_boolean)
print(lineage3)

In [None]:
longest3 = get_experiment_time(lineage3)
print(longest3)

### Estimation of distribution parameters using our estimators for full lineage (3 state)

In [None]:
for state in range(lineage3.num_states):
    print("State {}:".format(state))
    print("estimated state", E_3[state].estimator(lineage3.lineage_stats[state].full_lin_cells_obs))
    print("estimated state", E_3[state].estimator(lineage3.lineage_stats[state].pruned_lin_cells_obs))
    print("true_____ state", E_3[state])

### Analyzing a three state lineage

In [None]:
X = [lineage3] # population just contains one lineage

#deltas, state_ptrs, all_states, tHMMobj, NF, LL = Analyze(X, 3) # find three states

In [None]:
for num, lineageObj in enumerate(X):
    lin_estimated_states = all_states[num]
    lin_true_states = [cell.state for cell in lineageObj.output_lineage]
    total = len(lin_estimated_states)
    assert total == len(lin_true_states)
    counter = [1 if a==b else 0 for (a,b) in zip(lin_estimated_states,lin_true_states)]
    print("Accuracy {}".format(sum(counter)/total))

### Estimated Markov parameters ($\pi$, $T$, $E$)

In [None]:
print(tHMMobj.estimate.pi)

In [None]:
print(tHMMobj.estimate.T)

In [None]:
for state in range(tHMMobj.numStates):
    print(tHMMobj.estimate.E[state].shape[0])

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
A = sp.gamma.rvs(a=20,scale=4,size=2)
B = sp.gamma.rvs(a=5,scale=1,size=1000)
plt.hist(A)
plt.figure()
plt.hist(B)

In [None]:
import scipy.stats as sp
gamma_ll = sp.gamma.pdf(x=A, a=a, scale=b)  # gamma likelihood
print(gamma_ll)