# __tHMM__
#### A tree-hidden Markov model for analyzing cell lineages. 
#### Authors: Shakthi Visagan, Farnaz Mohammadi, Nikan Namiri, Adam Wiener, Ali Farhat, Alex Lim, JC Lagarde, and Aaron Meyer, PhD

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Synthesizing Cells (not required by the user)

In [2]:
from lineage.CellVar import CellVar as c
from lineage.CellVar import _double

Users will have to become comfortable with creating transition matrices using `numpy` if they plan on creating their own lineages. We provide a method to create synthetic lineages for testing the tHMM model, which is explained further on. Synthetic lineage creation utilizes user-given Markov parameters, like the transition matrix shown below. Knowing how to create cells, however, is not required by the user. It is beneficial to understand how the `CellVar` class is designed to store its `state`, its relationships (`left`, `right`, `parent`), and its multivariate observations (`obs)`.

In [3]:
T = np.array([[1.0, 0.0],
              [0.0, 1.0]])
    
parent_state = 1
parent_cell = c(state=parent_state, left=None, right=None, parent=None, gen=1)
left_cell, right_cell = parent_cell._divide(T)

QUESTION 1: The transition matrix above is the two-dimensional Identity matrix. What does this imply about the transitions between cells that follow this transition process? Are there any transitions? Show that you're correct by printing out the states of all the cells involved. Write your answer and code below.

QUESTION 2: The `gen` argument for instantiating cells represents the generation of the cell. Are generations in the tHMM / lineage-growth codebase `0`-indexed or `1`-indexed? (Do the generations of cell lineages start at `0` or `1`?) Write your answer below.

QUESTION 3: In the previous code block, we created a 3 cell lineage, with 2 generations. The first generation has one cell which was declared and can be accessed at `parent_cell`. Calling the member function `_divide` on `parent_cell` created two new cells which can be accessed at `left_cell` and `right_cell`. The daughter cells of any cell can also ALWAYS be accessed by using "dot" notation, using the member variables, `left` and `right`. Note that the division process utilizes the transition matrix. Our code provides some very basic printing methods, to print out cells. Verify that the object stored at the `left_cell` and `right_cell` variables are the same as the object referenced at `parent_cell.left` and `parent_cell.right` by printing out these variables.

QUESTION 4: Check that `left_cell.parent` and `right_cell.parent` are equivalent to `parent_cell` by printing the cells out, just as you did in QUESTION 3.

---

## Creating a synthetic lineage (required by the user) "Heterogeneous Two-State Model"

In [4]:
from lineage.LineageTree import LineageTree
from lineage.StateDistribution import StateDistribution

### Creating an unpruned two-state lineage

#### Defining the $\pi$ initial probability vector and $T$ stochastic transition rate matrix

The required probabilities are those that define the tree and act of state switching. This process works by first creating a hidden tree of empty cells. Empty cells are those that have their states set but do not have any observations attached to them. We then draw as many observations from each state's distribution and assign those observations to those cells. The $\pi$ and $T$ parameters are easy to define. The number of states is $k$. We require for $\pi$ a $k\times 1$ list of probabilities. These probabilities must add up to $1$ and they should be either in a $1$-dimensional list or a $1$-dimensional numpy array. The $T$ parameter should be a square numpy matrix of size $k\times k$. The rows are the states in which we are transitioning from and the columns are the states in which we are transitioning to. Each row of $T$ should sum to $1$. The columns need not sum to $1$. This convention follows the convention used by Wikipedia.

In [5]:
# pi: the initial probability vector
pi = np.array([0.6, 0.4], dtype="float")

# T: transition probability matrix
T = np.array([[0.75, 0.25],
              [0.15, 0.85]], dtype="float")

#### Defining the $E$ emissions matrix using state distributions

The emission matrix $E$ is a little more complicated to define because this is where the user has complete freedom in defining what type of observation they care about. In particular, the user has to first begin with defining what physical observation they will want to extract from images of their cells, or test on synthetically created lineages. For example, if one is observing kinematics or physics, they might want to use the Gaussian distribution parameterized by a mean and covariance to model their observations (velocity, acceleration, etc.). 

Ultimately, the user needs to provide three things based on the phenotype they wish to observe, model, and predict:

1. a probability distribution function: a function that returns a likelihood when given a sample and parameters describing the distribution
2. a random variable: a function that returns samples from the distribution when given parameters describing the distribution
3. a estimator: a function that returns parameters that describe a distribution when given samples 

An optional boolean function can be provided to "prune" cells based on the observation. In our example, cells with a Bernoulli observation of $0$, which implies that the cell died, are pruned from the tree. Another prune rule we've implemented is removing cells that were born after an experimental time.

We have already built, as an example, and as bioengineers interested in studying cancer cell heterogeneity, a model that resembles lineage trees of cancer cells. In our synthetic model, our emissions are multivariate. This first emission is a Bernoulli observation, $0$ implying death and $1$ implying division. The second emission is continuous and are gamma distributed. Though these can be thought of cell lifetimes or periods in a certain cell phase, we want the user to know that these values can really mean anything and they are completely free in choosing what the emissions and their values mean. We provide, as mentioned above, a probability distribution function that takes in as input multivariate samples, a Bernoulli rate parameter, and three parameters that define the gamma distribution, and returns a likelihood. We also define a random variable that takes in a Bernoulli parameter and three gamma parameters and returns multivariate samples. We also define estimators for these observations as well. Finally, we also define a prune rule, as explained previously.


Ultimately, $E$ is defined as a $k\times 1$ size list of state distribution objects. These distribution objects are rich in what they can already do, and a user can easily add more to their functionality. They only need to be instantiated by what parameters define that state's distribution.

In [6]:
# E: states are defined as StateDistribution objects

# State 0 parameters "Resistant"
state0 = 0
bern_p0 = 0.99
gamma_a0 = 20
gamma_loc = 0
gamma_scale0 = 5

# State 1 parameters "Susceptible"
state1 = 1
bern_p1 = 0.88
gamma_a1 = 10
gamma_scale1 = 1

state_obj0 = StateDistribution(state0, bern_p0, gamma_a0, gamma_loc, gamma_scale0)
state_obj1 = StateDistribution(state1, bern_p1, gamma_a1, gamma_loc, gamma_scale1)

E = [state_obj0, state_obj1]

The final required parameters are more obvious. The first is the desired number of cells one would like in their full unpruned lineage tree. This can be any number. Since one of our observations is time-based, we can also add a prune condition based on time as well. Ultimately, these design choices are left up to the user to customize based on their state distribution type. Without loss of generality, we provide the following example of an 'unpruned' lineage tree.

In [7]:
desired_num_cells = 2**12 - 1 
desired_experiment_time = 300
prune_condition = 'fate'
prune_boolean = False # To get the full tree

In [8]:
lineage1 = LineageTree(pi, T, E, desired_num_cells, desired_experiment_time, prune_condition, prune_boolean)
print(lineage1)
print("\n")

This tree is NOT pruned. It is made of 2 states.
 For each state in this tree: 
 	 There are 1472 cells of state 0, 
 	 There are 2623 cells of state 1.
 This UNpruned tree has 4095 many cells in total




### Estimation of distribution parameters using our estimators for full (unpruned) lineage

We can estimate the parameters of the state distributions that make up our cells by running the estimator built into our state distribution objects. Recall that these are stored in the Emissions list of state distribution objects `E`. Calling the estimator member function (using dot notation again) on a set of tuples that represent the observations held in each cell can give us the Maximum Likelihood Estimate (the best frequentist estimate) of the parameters that describe the distributions that the cells were originally sampled from. This is a good way of pre-checking if our model can analyze your data using synthetic lineages, before you begin running wet lab experiments to collect actual observations. This is also a good way to check that everything is working internally, apart from running tests.

In [9]:
for state in range(lineage1.num_states):
    print("State {}:".format(state))
    print("                    estimated state:", E[state].estimator(lineage1.lineage_stats[state].full_lin_cells_obs))
    print("original parameters given for state:", E[state])
    print("\n")

State 0:
                    estimated state: State object w/ parameters: 0.9891304347825423, 20.601233159977298, 0, 4.848518754422664.
original parameters given for state: State object w/ parameters: 0.99, 20, 0, 5.


State 1:
                    estimated state: State object w/ parameters: 0.8875333587494939, 10, 0, 1.
original parameters given for state: State object w/ parameters: 0.88, 10, 0, 1.




### Estimation of distribution parameters using our estimators for pruned lineage

We can do the same as above on both the pruned and unpruned trees. The estimator only requires a set of observations. Note that for the pruned lineage, estimations of the parameters that describe the distributions of the observations are worse than they are for the full unpruned lineage. We believe that this happens because:
1. Pruning a lineage tree biases estimators to count for cells that lived earlier on in a lineage rather than later
2. Cutting off trees based on experimental times creates cells that died at times unrepresentative of their original lifetime distributions
3. Cells that live for shorter times create more samples while cells that live longer create fewer samples and pruning the tree can hurt estimating the distributions of longer living cells


In [10]:
for state in range(lineage1.num_states):
    print("State {}:".format(state))
    print("                    estimated state:", E[state].estimator(lineage1.lineage_stats[state].pruned_lin_cells_obs))
    print("original parameters given for state:", E[state])
    print("\n")

State 0:
                    estimated state: State object w/ parameters: 0.9858974358973114, 21.214125975224338, 0, 4.679463010085723.
original parameters given for state: State object w/ parameters: 0.99, 20, 0, 5.


State 1:
                    estimated state: State object w/ parameters: 0.88708297690327, 10, 0, 1.
original parameters given for state: State object w/ parameters: 0.88, 10, 0, 1.




### Analyzing our first full lineage

Our project's goal is to analyze heterogeneity. We packaged our entire codebase's capabilities into one function `Analyze`, which runs the tree-hidden Markov Model on an appropriately formatted dataset. In the following example, we analyze the unrpuned lineage from above.

In [11]:
from lineage.Analyze import Analyze

X = [lineage1] # population just contains one lineage
deltas, state_ptrs, all_states, tHMMobj, NF, LL, accuracies = Analyze(X, 2) # find two states

### Estimated Markov parameters ($\pi$, $T$, $E$)

Let's see how well our model estimated the parameters that created this lineage. Recall that the model is BLIND to the true states of the cells (unlike the code blocks above where we knew the identity of the cells (in terms of their state)). This model primarily has to segment or partition the tree and its cells into the number of states we think is present in our data, and then identify the parameters that describe each state's distributions. We can not only check how well it estimated the state parameters, but also the initial probability vector $\pi$ and transition matrix $T$ vector. Note that estimating these also get better as more lineages are added (for the $\pi$ vector in particular) and in general as more cells and more lineages are added.

In [12]:
print(tHMMobj.estimate.pi)

[0. 1.]


In [13]:
print(tHMMobj.estimate.T)

[[0.86513865 0.25166887]
 [0.13486135 0.74833113]]


In [22]:
for state in range(lineage1.num_states):
    print("State {}:".format(state))
    print("                    estimated state:", tHMMobj.estimate.E[state])
    print("original parameters given for state:", E[state])
    print("\n")

State 0:
                    estimated state: State object w/ parameters: 0.9891304347825423, 20.601233159977298, 0, 4.848518754422664.
original parameters given for state: State object w/ parameters: 0.99, 20, 0, 5.


State 1:
                    estimated state: State object w/ parameters: 0.8875333587494939, 10, 0, 1.
original parameters given for state: State object w/ parameters: 0.88, 10, 0, 1.




## Trying another lineage, this time pruning branches with ancestors that die

In [26]:
prune_boolean = True # To get pruned tree

In [28]:
lineage2 = LineageTree(pi, T, E, desired_num_cells, desired_experiment_time, prune_condition, prune_boolean)
print(lineage2)

This tree is pruned. It is made of 2 states.
 For each state in this tree: 
 	 There are 1149 cells of state 0, 
 	 There are 1640 cells of state 1.
 This pruned tree has 2789 many cells in total


### Estimation of distribution parameters using our estimators for pruned lineage

In [39]:
for state in range(lineage2.num_states):
    print("State {}:".format(state))
    print("                    estimated state:", E[state].estimator(lineage2.lineage_stats[state].pruned_lin_cells_obs))
    print("original parameters given for state:", E[state])
    print("\n")

State 0:
                    estimated state: State object w/ parameters: 0.9895561357701498, 20.404615310671907, 0, 4.911323533620644.
original parameters given for state: State object w/ parameters: 0.99, 20, 0, 5.


State 1:
                    estimated state: State object w/ parameters: 0.8914634146340986, 9.805387810739804, 0, 1.0145367895364306.
original parameters given for state: State object w/ parameters: 0.88, 10, 0, 1.




## Analyzing a pruned lineage

In [None]:
X = [lineage2] # population just contains one lineage
deltas, state_ptrs, all_states, tHMMobj, NF, LL, accuracies = Analyze(X, 2) # find two states

### Estimated Markov parameters ($\pi$, $T$, $E$)

In [32]:
print(tHMMobj.estimate.pi)

[0. 1.]


In [33]:
print(tHMMobj.estimate.T)

[[0.86167578 0.23267333]
 [0.13832422 0.76732667]]


In [38]:
for state in range(lineage2.num_states):
    print("State {}:".format(state))
    print("                    estimated state:", tHMMobj.estimate.E[state])
    print("original parameters given for state:", E[state])
    print("\n")

State 0:
                    estimated state: State object w/ parameters: 0.9893170545592912, 20.513401918515704, 0, 4.876292274543093.
original parameters given for state: State object w/ parameters: 0.99, 20, 0, 5.


State 1:
                    estimated state: State object w/ parameters: 0.8890452732817082, 10, 0, 1.
original parameters given for state: State object w/ parameters: 0.88, 10, 0, 1.




### Analyzing a population of lineages

In [35]:
X = [lineage1, lineage2] # population just contains one lineage

deltas, state_ptrs, all_states, tHMMobj, NF, LL, accuracies = Analyze(X, 2) # find two states

### Estimated Markov parameters ($\pi$, $T$, $E$)

In [41]:
print(tHMMobj.estimate.pi)

[0. 1.]


In [42]:
print(tHMMobj.estimate.T)

[[0.86340657 0.2421711 ]
 [0.13659343 0.7578289 ]]


In [40]:
for state in range(tHMMobj.numStates):
    print("State {}:".format(state))
    print("                    estimated state:", tHMMobj.estimate.E[state])
    print("original parameters given for state:", E[state])
    print("\n")

State 0:
                    estimated state: State object w/ parameters: 0.9893170545592912, 20.513401918515704, 0, 4.876292274543093.
original parameters given for state: State object w/ parameters: 0.99, 20, 0, 5.


State 1:
                    estimated state: State object w/ parameters: 0.8890452732817082, 10, 0, 1.
original parameters given for state: State object w/ parameters: 0.88, 10, 0, 1.




QUESTION 5: In your own words, describe what a "state" is. Explain how the `StateDistribution` class used in defining the emissions helps us describe states. Give an example of a set of physical observations and how you can describe heterogeneous observations using two or more states. Write your answer below.

QUESTION 6: Using the example you wrote about in QUESTION 5, write the three things that are used in the user-defined `StateDistribution` class in terms of your example. If possible, also include a possible `prune_rule`. For example, we've written a `StateDistribution` class that provides a Bernoulli and gamma multivariate random variable, parameterized by the Bernoulli rate parameter (`p`), and the three gamma parameters (`a`, `loc`, `scale`). We also provide a probability density function and estimators. We use this to describe the physical observations of fate and lifetime, respectively. Write your answer below.

QUESTION 7: Create a population of 50 UNPRUNED lineages with two states. Use any paramter set you desire. Use `Analyze` to analyze the populations. Print out the estimated Markov parameters. Write your code below. (If your code is taking a long time to run, consider decreasing the `desired_num_cells`. Feel free to print out the lineage variable to see how many cells it contains.)

QUESTION 8: Using the same parameter set from QUESTION 7, create a new population of 50 PRUNED lineages with two states. Use `Analyze` to analyze the populations and print out the estimated Markov parameters as before. Write your code below. Describe how the estimation differs between the two cases (unpruned vs pruned).