
# Immune infiltrate model

The goal of this notebook is to describe the basic hierarchical immune infiltrate model and variations thereof.

In general, the goal is to explain the basic assumptions made & correspond each variation of the model to the versions implemented in Stan. 

Just for review, the stan code for models are contained within the [./models](models) folder of this directory. Each model is numbered & contains a description. 

In this document we will largely refer to those models by number. 




## Problem context

The basic data we are attempting to describe consist of counts of transcripts from a bulk-biopsy sample. 

Say we have N transcripts; the outcome data (`y`) consist of N datapoints, where each is a count of how many transcripts have been observed in that sample.

A few comments about the data:

- These are typically not considered absolute counts; more often they are treated as relative counts since the total number of cells is not known. However, they are (in their raw form) integers. 
- Bulk-biopsy samples are comprised of many cells; we are making an assumption that the expression levels for the sample are the sum of the expression levels of the individual cells comprising that sample.
    - There is a fair amount of evidence (c.f. [this doc - link TK]()) supporting an assumption that the counts of transcripts observed in a sample are additive.

At any rate, the goal is to determine the components of that mixture. IE what proportion of the cells are of particular types. 

To support this goal, we have a vector of cells types with typical expression vectors for each type. 

## Basic mixture components

This yields a simplest form of our model for the expression of a mixture of cells, according to components of that mixture.

We start by envisioning a set of $ N_{t} $ cells of type $t$.

The per-transcript expression count for that set of cells can be modeled as a vector $ C_{t} $:

$$ C_{t} = N_{t} * A_{t} $$ 

where $A_{t}$ is a vector describing typical per-cell expression levels for a cell of type $t$. 

Since C is constrained to be positive, and since the majority of At values are small, we are better off working on a log-scale.

This would more likely look something like: 

$$ lN_{t} = log(N_{t}) $$
$$ lA_{t} = log(A_{t}) $$
$$ log(C_{t}) = lN_{t} + lA_{t} $$

For a mixture of T cell types, our transcript-level count totals `C` would be the sum over the cell-specific counts for that transcript type.

IE:

$$ log(C) = log(\sum_{t=1}^{T} {C_{t}}) $$
$$ log(C) = log(\sum_{t=1}^{T} {exp(log(C_{t}))}) $$
$$ log(C) = log(\sum_{t=1}^{T} {exp(lN_{t} + lA_{t} )}) $$

Stan has a function `log_sum_exp` that implements an arithmetically stable version of this very operation (computing the log of sum of exponents). 

Note, also, that in practice, we don't have any data to inform the total sum of cells in a given sample. 

We instead constrain the values of $N$ to sum to 1, by using the `simplex` data type for this vector.


## Modeling variability

There are two ways we can model the observed counts given parameter values. 

There are a few ways to think about this:

1. What is expected variation of true counts among individual cells of the same type?
    - If we could measure the true count of transcripts **per cell**, would we expect them to be identical? Or would there be some distribution of those counts per type?
    - My sense of this is that *Yes* we would expect variation & that the variation would depend on the transcript
2. What is the expected measurement error in observed counts for a set of cells, given a known "true count" value?
    - Here I expect some measurement error, and this may or may not depend on the transcript. 
    - It may depend on the transcript length, for example.

I ask these questions before going into the way we model the probability of observing a count of X for a transcript t, because how we model the variability has implications for how flexibly we can approach the above-mentioned items.

In general, we have a choice between treating the cell-type-specific estimates $ lN_{t} $ and $ lA_{t} $ as parameters, or as log-probabilities directly. 

Early versions of this model used the parameter values to compute the expected counts `C`, which were modeled as random variables with a negative-binomial distribution. 

Later iterations will instead parameterize each type-specific count as a neg_binomial_lpmf which are then summed to yield the observed values of $C$.


## Modeling the similarity of cell types

A second question, orthogonal to that of the first, concerns approaches to modeling the hierarchy of (or similarities among) different cell types. 

Currently, we are modeling this by imposing a correlation matrix on the cell-type-specific expression levels for each transcript: $ A_{t} $ . 

Rather than impose a strict hierarchy, we instead assume that cell types which are more highly correlated are more similar to one another than cell types which are not highly correlated. 

Currently, this correlation among cell types is estimated from the data, but it could also be informed by priors given the known relatedness of cell types.

Also, currently, this correlation is modeled for each transcript using two components:

- an correlation matrix among the cell types 
- a covariance matrix derived from the correlation matrix, which accounts for variations by transcript

This approach is described in comments following [issue 2](https://github.com/hammerlab/immune-infiltrate-explorations/issues/2)

### cell-type features

Model 6 is a variation in which expression of each transcript is a function of features at the level of the cell type. So, a particular transcript $i$ might be more likely to be expressed among cells with a particular cell-surface marker, or might be similarly expressed among all B cells.

This is one way to include information about higher-level categorizations of cell types (e.g. B vs T cell). It can instead allows for transcript-level expression to be a function of cell-surface-marker presence. E.g. CD8+ cells may be defined by expression of a subset of transcripts. 

This feature is also described in [issue 2](https://github.com/hammerlab/immune-infiltrate-explorations/issues/2)


## Sample features

In addition to the typical expression profile by cell type ($A_{t}$), we will want to include at least one additional parameter for sample-specific variation in counts: $a_{s}$. 

We include this parameter to capture sample-level contributions to transcript-specific expression.

Note that we could also include a sample-specific contribution to the "error" or dispersion ($\sigma_{s}$), but this has not been done to date. In fact, many of our current models include fixed values of `nu` for all observed datapoints. 


## Putting pieces together

Given: 

- Sample identifiers $ s $ numbered from 1..$S$
- Cell type identifiers $ t $ numbered from 1..$T$
- Transcript identifiers $ i $ numbered from 1..$I$
- Known proportions $ x_{ts} $ of each sample $ s $ that is of a particular cell type $ t $, constrained so that:
    - each $ x_{ts} $ is between 0 and 1 
    - and, the sum of $ x_{ts} $ values for each sample is 1 (ie $ \sum_{t=1}^{T} x_{ts} = 1 $ for each sample $s$)


We model the transcript-level expression in the sample as the sum of cell-type-specific expression weighted by the known mixture proportions:

You can think of this as defining a per-cell-type latent parameter $ c_{its} $ which indicates the contribution to the sample*transcript*cell-type count, ie the portion of the sample-transcript count which is attributable to each cell type.

$$ p(c_{its} | x_{ts}, a_{it}, a_{s}, \sigma_{it}, \sigma_{s}) = x_{ts} \times NegBinom(f(a_{s}, a_{ti}), g(\sigma_{ti}, \sigma_{s})) $$

The sum of these values over all cell types yields the "true" sample*transcript count $c_{si}$: 

$$ p(c_{si}) = \sum_{t=1}^{T}{(x_{ts} \times NegBinom(f(a_{s}, a_{ti}), g(\sigma_{ti}, \sigma_{s})))}$$

We would then add some noise components to this value in order to model the observed sample*transcript counts.

# extensions to the model

## Including data about transcription factors

We have also discussed, for example, extending this component of the model to impose similar inter-cell-type correlations among transcripts with the same TF. But this component has not yet been implemented.