# Introduction

## Data Inputs

The input training data is comprised of:

- an $(S \times G)$ matrix of gene counts $\boldsymbol Y$
    - Each entry $y_{s,g}$ corresponds to the number of reads (tpm) of transcript $g$ in sample $s$
- a $(S \times C)$ matrix of known sample compositions $\boldsymbol X$
    - Each entry $ x_{s,c}$ corresponds to the proportion of sample $s$ comprised of cell type $c$.
    - Each value $ x_{s,c}$ is in range [0, 1]
    - Sum of values for a sample = 1
- a $(C \times M)$ matrix of cell-type features $\boldsymbol F$
    - Each entry $f_{c,m} $ in $F$ describes a feature $f$ among cells of type $c$.
    - In our first iteration, we have used the vector of cell-surface-markers as cell-type features. But this can be expanded or removed from the model.
    - As an example: CD4+ T helper cells are identified by the combination of markers: [cd4+, crth2+, cxcr3-]. These values are coded as dummy variables to comprise the feature vector $F_{c}$ for that cell type.

This implies that we have indices:

- **C**: number of cell type labels under consideration
- **S**: number of samples
- **G**: number of gene transcripts
- **M**: number of cell-type features

## High-level approach

In general our approach is to model the transcript-level reads $Y$ as a function of known sample compositions $X$ and possibly cell-type features $F$.

Specifically:

   $$ E(y | X, F, \theta ) = g^{-1}(X, F, \theta) $$

For a set of parameters \theta and link function $g$. 

Having learned this relationship from training data, we then use the estimated parameters $\theta$ and observed read counts $Y2$ to _infer_ the composition of cell types $\tilde{X2}$ for samples of unknown composition.



# First model

We start with a simplifying assumption that the sample-level transcripts are a linear sum of the frequencies of transcripts for each cell-type, weighted according to the sample composition.

This assumption yields a pretty standard GLM for the read counts: 

   $$ E(Y | X, \beta ) = g^{-1}(X\beta) $$

where $\boldsymbol \beta$ is a $ (C \times G) $ matrix. In this context the parameter $\beta$ has a convenient interpretation - namely, the expression profile for each cell-type. 

[//]: # (In other words, for a single sample $s$ with composition $ \vec{x_{c}} $: )
[//]: # 
[//]: # (  $$ E(\vec{y_{s}}) = g^{-1}(x_{(1 \times C)} * \beta_{(C \times G)}) $$ )

Because the observed data are (likely zero-inflated) count data, we choose a _negative binomial_ link function. 

$$ E(Y) = NegBinomial( X\beta , \vec{\phi_g} )  $$

Here we start by assuming that the dispersion ($\phi$) varies by transcript $g$.


### Incorporating hierarchy of cell types

We model the "hierarchy" of cell types by observing the similarity of their expression profiles among transcripts.

Specifically, through the prior on our parameter $ \beta $:

$ \beta_{(C \times G)} = MultiNormal(u_{(C \times G)}, \boldsymbol \Sigma)$ 

where 

  - $u_{g,c}$ is a linear predictor for the expression of a transcript $g$ for cell type $c$, and
  - the covariance matrix $\Sigma$ is decomposed into the diagonal matrix ("scaling factor") $\tau$ and correlation matrix $\Omega$

$$ \Sigma = \tau \space \Omega \space \tau $$


### including cell-type features

$ u = \vec{p_{C \times 1}} + \boldsymbol{F}_{C \times M} * 
( \vec {b_{M \times 1}} + \vec{\kappa_{g_{M \times 1}}} ) $

and 