Open in [nbviewer](https://nbviewer.jupyter.org/github/luiarthur/ucsc_litreview/blob/master/notes/Litreview.ipynb)
$
% Latex definitions
% note: Ctrl-shfit-p for shortcuts menu
\newcommand{\iid}{\overset{iid}{\sim}}
\newcommand{\ind}{\overset{ind}{\sim}}
\newcommand{\p}[1]{\left(#1\right)}
\newcommand{\bk}[1]{\left[#1\right]}
\newcommand{\bc}[1]{ \left\{#1\right\} }
\newcommand{\abs}[1]{ \left|#1\right| }
\newcommand{\norm}[1]{ \left|\left|#1\right|\right| }
\newcommand{\E}{ \text{E} }
\newcommand{\N}{ \mathcal N }
\newcommand{\ds}{ \displaystyle }
\newcommand{\given}{\bigg{|}}
\newcommand{\Bin}{\text{Bin}}
\newcommand{\Poi}{\text{Poi}}
$

# [Bayesian inference for intratumour heterogeneity in mutations and copy number variation](../pdf/bayesTumor.pdf)

## Glossary

**somatic**: relating to the body

**Variant allelic fractions (VAF)**

**next-generation sequencing**: 
Several new methods for DNA sequencing were developed in the mid to late 1990s and were implemented in commercial DNA sequencers by the year 2000. Together these were called the "next-generation" sequencing methods.

**Copy number variants (CNV) **

***

- L describes the subclonal copy numbers
- Z describes the numbers of subclonal variant alleles (can there be "2"?)
- w describes the cellular fractions of subclones

This table represents the truth underlying the
hypothetical scenario shown in Figure 1a. 

| Locus | subclone1 | subclone2 | subclone3 | normal Clone |
|:----: |:---:      |:---:      |:---:      |:---:   |
|**L**  |
| 1 | 2 | 3 | 3 | 2 | 
| 2 | 2 | 2 | 2 | 2 | 
| 3 | 2 | **3** | 4 | 2 | 
|**Z**  |
| 1 | 0 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 0 | 
| 3 | 0 | 0 | 0 | 0 |
|**w**  |
| $t_1$ | 30% | 0%  | 0%  | 70% |
| $t_2$ | 30% | 15% | 0%  | 55% |
| $t_3$ | 20% | 15% | 10% | 55% |

### Issues
- The bolded number is 3 (because it is hypothetical), but should be 2 if it represents the truth.
- The $w$ matrix in Figure 1a should be labelled differently so as to not confuse readers. The rows of $w$ correponds to time, not loci.
- Figure 1a: Difficult to distinguish between the green and brown for color-deficient readers.

### Existing Methods

- THetA considers only subclonal copy numbers
- TrAp emphasizes identifiability and sufficient sample size for unique mathematical solutions
- SciClone and Clomial assume a binary matrix, focusing on SNVs at copy neutral regions with heterozygous mutations
- PhyloSub and PhyloWGS consider possible genotypes at SNVs accounting for potential copy number changes and phylogenetic constraints
- CloneHD provides inference that is **similar to our method** but **assumes the availability of data from matched normal samples**
    - provides only point estimates of the subclonal copy numbers and subclonal mutations
- PyClone and CHAT adjust the estimation of subclonal cellular fractions for both CNVs and SNVs
    - stop short of directly inferring subclonal copy numbers or variant allele counts

# Method

### Notation
- $S$: number of Loci (known and fixed)
- $C$: number of subclones (unknown & random)
- $T$: number of samples (either collected at different times, or spatial locations of tumor)
- $\underset{S\times C}{L}$: integer-valued random matrix to characterize subclonal copy numbers
    - $l_{sc} = $ the number of alleles (aka copy number) at loci $s$ for subclone $c$.
    - $l_{sc} \in \bc{1,...,Q}$, predetermined finite $Q$
    - prior for $L$ is a finite version of categorical-IBP (CIBP)
- $\underset{S\times C}{Z}$: integer-valued random matrix to record SNVs
    - $z_{sc} = $ the number of alleles that bear a variant sequence (w.r.t. a reference / normal clone) at loci $s$ for subclone $c$.
    - $z_{sc} \le l_{sc}$, just think about it
- $\displaystyle\frac{z_{sc}}{l_{sc}} = $ the VAF at loci $s$ for subclone $c$
- $\underset{T \times C}{w}$: Matrix of proportions
    - $w_{tc}$ Cellular fractions of the subclone $c$ at time or location $t$

# Model

$$
\begin{split}
n_{st} ~|~ N_{st}, p_{st} &\ind \Bin(N_{st},p_{st}) \\
N_{st} ~|~ \phi_t, M_{st} &\ind \Poi(\phi_t M_{st}/2) \\
\end{split}
$$

- The factor $\phi_t$ is the expected number of reads in sample $t$ if there were no CNV, i.e. the copy number equals 2.
    - Need prior for $\phi_t$
- $p_{st}$ is the probability of observing a read with a variant sequence. (derived from $L$, $Z$, $w$)
- $M_{st}$ is the sample copy number that represents an average copy number across subclones. (derived from  $L$)
$$
\begin{split}
M_{st} &= l_{s0}w_{t0} + \ds\sum_{c=1}^C  l_{sc}w_{tc} \\
&= l_{s0}w_{t0} + \mathbf{l_s'w_t}
\end{split}
$$

- $l_{s0}$: the expected copy number from a hypothetical background subclone to account for potential noise and artefacts in the data, labelled as subclone $c=0$. Arbitrarily, assume no copy number loss, i.e. $l_{s0}=2$.

### Prior on $L$

- $\mathbf{\pi_c} = \p{\pi_{c0},...\pi_{cQ}}$, where $\pi_{cq} = P(l_{sc} = q)$, **the same across all $S$ loci**?
- $\pi_{c\bullet} = 1$
- place beta–Dirichlet distribution (Kim et al. 2012) on $\pi_{c\bullet}$
- $\mathbf{\tilde\pi_c} = \p{\tilde\pi_{c0},...\tilde\pi_{cQ}}$, where $\tilde\pi_{cq} = \ds\frac{\pi_{cq}}{1-\pi_{c2}}$
- $\pi_c \iid \text{Be-Dir}\p{\alpha/C,\beta,\gamma_0,...,\gamma_Q} \implies$
    - $\pi_{c2} | C \iid \text{Beta}(\beta,\alpha/C)$
    - $\tilde\pi_c \iid \text{Dir}\p{\gamma_0,...,\gamma_Q}$
- with $\beta=1$ and $C \rightarrow \infty$, and dropping all cloumns that are all-2's, then left order, then you get CIBP.
    - i.e., $L \sim CIBP$, any parameters?