# The Problem of Noise in Barcode Sequencing Data

Diverse sequencing data - for example, data involving barcodes - are potentially susceptible to noise from various sources, including:
1. **Synthesis errors** - in reverse transcription, PCR
2. **Sequencing errors** - during base incorporation in SBS, signal decoding or template processing in SMRT/ONT

Leaving aside empirical strategies to determine noise (e.g. by using UMIs or homopolymeric barcodes), we can measure the noise in barcode sequencing data by using the ecological framework of diversity.

## 1. Diversity can be measured within windows of sequencing data

Given an assemblage of species, a diversity index $D$ can be computed:
\begin{equation}
D(R, P)
\end{equation}
where $R$ is the richness and $P$ is the probability distribution of the species.

We can define an assemblage as a $k$-mer window of sequencing data in which:
- $R$ is the number of distinct variations of the sequence, and
- $P$ is the distribution of those variants

## 3. Noise can be measured as the diversity in sequence-invariant data

Using the above framework, we can define noise as the diversity across the *sequence-invariant* subset of sequencing data for any given sample.

This diversity can be averaged across windows of $k$ length:
\begin{equation}
N(k) = \frac{n}{k} \sum{D_i(k)}
\end{equation}
where $n$ is the length of each sequence.

If we consider the barcode in each sequence to be a window by itself, we can choose $k$ to be the length of the barcodes (assuming they are fixed-length).

## 4. Barcode diversity can be corrected for noise using an additive index

Given the apparent diversity $A$ of $k$-mer barcodes, we can correct for noise to compute the true diversity $B$ as follows:
\begin{equation}
B(k) = A(k) - N(k)
\end{equation}

To do this, we need an *additive* diversity index $D$ which can be decomposed into parts:
\begin{equation}
D = \sum{D_i}
\end{equation}

Classical diversity indexes such as the Shannon-Wiener index are not additive.
However, the Hill numbers (Chao et al. 2014 in *Ecological Monographs*) are a diversity index with an intuitive linear interpretation which implies additivity:
\begin{equation}
D = \left(\sum{p_i^q}\right)^{1/1-q}
\end{equation}
In particular, for a Hill number of order $q$, diversity equals the *number of effective species* in an assemblage with perfectly even distribution which would have the same diversity:
\begin{align}
\left(\sum_{i=1}^D{\left(\frac{1}{D}\right)^q}\right)^{1/1-q} = \left(\frac{1}{D^{q-1}}\right)^{1/1-q} = D
\end{align}

Using the Hill numbers, it is possible to interpret $B$ and $N$ in terms of *effective numbers of barcodes* which sum to $A$ for each sample.

# A Sliding Window Algorithm to Measure Noise

Based on the above, we can measure noise using a sliding window algorithm and subtract it from the barcode diversity.