# The Problem of Noise in Barcode Sequencing Data

Diverse sequencing data - for example, data involving barcodes - are potentially susceptible to noise from various sources, including:
1. **Synthesis errors** - in reverse transcription, PCR
2. **Sequencing errors** - during base incorporation in SBS, signal decoding or template processing in SMRT/ONT

Leaving aside empirical strategies to determine noise (e.g. by using UMIs or homopolymeric barcodes), we can measure the noise in barcode sequencing data by using the ecological framework of diversity.

## 1. Diversity can be measured within windows of sequencing data

Given an assemblage of species, a diversity index $D$ can be computed:
\begin{equation}
D(R, P)
\end{equation}
where $R$ is the richness and $P$ is the probability distribution of the species.

We can define an assemblage as a $k$-mer window of sequencing data in which:
- $R$ is the number of distinct variations of the sequence, and
- $P$ is the distribution of those variants

## 3. Noise can be measured as the diversity in sequence-invariant data

Using the above framework, we can define noise as the diversity across the *sequence-invariant* subset of sequencing data for any given sample.

This diversity can be averaged across windows of $k$ length:
\begin{equation}
N = \frac{n}{k} \sum{D_i}
\end{equation}
where $n$ is the length of each sequence.

If we consider the barcode in each sequence to be a window by itself, we can choose $k$ to be the length of the barcodes (assuming they are fixed-length).

## 4. Barcodes generated by noise can be partitioned into equivalence classes

In the complete set of barcodes $S$, there will exist a set of 'true' barcodes $T$ which can be used to define *equivalence classes* where all pseudo-barcodes due to noise are grouped with their barcode of origin:
\begin{equation}
[t] = \{s:s\cong t\}
\end{equation}
where $T\subseteq S$.
Now we can partition $S$ into these barcode classes:
\begin{equation}
S = \dot\bigcup_i[t_i]
\end{equation}

## 5. Total barcode diversity can be measured using an additive index
To measure the diversity of $S$, we use an additive index to allow later decomposition into noise and 'true' diversity.

Classical diversity indexes such as the Shannon-Wiener index are not directly additive. 

On the other hand, the Hill number (Hill, 1973) is interesting because it has an intuitive linear interpretation:
\begin{equation}
D = \left(\sum{p_i^q}\right)^{1/1-q}
\end{equation}
where $D$ is the Hill number of order $q$ ($q\neq 1$).
In particular, any value of $D$ can be equated to the *effective number of species* in an assemblage with a perfectly even distribution:
\begin{equation}
D = \left(\sum^D{\left(\frac{1}{D}\right)^{q}}\right)^{1/1-q} = \left(\frac{1}{D^{q-1}}\right)^{1/1-q} = D
\end{equation}
Using this interpretation, the additive property follows:
\begin{equation}
D = \sum{D_i}
\end{equation}
where $D_i$ is the diversity of each subset within a partition of the assemblage.

## 6. True barcode diversity can be indexed by decomposition
Partitioning $S$ into barcode classes, if we assume that noise affects all 'true' barcodes $t$ equally, we can say that the size of each barcode class $[t]$ is $N$.

If we use the Hill number, since $D$ is additive, we can decompose the diversity of $S$:
\begin{equation}
{}^SD = \sum^{{}^TD}{N} = N\cdot {}^TD
\end{equation}
where ${}^TD$ is the 'true' diversity of barcodes (i.e. the *effective number of true barcodes*).

Hence, we can compute:
\begin{equation}
{}^TD = \frac{{}^SD}{N}
\end{equation}
for any barcode sequencing data $S$.

## References

- Hill, M. O. (1973). **Diversity and evenness: a unifying notation and its consequences.** *Ecology*, 54(2), 427-432.