# Neural Sequence Distance Embeddings

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gcorso/neural_seed/blob/master/tutorial/Neural_SEED.ipynb)

The improvement of data-dependent heuristics and representation for biological sequences is a critical requirement to fully exploit the recent technological and scientific advancements for human microbiome analysis. This notebook presents Neural Sequence Distance Embeddings (Neural SEED), a novel framework to embed biological sequences in geometric vector spaces that unifies recently proposed approaches. We demonstrate its capacity by presenting different ways it can be applied to the tasks of edit distance approximation, closest string retrieval, hierarchical clustering and multiple sequence alignment. In particular, the hyperbolic space is shown to be a key component to embed biological sequences and obtain competitive heuristics. Benchmarked with common bioinformatics and machine learning baselines, the proposed approaches display significant accuracy and/or runtime improvements on real-world datasets formed by sequences from samples of the human microbiome.

![Cover](https://raw.githubusercontent.com/gcorso/neural_seed/master/tutorial/cover.png)

Figure 1: On the left, a diagram of the Neural SEED underlying idea: embed sequences in vector spaces preserving the edit distance between them. On the right, an example of the hierarchical clustering produced on the Poincarè disk from the P53 tumour protein from 30 different organisms.


## Introduction and Motivation

### Motivation

Dysfunctions of the human microbiome (Morgan & Huttenhower, 2012) have been linked to many serious diseases ranging from diabetes and antibiotic resistance to inflammatory bowel disease. Its usage as a biomarker for the diagnosis and as a target for interventions is a very active area of research. Thanks to the advances in sequencing technologies, modern analysis relies on sequence reads that can be generated relatively quickly. However, to fully exploit the potential of these advances for personalised medicine, the computational methods used in the analysis have to significantly improve in terms of speed and accuracy.

![Classical microbiome analysis](https://raw.githubusercontent.com/gcorso/neural_seed/master/tutorial/microbiome_analysis.png)

Figure 2: Traditional approach to the analysis of the 16S rRNA sequences from the microbiome. 

### Problem

While the number of available biological sequences has been growing exponentially over the past decades, most of the problems related to string matching have not been addressed by the recent advances in machine learning. Classical algorithms are data-independent and, therefore, cannot exploit the low-dimensional manifold assumption that characterises real-world data. Exploiting the available data to produce data-dependent heuristics and representations would greatly accelerate large-scale analyses that are critical to microbiome analysis and other biological research. 

Unlike most tasks in computer vision and NLP, string matching problems are typically formulated as combinatorial optimisation problems. These discrete formulations do not fit well with the current deep learning approaches causing these problems to be left mostly unexplored by the community. Current supervised learning methods also suffer from the lack of labels that characterises many downstream applications with biological sequences. On the other hand, common self-supervised learning approaches, very successful in NLP, are less effective in the biological context where relations tend to be per-sequence rather than per-token (McDermott et al. 2021).


### Neural Sequence Distance Embedding

In this notebook, we present Neural Sequence Distance Embeddings (Neural SEED), a general framework to produce representations for biological sequences where the distance in the embedding space is correlated with the evolutionary distance between sequences. This control over the geometric interpretation of the representation space enables the use of geometrical data processing tools for the analysis of the spectrum of sequences.

![Classical microbiome analysis](https://raw.githubusercontent.com/gcorso/neural_seed/master/tutorial/edit_diagram.PNG)

Figure 3: The key idea of Neural SEED is to learn an encoder function that preserves distances between the sequence and vector space.


Examining the task of embedding sequences to preserve the edit distance reveals the importance of data-dependent approaches and of using a geometry that matches well the underlying distribution in the data analysed. For biological datasets, that have an implicit hierarchical structure given by evolution, the hyperbolic space provides significant improvement.

We show the potential of the framework by analysing three fundamental tasks in bioinformatics: closest string retrieval, hierarchical clustering and multiple sequence alignment. For all tasks, relatively simple unsupervised approaches using Neural SEED encoders significantly outperform data-independent heuristics in terms of accuracy and/or runtime. In the paper (preprint will be available soon) and the [complete repository](https://github.com/gcorso/neural_seed) we also present more complex geometrical approaches to hierarchical clustering and multiple sequence alignment.


## 2. Analysis

To improve readability and limit the size of the notebook we make use of some subroutines in the [official repository](https://github.com/gcorso/neural_seed) for the research project. The code in the notebook is our best effort to convey the promising application of hyperbolic geometry to this novel research direction and how `geomstats` helps to achieve it.

Install and import the required packages. 

In [None]:
!pip3 install geomstats
!apt install clustalw
!pip install biopython
!pip install python-Levenshtein
!pip install Cython
!pip install networkx
!pip install tqdm
!pip install gdown
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
!git clone https://github.com/gcorso/neural_seed.git
import os
os.chdir("neural_seed")
!cd hierarchical_clustering/relaxed/mst; python setup.py build_ext --inplace; cd ../unionfind; python setup.py build_ext --inplace; cd ..; cd ..; cd ..;
os.environ['GEOMSTATS_BACKEND'] = 'pytorch'

In [None]:
import torch
import os 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time
from geomstats.geometry.poincare_ball import PoincareBall

from edit_distance.train import load_edit_distance_dataset
from util.data_handling.data_loader import get_dataloaders
from util.ml_and_math.loss_functions import AverageMeter

INFO: Using pytorch backend


### Dataset description

As microbiome analysis is one of the most critical applications where the methods presented could be applied, we chose to use a dataset containing a portion of the 16S rRNA gene widely used in the biological literature to analyse microbiome diversity. Qiita (Clemente et al. 2015) contains more than 6M sequences of up to 152 bp that cover the V4 hyper-variable region collected from skin, saliva and faeces samples of uncontacted Amerindians. The full dataset can be found on the [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/text-search?query=ERP008799), but, in this notebook, we will only use a subset of a few tens of thousands that have been preprocessed and labelled with pairwise distances. We also provide results on the RT988 dataset (Zheng et al. 2019), another dataset of 16S rRNA that contains slightly longer sequences (up to 465 bp).

In [None]:
!gdown --id 1yZTOYrnYdW9qRrwHSO5eRc8rYIPEVtY2 # for edit distance approximation
!gdown --id 1hQSHR-oeuS9bDVE6ABHS0SoI4xk3zPnB # for closest string retrieval
!gdown --id 1ukvUI6gUTbcBZEzTVDpskrX8e6EHqVQg # for hierarchical clustering

Downloading...
From: https://drive.google.com/uc?id=1yZTOYrnYdW9qRrwHSO5eRc8rYIPEVtY2
To: /content/string_matching/edit_qiita_large.pkl
218MB [00:03, 68.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1hQSHR-oeuS9bDVE6ABHS0SoI4xk3zPnB
To: /content/string_matching/closest_qiita_large.pkl
2.44MB [00:00, 38.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1ukvUI6gUTbcBZEzTVDpskrX8e6EHqVQg
To: /content/string_matching/hc_qiita_large_extr.pkl
806MB [00:08, 90.6MB/s]


### Edit distance approximation

**Edit distance**  The task of finding the distance or similarity between two strings and the related task of global alignment lies at the foundation of bioinformatics. Due to the resemblance with the biological mutation process, the edit distance and its variants are typically used to measure similarity between sequences. Given two string $s_1$ and $s_2$, their edit distance $ED(s_1, s_2)$ is defined as the minimum number of insertions, deletions or substitutions needed to transform $s_1$ in $s_2$. We always deal with the classical edit distance where the same weight is given to every operation, however, all the approaches developed can be applied to any distance function of choice. 

**Task and loss function** As represented in Figure 3, the task is to learn an encoding function $f$ such that given any pair of sequences from the domain of interest $s_1$ and $s_2$:
\begin{equation}ED(s_1, s_2) \approx n \; d(f(s_1), f(s_2)) \end{equation}

where $n$ is the maximum sequence length and $d$ is a distance function over the vector space. In practice this is enforced in the model by minimising the mean squared error between the actual and the predicted edit distance. To make the results more interpretable and comparable across different datasets, we report results using \% RMSE defined as:
\begin{equation}
\text{% RMSE}(f, S) = \frac{100}{n} \, \sqrt{L(f, S)} = \frac{100}{n} \, \sqrt{\sum_{s_1, s_2 \in S} (ED(s_1, s_2) - n \; d(f(s_1), f(s_2)))^2}
\end{equation}

which can be interpreted as an approximate average error in the distance prediction as a percentage of the size of the sequences.


In this notebook, we only show the code to run a simple linear layer on the sequence which, in the hyperbolic space, already gives particularly good results. Later we will also report results for more complex models whose implementation can be found in the [Neural SEED repository](https://github.com/gcorso/neural_seed).

In [None]:
class LinearEncoder(nn.Module):
    """  Linear model which simply flattens the sequence and applies a linear transformation. """

    def __init__(self, len_sequence, embedding_size, alphabet_size=4):
        super(LinearEncoder, self).__init__()
        self.encoder = nn.Linear(in_features=alphabet_size * len_sequence, 
                                 out_features=embedding_size)

    def forward(self, sequence):
        # flatten sequence and apply layer
        B = sequence.shape[0]
        sequence = sequence.reshape(B, -1)
        emb = self.encoder(sequence)
        return emb


class PairEmbeddingDistance(nn.Module):
    """ Wrapper model for a general encoder, computes pairwise distances and applies projections """

    def __init__(self, embedding_model, embedding_size, scaling=False):
        super(PairEmbeddingDistance, self).__init__()
        self.hyperbolic_metric = PoincareBall(embedding_size).metric.dist
        self.embedding_model = embedding_model
        self.radius = nn.Parameter(torch.Tensor([1e-2]), requires_grad=True)
        self.scaling = nn.Parameter(torch.Tensor([1.]), requires_grad=True)

    def normalize_embeddings(self, embeddings):
        """ Project embeddings to an hypersphere of a certain radius """
        min_scale = 1e-7
        max_scale = 1 - 1e-3
        return F.normalize(embeddings, p=2, dim=1) * self.radius.clamp_min(min_scale).clamp_max(max_scale)

    def encode(self, sequence):
        """ Use embedding model and normalization to encode some sequences. """
        enc_sequence = self.embedding_model(sequence)
        enc_sequence = self.normalize_embeddings(enc_sequence)
        return enc_sequence

    def forward(self, sequence):
        # flatten couples
        (B, _, N, _) = sequence.shape
        sequence = sequence.reshape(2 * B, N, -1)

        # encode sequences
        enc_sequence = self.encode(sequence)

        # compute distances
        enc_sequence = enc_sequence.reshape(B, 2, -1)
        distance = self.hyperbolic_metric(enc_sequence[:, 0], enc_sequence[:, 1])
        distance = distance * self.scaling

        return distance

General training and evaluation routines used to train the models:

In [None]:

def train(model, loader, optimizer, loss, device):
    avg_loss = AverageMeter()
    model.train()

    for sequences, labels in loader:
        # move examples to right device
        sequences, labels = sequences.to(device), labels.to(device)

        # forward propagation
        optimizer.zero_grad()
        output = model(sequences)

        # loss and backpropagation
        loss_train = loss(output, labels)
        loss_train.backward()
        optimizer.step()

        # keep track of average loss
        avg_loss.update(loss_train.data.item(), sequences.shape[0])

    return avg_loss.avg


def test(model, loader, loss, device):
    avg_loss = AverageMeter()
    model.eval()

    for sequences, labels in loader:
        # move examples to right device
        sequences, labels = sequences.to(device), labels.to(device)

        # forward propagation and loss computation
        output = model(sequences)
        loss_val = loss(output, labels).data.item()
        avg_loss.update(loss_val, sequences.shape[0])

    return avg_loss.avg

The linear model is trained on 7000 sequences (+700 of validation) and tested on 1500 different sequences: 

In [None]:
EMBEDDING_SIZE = 128

device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.manual_seed(2021)
if device == 'cuda':
    torch.cuda.manual_seed(2021)

# load data
datasets = load_edit_distance_dataset('./edit_qiita_large.pkl')
loaders = get_dataloaders(datasets, batch_size=128, workers=1)

# model, optimizer and loss
encoder = LinearEncoder(152, EMBEDDING_SIZE)
model = PairEmbeddingDistance(embedding_model=encoder, embedding_size=EMBEDDING_SIZE)
model.to(device)

optimizer = optim.Adam(model.parameters(), lr=0.001)
loss = nn.MSELoss()

# training
for epoch in range(0, 21):
    t = time.time()
    loss_train = train(model, loaders['train'], optimizer, loss, device)
    loss_val = test(model, loaders['val'], loss, device)

    # print progress
    if epoch % 5 == 0:
        print('Epoch: {:02d}'.format(epoch),
              'loss_train: {:.6f}'.format(loss_train),
              'loss_val: {:.6f}'.format(loss_val),
              'time: {:.4f}s'.format(time.time() - t))
      
# testing
for dset in loaders.keys():
    avg_loss = test(model, loaders[dset], loss, device)
    print('Final results {}: loss = {:.6f}'.format(dset, avg_loss))


Epoch: 00 loss_train: 0.021631 loss_val: 0.000938 time: 30.0685s
Epoch: 10 loss_train: 0.000500 loss_val: 0.000741 time: 28.9802s
Epoch: 20 loss_train: 0.000444 loss_val: 0.000728 time: 28.9375s
Epoch: 30 loss_train: 0.000422 loss_val: 0.000682 time: 28.9818s
Epoch: 40 loss_train: 0.000406 loss_val: 0.000653 time: 28.8732s
Epoch: 50 loss_train: 0.000401 loss_val: 0.000655 time: 29.5673s
Final results train: loss = 0.000397
Final results val: loss = 0.000655
Final results test: loss = 0.000680


Therefore, our linear model after only 50 epochs has a $\% RMSE \approx 2.6$ that, as we will see, is significantly better than any data-independent baseline. 

### Closest string retrieval

This task consists of finding the sequence that is closest to a given query among a large number of reference sequences and is very commonly used to classify sequences. Given a set of reference strings $R$ and a set of queries $Q$, the task is to identify the string $r_q \in R$ that minimises $ED(r_q, q)$ for each $q \in Q$. This task is performed in an unsupervised setting using models trained for edit distance approximation. Therefore, given a pretrained encoder $f$, its prediction is taken to be the string $r_q \in R$ that minimises $d(f(r_q), f(q))$ for each $q \in Q$. This allows for sublinear retrieval (via locality-sensitive hashing or other data structures) which is critical in real-world applications where databases can have billions of reference sequences. As performance measures, we report the top-1, top-5 and top-10 scores, where top-$k$ indicates the percentage of times the model ranks the closest string within its top-$k$ predictions.

In [None]:
from closest_string.test import closest_string_testing

closest_string_testing(encoder_model=model, data_path='./closest_qiita_large.pkl',
                       batch_size=128, device=device, distance='hyperbolic')

Results: accuracy 0.441 0.537 0.583 0.620 0.642 0.660 0.677 0.698 0.712 0.725
Top1: 0.441  Top5: 0.642  Top10: 0.725
Total time elapsed: 0.4419s


Evaluated on a dataset composed of 1000 reference and 1000 query sequences (disjoint from the edit distance training set) the simple model we trained is capable of detecting the closest sequence correctly 44\% of the time and in approximately 3/4 of the cases it places the real closest sequence in its top-10 choices.


### Hierarchical clustering

Hierarchical clustering (HC) consists of constructing a hierarchy over clusters of data by defining a tree with internal points corresponding to clusters and leaves to datapoints. The goodness of the tree can be measured using Dasgupta's cost (Dasgupta 2016).

One simple approach to use Neural SEED to speed up hierarchical clustering is similar to the one adopted in the previous section: estimate the pairwise distance matrix with a model pretrained for *edit distance approximation* and then use the matrix as the basis for classical agglomerative clustering algorithms (e.g. Single, Average and Complete Linkage). The computational cost to generate the matrix goes from $O(N^2M^2)$ to $O(N(M+N))$ and by using optimisations like locality-sensitive hashing the clustering itself can be accelerated.

The following code computes the pairwise distance matrix and then runs a series of agglomerative clustering heuristics (Single, Average, Complete and Ward Linkage) on it.

In [None]:
from hierarchical_clustering.unsupervised.unsupervised import hierarchical_clustering_testing

hierarchical_clustering_testing(encoder_model=model, data_path='./hc_qiita_large_extr.pkl',
                                batch_size=128, device=device, distance='hyperbolic')

Hierarchical torch.Size([10000, 152])
{'single': {'DC': 335039492911.7103}, 'complete': {'DC': 334477330814.3947}, 'average': {'DC': 333673018932.05225}, 'ward': {'DC': 334313576211.0263}}


An alternative approach to performing hierarchical clustering we propose uses the continuous relaxation of Dasgupta's cost (Chami et al. 2020) to embed sequences in the hyperbolic space. In comparison to Chami et al. (2020), we show that it is possible to significantly decrease the number of pairwise distances required by directly mapping the sequences. 
This allows to considerably speed up the construction especially when dealing with a large number of sequences without requiring any pretrained model. Figure 1 shows an example of this approach when applied to a small dataset of proteins and the code for it is in the Neural SEED repository.

### Multiple Sequence Alignment

Multiple Sequence Alignment is another very common task in bioinformatics and there are several ways of using Neural SEED to accelerate heuristics. The most commonly used programs such as the Clustal series and MUSCLE are based on a phylogenetic tree estimation phase from the pairwise distances which produces a guide tree, which is then used to guide a progressive alignment phase.

In Clustal algorithm for MSA on a subset of RT988 of 1200 sequences, the construction of the distance matrix and the tree takes 99\% of the total running time (the rest takes 24s out of 35 minutes). Therefore, one obvious improvement that Neural SEED can bring is to speed up this phase using the hierarchical clustering techniques seen in the previous section. 

The following code uses the model pretrained for edit distance to approximate the neighbour joining tree construction and the runs clustalw using that guide tree:

In [None]:
from multiple_alignment.guide_tree.guide_tree import approximate_guide_trees

# performs neighbour joining algorithm on the estimate of the pairwise distance matrix
approximate_guide_trees(encoder_model=model, dataset=datasets['test'],
                        batch_size=128, device=device, distance='hyperbolic')

# Command line clustalw using the tree generated with the previous command. 
# The substitution matrix and gap penalties are set to simulate the classical edit distance used to train the model 
!clustalw -infile="sequences.fasta" -dnamatrix=multiple_alignment/guide_tree/matrix.txt -transweight=0 -type='DNA' -gapopen=1 -gapext=1 -gapdist=10000 -usetree='njtree.dnd'  | grep 'Alignment Score'

Alignment Score -12042858



An alternative method we propose for the MSA uses an autoencoder to convert the Steiner string approximation problem in a continuous optimisation task. More details on this in our paper and repository.


### Central role of Geomstats

As we will show in the next section, the choice of the geometry of the embedding space is critical with Neural SEED. `geomstats` provides the geometric functions required for the methods presented (e.g.`metric.dist`, `metric.dist_pairwise` and `metric.closest_neighbor_index`) for the Poincaré Ball used in this notebook. Moreover, the large variety of other geometric spaces present (with the same interface) allows to quickly experiment to find the most appropriate one in every application domain of this approach.

Finally, Neural SEED provides representations that are better suited for human interaction than classical bioinformatics algorithms. While the functioning of the encoder might be hard to interpret, the embeddings produced can be plotted (directly or after dimensionality reduction) and intuitively interpreted as continuous space where distance reflects evolutionary separation. `geomstats` could be of critical importance also on this front.

`giotto-tda` was not used for this project.

## 3. Benchmark

In this section, we compare the Neural SEED approach to classical baseline alignment-free approaches such as k-mer and contrast the performance of neural models with different architectures and on different geometric spaces.

### Edit distance approximation

![Table of results](https://raw.githubusercontent.com/gcorso/neural_seed/master/tutorial/edit_real.PNG)

Figure 4: \% RMSE test set results on the Qiita and RT988 datasets. The first five models are the k-mer baselines and, in parentheses, we indicate the dimension of the embedding space. The remaining are encoder models trained with the Neural SEED framework and they all have an embedding space dimension equal to 128. - indicates that the model did not converge.

Figure 4 highlights the advantage provided by data-dependent methods when compared to the data-independent baseline approaches. Moreover, the results show that it is critical for the geometry of the embedding space to reflect the structure of the low dimensional manifold on which the data lies. In these biological datasets, there is an implicit hierarchical structure given by the evolution process which is well reflected by the *hyperbolic* plane. Thanks to this close correspondence, even relatively simple models like the linear regression and MLP perform very well with this distance function.

![Embedding dimension results](https://raw.githubusercontent.com/gcorso/neural_seed/master/tutorial/edit_dimension.png)

Figure 5: \% RMSE on Qiita dataset for a Transformer with different distance functions.

The clear benefit of using the hyperbolic space is evident when analysing the dimension required for the embedding space (Figure 5). In these experiments, we run the Transformer model tuned on the Qiita dataset with an embedding size of 128 on a range of dimensions. The hyperbolic space provides significantly more efficient embeddings, with the model reaching the 'elbow' at dimension 32 and matching the performance of the other spaces with dimension 128 with only 4 to 16. Given that the space to store the embedding and the time to compute distances between them scale linearly with the size of the space, this would provide a significant improvement in downstream tasks over other Neural SEED approaches.

**Running time** A critical step behind most of the algorithms analysed in the rest of the paper is the computation of the pairwise distance matrix of a set of sequences. Taking as an example the RT988 dataset (6700 sequences of length up to 465 bases), optimised C code computes on a CPU approximately 2700 pairwise distances per second and takes 2.5 hours for the whole matrix. In comparison, using a trained Neural SEED model, the same matrix can be approximated in 0.3-3s (similar value for the k-mer baseline) on the same CPU. The computational complexity for $N$ sequences of length $M$ is reduced from $O(N^2\; M^2)$ to $O(N(M + N))$ (assuming the model is linear w.r.t. the length and constant embedding size). The training process takes typically 0.5-3 hours on a GPU. However, in applications such as microbiome analysis, biologists typically analyse data coming from the same distribution (e.g. the 16S rRNA gene) for multiple individuals, therefore the initial cost would be significantly amortised.

### Closest string retrieval

Figure 6 shows that also in this task the data-dependent models outperform the baselines even when these operate on larger spaces. In terms of distance function, the *cosine* distance achieves performances on par with the *hyperbolic*. This can be explained by the fact that for a set of points on the same hypersphere, the ones with the smallest *cosine* or *hyperbolic* distance are the same. So the *cosine* distance is capable of providing good orderings of sequence similarity but inferior approximations of their distance.

![Closest string retrieval table](https://raw.githubusercontent.com/gcorso/neural_seed/master/tutorial/closest_real.png)

Figure 6: Accuracy of different models in the *closest string retrieval* task on the Qiita dataset.

### Hierarchical clustering

The results (Figure 7) show that the difference in performance between the most expressive models and the round truth distances is not statistically significant. The *hyperbolic* space achieves the best performance and, although the relative difference between the methods is not large in terms of percentage Dasgupta's cost (but still statistically significant), it results in a large performance gap when these trees are used for tasks such as MSA. The total CPU time taken to construct the tree is reduced from more than 30 minutes to less than one in this dataset and the difference is significantly larger when scaling to datasets of more and longer sequences.

![Unsupervised HC table](https://raw.githubusercontent.com/gcorso/neural_seed/master/tutorial/hc_average.png)

Figure 7: Average Linkage \% increase in Dasgupta's cost of Neural SEED models compared to the performance of clustering on the ground truth distances. Average Linkage was the best performing clustering heuristic across all models.

### Multiple Sequence Alignment


The results reported in Figure 8 show that the alignment scores obtained when using the Neural SEED heuristics with models such as GAT are not statistically different from those obtained with the ground truth distances. Most of the models show a relatively large variance in performance across different runs. This has positive and negative consequences: the alignment obtained using a single run may not be very accurate, but, by training an ensemble of models and applying each of them, we are likely to obtain a significantly better alignment than the one from the ground truth matrix while still only taking a fraction of the time. 

![Unsupervised MSA table](https://raw.githubusercontent.com/gcorso/neural_seed/master/tutorial/msa_guide_table.png)

Figure 8: Percentage change in the alignment cost (- alignment score) returned by Clustal when using the heuristics to generate the tree as opposed to using NJ on real distances. The alignment was done on 1.2k unseen sequences from the RT988 dataset. 


## 4. Limitations and perspectives

### Limitations of the method presented

As mentioned in the introduction, we believe that the Neural SEED framework has the potential to be applied to numerous problems and, therefore, this project constitutes only an initial analysis of its geometrical properties and applications. Below we list some of the limitations of the current analysis and potential directions of research to cover them.

**Type of sequences** Both the datasets analysed consist of sequence reads of the same part of the genome. This is a very common set-up for sequence analysis (for example for microbiome analysis) and it is enabled by biotechnologies that can amplify and sequence certain parts of the genome selectively, but it is not ubiquitous. Shotgun metagenomics consists of sequencing random parts of the genome. This would, we believe, generate sequences lying on a low-dimensional manifold where the hierarchical relationship of evolution is combined with the relationship based on the specific position in the whole genome. Therefore, more complex geometries such as product spaces might be best suited.

**Type of labels** In this project, we work with edit distances between strings, these are very expensive when large scale analysis is required, but it is feasible to produce several thousand exact pairwise distance values from which the model can learn. For different definitions of distance, however, this might not be the case. If it is only feasible to determine which sequences are closest, then encoders can be trained using triplet loss and then most of the approaches presented would still apply. Future work could explore the robustness of this framework to inexact estimates of the distances as labels and whether Neural SEED models, once trained, could provide more accurate predictions than its labels. 

**Architectures** Throughout the project we used models that have been shown to work well for other types of sequences and tasks. However, the correct inductive biases that models should have to perform SEED are likely to be different to the ones used for other tasks and even dependent on the type of distance it tries to preserve. Moreover, the capacity of the hyperbolic space could be further exploited using models that directly operate in the hyperbolic space (Peng et al. 2021). 

**Self-supervised embeddings** One potential application of Neural SEED that was not explored in this project is the direct use of the embedding produced by Neural SEED for downstream tasks. This would enable the use of a wide range of geometric data processing tools for the analysis of biological sequences. 


### Limitations of Geomstats

We did not find a way to use both `pytorch` and `numpy` backends *at the same time*, while in some cases in our experience it would have been useful.



### Proposed features of Geomstats

The documentation for the various manifolds could be improved with more detailed descriptions and an index with short summaries. Moreover, there are some classes like `PoincareBall` that we found in tutorials, but we could not find in the documentation. Finally, it would be great if more complex geometries such as product spaces were included.


## References

The preprint detailing all the approaches and related work will be available soon.

(Morgan & Huttenhower, 2012) Xochitl C Morgan and Curtis Huttenhower. Human microbiome analysis. PLoS Comput Biol, 2012.

(McDermott et al. 2021) Matthew McDermott, Brendan Yap, Peter Szolovits, and Marinka Zitnik. Rethinking relational encoding in language model: Pre-training for general sequences. arXiv preprint, 2021.

(Clemente et al. 2015) Jose C Clemente, Erica C Pehrsson, Martin J Blaser, Kuldip Sandhu, Zhan Gao, Bin Wang, Magda Magris, Glida Hidalgo, Monica Contreras, Oscar Noya-Alarcon, et al. ´The microbiome of uncontacted amerindians. Science advances, 2015.

(Zheng et al. 2019)Wei Zheng, Le Yang, Robert J Genco, Jean Wactawski-Wende, Michael Buck, and Yijun Sun. Sense: Siamese neural network for sequence embedding and alignment free comparison. Bioinformatics, 2019.

(Dasgupta 2016) Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, 2016.

(Chami et al. 2020) Ines Chami, Albert Gu, Vaggos Chatziafratis, and Christopher Re. From trees to continuous embeddings and back: Hyperbolic hierarchical clustering. Advances in Neural Information Processing Systems 33, 2020.

