# Embedding via SeqVec
*Implementation by Klaus Baruffi, Fabio Pfaehler* <br>
*SeqVec embedder by mheinziger* ([Repository](https://github.com/mheinzinger/SeqVec), [Fork by Rostlab](https://github.com/Rostlab/SeqVec))

Useful Resources:
- [Comparison of different embedders](https://www.mdpi.com/1422-0067/24/4/3775)

## 1) Setting up a conda environment and dependencies
1) install anaconda
2) deactivate the base env 
    - $conda deactivate
3) create and activate python 3.8.18 conda env
    - $conda create --name myenv python=3.8.18
    - $conda activate myenv
5) $pip install seqvec
6) Install other dependencies (which for what: see 'Error encounter' notifications below the particular notebook cells)
    - $pip install overrides==3.1.0
    - $pip install Bio
    - *also check pyproject.toml in the SeqVec Github repository*
    - *Eventually you need to restart your jupyter kernel after certain changes*

## 2) Choose input fasta file

In [None]:
# Test file for download (tiny_sampled.fasta)
# !wget http://data.bioembeddings.com/public/embeddings/notebooks/custom_data/tiny_sampled.fasta --output-document fasta_files/tiny_sampled.fasta

In [None]:
filepath = "fasta_files/tiny_sampled.fasta"
# filepath = "fasta_files/VOG00024.faa"

## 3) Generate Embeddings

SeqVec is a pretrained model trained on large databases (Uniprot/Uniref50). The model consists of two files: An options.json and a Weights.hdf5 file which can be downloaded and unziped using the next two code cells.

In [None]:
# Download SeqVec model (ELMo model trained on UniRef50) as zip folder
# !wget https://rostlab.org/~deepppi/seqvec.zip --output-document seqvec.zip

In [None]:
# # Unzip and store model files in a folder called 'SV_model' in your working directory (assuming you have unzip installed: $sudo apt-get install unzip)
# !unzip seqvec.zip -d . 
# !mkdir -p SV_model 
# !mv uniref50_v2/* SV_model/
# !rm -r uniref50_v2/ seqvec.zip # remove empty folders"

In [None]:
# Load pretrained model
from allennlp.commands.elmo import ElmoEmbedder
from pathlib import Path

model_dir = Path('SV_model/')
weights = model_dir / 'weights.hdf5'
options = model_dir / 'options.json'
embedder = ElmoEmbedder(options,weights, cuda_device=-1) # cuda_device=-1 for CPU, 0 for GPU (faster)

**Error Encounter:**
- issue with allennlp
    1) allennlp should be version 0.9.0 and overrides 3.1.0
    2) Eventually you need numpy 1.21
- CUDA capability of GPU not compatible with current PyTorch installation
    - For Nvidea Geforce RTX 3050 Laptop GPU with CUDA capability sm_86:
        1) $conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
        2) Check out PyTorch versions that match your GPU ([https://pytorch.org/](https://pytorch.org/))

In [None]:
from Bio import SeqIO
import numpy as np

# Extract sequences from fasta file and store them as a list
sequences = []
for record in SeqIO.parse(filepath, "fasta"):
    sequences.append(list(record.seq))

# Sort the sequences in order to enhance performance
sequences.sort(key=len) 

# Generate amino acid level embedding
aa_embd = embedder.embed_sentences(sequences) # returns: List-of-Lists with shape [3,L,1024]
aa_embd = list(aa_embd) # turn generator into list

In [None]:
import torch
    
# Get 1024-dimensional embedding for per-protein predictions:
protein_embd_SV = torch.tensor(aa_embd).sum(dim=0).mean(dim=0) # Vector with shape [1024]

# Get 1024-dimensional embedding for per-residue predictions:
residue_embd_SV = torch.tensor(aa_embd).sum(dim=0) # Tensor with shape [L,1024]