# Bio Embeddings for Shallow Learners
1) Choose a VOG
2) Choose and create bio embedding algorithm: SeqVec, ProtTrans 
    - Input: VOG specific .fasta file
    - Store sequences as objects in a list
    - Create embedder object and then a protein- or aminoacid-embedding
    - Output: Vector of sequences represented/embedded as points in a multidimensional (feature-) space (vector of vectors). The embeddings harbor a constant number of features, in contrast to the previous state, where protein sequences had different lengths. So the VOG has now this embedding matrix as a profile.

**Links:**
- [SeqVec Repository](https://github.com/Rostlab/SeqVec)
    - [SeqVec Source-Code](https://github.com/Rostlab/SeqVec/blob/master/seqvec/seqvec.py)
- [ProtTrans Respository](https://github.com/agemagician/ProtTrans)
- [Embedder Benchmark Paper](https://www.mdpi.com/1422-0067/24/4/3775)
    

## o) Install and Load Libraries

In [1]:
# !pip install seqvec

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import h5py
from Bio import SeqIO
from bio_embeddings.embed import ProtTransBertBFDEmbedder
from bio_embeddings.embed.seqvec_embedder import SeqVecEmbedder
from bio_embeddings.project import tsne_reduce
from bio_embeddings.visualize import render_3D_scatter_plotly


  from .autonotebook import tqdm as notebook_tqdm


## o) Generate Embedding: SeqVec
### Input: .fasta file | Output: array of embedding vectors

In [3]:
# filepath = "BE_testing/tiny_sampled.fasta"
# # !seqvec -i $filepath -o embeddings.npz
# !seqvec -i $filepath -o embeddings.npz --split-char ' ' --id 0


## o) Generate Bio-Embeddings (in progress)
see <embed_fasta_sequences.ipynb>

As we highlited in the previous step, the dimensions, - complexity of our feature space - , are extraordinary high, we need to reduce the feature size. For this purpose we will use so called protein- or bio-embeddings to ...

In [4]:
# # Create (download) testing fasta file (tiny_sampled.fasta)
# !wget http://data.bioembeddings.com/public/embeddings/notebooks/custom_data/tiny_sampled.fasta --output-document BE_testing/tiny_sampled.fasta

In [None]:
# Extract sequences from fasta file and store them as a list
# filepath = "Fabio/BE_testing/tiny_sampled.fasta"
filepath = "BE_testing/VOG1_trial2/VOG00001.faa"
sequences = []
for record in SeqIO.parse(filepath, "fasta"):
    sequences.append(record)

# Sanity-check
print(f"Member-ID     Identifier\t\tLength\t    Sequence\n")
for i,s in enumerate(sequences):
    print(f"Protein {i+1:<6}{(s.id):<28}{len(s.seq):<10}{s.seq}") # :<6 for proper output alignment

Member-ID     Identifier		Length	    Sequence

Protein 1     1094892.YP_004894869.1      485       MFFTINNNSLKSSNNDGTNNTYIKIFNIQDIEQMYNLGTTLYIIKTNGQHNVNAYNKTMCTNQFEIISKYSLYDISTYEYFGLDITKNIYIVDFACKEGNIDFLNWWLSSGYDLKYTQIGMDNANRFKEIEILNWWVNSGLEMRYTYKSMDRASNLALFDILNWWITSGLPLKYSESSMNLDYFASLSEEDEIRVLTWWKESGLLLKYNHSAIDNASRKMHINVLNWWIESGLDICYSENSIDCITISNNTKILDWWLESGLKIKYTELAIDIASRYFSFDVLDWWLKSGLEFKYTSDSLDDINLDYFTNINQEEDEYEVIKLLNWWKSSGLELRYTNLCLDAMSEYGYIKVLDWFKNSELPILYSIEAIDCTDDINVLEWWIKSDLPLKYSCKAMNICTNIEILQWWISSGLTIKYNEHLIDGAFTNDRIDILEFMLSSGLPFIHSKYCVLGYMEKIPELSVKWWIKRELPKEYISNINSIDKN
Protein 2     1247379.YP_010779223.1      460       MLVYITNGRHKSIDVDGVYDSSNFLTAFNINEIKKYYCLGHQLHIINMKNNSSDIESEKKGDNKYIADDIPIEKIYNKFKIIESYSLYDINTYIKLGIDIRENNHIVDIASKEGNIDFLNWWISNDLDLKYSEKSLDDASDYGHIDVLNWWFHSNLPLRYSKNSTNFKLNIKLEVKDKIKVLDWWYNSGLDIIYDEGSMDNASNNCEREILDWWVNSGLKLKYSERSMDDASMFNKIDVLNWWFMSGLELKYSDIAIDYASKNQFINVLDWWKNSNLPLKYSNLSIDEVNEYYYLEDDKSYLNVLEWWKNSGLELKYSERCIDSFSRCGEIAGLEWWKNSGLELRYSECSIVDIVDPNILQWW

sm_86 pytorch compatibility issue:
- https://github.com/pytorch/pytorch/issues/45028
- \\wsl.localhost\Ubuntu\home\dinglemittens\anaconda3\envs\SPEnv38\lib\python3.8\site-packages\bio_embeddings\embed
- solved: conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge

About SeqVec:
- https://github.com/Rostlab/SeqVec/blob/master/README.md : "All results built upon the embeddings gained from the new tool SeqVec neither explicitly nor implicitly using evolutionary information. Nevertheless, it improved over some methods using such information. Where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created the vector representation on average in 0.03 seconds."

In [7]:
# Generate Embedder Object
# embedder = ProtTransBertBFDEmbedder()
embedder = SeqVecEmbedder()

Trouble with ProtTrans Bert:
- Fabio: For the following aa level embedding via ProtTransBertBFDEmbedder() apparently my laptop GPU memory is too low, it uses CPU instead which significantly affects the speed. Interrupted after 40 min. Errormessage:
RuntimeError for sequence with 166 residues: CUDA error: no kernel image is available for execution on the device<br>
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.<br>
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.. This most likely means that you don't have enough GPU RAM to embed a protein this long. Embedding on the CPU instead, which is very slow.

Trouble with SeqVec:
- Fabio: Managed to install the proper pytorch version such that the embedder does not have to use the CPU, but it still took (too?) long. Interrupted after 77min

In [8]:
# Compute Amino Acid Level Embedding (takes quiet some time)
aa_embeddings = embedder.embed_many([str(s.seq) for s in sequences])
# `embed_many` returns a generator.
# We want to keep both RAW embeddings and reduced embeddings in memory.
# To do so, we simply turn the generator into a list!
# (this will start embedding the sequences!)
# Needs certain amount of GPU RAM, if not sufficient CPU is used (slower)
aa_embeddings = list(aa_embeddings)

KeyboardInterrupt: 

In [None]:
# Print Shape of Amino Acid Level Embedding
print("amino acid level embeddings object shape:")
print("o) 3 dimensional list of a number of <len(sequences)> embedding matrices with 1024 rows and <len(seq)> columns") #
print(f"o) 1st D (number of sequences)\t{len(aa_embeddings)}")
print("o) 2nd D (sequence length)\tdepending on sequence")
print(f"o) 3rd D (embedding dimensions)\t{len(aa_embeddings[0][0])} (constant)")

In [None]:
# Compute Protein Level Embedding
protein_embeddings = [ProtTransBertBFDEmbedder.reduce_per_protein(e) for e in aa_embeddings]
# mean of amino acid level vectors

In [None]:
# Print Shape of Protein Level Embedding
print("protein level embeddings-object shape:")
print("o) 2 dimensional list of <len(sequences)> embedding vectors with 1024 entries")
print(f"o) 1st D (number of sequences)\t{len(protein_embeddings)}")
print(f"o) 2nd D (embedding dimensions)\t{len(protein_embeddings[0])} (constant)")

In [None]:
# Print Summary of Embedding Shapes:  Sequence | AA Level Embedding | Protein Level Embedding
for i, (per_amino_acid, per_protein) in enumerate(zip(aa_embeddings, protein_embeddings)):
    print(f"Protein {i+1}\t{per_amino_acid.shape}\t{per_protein.shape}")

## o) Projecting high dimensional embedding space to a 3D space
see <project_visualize_pipeline_embeddings.ipynb> (Bio-embeddings GitHub)

In [None]:
# Configure tsne options
options = {
    'perplexity': 3, # Low perplexity values (e.g., 3) cause t-SNE to focus more on preserving the local structure of the data (high, e.g. 30).
    'n_iter': 500 # number of iterations for the tsne algorithm
}

# Apply TSNE Projection 
projected_p_embedding = tsne_reduce(protein_embeddings, **options) # list

# Display Projected Embedding (from 1024 dimensional (Protein Level) vectors to 3 dimensional coordinate vectors)
print(f"\nShape of projected protein level embedding: {projected_p_embedding.shape}\n")
for i,embedding in enumerate(projected_p_embedding[:3]): # first 3
    print(f"Protein {i+1}\t{embedding}")
print(". . .")
for i,embedding in enumerate(projected_p_embedding[-3:]): # last 3
    print(f"Protein {i+len(projected_p_embedding)-2}\t{embedding}")

## o) Visualization of the Data
see <project_visualize_pipeline_embeddings.ipynb> (Bio-embeddings GitHub)

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Assuming your data is stored in a variable called 'data'
# Replace this with your actual data

# Create a 3D scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Extract x, y, and z coordinates from the data
x = data[:, 0]
y = data[:, 1]
z = data[:, 2]

# Plot the points
ax.scatter(x, y, z)

# Set labels for each axis
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

# Show the plot
plt.show()


In [None]:
import plotly.express as px
from sklearn.decomposition import PCA

df = px.data.iris()
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

pca = PCA(n_components=3)
components = pca.fit_transform(X)

total_var = pca.explained_variance_ratio_.sum() * 100

fig = px.scatter_3d(
    components, x=0, y=1, z=2, color=df['species'],
    title=f'Total Explained Variance: {total_var:.2f}%',
    labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
)
fig.show()