# Bio Embeddings for Shallow Learners
1) Choose a VOG
2) Choose and generate Bio Embedding: SeqVec, ProtTrans 
    - Input: VOG specific .fasta file
    - Store sequences as objects in a list
    - Create embedder object and then an amino-acid level- and a protein- level embedding
    - Output: Vector of sequences represented/embedded as points in a multidimensional (feature-) space (vector of vectors). The embeddings harbor a constant number of features, in contrast to the previous state, where protein sequences had different lengths. Amino acid - level embeddings can be considered not as points in this constant feature space but vectors with length equal the length of the sequence.

**Links:**
- SeqVec 
    - [Repository](https://github.com/Rostlab/SeqVec)
    - [Source-Code](https://github.com/Rostlab/SeqVec/blob/master/seqvec/seqvec.py)
- ProtTrans 
    - [Respository](https://github.com/agemagician/ProtTrans)
- Bio-Embeddings
    - [Repository](https://github.com/sacdallago/bio_embeddings/tree/develop)
    - [Notebooks](https://github.com/sacdallago/bio_embeddings/tree/develop/notebooks)
- [Embedder Benchmark Paper](https://www.mdpi.com/1422-0067/24/4/3775)
    

## 1) Choose input fasta file

In [None]:
# Test file for download (tiny_sampled.fasta)
# !wget http://data.bioembeddings.com/public/embeddings/notebooks/custom_data/tiny_sampled.fasta --output-document BE_testing/tiny_sampled.fasta

In [None]:
filepath = "BE_testing/tiny_sampled.fasta"
# filepath = "BE_testing/VOG1_trial2/VOG00001.faa"

## 2) Generate Embeddings

### 2.1) Via SeqVec - Commandline (in progress)

In [None]:
# !pip install seqvec

In [None]:
# !seqvec -i $filepath -o embeddings.npz
!seqvec -i $filepath -o embeddings.npz --split-char ' ' --id 0

<span style="color:red">
Error encounter: <br>
- json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) <br>
- The seqvec command has an optional parameter --model, for choosing a pretrained ELMo model (as 2 files: weights.hdf5 and options.json). <br>
- If the parameter is not set, the command will download the files for the default model automatically. <br>
- Apparently the .json file is empty (?). The options file also starts with an initial line, typcial for html files and not json files.<br>

</span>

In [None]:
# Load Embeddings
import numpy as np
import json

# option 1: without identifiers:
data = np.load("embeddings.npz")  # type: Dict[str, np.ndarray]

# # option 2: with identifiers as additional json file
# data = np.load("embeddings.npy") # shape=(n_proteins,)
# with open("embeddings.json") as fp:
#     labels = json.load(fp)

### 2.2) Via SeqVec - Script (in progress)

In [21]:
# Download SeqVec model (ELMo model trained on UniRef50) as zip folder
!wget https://rostlab.org/~deepppi/seqvec.zip --output-document seqvec.zip

--2024-02-10 13:48:40--  https://rostlab.org/~deepppi/seqvec.zip
Resolving rostlab.org (rostlab.org)... 188.114.96.10, 188.114.97.10, 2a06:98c1:3120::a, ...
Connecting to rostlab.org (rostlab.org)|188.114.96.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 347361261 (331M) [application/zip]
Saving to: ‘seqvec.zip’


2024-02-10 13:51:48 (1.76 MB/s) - ‘seqvec.zip’ saved [347361261/347361261]



In [35]:
# Unzip and store model files in a folder called 'model' in your working directory (assuming you have unzip installed: $sudo apt-get install unzip)
!unzip seqvec.zip -d . # unexpectedly the result is not the seqvec folder but it´s subfolder uniref50_v2
!mkdir -p model
!mv uniref50_v2/* model/
!rm -r uniref50_v2/ seqvec.zip

Archive:  seqvec.zip
  inflating: ./uniref50_v2/weights.hdf5  
  inflating: ./uniref50_v2/options.json  


In [2]:
# Load pretrained model
from allennlp.commands.elmo import ElmoEmbedder
from pathlib import Path

model_dir = Path('model/')
weights = model_dir / 'weights.hdf5'
options = model_dir / 'options.json'
embedder = ElmoEmbedder(options,weights, cuda_device=-1) # cuda_device=-1 for CPU, 0 for GPU

<span style="color:red">
Error Encounter: <br>
- No CUDA GPUs are available
</span>

### 2.3) Via ProTrans - Command Line (in progress)

### 2.4) Via ProTrans - Script (in progress)

### 2.5) Via Bio-Embeddings Module (done)
see [embed_fasta_sequences.ipynb](https://github.com/sacdallago/bio_embeddings/tree/develop/notebooks)

In [None]:
# Load Modules
import numpy as np
from Bio import SeqIO
from bio_embeddings.embed import ProtTransBertBFDEmbedder
from bio_embeddings.embed.seqvec_embedder import SeqVecEmbedder

# Extract sequences from fasta file and store them as a list
sequences = []
for record in SeqIO.parse(filepath, "fasta"):
    sequences.append(record)

# Sanity-check (First 3 and last 3 sequences)
print(f"Member-ID     Identifier\t\tLength\t    Sequence\n")
for i,s in enumerate(sequences[:3]): # s:SeqIO-object
    print(f"Protein {i+1:<6}{(s.id):<28}{len(s.seq):<10}{s.seq}") # :<6 for proper output alignment
print(". . .")
for i,s in enumerate(sequences[-3:], start=len(sequences)-2):
    print(f"Protein {i+1:<6}{(s.id):<28}{len(s.seq):<10}{s.seq}")


In [None]:
# Choose Embedder
# embedder = ProtTransBertBFDEmbedder()
embedder = SeqVecEmbedder()

In [None]:
# Compute Amino Acid Level Embedding (takes quiet some time)
aa_embeddings = embedder.embed_many([str(s.seq) for s in sequences])
# `embed_many` returns a generator. We want to keep both RAW embeddings and reduced embeddings in memory.
# To do so, we simply turn the generator into a list (this will start embedding the sequences!).
# Needs certain amount of GPU RAM, if not sufficient CPU is used (slower).
aa_embeddings = list(aa_embeddings)

sm_86 pytorch compatibility issue:
- wrong pytorch version prevented using the GPU for the following embedding computation.
- solved: conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge

About SeqVec:
- https://github.com/Rostlab/SeqVec/blob/master/README.md : "All results built upon the embeddings gained from the new tool SeqVec neither explicitly nor implicitly using evolutionary information. Nevertheless, it improved over some methods using such information. Where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created the vector representation on average in 0.03 seconds."
- Directory: - \\wsl.localhost\Ubuntu\home\dinglemittens\anaconda3\envs\SPEnv38\lib\python3.8\site-packages\bio_embeddings\embed


Trouble with Speed:
- Managed to install the proper pytorch version such that the embedder does not have to use the CPU, but it still took (too?) long. Interrupted after 77min for SeqVecEmbedder(). Interruption after 40 min for ProtTransBertBFDEmbedder().
- For the tiny_sampled.fasta file which harbors 12 sequences of a total length of 3682 aminoacids, the embedding run took 43 seconds, which are ~5k aminoacids per minute.

In [None]:
# Returns the number of dimensions of an array
def dimension_number(array):
    dim_num = 0
    sublist = array
    while isinstance(sublist, (np.ndarray, list)):
        dim_num += 1
        sublist = sublist[0]
    return dim_num

In [None]:
# Print Shape of Amino Acid Level Embedding
if dimension_number(aa_embeddings) == 4:
      print(f"Amino acid level embeddings shape (SeqVec):")
      print(f"( {len(aa_embeddings)} | {len(aa_embeddings[0])} | variable | {len(aa_embeddings[0][0][0])} )")
      print("( no. of sequences | NN layers | sequence length | embedding dimensions)")
elif dimension_number(aa_embeddings) == 3:
      print(f"Amino acid level embeddings object shape (ProtTrans):")
      print(f"( {len(aa_embeddings)} | variable | {len(aa_embeddings[0][0])} )")
      print("( no. of sequences | sequence length | embedding dimensions)")


In [None]:
# Compute Protein Level Embedding
protein_embeddings = [embedder.reduce_per_protein(e) for e in aa_embeddings]
# mean of amino acid level vectors

# Print Shape of Protein Level Embedding
print("Protein level embeddings shape:")
print(np.shape(protein_embeddings))
print("( no. of sequences | embedding dimensions )")

In [None]:
# Print Summary of Embedding Shapes:  Sequence | AA Level Embedding | Protein Level Embedding
print("Member ID\tAA Level Embedding\tProtein Level Embedding")
for i, (per_amino_acid, per_protein) in enumerate(zip(aa_embeddings[:3], protein_embeddings[:3])):
    print(f"Protein {i+1}\t{per_amino_acid.shape}\t\t{per_protein.shape}")
print(". . .")
for i, (per_amino_acid, per_protein) in enumerate(zip(aa_embeddings[-3:], protein_embeddings[-3:]), start=len(aa_embeddings)-2):
    print(f"Protein {i+1}\t{per_amino_acid.shape}\t\t{per_protein.shape}")

## 3) Projection/Dimensionality Reduction

### 3.1) Via Bio-Embeddings Module (done)
see [project_visualize_pipeline_embeddings.ipynb](https://github.com/sacdallago/bio_embeddings/tree/develop/notebooks) (Bio-embeddings GitHub)

In [None]:
from bio_embeddings.project import tsne_reduce

# Configure tsne options
options = {
    'perplexity': 3, # Low perplexity values (e.g., 3) cause t-SNE to focus more on preserving the local structure of the data (high, e.g. 30).
    'n_iter': 500 # number of iterations for the tsne algorithm
}

# Apply TSNE Projection 
projected_p_embedding = tsne_reduce(protein_embeddings, **options) # list

# Display Projected Embedding (from 1024 dimensional (Protein Level) vectors to 3 dimensional coordinate vectors)
print(f"\nShape of projected/dimensionality-reduced protein level embedding: {projected_p_embedding.shape}\n")
for i,embedding in enumerate(projected_p_embedding[:3]): # first 3
    print(f"Protein {i+1}\t{embedding}")
print(". . .")
for i,embedding in enumerate(projected_p_embedding[-3:]): # last 3
    print(f"Protein {i+len(projected_p_embedding)-2}\t{embedding}")
print() 

## 4) Visualization of the Projection

### 4.1) Via Pyplot Scatterplot

In [None]:
import matplotlib.pyplot as plt

# Create a 3D scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Extract x, y, and z coordinates from the data
x = projected_p_embedding[:, 0]
y = projected_p_embedding[:, 1]
z = projected_p_embedding[:, 2]

# Plot the points
ax.scatter(x, y, z)

# Set labels for each axis
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

# Show the plot
plt.show()


### 4.2) Via Plotly Express (in progress)

In [None]:
import plotly.express as px

fig = px.scatter_3d(
    projected_p_embedding, x=0, y=1, z=2,
    labels={'0': 'label 1', '1': 'label 2', '2': 'label 3'}
)
fig.show()

In [None]:
# raw code from ChatGTP
import plotly.express as px
from sklearn.decomposition import PCA

df = px.data.iris()
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

pca = PCA(n_components=3)
components = pca.fit_transform(X)

total_var = pca.explained_variance_ratio_.sum() * 100

fig = px.scatter_3d(
    components, x=0, y=1, z=2, color=df['species'],
    title=f'Total Explained Variance: {total_var:.2f}%',
    labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
)
fig.show()

### 4.3) Via Bio-Embeddings Module (in progress)
see [project_visualize_pipeline_embeddings.ipynb](https://github.com/sacdallago/bio_embeddings/tree/develop/notebooks) (Bio-embeddings GitHub)

In [None]:
# code from jupyter notebook. Has to be adjusted.
import h5py
import numpy as np
from pandas import read_csv
from bio_embeddings.project import tsne_reduce
from bio_embeddings.visualize import render_3D_scatter_plotly

mapping_file = read_csv('mapping_file.csv', index_col=0)
embeddings = []
with h5py.File('reduced_embeddings_file.h5', 'r') as f:
    for remapped_id in mapping_file.index:
        embeddings.append(np.array(f[remapped_id]))
options = {
    'perplexity': 3,
    'n_iter': 500
}

projected_embeddings = tsne_reduce(embeddings, **options)
mapping_file['component_0'] = projected_embeddings[:, 0]
mapping_file['component_1'] = projected_embeddings[:, 1]
mapping_file['component_2'] = projected_embeddings[:, 2]
annotation_file = read_csv('annotation_file.csv', index_col=0)

merged_annotation_file = annotation_file.join(mapping_file.set_index('original_id'), how='outer')
merged_annotation_file['label'].fillna('UNKNOWN', inplace=True)
figure = render_3D_scatter_plotly(merged_annotation_file)
figure.show()

