# Software Projekt - Sequence Embeddings on Shallow Learners
**2023, Klaus Hartmann-Baruffi, Fabio Pfaehler**

## o) Open Topics/Questions:
- F: How does our subset look like? Either 1) define a number of protein IDs or 2) define a number of VOGs ?
  1) We would have to collect the subset of protein IDs and convert them into a new fasta file that serves as input for the bio-embedding tool.
  2) We could directly simply define a subset of VOGs and input the corresponding (already VOG assigned) fasta files and could skip all the steps before the bio-embeddings tool.

## o) Install Libraries

In [62]:
"""not tested"""
# # bio-embeddings
# !pip3 install -U pip > /dev/null
# !pip3 install -U bio_embeddings[all] > /dev/null

'not tested'

## o) Import Libraries

In [119]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import h5py
from Bio import SeqIO
from bio_embeddings.embed import ProtTransBertBFDEmbedder
from bio_embeddings.project import tsne_reduce
from bio_embeddings.visualize import render_3D_scatter_plotly

## o) Load Data, Choose a Subset

Our aim is to use shallow learners, hence using the whole dataset (38161 vog groups/instances) is not feasable and we take only a subset to work with.

In [64]:
# Load data
df = pd.read_csv("/home/dinglemittens/SoftwareProject/VOGDB/vog.members.tsv",sep='\t', header=0)
print("dataset shape: {}\n".format(df.shape))

# Choose subset from vog "start" to vog "end"
start = 5
end = 19
subset = df.iloc[start-1 : end]
print(subset.iloc[:3])

dataset shape: (38161, 5)

  #GroupName  ProteinCount  SpeciesCount FunctionalCategory  \
4   VOG00005           213            42                 Xu   
5   VOG00006           309            13                 Xu   
6   VOG00007           893           715             XhXrXs   

                                          ProteinIDs  
4  176652.NP_149851.1,72201.YP_009046735.1,126902...  
5  1094892.YP_004894116.1,1094892.YP_004894117.1,...  
6  1002918.NP_937979.1,1002921.NP_740489.1,100606...  


## o) Generate Feature- and Label-Vectors

Number of labels/classes (VOG groups) = size of the subset
Number of features/feature dimensions (protein IDs/sequences) = size of the subset * number of proteins per VOG * length of the proteinsequence * 20 
, where 20 reflects the number of aminoacids in a 1-hot-encoding, since we can´t feed the model with string-characters.

In [65]:
# Convert unflattened labels (#GroupName) and features (ProteinIDs) into lists
group_names = subset["#GroupName"].tolist()
protein_ids = subset["ProteinIDs"].tolist()

# Generate flattened feature(X)- and label(y)-vectors
X=[]
y=[]
for group in group_names:
    for per_group_ids in protein_ids:
        for protein_id in per_group_ids.split(","): # note: maybe change iterator names (confusing; we have the df ProteinIDs column which contains collections of protein IDs per group, so ProteinIDs contains protein ids)
            y.append(int(group.replace("VOG","")))
            X.append(protein_id)

## o) Generate Bio-Embeddings (in progress)
see <embed_fasta_sequences.ipynb>

As we highlited in the previous step, the dimensions, - complexity of our feature space - , are extraordinary high, we need to reduce the feature size. For this purpose we will use so called protein- or bio-embeddings to ...

In [66]:
# # Create testing fasta file (tiny_sampled.fasta)
# !wget http://data.bioembeddings.com/public/embeddings/notebooks/custom_data/tiny_sampled.fasta --output-document BE_testing/tiny_sampled.fasta

In [79]:
# Extract sequences from fasta file and store them as a list
sequences = []
for record in SeqIO.parse("BE_testing/tiny_sampled.fasta", "fasta"):
    sequences.append(record)

# Print:  Index | ID | Length | Sequence
for i,s in enumerate(sequences):
    print(f"{i+1:<6}{(s.id):<18}{len(s.seq):<10}{s.seq}")

1     A0A2I1HIX6        129       MYNILFSIIENSWFIDLIKTLQLEYDSPSRQVLSGILLEPKISHVNICIINELSADNNFTIAIDEHLSNVIEEIINKVGAVAIVSDNSLNIAAAHKIITNNYPNIINMQCITHCVNLINIFIGEKLIFQ
2     MiniChange        129       MYNILFSIIENSWFIDLIKTLQLEYDSPSQQVLSGILLEPKISHVNICIINELSADNNFTIAIDEHLSNVIEEIINKVGAVAIVSDNSLNIAAAHKIITNNYPNIINMQCITHCVNLINIFIGEKLIFQ
3     Q95021            46        PQGIEVVVLLFCLKIRYRDRIFLLRGNHETPSVNKVYFKCIVSFNF
4     G7J9N7            133       MAGISAVIIVISIFLMVLVVADDMSSSSLSSSSSSVIRLPSKVTAEGKNVCAGAVASSWCPVKCFRTDPVCGVDGVTYWCGCAEAACAGVKVGKMGFCEVGSGGSAPLSAQAFLLLHIVWLIVLAFSVFFGLF
5     A0A2E8WNN2        172       MPTDSSTSPKHSLALLSPRRADRQWLGEALSSSYYRWVTDTLASSKLSERRRLASTETGLLLATGQEISALNEAYRGIAQETNVLAFPSMEWLEDGTLSLGDLVICPXIVRKEAKAQGKSTNDHFLHLLLHGMLHLFGYDHQTARQAKTMESREIAMLEKVGISNPYXESDS
6     A0A4Q4CTN5        386       MRSGRGGCKGVVNAQPIHVLGGDRLARGLGQPAGGPGPGFVQTAGNAADAAGGAGWWQHAVAGPGQGKAPVLAPGPPAAGPCRAGVGPGAGTEVVDPQRPATRGGMDAARQGLGKGAITILAGRQGLDAEDAKASAIVHAGDEQELLEDPGAGIALEGQQGEIAASDVHHALETGAAIPARG

In [None]:
# Generate Embedder
embedder = ProtTransBertBFDEmbedder()

# Compute Embedding (amino acid Level)
embeddings = embedder.embed_many([str(s.seq) for s in sequences])
# `embed_many` returns a generator.
# We want to keep both RAW embeddings and reduced embeddings in memory.
# To do so, we simply turn the generator into a list!
# (this will start embedding the sequences!)
# Needs certain amount of GPU RAM, if not sufficient CPU is used (slower)
embeddings = list(embeddings)

In [103]:
# Print Shape of Embedding (amino acid level)
print("embeddings-object shape:")
print("o) 3 dimensional list (list of <#sequences> embedding matrices with 1024 rows and <seq-length> columns)")
print(f"o) 1st D (number of sequences)\t{len(embeddings)}")
print("o) 2nd D (sequence length)\tdepending on sequence")
print(f"o) 3rd D (embedding dimensions)\t{len(embeddings[0][0])} (fix)")

embedding-object shape:
o) 3 dimensional list (list of <#sequences> embedding matrices with 1024 rows and <seq-length> columns)
o) 1st D (number of sequences)	12
o) 2nd D (sequence length)	depending on sequence
o) 3rd D (embedding dimensions)	1024 (fix)


In [104]:
# Compute Reduced Embedding (protein Level)
reduced_embeddings = [ProtTransBertBFDEmbedder.reduce_per_protein(e) for e in embeddings]

In [109]:
# Print Shape of Reduced Embedding (protein level)
print("reduced-embeddings-object shape:")
print("o) 2 dimensional list (list of <#sequences> embedding vectors with 1024 entries)")
print(f"o) 1st D (number of sequences)\t{len(reduced_embeddings)}")
print(f"o) 2nd D (embedding dimensions)\t{len(reduced_embeddings[0])} (fix)")

reduced-embeddings-object shape:
o) 2 dimensional list (list of <#sequences> embedding vectors with 1024 entries)
o) 1st D (number of sequences)	12
o) 2nd D (embedding dimensions)	1024 (fix)


In [118]:
# Print Summary of Embedding Shapes:  Sequence | AA Level Embedding | Protein Level Embedding
for i, (per_amino_acid, per_protein) in enumerate(zip(embeddings, reduced_embeddings)):
    print(f"{i+1}\t{per_amino_acid.shape}\t{per_protein.shape}")

0	(129, 1024)	(1024,)
1	(129, 1024)	(1024,)
2	(46, 1024)	(1024,)
3	(133, 1024)	(1024,)
4	(172, 1024)	(1024,)
5	(386, 1024)	(1024,)
6	(133, 1024)	(1024,)
7	(207, 1024)	(1024,)
8	(165, 1024)	(1024,)
9	(439, 1024)	(1024,)
10	(159, 1024)	(1024,)
11	(1584, 1024)	(1024,)


## o) Projecting high dimensional embedding space to a 3D space
see <project_visualize_pipeline_embeddings.ipynb> (Bio-embeddings GitHub)


In [124]:
# Configure tsne options
options = {
    'perplexity': 3, # Low perplexity values (e.g., 3) cause t-SNE to focus more on preserving the local structure of the data (high, e.g. 30).
    'n_iter': 500 # number of iterations for the tsne algorithm
}

# Apply TSNE projection
projected_embeddings = tsne_reduce(reduced_embeddings, **options)

# Print shape of projected embedding (from 1024 dimensional (Protein Level) vectors to 3 dimensional vectors)
print(f"\nShape of projected embedding: {projected_embeddings.shape}\n")

[t-SNE] Computing 10 nearest neighbors...
[t-SNE] Indexed 12 samples in 0.000s...
[t-SNE] Computed neighbors for 12 samples in 0.035s...
[t-SNE] Computed conditional probabilities for sample 12 / 12
[t-SNE] Mean sigma: 0.299232
[t-SNE] KL divergence after 250 iterations with early exaggeration: 98.866592
[t-SNE] KL divergence after 500 iterations: 3.377068

shape of projected embedding: (12, 3)





(12, 3)


## o) Visualization of the Data

## o) Split the Data

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## o) Train a Classifier on the Training Set

In [None]:
# Define the LDA classifier
"""Add model object"""

# Ttrain the classifier (modelfitting)
"""<model>.fit(X_train, y_train)"""

## o) Prediction on the Validation Set & Accuracy

In [None]:
# Use your model to make a prediction on the test data
"""y_pred = <model>.predict(X_test)"""

# Compute accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {}".format(round(accuracy, 2)))

## o) Visualization/Plot of Decision Boundaries (?)

---
# Older Version of Notebook

In [None]:
# Step 1: Import the necessary libraries
import os
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from Bio import SeqIO
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# Step 2: Load your dataset into a pandas DataFrame
df = pd.read_csv("/home/dinglemittens/SoftwareProject/VOGDB/vog.members.tsv",sep='\t', header=0)
# df = pd.read_csv("VOGDB/test.tsv",sep='\t', header=0)
print(df)

In [None]:
# Step 3: Preprocess your data
"""Next step is to pick out the relevant categories in my dataframe: The VOG numbers (labels and their 
corresponding collections of ProteinIDs (features). In addition I must convert each ID to it´s sequence by using
the fasta files. 
For the scikit split functoin I need Feature set X and label set y (with redundant labels) of same size.
By now I have my df ordered in such a way that each label has a list of proteins,
but I need the resolve them such that I have a big list of proteins each added with a label.
(Analogy: By now I have containers of balls (proteins/features), I know their label (#VOG/container), 
because they are seperated from other balls through the container. To continue I need to merge all 
the balls of all containers in a pool, before that I label them with the container number. This pool
can now be split 2 : 8 in test and training set. By stratifying (use as parameter) I can inherit the information 
of the frequency distribution of balls from a certain container relative to all balls into the two sets (If all
Ball of container 1 make up 10% of the total number of balls, then in the teset and training set will make up
10% of all balls in each of the two sets)).
Next we don´t want only our features as single strings (sequences) but as numerical vectors, where each
dimension of the vector is an amino acid. The algorithm needs numerical values for learning patterns.
The most straigt forward way would be a 1hot encoding, i.e. one feature would be a vector of vectors of 
length 20, 19 zeros and 1 one (depending on which letter is considered). We won´t do hot1 embedding but another one."""

# select interval for subset (from VOGa to VOGb) 1 - 38.161
end = df.shape[0] # last vog

a = 1
b = 180

features= df['ProteinIDs'].str.split(',').iloc[a-1:b] # each row a VOGs collection of proteins
labels = df['#GroupName'].iloc[a-1:b]

print("features:\n",features, "\n")
print("labels:\n", labels, "\n")

X=[]
y=[]
for i in range(len(features)): # for each VOG
    # id2seqvec = vog2fasta_dict(labels[i])
    for j in range(len(features.iloc[i])): # for each VOGs proteinIDs
        y.append(labels[i])
        X.append("add function here that turns ProteinID into sequence embedding")

print("X:\n",X[:8], "...\n")
print("y:\n",y[:8], "...\n")

In [None]:
# Step 4: Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("X_train:\n", X_train[:8], "...\n")
print("y_train:\n", y_train[:8], "...\n")
print("X_test:\n", X_test[:8], "...\n")
print("y_test:\n", y_test[:8], "...\n")







In [None]:
# Step 5: Choose a machine learning algorithm to use
model = LogisticRegression()

In [None]:

# Step 6: Train the model on the training data
model.fit(X_train, y_train)

In [None]:
proteinids = X_train.loc[:, 'ProteinIDs']
new_df = pd.DataFrame({'ProteinIDs': proteinids})
new_df.to_excel('./vog_proteins.xlsx', index=False)

new_df = new_df['ProteinIDs'].str.split(',', expand=True)
print(new_df)


In [None]:

# Step 7: Evaluate the model's performance on the testing data
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

In [None]:
# Step 8: Tune the model's hyperparameters to improve its performance
# For example, you could use GridSearchCV to search over a range of hyperparameters

In [None]:

# Step 9: Use the model to make predictions on new data
# For example, you could use model.predict(new_data) to make predictions on new data