## PHAGES2050
**Date:** 19.09.2020  
**Author:** Pior Tynecki

That notebook is a part of examples in the [PHAGES2050](https://github.com/ptynecki/PHAGES2050) repository. The content of it is valid and was tested with 0.0.7 framework version.

Two belows examples are presenting the embedding model execution used for protein vectorization. The vectorization is producing 1024 or 1280 feature space for each protein or for the set of proteins representing single bacteriophage.
Depends on your needs you can use fixed-length numeric vectors for:
* single protein function prediction,
* feature space creation for further Machine Learning classifications,
* feature space creation for bacteriophages deeper exploration in 3D;

This noteboook presents **ProtBert** and **ESM** embeddings execution.

## ProtBert
### Import necessary modules

In [None]:
import torch
import pandas as pd

In [None]:
from phages2050.embeddings.proteins.bert import BertModelManager, BertEmbedding

### Download pre-trained proteins model
Before you start vectorizing protein sequence or set of proteins you have to download pre-trained model dedicated for proteins embedding.  

This step have to be done once and doesn't have to be repeated on single environment.

In [None]:
bert_model_manager = BertModelManager()

In [None]:
# It can take a while and depends on the Internet speed
# Be aware that model size is 1.4 GB
bert_model_dir = bert_model_manager.download_model()

# If succeed expect pre-trained model directory path
bert_model_dir

### Preapre DataFrame

PHAGES2050 standard I/O data format is a pandas DataFrame.

Put your protein sequence into DataFrame with their known function name or left it as "Unknown" value.

In [None]:
# Non-fixed length protein string made up of 20 different amino acids
protein_sequence = "MSMYTTAQLLAANEQKFKFDPLFLRLFFRESYPFTTEKVYLSQIPGLVNMALYVSPIVSGEVIRSRGGSTSEFTPGYVKPKHEVNP" \
                   "QMTLRRLPDEDPQNLADPAYRRRRIIMQNMRDEELAIAQVEEMQAVSAVLKGKYTMTGEAFDPVEVDMGRSEENNITQSGGTEWSK" \
                   "RDKSTYDPTDDIEAYALNASGVVNIIVFDPKGWALFRSFKAVKEKLDTRRGSNSELETAVKDLGKAVSYKGMYGDVAIVVYSGQYV" \
                   "ENGVKKNFLPDNTMVLGNTQARGLRTYGCIQDADAQREGINASARYPKNWVTTGDPAREFTMIQSAPLMLLADPDEFVSVQLA"

# Protein function name
protein_class = "Unknown"

# Group it as list of two elements
protein_data = [
    protein_sequence,
    protein_class
]

In [None]:
# Create temporary DataFrame
df = pd.DataFrame(
    data=[protein_data],
    columns=['sequence', 'class']
)
df.head()

### Check CUDA and GPU support
If you have CUDA and GPU support then each of the calculation below will be executed much more faster than with CPU.

In [None]:
# Check if you have CUDA support
torch.cuda.is_available()

In [None]:
# Check which GPU has expected ID
gpu_device_count = torch.cuda.device_count()
{gpu_id: torch.cuda.get_device_name(gpu_id) for gpu_id in range(gpu_device_count)}

### Single protein vectorization - example 1
This example shows how to vectorize single protein sequence into 1024 numeric vector space.

In [None]:
bert_embedding = BertEmbedding(
    model_dir=bert_model_dir,
    # Pass expected GPU ID
    cuda_device=0
)

In [None]:
embbeding_result = bert_embedding.transform(df=df)

In [None]:
embbeding_result.head()

### Bacteriophage averaged vectorization - example 2
This example shows how to vectorize set of bacteriophage proteins sequences into 1024 (averaged) numeric vector space.

In [None]:
# Non-fixed length protein string made up of 20 different amino acids
protein_sequence_1 = "MSMYTTAQLLAANEQKFKFDPLFLRLFFRESYPFTTEKVYLSQIPGLVNMALYVSPIVSGEVIRSRGGSTSEFTPGYVKPKHEVNP" \
                     "QMTLRRLPDEDPQNLADPAYRRRRIIMQNMRDEELAIAQVEEMQAVSAVLKGKYTMTGEAFDPVEVDMGRSEENNITQSGGTEWSK" \
                     "RDKSTYDPTDDIEAYALNASGVVNIIVFDPKGWALFRSFKAVKEKLDTRRGSNSELETAVKDLGKAVSYKGMYGDVAIVVYSGQYV" \
                     "ENGVKKNFLPDNTMVLGNTQARGLRTYGCIQDADAQREGINASARYPKNWVTTGDPAREFTMIQSAPLMLLADPDEFVSVQLA"

protein_id_1 = "NP_040587.1"

protein_sequence_2 = "MNLKEQRKAKMAEALNFKARIKAGDALSDDEVKALTALLGDIESLDAQIAKAEENAKVLASVAQLPDADSEKPTGTDGGAKASSLG" \
                     "AHFIKELKASGRSLKDPGSFVSTEFKAATDTQSVGSAGGAFGPLITDVDRTPVIPFQRPLAVADLMASGSVSGNAITYPVFGALEG" \
                     "TSTFVAEGAAKPQLHVANPEWKTDALGEIAAWFKVTDDMAEDIPYLVSEIQSTAIYDLALREELALLSGDGSASSITGILNRPGVQ" \
                     "TIAKGTDSDPDRIFSAISKVQEVTGFVADGIVINPADYQAIRLSKDANGQYFGGGFFSGQYGTGGIVQNPPLWGLKTVVTSSIAQG" \
                     "TTVVGPFKAAAKVFRKGGLRVESTNSHADDFTNDKITIRIRERLGLQVKYPSAFVKVALGTAGK"


protein_id_2 = "YP_009596870.1"

# Group it as list of lists with two elements each
bacteriophage_data = [
    [protein_sequence_1, protein_id_1],
    [protein_sequence_2, protein_id_2]
]

In [None]:
# Create temporary DataFrame
df_phage = pd.DataFrame(
    data=bacteriophage_data,
    columns=['sequence', 'name']
)
df_phage.head()

In [None]:
embbeding_result = bert_embedding.transform(
    df=df_phage,
    bacteriophage_level=True
)

In [None]:
embbeding_result.head()

## ESM
### Import necessary modules

In [None]:
from phages2050.embeddings.proteins.esm import ESMEmbedding

### Download pre-trained proteins model

In [None]:
# It can take a while and depends on the Internet speed
# Be aware that model size is 6.7 GB
esm_embedding = ESMEmbedding(
    # Uniref100 or Uniref50
    uniref='Uniref100',
    # CUDA device ID or None (CPU)
    cuda_device=0
)

### Bacteriophage averaged vectorization - example 3

In [None]:
phage_embbeding_result = esm_embedding.transform(
    fasta_path='example_with_proteins.fasta',
    bacteriophage_level=True
)

In [None]:
phage_embbeding_result.head()

In [None]:
phage_embbeding_result.shape

### Bacteriophage each protein vectorization - example 4

In [None]:
proteins_embbeding_result = esm_embedding.transform(
    fasta_path='example_with_proteins.fasta'
)

In [None]:
proteins_embbeding_result.head()

In [None]:
proteins_embbeding_result.shape