Skip to content

Comparison of Protein Sequence Embeddings to Classify Molecular Functions

License

Notifications You must be signed in to change notification settings

philippvaeth/PROVAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PROVAL: Evaluation Framework for Protein Sequence Embeddings

Code submission of paper 'PROVAL: A Framework for Comparison of Protein Sequence Embeddings'

DOI:10.1016

PROVAL Setup

  1. We recommend using a new Conda enviroment!
  2. Install Proval Framework pip install -e .[all]
  3. (Optional) Install Smith-Watermann Alignment:
git clone https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.git
cd Complete-Striped-Smith-Waterman-Library/src
make

Extension to Other Embedding Algorithms

Integration into embedding.py
  1. Load pretrained model
  2. Add function to embedding_utils.py, which takes the train and test sequences as lists of Bio sequences (see read_fasta() in utils.py) and returns the vectors in a dictionary of the form id(String):vector(NumPy array)
  3. Add approach to embedding list (embeddings.py, line 17)
  4. Add embedding function call to the if/elif statements in the similar form
  5. Run embeddings.py and the respective comparison scripts
or
Custom integration through vector file
  1. Load the train and test sequences as lists of Bio sequences (see read_fasta() in utils.py)
  2. Use custom embedding to predict the embedding vector for each sequence in the dictionary format id(String):vector(NumPy array).
  3. Truncate the vectors to d=100 if necessary, compare embeddings.py
  4. Save as pickle '.p' file, compare embeddings.py

Full Reproducibility of the Paper Results

Note, the extraction of the vectors and the results might not be fully deterministic and small deviations might be possible.

Data set (optional)

Steps to reproduce the test.fasta and train.fasta files in the data/ folder:

  1. Download the full SwissProt data set (release 02/2021):
    https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2021_02/
  2. Select the sequence IDs, the sequence strings and the molecular function information ('GO:xxxxxx' terms)
  3. Discard all sequences with more than one molecular function (to reduce the complexity of the experiments)
  4. Select 1000 random sequences for each of the most frequent 15 molecular functions (=15,000 sequences)
  5. Randomly split the sequences in training and test sets (70:30)
  6. Save the sequences in the .fasta format, compare the test.fasta and train.fasta files in the data folder:

    <Sequence ID> [<GO-ID>]
    <Sequence>
    <Sequence ID> [<GO-ID>]
    <Sequence>
    ...

Embedding methods
  1. Install the Smith-Watermann Alignment
  2. Run embeddings.py to obtain the vectors
Figures
  • Run dataset_metrics.py for optional data set plots
  • Run semantics.py for the classification results (Table 3)
  • Run visualization.py for the visualization results (Figure 7)
  • Run eigenspectrum_plot.py for the information theory results (Figure 8)

Citation

 @article{VATH2022100044,
title = {PROVAL: A framework for comparison of protein sequence embeddings},
journal = {Journal of Computational Mathematics and Data Science},
pages = {100044},
year = {2022},
issn = {2772-4158},
doi = {https://doi.org/10.1016/j.jcmds.2022.100044},
url = {https://www.sciencedirect.com/science/article/pii/S2772415822000128},
author = {Philipp Väth and Maximilian Münch and Christoph Raab and F.-M. Schleif},
}