PROVAL: Evaluation Framework for Protein Sequence Embeddings

Code submission of paper 'PROVAL: A Framework for Comparison of Protein Sequence Embeddings'

PROVAL Setup

We recommend using a new Conda enviroment!
Install Proval Framework pip install -e .[all]
(Optional) Install Smith-Watermann Alignment:

git clone https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.git
cd Complete-Striped-Smith-Waterman-Library/src
make

Extension to Other Embedding Algorithms

Integration into embedding.py

Load pretrained model

Add function to embedding_utils.py, which takes the train and test sequences as lists of Bio sequences (see read_fasta() in utils.py) and returns the vectors in a dictionary of the form id(String):vector(NumPy array)

Add approach to embedding list (embeddings.py, line 17)

Add embedding function call to the if/elif statements in the similar form

Run embeddings.py and the respective comparison scripts

or

Custom integration through vector file

Load the train and test sequences as lists of Bio sequences (see read_fasta() in utils.py)

Use custom embedding to predict the embedding vector for each sequence in the dictionary format id(String):vector(NumPy array).

Truncate the vectors to d=100 if necessary, compare embeddings.py

Save as pickle '.p' file, compare embeddings.py

Full Reproducibility of the Paper Results

Note, the extraction of the vectors and the results might not be fully deterministic and small deviations might be possible.

Data set (optional)

Steps to reproduce the test.fasta and train.fasta files in the data/ folder:

Download the full SwissProt data set (release 02/2021):
https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2021_02/

Select the sequence IDs, the sequence strings and the molecular function information ('GO:xxxxxx' terms)

Discard all sequences with more than one molecular function (to reduce the complexity of the experiments)

Select 1000 random sequences for each of the most frequent 15 molecular functions (=15,000 sequences)

Randomly split the sequences in training and test sets (70:30)

Save the sequences in the .fasta format, compare the test.fasta and train.fasta files in the data folder:

<Sequence ID> [<GO-ID>]
<Sequence>
<Sequence ID> [<GO-ID>]
<Sequence>
...

Embedding methods

Install the Smith-Watermann Alignment

Run embeddings.py to obtain the vectors

Figures

Run dataset_metrics.py for optional data set plots

Run semantics.py for the classification results (Table 3)

Run visualization.py for the visualization results (Figure 7)

Run eigenspectrum_plot.py for the information theory results (Figure 8)

Citation

@article{VATH2022100044, title = {PROVAL: A framework for comparison of protein sequence embeddings}, journal = {Journal of Computational Mathematics and Data Science}, pages = {100044}, year = {2022}, issn = {2772-4158}, doi = {https://doi.org/10.1016/j.jcmds.2022.100044}, url = {https://www.sciencedirect.com/science/article/pii/S2772415822000128}, author = {Philipp Väth and Maximilian Münch and Christoph Raab and F.-M. Schleif}, }

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
proval		proval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PROVAL: Evaluation Framework for Protein Sequence Embeddings

Code submission of paper 'PROVAL: A Framework for Comparison of Protein Sequence Embeddings'

PROVAL Setup

Extension to Other Embedding Algorithms

Full Reproducibility of the Paper Results

Citation

About

Releases 1

Packages

Contributors 2

Languages

License

philippvaeth/PROVAL

Folders and files

Latest commit

History

Repository files navigation

PROVAL: Evaluation Framework for Protein Sequence Embeddings

Code submission of paper 'PROVAL: A Framework for Comparison of Protein Sequence Embeddings'

PROVAL Setup

Extension to Other Embedding Algorithms

Full Reproducibility of the Paper Results

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages