Amino Acid Vector Embeddings, Similarity Scores, and Protein Subcellular Localization

The unique sequence of amino acids that make up a protein impart to it distinct physical and chemical properties. Inspired by ideas in NLP like word2vec and sequence-based models, we create vector embeddings of amino acids which encode contextual information and meaningful biochemical properties. We use these vector embeddings to compute substitution matrices for the problem of protein sequence alignment. We also use these embeddings together with sequence based models for the task of predicting protein subcellular localization.

Usage:

Download datafiles:
1. PDB. Download the PDB file from http://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt, and rename it to pdb.fasta.
2. UniProt. Download the SwissProt file from https://www.uniprot.org/downloads, and rename it to uniprot.fasta.
3. DeepLoc. Download the DeepLoc file from http://www.cbs.dtu.dk/services/DeepLoc-1.0/deeploc_data.fasta, and rename it to deeploc.fasta
4. Place these three files in a folder data/
Extract amino acid sequences from PDB and UniProt datafiles: python extractSeq.py
Train amino acid vector embeddings: python models.py
Use embeddings to create and visualize substitution matrix, calculate relative entropy, view PCA and t-SNE: python usemodel.py
Train model for subcellular classification, specify embedding model to be used: python subcell.py wvs_cbow_pdb_19

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
figs		figs
matrices		matrices
models		models
scripts		scripts
LICENSE		LICENSE
README.md		README.md
aa-vec-subcell-loc.pdf		aa-vec-subcell-loc.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amino Acid Vector Embeddings, Similarity Scores, and Protein Subcellular Localization

About

Releases

Packages

Languages

License

murhekar/aa-vec-embeddings

Folders and files

Latest commit

History

Repository files navigation

Amino Acid Vector Embeddings, Similarity Scores, and Protein Subcellular Localization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages