Adhesin classifier

This model is inspired by SPAAN (Software Program for prediction of Adhesins and Adhesin-like proteins using Neural network), which is originally described in this paper.

Dataset

Bacterial adhesins were obtained performing jackhmmer search among reference proteomes of eubacteria with default patrameters and the domains listed in this file as query. Non adhesin proteins were obtained using the following query in uniprot:

(taxonomy_id:2) AND (reviewed:true) NOT (keyword:KW-1217) NOT (keyword:KW-1233) NOT (keyword:KW-0130) NOT (cc_function:adhesion) NOT (cc_function:"cell adhesion")

a subset of non adhesin proteins was randomly selected to match the size of adhesin dataset. Reduntant sequences (60% and 25% identity trasholds) were removed using CD-HIT.

Feature computation

Features are computed with iFeature so a parser of the iFeature output files, to obtain the vectors to feed the model, is used. They are:

AAC: amino acids composition
DPC: dipeptide composition
CTDC: composition
CTDT: transition
CTDD: distribution

(here if you want more information about what they are and how to compute them).

... and here a brief tutorial on how to compute them:

!rm -r iFeature
!git clone https://github.com/Superzchen/iFeature
!python iFeature/iFeature.py --file ./input.fasta --type AAC --out aac.out    # amino acids composition
!python iFeature/iFeature.py --file ./input.fasta --type DPC --out dpc.out    # dipeptide composition
!python iFeature/iFeature.py --file ./input.fasta --type CTDC --out ctdc.out  # composition
!python iFeature/iFeature.py --file ./input.fasta --type CTDT --out ctdt.out  # transition
!python iFeature/iFeature.py --file ./input.fasta --type CTDD --out ctdd.out  # distribution

PCA

Since every sequence has a (20+400+39+39+195=693)-dimensional feature vector, we performed Principal Component Analysis to reduce the dimensionality. Here the results:

so we can take just the first 350 components, reducing the dimensionality of about the 50%.

Model

We decided to use the smallest model possible, and with just a 10-units Dense layer and a K=400 from PCA, we are able to get the best results so far.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_11 (InputLayer)       [(None, 400)]             0         
                                                                 
 dense_21 (Dense)            (None, 10)                4010      
                                                                 
 dense_22 (Dense)            (None, 1)                 11        
                                                                 
=================================================================
Total params: 4,021
Trainable params: 4,021
Non-trainable params: 0
_________________________________________________________________

Results

test_loss = 0.214177668094635
test_accuracy = 0.9396551847457886

Notice that removing regularizers and increasing neurons in the Dense layer it is possible to obtain roughly the same results (a little bit more overfitted) but in about 20 epochs.

You can follow every step in this notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
__pycache__		__pycache__
data		data
deprecated		deprecated
sequence-aware models		sequence-aware models
.DS_Store		.DS_Store
AC.ipynb		AC.ipynb
README.md		README.md
Test_domains.ipynb		Test_domains.ipynb
__init__.py		__init__.py
ac.h5		ac.h5
cross_validation.ipynb		cross_validation.ipynb
leave_one_out_val.ipynb		leave_one_out_val.ipynb
mean.npy		mean.npy
projection_matrix.npy		projection_matrix.npy
std.npy		std.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adhesin classifier

Dataset

Feature computation

PCA

Model

Results

About

Contributors 3

Languages

nicolagulmini/spaan

Folders and files

Latest commit

History

Repository files navigation

Adhesin classifier

Dataset

Feature computation

PCA

Model

Results

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages