This model is inspired by SPAAN (Software Program for prediction of Adhesins and Adhesin-like proteins using Neural network), which is originally described in this paper.
Bacterial adhesins were obtained performing jackhmmer search among reference proteomes of eubacteria with default patrameters and the domains listed in this file as query. Non adhesin proteins were obtained using the following query in uniprot:
(taxonomy_id:2) AND (reviewed:true) NOT (keyword:KW-1217) NOT (keyword:KW-1233) NOT (keyword:KW-0130) NOT (cc_function:adhesion) NOT (cc_function:"cell adhesion")
a subset of non adhesin proteins was randomly selected to match the size of adhesin dataset. Reduntant sequences (60% and 25% identity trasholds) were removed using CD-HIT.
Features are computed with iFeature so a parser of the iFeature output files, to obtain the vectors to feed the model, is used. They are:
- AAC: amino acids composition
- DPC: dipeptide composition
- CTDC: composition
- CTDT: transition
- CTDD: distribution
(here if you want more information about what they are and how to compute them).
... and here a brief tutorial on how to compute them:
!rm -r iFeature
!git clone https://github.com/Superzchen/iFeature
!python iFeature/iFeature.py --file ./input.fasta --type AAC --out aac.out # amino acids composition
!python iFeature/iFeature.py --file ./input.fasta --type DPC --out dpc.out # dipeptide composition
!python iFeature/iFeature.py --file ./input.fasta --type CTDC --out ctdc.out # composition
!python iFeature/iFeature.py --file ./input.fasta --type CTDT --out ctdt.out # transition
!python iFeature/iFeature.py --file ./input.fasta --type CTDD --out ctdd.out # distribution
Since every sequence has a (20+400+39+39+195=693)-dimensional feature vector, we performed Principal Component Analysis to reduce the dimensionality. Here the results:
so we can take just the first 350 components, reducing the dimensionality of about the 50%.
We decided to use the smallest model possible, and with just a 10-units Dense layer and a K=400 from PCA, we are able to get the best results so far.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_11 (InputLayer) [(None, 400)] 0
dense_21 (Dense) (None, 10) 4010
dense_22 (Dense) (None, 1) 11
=================================================================
Total params: 4,021
Trainable params: 4,021
Non-trainable params: 0
_________________________________________________________________
test_loss = 0.214177668094635
test_accuracy = 0.9396551847457886
Notice that removing regularizers and increasing neurons in the Dense layer it is possible to obtain roughly the same results (a little bit more overfitted) but in about 20 epochs.
You can follow every step in this notebook.