# MetOx prediction using MusiteDeep

In this notebook, we will use the framework MusiteDeep to try to solve the methionine-oxidation problem. We will pre-train the MusiteDeep model on general phosphorylation data, and then we will fine-tune it using our methionine-oxidation dataset.

In [1]:
!python -V

Python 2.7.15 :: Anaconda, Inc.


## Pre-trained models

We pre-train a custom general model using both training and testing general phoshorylation datasets.

Firstly, we check the total number of resiudes contained only in the training dataset:

In [4]:
# Positive residues (in the original article, 36284 are reported)
!cat ../testdata/training_proteins_nonredundant_STY.fasta | grep ">sp" -v | grep "#" -o | wc -l

36284


In [5]:
# Positive and negative residues (in the original article, 841448 are reported)
!cat ../testdata/training_proteins_nonredundant_STY.fasta | grep ">sp" -v | grep "S|T|Y" -Eo | wc -l

840198


In [6]:
# Negative Tyr residues (in the original article, 128007 are reported)
a = !cat ../testdata/training_proteins_nonredundant_STY.fasta | grep ">sp" -v | grep "Y" -o | wc -l
p = !cat ../testdata/training_proteins_nonredundant_STY.fasta | grep ">sp" -v | grep "Y#" -o | wc -l
int(a[0]) - int(p[0])

126757

Regarding the training dataset described in the original article, this dataset lacks of 1250 Tyr negative sites.

Then, we check that there are not proteins contained both in training and testing datasets, based on their IDs:

In [8]:
!cat ../testdata/training_testing_proteins_nonredundant_STY.fasta | grep ">sp" | wc -l

7627


In [7]:
!cat ../testdata/training_testing_proteins_nonredundant_STY.fasta | grep ">sp" | sort | uniq | wc -l

7627


Then, we check the total number of resiudes contained in the joined dataset:

In [9]:
# Positive residues (in the original article, 38405 are reported)
!cat ../testdata/training_testing_proteins_nonredundant_STY.fasta | grep ">sp" -v | grep "#" -o | wc -l

38405


In [10]:
# Tyr positive residues (in the original article, 1930 are reported)
!cat ../testdata/training_testing_proteins_nonredundant_STY.fasta | grep ">sp" -v | grep "Y#" -o | wc -l

1930


In [11]:
# Tyr negative residues (in the original article, 137181 are reported)
a = !cat ../testdata/training_testing_proteins_nonredundant_STY.fasta | grep ">sp" -v | grep "Y" -o | wc -l
p = !cat ../testdata/training_testing_proteins_nonredundant_STY.fasta | grep ">sp" -v | grep "Y#" -o | wc -l
int(a[0]) - int(p[0])

135931

In [12]:
# Positive and negative residues (in the original article, 913623 are reported)
!cat ../testdata/training_testing_proteins_nonredundant_STY.fasta | grep ">sp" -v | grep -Eo "S|T|Y" | wc -l

912373


Regarding the dataset described in the original article, this dataset lacks of the 1250 Tyr negative sites of the training set, as stated above.

General model pre-trained on S, T residues:

In [None]:
%%time
!python ../MusiteDeep_Keras2.0/MusiteDeep/train_general.py -input ../testdata/training_testing_proteins_nonredundant_STY.fasta -output-prefix all-phos-data/models/pre-train/custom_general_ST -residue-types S,T -nclass=5

General model pre-trained on S, T, Y residues:

In [None]:
%%time
!python ../MusiteDeep_Keras2.0/MusiteDeep/train_general.py -input ../testdata/training_testing_proteins_nonredundant_STY.fasta -output-prefix all-phos-data/models/pre-train/custom_general_STY -residue-types S,T,Y -nclass=5

## Fine-tuned models

We now fine-tune the previously pre-trained custom models on methionine-oxidation data:

### Using bootstrap strategy

Using STY-residues:

In [13]:
!cat train_bootstrap_BS_custom_model.sh

#!/bin/bash


# ----- Arguments management -----

while [[ $# -gt 0 ]]
do

key="$1"
case $key in
    -l|--learning-rate)
    LR=$2
    shift # past argument
    shift # past value
    ;;
    -t|--transfer-leayer)
    TL=$2
    shift # past argument
    shift # past value
    ;;
    -n|--nclass)
    N=$2
    shift # past argument
    shift # past value
    ;;
    -b|--background)
    B=$2
    shift # past argument
    shift # past value
    ;;
    -p|--phospho-residue)
    PH=$2
    shift # past argument
    shift # past value
    ;;
    -f|--fine-output)
    FT=$2
    shift # past argument
    shift # past value
    ;;
    -r|--result-output)
    RES=$2
    shift # past argument
    shift # past value
    ;;
    -h|--help)
	echo "Usage: $0 [OPTION]...
Example: $0 -l 0.003 -t 5

Arguments:
 -l, --learning-rate=NUM	Learning rate value used when fine-tuning
 				the model. It should be a FLOAT value.
 				Default value is 0.001
 -t, --t

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh

Using ST-residues:

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -p ST

#### Transfer-leayer and learning-rate

Using lr=0.00075

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -l 0.00075

Using lr=0.0005

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -l 0.0005

Using lr=0.00025

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -l 0.00025

Using lr=0.00125

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -l 0.00125

Using lr=0.0015

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -l 0.0015

Using tl=0:

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -t 0

Using tl=3:

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -t 3

Using tl=5:

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -t 5

Using lr=0.00025 and tl=3:

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -l 0.00025 -t 3

Using lr=0.00025 and tl=0:

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -l 0.00025 -t 0

Using lr=0.0005 and tl=3:

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -l 0.0005 -t 3

Using lr=0.0005 and tl=0:

In [None]:
%%time
!./train_bootstrap_BS_custom_model.sh -l 0.0005 -t 0