<a href="https://colab.research.google.com/github/nathanbollig/distributed-mutation/blob/main/explore_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explore Synthetic Data and Classifier

In Phase 1 and 2 of this project, synthetic viral sequence data is generated and a classification model is trained on 80% of this data. The result of running the `phases_1_and_2.py` script is a set results objects saved to disk into 6 pickle files. After performing Phase 1 and Phase 2, this notebook loads in these files and explores their contents.

In [1]:
import numpy as np
from tensorflow import keras
import pickle

## Phase 1 and 2

This section follows the instructions in the project README to generate data and train a classifier.

In [2]:
# Clone the git repository
!git clone https://github.com/nathanbollig/distributed-mutation

Cloning into 'distributed-mutation'...
remote: Enumerating objects: 52, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 52 (delta 24), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (52/52), done.


In [3]:
# Apply requirements.txt
%cd distributed-mutation
!pip install -r requirements.txt

/content/distributed-mutation
Collecting hmmlearn==0.2.6
  Downloading hmmlearn-0.2.6-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (374 kB)
[K     |████████████████████████████████| 374 kB 5.1 MB/s 
[?25hCollecting Keras==2.2.4
  Downloading Keras-2.2.4-py2.py3-none-any.whl (312 kB)
[K     |████████████████████████████████| 312 kB 45.0 MB/s 
[?25hCollecting scikit-learn==0.22.2
  Downloading scikit_learn-0.22.2-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 17.1 MB/s 
[?25hCollecting tensorflow==1.15.5
  Downloading tensorflow-1.15.5-cp37-cp37m-manylinux2010_x86_64.whl (110.5 MB)
[K     |████████████████████████████████| 110.5 MB 1.3 kB/s 
Collecting keras-applications>=1.0.6
  Downloading Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 6.7 MB/s 
Collecting gast==0.2.2
  Downloading gast-0.2.2.tar.gz (10 kB)
Collecting tensorflow-estimator==1.15.1
  Downloading tensorflow_esti

The above conflicts are acceptable because these conflicts are with pre-installed packages in this environment that I don't believe are needed for executing my code. May now need to restart the runtime (and therefore change directory again).

In [2]:
# Run script
%cd distributed-mutation
!bash phases_1_and_2.sh

/content/distributed-mutation
Using TensorFlow backend.
Running data generation and model training...
tcmalloc: large alloc 4800004096 bytes == 0x55e94f9c2000 @  0x7f27fb31a001 0x7f27f8ebe7b5 0x7f27f8f22c00 0x7f27f8f24a9f 0x7f27f8fbb078 0x55e92ca6f544 0x55e92ca6f240 0x55e92cae3627 0x55e92cadd9ee 0x55e92ca70bda 0x55e92cade915 0x55e92ca70afa 0x55e92cade915 0x55e92cadd9ee 0x55e92ca70bda 0x55e92cadf737 0x55e92cadd9ee 0x55e92cadd6f3 0x55e92cba74c2 0x55e92cba783d 0x55e92cba76e6 0x55e92cb7f163 0x55e92cb7ee0c 0x7f27fa102bf7 0x55e92cb7ecea
tcmalloc: large alloc 3840000000 bytes == 0x55ea6db66000 @  0x7f27fb3181e7 0x7f27f8ebe631 0x7f27f8f22cc8 0x7f27f8f22de3 0x7f27f8fadf06 0x7f27f8fae368 0x55e92cb57409 0x55e92cadee7a 0x55e92cadd9ee 0x55e92ca70bda 0x55e92cadf737 0x55e92cadd9ee 0x55e92ca70bda 0x55e92cade915 0x55e92cb61cf8 0x55e92cb57cae 0x55e92cb47ae5 0x55e92ca7e224 0x55e92caaf0a4 0x55e92ca6fc52 0x55e92cae2c25 0x55e92caddced 0x55e92ca70bda 0x55e92cadf737 0x55e92cadd9ee 0x55e92ca70bda 0x55e92cadf73

## Explore Phase 1 and 2 Output

### Read in results

In [3]:
# Load model
from tensorflow import keras
model = keras.models.load_model('model.tf')

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [11]:
# Load pickled objects
with open("aa_vocab.pkl", 'rb') as pfile:
    aa_vocab = pickle.load(pfile)
with open("generator.pkl", 'rb') as pfile:
    gen = pickle.load(pfile)
with open("result.pkl", 'rb') as pfile:
    result = pickle.load(pfile)

# Load data objects
with open("data_train.pkl", 'rb') as pfile:
    data_train = pickle.load(pfile)
with open("data_val.pkl", 'rb') as pfile:
    data_val = pickle.load(pfile)
with open("data_test.pkl", 'rb') as pfile:
    data_test = pickle.load(pfile)

In [12]:
# Unpack data objects
X_train, y_train = data_train
X_val, y_val = data_val
X_test, y_test = data_test

In [13]:
def get_test_data():
    # Load test data
    with open('data_test.txt') as f:
        lines = f.readlines()

    '''
    Convert string representation of sequence in `sequences` to integer-encoded list
    '''
    def seq_str_to_list(s):
        seq = s.split(',')
        return list(map(int, seq))

    # Read file into lists
    X = []
    y = []
    for line in lines:
        split_line = line.split(',')
        seq = split_line[0:-1]
        label = int(split_line[-1])
        X.append(seq)
        y.append(label)

    # Convert lists to arrays
    X = np.array(X, dtype=int)
    y = np.array(y, dtype=int)

    return X, y

X, y = get_test_data()

In [14]:
X[5]

array([ 1,  6, 17,  1, 17,  1, 10,  2, 14,  8, 10, 19,  0, 11, 14, 10, 10,
        0, 10,  0, 11, 17, 15,  9,  0, 11, 16, 15, 13, 16,  5, 18, 10, 10,
       11, 19, 11,  5,  8,  2, 19,  5, 13, 15,  9, 10,  8, 13, 10, 14,  0,
        1,  3, 13,  2,  5, 19,  8, 16,  0])

In [15]:
y[5]

1

### Model and validation results

The Keras model object is stored in `model`.

In [16]:
type(model)

tensorflow.python.keras.engine.sequential.Sequential

During the training of this model, a dictionary was created of training set and validation set performance. We can display the values in this dictionary to see the accuracy of the stored model on its training and validation set.

In [17]:
result

{'model_train_accuracy': 0.979705, 'model_val_accuracy': 0.97897}

### Synthetic sequence data

The data are sequences (variable names prefixed with `X`) and their labels (prefixed with `y`). Data is split into training, validation, and test sets. The training set was used to train the model and the validation set was used to measure the performance reported in the `result` dictionary. The test set has not yet been used for model training or evaluation.

The data should be 80% training, 10% validation, and 10% test, with the total number of sequences as specified in the `phases_1_and_2.sh` script.

In [18]:
len(X_train), len(X_val), len(X_test)

(800000, 100000, 100000)

As expected, there are the same number of labels.

In [19]:
len(y_train), len(y_val), len(y_test)

(800000, 100000, 100000)

The sequence variables (`X_`*) are numpy arrays, where each sequence is represented by a 60 x 20 matrix.

In [20]:
X_train.shape

(800000, 60, 20)

Each of the 60 positions in the sequence is represented by a one-hot vector of length 20. We assume that 20 is the size of the character alphabet. For example, a single sequence looks like the following matrix.

In [21]:
X_train[42]

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

The binary sequence label is an integer. The value 1 represents positive, and 0 represents negative. For example, the label for the above sequence is shown below.

In [22]:
y_train[42]

0

### Applying the model to a sequence

We can apply the stored model to a sequence to get a prediction, using the TensorFlow model objects's API. For example, suppose we want to apply the model to sequences at index 42 and 43 in the training set.

In [23]:
model.predict(X_train[42:44])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


array([[0.00095886],
       [0.01204966]], dtype=float32)

We see the prediction for each sequence as a number between 0 and 1. In this case, they are both close to 1, indicating higher confidence of positive. These predictions are correct in this context, as shown below.

In [24]:
y_train[42:44]

array([0, 0])

### Amino acid vocabulary

The pickle file also included the amino acid vocabulary used for the encoding.

In [25]:
print(aa_vocab)

['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']


In [26]:
len(aa_vocab)

20

This associates an index in the range 0-19 (as described above in relation to the one-hot representation of sequences) to a specific character that reflects an amino acid in the biological sequence.

### Markov model (HMMGenerator object)

Finally, the pickle file includes the `HMMGenerator` object used to synthesize the data.

In [27]:
type(gen)

HMM_generator_motif.HMMGenerator

This object has fields that determine the structure behind the synthesized data. For example, the sequence lengths, where the active site starts in a sequence, how long the active site is, the class proportion, the intensity of the positive class signal (as described in my report), emission probability distributions, transition mutation probabilities, and some others.

In [28]:
print(gen.__dict__.keys())

dict_keys(['seq_length', 'start', 'active_site_length', 'p', 'class_signal', 'aa_list', 'background_emission', 'state0_emission', 'state1_emissions', 'transmat', 'startprob', 'emissionprob', 'n_components', 'model'])


The `HMMGenerator` class is a wrapper around a Multinomial Hidden Markov Model implemented in `hmmlearn`. The field `gen.model` contains the `hmmlearn` model used to synthesize data.

In [29]:
type(gen.model)

hmmlearn.hmm.MultinomialHMM

## Using the generator to classify novel sequences

The `HMMGenerator` class is capable of predicting the class 1 probability of a sequence under the existing HMM. Suppose we take a test item.

In [50]:
x = X_test[52]

In [51]:
y = y_test[52]
y

0

This is a negative instance. Make a mutation at position 25.

In [52]:
x

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [53]:
x[25]

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)

In [54]:
old_char_idx = np.argmax(x[25])
x[25][old_char_idx]

1.0

In [55]:
x[25][old_char_idx] = 0.0

In [56]:
x[25][11] = 1.0

In [57]:
aa_vocab[16]

'T'

In [58]:
aa_vocab[11]

'K'

This corresponds to substituting the 16th character in the alphabet ('T') with the 11th character ('K'). Now we can predict the class label using the model (after reshaping into a batch of size 1), and predict the class label under the generator model (after converted to a sequence of indices rather than one-hot encoding).

In [59]:
# Model prediction on mutation
model.predict(x.reshape(1,60,20))

array([[0.02638009]], dtype=float32)

In [60]:
# Generator posterior prob of positive class
gen.predict_proba(np.argmax(x, axis=1))

0.04532741398446242

Make a few more mutations that I know should make it look like a positive sequence, then see that the generator indicates this looks more like a positive sequence, and the model correctly predicts this as well.

In [61]:
import numpy as np

# Set the following characters starting at index 26: 'RSFIED'
chars = 'RSFIED'

for i in range(26, 32):
    # Get the new char index
    char = chars[i-26]
    new_char_idx = aa_vocab.index(char)

    # Reset current char to 0
    x[i][np.argmax(x[i])] = 0.0

    # Make substitution for new char
    x[i][new_char_idx] = 1.0

In [62]:
# Generator posterior prob of positive class
gen.predict_proba(np.argmax(x, axis=1))

0.9999998716679468

In [63]:
# Model prediction on mutation
model.predict(x.reshape(1,60,20))

array([[0.9999999]], dtype=float32)

### Copy pickle files to Google Drive

First need to mount Google Drive for this to work, and have a folder there called '744'.

In [64]:
# Copy files to Drive

!cp *.* ../drive/MyDrive/744/

Create shortened version of test data and push to Drive.

In [66]:
lines = []
N = 100
count = 0
with open('data_test.txt') as f_in:
    for line in f_in:
        if count > N:
            break
        else:
            lines.append(line)
            count += 1

with open('data_test_short.txt', 'w') as f:
    for line in lines:
        f.write(line)

In [67]:
!cp data_test_short.txt ../drive/MyDrive/744/