 **IBM Virus Species Jump Hackathon at Deep Learning IndabaX 2019 in Durban**

Goal: Predict whether an unknown animal virus could potentially infect humans given its genome (DNA or RNA) sequence.

Importance:

Many old and new dangerous viruses infecting humans emerge from animals. Emergence of new viruses is currently considered one of the biggest existential threats facing humanity. In 1918, a world-wide flu outbreak killed over 50 million people around the world. Ebola virus is thought to have come from bats, HIV virus is thought to have come from monkeys, SARS virus potentially came from birds, and many more. We can today sequence thousands of viruses in animals but it is hard to identify which of the tens of thousands of viruses present in wild and domestic animals could potentially cross over to infect humans. Thus, a computational model to predict whether an animal virus can infect humans would be of huge importance.

Training Data: Genome sequences of 70 human viruses that can be easily transmitted from animals to humans, e.g. ebola (Class Label: Zoonotic) and sequences of another 70 viruses that cannot be easily transmitted from animals to humans (Class Label: Non-Zoonotic).

The genome sequences are basically a string of 4 characters (AGCT) and the sequence of each virus ranges from 2000 letters to 10,000 characters. If you feel the training/test dataset is small, an option is to fragment each virus genome into smaller pieces which could easily create a training set of 5,000 to 10,000 samples in each class (i.e we can fragment each viral genome into 100 pieces).

Test Data: 60 Genome sequences with the class label hidden from participants but provided to hackathon organizers (30 from each class).

Evaluation: Participants will have to provide a deep learning model and their predictions for each provided test sample. Evaluation will be based on the follwing two criteria:

    Most innovative model
    Performance based on AUC/ precision-recall curves

Let's get started. Import all the necessary Python libraries.

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
import keras
import string
import re
import matplotlib

Using TensorFlow backend.


Load the data uploaded to the current Colab instance under /content/sample_data/

In [0]:
with open("/content/sample_data/NonZoonoticVirusesTrain.fasta") as f:
    NonZoo_raw_data  = f.read()
    
with open("/content/sample_data/ZoonoticVirusSequencesTrain.fasta") as g:
    Zoo_raw_data  = g.read()
    
with open("/content/sample_data/VirusesTestInput.fasta") as h:
    test_raw_data  = h.read()

Preprocessing the data.

In [0]:
NonZoodata = NonZoo_raw_data.split(">")
Zoodata = Zoo_raw_data.split(">")
testdata = test_raw_data.split(">")

# dump the epmpy string in position [0]
NonZoodata2 = NonZoodata[1:]
Zoodata2 = Zoodata[1:]
testdata2 = testdata[1:]

Separate names and sequences into two lists.

In [0]:
NonZoo_new_data = [x.split(",") for x in NonZoodata2]
Zoo_new_data = [x.split(",") for x in Zoodata2]
test_new_data = [x.split(",") for x in testdata2]

NonZoo_id_name = []
Zoo_id_name = []
test_id_name = []
NonZoo_genome_sequence = []
Zoo_genome_sequence = []
test_genome_sequence = []

for x in NonZoo_new_data:
    NonZoo_id_name.append(x[0])
    NonZoo_genome_sequence.append(x[1])
    
for x in Zoo_new_data:
    Zoo_id_name.append(x[0])
    Zoo_genome_sequence.append(x[1])
    
for x in test_new_data:
    test_id_name.append(x[0])
    test_genome_sequence.append(x[1])

Separate into ID, Description and Sequence. From the data, you can just plit the ID using the first occurance of a "white space".

In [0]:
def clean_id_names(text):
    id_and_names = text.split(' ',1)
    return id_and_names

In [0]:
ZooID_Description = []
NonZooID_Description = []
testID_Description = []
for x in Zoo_id_name:
    ZooID_Description.append(clean_id_names(x))
for x in NonZoo_id_name:
    NonZooID_Description.append(clean_id_names(x)) 
for x in test_id_name:
    testID_Description.append(clean_id_names(x))

In [0]:
Zoo_ID = []
Zoo_Description = []

for x in ZooID_Description:
    Zoo_ID.append(x[0])
    Zoo_Description.append(x[1])
    
NonZoo_ID = []
NonZoo_Description = []

for x in NonZooID_Description:
    NonZoo_ID.append(x[0])
    NonZoo_Description.append(x[1])
    
test_ID = []
test_Description = []

for x in testID_Description:
    test_ID.append(x[0])
    test_Description.append(x[1])

Clean the sequences.

In [0]:
def clean_text(text):
    
    remove_lower = lambda text: re.sub('[a-z]', '', text)
    
    text = remove_lower(text)
    text = text.strip()
    text = text.replace('\n', '')
    return text

In [0]:
NonZoo_clean_sequences = []
Zoo_clean_sequences = []
test_clean_sequences = []
for seq in NonZoo_genome_sequence:
    NonZoo_clean_sequences.append(clean_text(seq))
    
for seq in Zoo_genome_sequence:
    Zoo_clean_sequences.append(clean_text(seq))
    
for seq in test_genome_sequence:
    test_clean_sequences.append(clean_text(seq))

Now we create two lists for classification, zeros and ones.

In [0]:
def zerolistmaker(n):
    listofzeros = [0] * n
    return listofzeros

def onelistmaker(n):
    listofzeros = [1] * n
    return listofzeros

In [0]:
Zoolabels = zerolistmaker(len(Zoo_clean_sequences)) #Zootonic viruses as Class 0.
NonZoolabels = onelistmaker(len(NonZoo_clean_sequences)) #NonZootonic viruses as Class 1.

Now merge the data and the labels.

In [14]:
Zoo_data_frame = [list(x) for x in zip(Zoo_ID,Zoo_Description,Zoo_clean_sequences,Zoolabels)]
NonZoo_data_frame = [list(x) for x in zip(NonZoo_ID,NonZoo_Description,NonZoo_clean_sequences,NonZoolabels)]
test_data_frame = [list(x) for x in zip(test_ID,test_Description,test_clean_sequences)]
trainlist = Zoo_data_frame + NonZoo_data_frame
dataframe = pd.DataFrame(trainlist, columns = ['ID' , 'Description', 'Sequences', 'Labels'])
dataframe.head()

Unnamed: 0,ID,Description,Sequences,Labels
0,NC_003466.1,Andes virus segment S,TAGTAGTAGACTCCTTGAGAAGCTACTGCTGCGAAAGCTGGAATGA...,0
1,NC_003468.2,Andes virus segment L,TAGTAGTAGACTCCGGGATAGAAAAAGTTAGAAAAATGGAAAAGTA...,0
2,NC_003467.2,Andes virus segment M,TAGTAGTAGACTCCGCAAGAAGAAGCAAAAAATTAAAGAAGTGAGT...,0
3,NC_009026.2,Bussuquara virus,AGTATTTCTTCTGCGTGAGACCATTGCGACAGTTCGTACCGGTGAG...,0
4,NC_004211.1,Banna virus strain JKT-6423 segment 1,GTATTAAAAATTATCAACAAGGAATGGACATTCAAGAACAATTTGA...,0


In [15]:
testframe = pd.DataFrame(test_data_frame, columns = ['ID' , 'Description', 'Sequences'])
testframe.head()

Unnamed: 0,ID,Description,Sequences
0,NC_002728.1,Nipah virus,ACCAAACAAGGGAGAATATGGATACGTTAAAATATATAACGTATTT...
1,NC_005283.1,Dolphin morbillivirus,ACCAGACAAAGCTGGCTAGGGGTAGAATAACAGATAATGATAAATT...
2,NC_001498.1,Measles virus,ACCAAACAAAGTTGGGTAAGGATAGATCAATCAATGATCATATTCT...
3,NC_004148.2,Human metapneumovirus,ACGCGAAAAAAACGCGTATAAATTAAGTTACAAAAAAACATGGGAC...
4,NC_003266.2,Human adenovirus E,CATCATCAATAATATACCTTATTTTTTTTGTGTGAGTTAATATGCA...


In [16]:
# Check for any nulls values
print(dataframe.isnull().sum())
print(testframe.isnull().sum())

ID             0
Description    0
Sequences      0
Labels         0
dtype: int64
ID             0
Description    0
Sequences      0
dtype: int64


In [0]:
# remove unwanted features 
clean_data = dataframe.drop(['ID', 'Description'],1)
clean_data.reset_index(drop=True)

In [0]:
# remove unwanted features 
clean_test = testframe.drop(['ID', 'Description'],1)
clean_test.reset_index(drop=True)

# **This is where your model starts. Train your model on clean_data dataframe and predict classes for clean_test dataframe**

In [32]:
clean_data.head() # Training data

Unnamed: 0,Sequences,Labels
0,TAGTAGTAGACTCCTTGAGAAGCTACTGCTGCGAAAGCTGGAATGA...,0
1,TAGTAGTAGACTCCGGGATAGAAAAAGTTAGAAAAATGGAAAAGTA...,0
2,TAGTAGTAGACTCCGCAAGAAGAAGCAAAAAATTAAAGAAGTGAGT...,0
3,AGTATTTCTTCTGCGTGAGACCATTGCGACAGTTCGTACCGGTGAG...,0
4,GTATTAAAAATTATCAACAAGGAATGGACATTCAAGAACAATTTGA...,0


In [33]:
clean_test.head()

Unnamed: 0,Sequences
0,ACCAAACAAGGGAGAATATGGATACGTTAAAATATATAACGTATTT...
1,ACCAGACAAAGCTGGCTAGGGGTAGAATAACAGATAATGATAAATT...
2,ACCAAACAAAGTTGGGTAAGGATAGATCAATCAATGATCATATTCT...
3,ACGCGAAAAAAACGCGTATAAATTAAGTTACAAAAAAACATGGGAC...
4,CATCATCAATAATATACCTTATTTTTTTTGTGTGAGTTAATATGCA...


# **Make predictions on test data and save to file for submission**

In [0]:
import random
y_pred_test = [random.randrange(0, 2, 1) for _ in range(60)] #Replace this with your predictions for the 60 input DNAs

In [0]:
y_pred = [str(x) for x in y_pred_test]
pred_y = ","
pred_y = pred_y.join(y_pred)
with open("/content/sample_data/VirusesTestPrediction.fasta","w+") as o:
    o.write(pred_y)