# Classification of DNA Sequences to identify invasive species with semi-supervised training

## Machine Learning at Berkeley Research Project

### Background

We attempt to solve the classification problem of identifying invasive species given binary labels and a DNA dataset from the island of Morea.

### Method

We separate training and testing data completely. Otherwise, same as `SVC_Final.ipynb`

## Data processing 

In [1]:
# import libraries
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import sklearn as sk
from __future__ import division
from sklearn.cross_validation import train_test_split
import math

from tqdm import trange
from sklearn.svm import SVC



Here, we use pandas to read the excel sheet, and then extract features and convert the data to numpy arrays.

In [40]:
# read the excel sheet 
df = pd.read_excel('./BioCode for Machine Learning Updated.xlsx')

# Read in the labels
cls = df['Classification']

# Read the DNA sequences, which are strings comprised of the letters ATCG
seq = df['Aligned Sequence']

species = df['NCBI_Genus_species']

The data we're working with are snippets of DNA a few hundred bases long.

In [41]:
seq[0]

'ACATTATACTTCATATTTGGAGGATGAGCCGGAATAGTAGGAACCTCGTTAAGA---ATACTTATTCGCGCAGAACTTAATCAACCA---GGATCCCTT------ATTGGAGATGATCAAATTTATAATGTTATTGTTACAGCCCACGCATTTGTTATAATTTTCTTTATAGTTATACCAATCTTGATTGGAGGGTTTGGAAATTGATTAGTACCTCTAATATTAGGAGCACCAGATATAGCATTCCCACGAATAAATAATATAAGATTCTGATTATTACCCCCATCACTCTCATTATTATTAACCAGTAGATTAGTCGAAAGAGGAGCTGGTACTGGTTGAACTGTATACCCACCCTTAGCTAGAGGGTTAGCCCATGCTGGTGCATCTGTTGATCTTGCAATCTTTTCTCTACACTTAGCAGGTGTTTCCTCTATTTTAGGAGCAGTTAATTTCATTTCAACAACAATCAATATAAAACCAATAAATATAACATCAGACCGAATCCCTTTATTTGTATGAGCTGTAGCAATCACAGCTTTACTTCTATTATTATCCCTACCAGTGCTTGCAGGAGCAATTACTATATTATTAACAGACCGAAACCTAAATACATCATTTTTTGACCCAGCTGGCGGGGGGGATCCTATTCTCTATCAACATTTATTT--------------------------------'

In [42]:
unique = set()
valid = []
for i in species:
    if i not in unique:
        valid.append(True)
        unique.add(i)
    else:
        valid.append(False)
        
print(len(valid))
print(type(cls), type(seq), type(species))
cls, seq, species = cls[valid], seq[valid], species[valid]

print(len(cls))

4459
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
764


We can visualize the labels below. As we can see, the labels are very messy. We can only use the values of `Indigenous`, `Invasive`, or `NaN` for supervised training. However, because most unlabeled data points still have an associated DNA sequence, we can still use them in an unsupervised pre-training stage.

In [44]:
print(cls[:20])

0            NaN
2       Invasive
3       Invasive
5            NaN
10    Introduced
11    Introduced
12    Introduced
18    Introduced
29           NaN
33           NaN
34    Introduced
35           NaN
37           NaN
38           NaN
39           NaN
41           NaN
42           NaN
43           NaN
47           NaN
48           NaN
Name: Classification, dtype: object


However, some species don't even have associated DNA sequences. We have to discard these before we proceed.

In [45]:
# Shuffles the data (to make sure)
#cls = cls.sample(frac=1).reset_index(drop=True)

In [46]:
# Convert DNA data to numpy array, and convert NaNs to Nones

seq = np.array(seq.fillna('None'))

# Create a binary filter to eliminate invalid DNA sequences
valid_idx = np.array([i for i in range(len(seq)) if seq[i] != 'None'])

# Apply the filter
valid_seq = seq[valid_idx]
cls_valid = cls[valid_idx]
cls_valid = np.array(cls_valid)

Now, we process the DNA sequences by converting the string of bases into an array of characters.

In [47]:
valid_seq[0]

'ACATTATACTTCATATTTGGAGGATGAGCCGGAATAGTAGGAACCTCGTTAAGA---ATACTTATTCGCGCAGAACTTAATCAACCA---GGATCCCTT------ATTGGAGATGATCAAATTTATAATGTTATTGTTACAGCCCACGCATTTGTTATAATTTTCTTTATAGTTATACCAATCTTGATTGGAGGGTTTGGAAATTGATTAGTACCTCTAATATTAGGAGCACCAGATATAGCATTCCCACGAATAAATAATATAAGATTCTGATTATTACCCCCATCACTCTCATTATTATTAACCAGTAGATTAGTCGAAAGAGGAGCTGGTACTGGTTGAACTGTATACCCACCCTTAGCTAGAGGGTTAGCCCATGCTGGTGCATCTGTTGATCTTGCAATCTTTTCTCTACACTTAGCAGGTGTTTCCTCTATTTTAGGAGCAGTTAATTTCATTTCAACAACAATCAATATAAAACCAATAAATATAACATCAGACCGAATCCCTTTATTTGTATGAGCTGTAGCAATCACAGCTTTACTTCTATTATTATCCCTACCAGTGCTTGCAGGAGCAATTACTATATTATTAACAGACCGAAACCTAAATACATCATTTTTTGACCCAGCTGGCGGGGGGGATCCTATTCTCTATCAACATTTATTT--------------------------------'

In [48]:
# Seperate string into individual bases. So, each value in the array is a base. Stored in list
seq_arr = [np.array([i for i in s]) for s in valid_seq]

#seq_mtx = len(seq_arr)

print(len(valid_seq), len(cls_valid)) #, seq_mtx)

764 764


In [49]:
valid_labels = ['Introduced', 'Invasive', 'Indigenous']
cls_labeled = [label in valid_labels for label in cls_valid]
#labeled_cls = (valid_labels[labeled_cls] == 'Indigenous').astype(int)

# Create a filter telling us which points are valid to use for supervised training
cls_labeled = np.array(cls_labeled)
#print(len(labeled_cls))
unshuffled_labels = cls_valid[cls_labeled]

print(type(seq_arr))

seq_data = [i for i, validity in zip(seq_arr, cls_labeled) if validity]

print(type(seq_data))
print(len(seq_data))

<class 'list'>
<class 'list'>
54


In [50]:
cls_valid_shuff, seq_data_shuff = sk.utils.shuffle(unshuffled_labels, seq_data, random_state=1337)

print(len(cls_valid_shuff))

# cls_train, cls_test, res_train, res_test = train_test_split(cls_valid_shuff, res_mat_shuff, test_size=0.2)
train_test_split = int(len(cls_valid_shuff)*0.5)

print(train_test_split)

cls_test, cls_train = cls_valid_shuff[:train_test_split], cls_valid_shuff[train_test_split:]

print(len(cls_test), len(cls_train))

54
27
27 27


In [51]:
cls_unlabeled = 1 - cls_labeled

seq_data_unlabeled = [i for i, validity in zip(seq_arr, cls_unlabeled) if validity]

print(len(seq_data_unlabeled))

# print(np.mean(unlabled_cls))

710


In [52]:
num_unlabeled = np.sum(cls_unlabeled)
print(num_unlabeled)

710


## Unsupervised Training

We create a similarity matrix, which is a pairwise comparison of DNA sequences and determining the percentage of base pairs that are the same.

Because the DNA sequences have been pre-aligned, we can expect this to be mostly accurate and close to the true similiarity values. In some places, the DNA sequences have a '-' character where the base was not read correctly, or missed. We ignore these.

In [53]:
sim_train = np.vstack((seq_data_unlabeled, seq_data_shuff[train_test_split:]))
print(sim_train.shape)

mat_size = len(sim_train)
print(mat_size)

sim_mat_train = -np.ones((mat_size, mat_size))

(737, 701)
737


In [54]:
# Precompute no dashes
dashes_train = []
for i in range(mat_size):
    dashes_train.append(sim_train[i] != '-')

In [55]:
print(sim_mat_train.shape)

(737, 737)


In [57]:
try:
    assert False
    np.load('online_sim_mat_train.npy')

except:
    # this will take a few minutes
    for i in trange(mat_size):
        # clean up bad data
        a = sim_train[i]
        # iterate over DNA sequences and figure out the match
        for j in range(i):
            b = sim_train[j]
            match = (a==b)
            valid = (dashes_train[i] * dashes_train[j])
            sim_mat_train[i,j] = np.mean(match[valid])
            sim_mat_train[j,i] = sim_mat_train[i,j]
        sim_mat_train[i,i] = 1
    np.save('online_sim_mat_train.npy', sim_mat_train)

100%|██████████| 737/737 [00:06<00:00, 117.54it/s]


In [58]:
valid_mat_train = sim_mat_train

In [59]:
sim_mat_train.shape, valid_mat_train.shape

((737, 737), (737, 737))

The similarity matrix is very big (100mb+), so we try PCA/SVD to extract the most useful features from the largest singular values.

In [60]:
# %%time
# u,s,v = np.linalg.svd(valid_mat, full_matrices=0)

The rank / number of singular values we pick is a hyperparameter. We run the dimension reduction step.

In [61]:
# rank = 1000
# approx_1000 = u[:,:rank].dot(np.diag(s[:rank])).dot(v[:rank])
# errors = ((approx_1000 - valid_mat)/valid_mat)
# plt.hist(errors.flatten())

## Supervised Training

Now, we have a set of features from our pre-training step, and we're ready to run supervised training. Before we start, we need to first remove data that don't have valid labels. We can't use them anymore!

We see that 36% of the data have associated labels.

In [65]:
np.mean(cls_labeled)

0.070680628272251314

In [66]:
# apply the filter over our features and labels
# supervised_X = approx[labeled_cls]
supervised_X = valid_mat_train[num_unlabeled:]
# supervised_y = cls_valid[labeled_cls]

supervised_y_train = (cls_train == 'Indigenous').astype(int)

Below, we run Support Vector Clustering (SVC). We shuffle the data first, and then split our data into testing and training splits.

There is somewhat large variance inbetween runs, so we take the average for a more accurate score.

In [67]:
# 10/18

c = 1e15

avg_score = []


# for _ in range(2):
#     res_mat_shuff, cls_valid_shuff = sk.utils.shuffle(supervised_X, supervised_y, random_state=0)

#     cls_train, cls_test, res_train, res_test = train_test_split(cls_valid_shuff, res_mat_shuff, test_size=0.2)

# print(len(cls_train), len(cls_test))

X = supervised_X
y = supervised_y_train #.reshape(-1, 1)



clf = SVC(C=c, kernel='poly', degree=2, coef0=0)

clf.fit(X, y)

SVC(C=1000000000000000.0, cache_size=200, class_weight=None, coef0=0,
  decision_function_shape='ovr', degree=2, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## Test data projection

We need to get the test data, which is currently still DNA bases, into sim matrix rows.

In [68]:
#sim_test = np.vstack((seq_arr_unlabeled, seq_arr_shuff[:train_test_split]))
sim_test = np.array(seq_data_shuff[:train_test_split])

sim_mat_test = -np.ones((mat_size, len(sim_test)))

print(sim_test.shape)
print(sim_mat_test.shape)

(27, 701)
(737, 27)


In [69]:
# Precompute no dashes
dashes_test = []
for j in range(len(sim_test)):
    dashes_test.append(sim_test[j] != '-')

In [70]:
print(len(dashes_test))

27


In [71]:
try:
    assert False
    np.load('online_sim_mat_test.npy')
except:
    # this will take a few minutes
    for i in trange(mat_size):
        # clean up bad data
        a = sim_train[i]
        # iterate over DNA sequences and figure out the match
        for j in range(len(sim_test)):
            b = sim_test[j]
            match = (a==b)
            valid = (dashes_train[i] * dashes_test[j])
            sim_mat_test[i,j] = np.mean(match[valid])
            # sim_mat_test[j,i] = sim_mat_test[i,j]
        # sim_mat_test[i,i] = 1
    np.save('online_sim_mat_test.npy', sim_mat_test)

100%|██████████| 737/737 [00:00<00:00, 1614.66it/s]


In [72]:
print(sim_mat_test)

[[ 0.54059829  0.80131363  0.74733638 ...,  0.77083333  0.78538813
   0.74668874]
 [ 0.56410256  0.9589491   0.75190259 ...,  0.79166667  0.78082192
   0.75662252]
 [ 0.56410256  0.80131363  0.75494673 ...,  0.78125     0.80974125
   0.79304636]
 ..., 
 [ 0.5982906   0.75907591  0.72588055 ...,  0.74842767  0.74272588
   0.70333333]
 [ 0.60042735  0.75205255  0.79147641 ...,  0.83541667  0.85388128
   0.78807947]
 [ 0.58974359  0.75369458  0.76712329 ...,  0.81458333  0.84018265
   0.79801325]]


In [73]:
sim_mat_test.shape

(737, 27)

In [74]:
supervised_y_test = (cls_test == 'Indigenous').astype(int)

print(supervised_y_train.shape, supervised_y_test.shape)

(27,) (27,)


In [75]:
X_test = sim_mat_test.T

predict = clf.predict(X_test)

print(predict[:20])
# print(predict == np.array(cls_test))
print(cls_test[:20])
score = np.mean((predict == np.array(supervised_y_test))*1)
avg_score.append(score)

print(avg_score, np.mean(avg_score))

# print 'Approximated similarity matrix: \n'
# test_and_score(supervised_X, supervised_y)
# print 'Full similarity matrix: \n'
# test_and_score(full_supervised_X, supervised_y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
['Invasive' 'Invasive' 'Introduced' 'Indigenous' 'Invasive' 'Indigenous'
 'Introduced' 'Introduced' 'Introduced' 'Invasive' 'Introduced'
 'Introduced' 'Introduced' 'Introduced' 'Invasive' 'Invasive' 'Indigenous'
 'Introduced' 'Indigenous' 'Introduced']
[0.77777777777777779] 0.777777777778


We see the results are very competitive.

Now, we use cross validation method get another take on our performance.

In [101]:
# # cross validation method
# # SVM

# # tricks: shuffling data, cross validation, balanced classes, hyperparam tuning

# def cv_test_and_score(supervised_X_train, supervised_y, c=1e14):
#     scores = []
#     param_vals = []
    
#     # shuffle the data
#     # res_mat_shuff, cls_valid_shuff = sk.utils.shuffle(supervised_X, supervised_y, random_state=0)

#     c = 10*c
#     clf = SVC(C=c,kernel='poly', degree=2, coef0=0) #, gamma=i)

#     score = sk.cross_validation.cross_val_score(clf, res_mat_shuff, cls_valid_shuff, cv=6) #, n_jobs=-1)
#     # print('Prediction accuracy:', np.mean((prediction == np.array(cls_test))*1))
#     #Coefficients used by the classifier

#     scores.append(score)
#     param_vals.append(i)

#     print(scores)

#     mn_scores = [np.mean(score) for score in scores]

#     print('mean scores:', mn_scores)

# print ('Approximated similarity matrix: \n')
# cv_test_and_score(supervised_X, supervised_y)
# # print ('\nFull similarity matrix: \n')
# # cv_test_and_score(full_supervised_X, supervised_y)

Notice that we included our testing data when creating the similiarity matrix, because we first create the matrix and then separate the data into train and test sets. This is somewhat unsatisfying, and very anonying if we want to do on the fly predictions. We have to recompute the simliarity matrix every time.

We now try excluding the test data from computing the similarity matrix. Instead, we can compute the values for the test data afterwards. We then also need to project the similiarity values for the test data to the SVD space, before we can run SVC.