# Classification of DNA Sequencing to Identify Invasive Species

## Machine Learning at Berkeley

We attempt to solve the classification problem of identifying invasive species given binary labels and a DNA dataset from the island of Morea.

We use a semi-supervised method to take advantage of unlabeled DNA, which makes up over 80% of the dataset.

First, we load the DNA sequences and compute a similarity matrix which is easier for us to work with. We can add unlabeled data when creating this matrix, because it is not dependent on labels.

We are losing infomation about the actual DNA sequences if we do this, so we can try a different method based on reading the ATCG bases in the future (RNNs?)

Then, we can use a clustering algorithm like SVC to perform classification on the processed matrix. We are doing supervised learning, so we simply discard the unlabeled rows of our data matrix.

In [8]:
# import libraries
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import sklearn as sk
from __future__ import division
from sklearn.cross_validation import train_test_split
import math

from tqdm import trange
from sklearn.svm import SVC

Here, we use pandas to read the excel sheet, and then extract features and convert the data to numpy arrays.

In [10]:
# read the excel sheet 
df = pd.read_excel('./BioCode for Machine Learning.xlsx')

# Read in the labels
cls = df['Classification']

# Convert the label (Indigenous/Non-native/Invasive) into a binary label
cls_valid = (cls == 'Indigenous')*1
# cls_valid = cls_binary[valid_samples]


# Read the DNA sequences, which are strings comprised of the letters ATCG
seq = df['Aligned Sequence']

# Convert data to numpy array
seq = np.array(seq.fillna('None'))

# Seperate string into individual characters
seq_arrays = [np.array([i for i in s]) for s in seq]

We create a similarity matrix, which is a pairwise comparison of DNA sequences and determining the percentage of base pairs that are the same.

Because the DNA sequences have been pre-aligned, we can expect this to be mostly accurate and close to the true similiarity values. In some places, the DNA sequences have a '-' character where the base was not read correctly, or missed. We ignore these.

In [None]:
mat_size = len(seq_arrays)
sim_mat = -np.ones((mat_size, mat_size))

# this will take a few minutes
for i in trange(mat_size):
    # clean up bad data
    if seq[i] != 'None':
        a = seq_arrays[i]
        # iterate over DNA sequences and figure out the match
        for j in range(i):
            if seq[j] != 'None':
                b = seq_arrays[j]
                match = (a==b)
                valid = ((a != '-')* (b != '-')) == 1
                sim_mat[i,j] = np.mean(match[valid])
                sim_mat[j,i] = sim_mat[i,j]
        sim_mat[i,i] = 1

 90%|█████████ | 4401/4877 [04:19<00:58,  8.14it/s]

We now figure out valid rows of the similarity matrix

In [None]:
valid_idx = np.array([i for i in range(len(seq)) if seq[i] != 'None'])
valid_mat = sim_mat[valid_idx][:, valid_idx]

In [None]:
sim_mat.shape, valid_mat.shape

The similarity matrix is very big (100mb+), so we try PCA/SVD to extract the most useful features from the largest singular values.

In [None]:
%%time
u,s,v = np.linalg.svd(valid_mat, full_matrices=0)

The rank / number of singular values we pick is a hyperparameter.

In [None]:
rank = 1000
approx_1000 = u[:,:rank].dot(np.diag(s[:rank])).dot(v[:rank])
errors = ((approx_1000 - valid_mat)/valid_mat)
plt.hist(errors.flatten())

We remove the unlabeled data from our matrix.

In [None]:
valid_idx = list(valid_idx)
valid_mat_idx = [valid_idx.index(i) for i in np.where(valid_samples)[0] if i in valid_idx]

In [None]:
mat = approx[valid_mat_idx]
res_mat = residues[valid_mat_idx]
print(mat.shape, res_mat.shape)

Below, we run Support Vector Clustering. We shuffle the data first, and then split our data into testing and training splits.

There is somewhat large variance inbetween runs, so we take the average for a more accurate score.

In [None]:
# 10/18

c = 1e15
avg_score = []

for _ in range(2):
    res_mat_shuff, cls_valid_shuff = sk.utils.shuffle(approx[valid_mat_idx], cls_valid, random_state=0)

    cls_train, cls_test, res_train, res_test = train_test_split(cls_valid_shuff, res_mat_shuff, test_size=0.2)

    print(len(cls_train), len(cls_test))

    X = res_train
    y = cls_train
    X_test = res_test

    clf = SVC(C=c, kernel='poly', degree=2, coef0=0)

    clf.fit(X, y)

    predict = clf.predict(X_test)

    # print(predict == np.array(cls_test))

    score = np.mean((predict == np.array(cls_test))*1)
    avg_score.append(score)

print(avg_score, np.mean(avg_score))

We see the results are very competitive.

Now, we use cross validation method get another take on our performance.

In [None]:
# cross validation method
# SVM

# tricks: shuffling data, cross validation, balanced classes, hyperparam tuning

scores = []
param_vals = []
c = 1e14

# shuffle the data
res_mat_shuff, cls_valid_shuff = sk.utils.shuffle(res_mat, cls_valid, random_state=0)

c = 10*c
clf = SVC(C=c,kernel='poly', degree=2, coef0=0) #, gamma=i)

score = sk.cross_validation.cross_val_score(clf, res_mat_shuff, cls_valid_shuff, cv=6) #, n_jobs=-1)
# print('Prediction accuracy:', np.mean((prediction == np.array(cls_test))*1))
#Coefficients used by the classifier

scores.append(score)
param_vals.append(i)

print(scores)

mn_scores = [np.mean(score) for score in scores]

print('mean scores:', mn_scores)

Notice that we included our testing data when creating the similiarity matrix, because we first create the matrix and then separate the data into train and test sets. This is somewhat unsatisfying, and very anonying if we want to do on the fly predictions. We have to recompute the simliarity matrix every time.

We now try excluding the test data from computing the similarity matrix. Instead, we can compute the values for the test data afterwards. We then also need to project the similiarity values for the test data to the SVD space, before we can run SVC.