<a href="https://colab.research.google.com/github/linyuehzzz/5526_neural_networks/blob/master/rbf_svm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Lab 3**
This code trains an RBF kernel SVM to determine if a genomic sequence is an ncRNA..  
Yue Lin (lin.3326 at osu.edu)  
Created: 11/5/2020

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

#### **Set up libraries**

In [2]:
!pip install libsvm

Collecting libsvm
[?25l  Downloading https://files.pythonhosted.org/packages/4b/11/c7700d0cd3a21eef2d7d996256277fc640ccd4f84717c10228cb6c1567dc/libsvm-3.23.0.4.tar.gz (170kB)
[K     |██                              | 10kB 16.8MB/s eta 0:00:01[K     |███▉                            | 20kB 1.7MB/s eta 0:00:01[K     |█████▊                          | 30kB 2.2MB/s eta 0:00:01[K     |███████▊                        | 40kB 2.5MB/s eta 0:00:01[K     |█████████▋                      | 51kB 1.9MB/s eta 0:00:01[K     |███████████▌                    | 61kB 2.2MB/s eta 0:00:01[K     |█████████████▌                  | 71kB 2.4MB/s eta 0:00:01[K     |███████████████▍                | 81kB 2.7MB/s eta 0:00:01[K     |█████████████████▎              | 92kB 2.8MB/s eta 0:00:01[K     |███████████████████▏            | 102kB 2.7MB/s eta 0:00:01[K     |█████████████████████▏          | 112kB 2.7MB/s eta 0:00:01[K     |███████████████████████         | 122kB 2.7MB/s eta 0:00:01[K

In [41]:
from libsvm.svmutil import *
from sklearn.datasets import load_svmlight_file
import random
import matplotlib.pyplot as plt

#### **Divide datasets**

In [40]:
def divide_data(filename, n, m):
  # Read the entire training dataset
  with open(filename, 'r') as f:
    lines = f.read().splitlines()
    f.close()
  
  # Divide dataset
  lines = random.sample(lines, int(len(lines) * 0.5))
  _cv = [lines[i::n] for i in range(n)]
  _train = _cv[:m] + _cv[m + 1:]
  _train = [x for sublist in _train for x in sublist]
  _val = _cv[m]
  # print(_train)
  # print(_val)

  # Get y values
  train_y = [int(line.split(' ')[0]) for line in _train]
  val_y = [int(line.split(' ')[0]) for line in _val]
  # print(train_y)
  # print(val_y)

  # Get x values
  train_x = [line.split(' ')[1:] for line in _train]
  for i in range(len(train_x)):
    line = train_x[i]
    for j in range(len(line)):
      train_x[i][j] = float(train_x[i][j].split(':')[1])
  val_x = [line.split(' ')[1:] for line in _val]
  for i in range(len(val_x)):
    line = val_x[i]
    for j in range(len(line)):
      val_x[i][j] = float(val_x[i][j].split(':')[1])
  # print(train_x)
  # print(val_x)

  return train_x, train_y, val_x, val_y

#### **Select parameters for RBF kernel SVMs**

In [42]:
def rbf_svm_param(filename, c, alpha):
  acc = 0

  for i in range(5):
    # Prepare data
    train_x, train_y, val_x, val_y = divide_data(filename, 5, i)

    # Train svm
    prob = svm_problem(train_y, train_x)
    param_str = '-t 2 -c 2e' + str(c) + ' -g 2e' + str(alpha)
    print("Param: " + param_str)
    param = svm_parameter(param_str)
    m = svm_train(prob, param)

    # Test
    p_label, p_acc, p_val = svm_predict(val_y, val_x, m)
    acc += p_acc[0]

  return acc / 5


#### **Read training and test data**

In [49]:
def get_data(filename):
    data = load_svmlight_file(filename)
    return data

#### **Classification using RBF kernel SVMs**

In [46]:
def rbf_svm(train_data, c, alpha, test_data):
  # Get data
  train_x = train_data[0]
  train_y = train_data[1]
  test_x = test_data[0]
  test_y = test_data[1]

  # Train svm
  prob = svm_problem(train_y, train_x)
  param_str = '-t 2 -c 2e' + str(c) + ' -g 2e' + str(alpha)
  print("Param: " + param_str)
  param = svm_parameter(param_str)
  m = svm_train(prob, param)

  # Test
  p_label, p_acc, p_val = svm_predict(test_y, test_x, m)
  return p_acc

#### **Wrapper**

Selecting parameters for RBF kernel SVMs

In [43]:
%cd "/content/gdrive/My Drive/Colab Notebooks/cse5526"

train_file = "ncRNA_s.train.txt"
acc = [[0] * 13 for i in range(13)]
for c in range(-4, 9):
  for alpha in range(-4, 9):
    p_acc = rbf_svm_param(train_file, c, alpha)
    acc[int(c+4)][int(alpha+4)] = p_acc

/content/gdrive/My Drive/Colab Notebooks/cse5526
Param: -t 2 -c 2e-4 -g 2e-4
Accuracy = 70.5% (141/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-4
Accuracy = 64.5% (129/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-4
Accuracy = 67.5% (135/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-4
Accuracy = 68.5% (137/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-4
Accuracy = 61.5% (123/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-3
Accuracy = 67% (134/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-3
Accuracy = 65.5% (131/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-3
Accuracy = 67% (134/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-3
Accuracy = 66.5% (133/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-3
Accuracy = 66% (132/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-2
Accuracy = 68% (136/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-2
Accuracy = 72.5% (145/200) (classification)
Param: -t 2 -c 2e-4 -g 2e-2
Accuracy = 71% (142/200) (classification)
Param: -t 2 -c 2e-4 -g 2e

Plot accuracy matrix

In [44]:
with open('acc_mtx.txt', 'w') as fw:
  fw.write('\n'.join(['\t'.join([str(round(cell,2)) for cell in row]) for row in acc]))

Read training and test data

In [50]:
%cd "/content/gdrive/My Drive/Colab Notebooks/cse5526"

# Read training and test data
train_file = "ncRNA_s.train.txt"
test_file = "ncRNA_s.test.txt"

train_data = get_data(train_file)
test_data = get_data(test_file)

/content/gdrive/My Drive/Colab Notebooks/cse5526


Classification using RBF kernel SVMs

In [52]:
c = 6
alpha = -3

p_acc = rbf_svm(train_data, c, alpha, test_data)
print(p_acc[0])

Param: -t 2 -c 2e6 -g 2e-3
Accuracy = 94.4056% (945/1001) (classification)
94.4055944055944
