# Preprocessing the DNA dataset

In [1]:
import os

import numpy as np

Path to data folder.

In [2]:
PATH_TO_EXP = '/cobrain/groups/ml_group/experiments/dustpelt/imc_exp/'
PATH_DATA = os.path.join(PATH_TO_EXP, 'data/dna')

Download the `dna` from `libsvm`.

In [3]:
filename_raw_train = os.path.join(PATH_DATA, "dataset_train.libsvm")

if not os.path.exists(filename_raw_train):
    !wget -O {filename_raw_train} -t inf \
        https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/dna.scale

--2018-07-05 11:38:47--  https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/dna.scale
Resolving www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)... 140.112.30.26
Connecting to www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)|140.112.30.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 499248 (488K)
Saving to: ‘/cobrain/groups/ml_group/experiments/dustpelt/imc_exp/data/dna/dataset_train.libsvm’


2018-07-05 11:38:51 (164 KB/s) - ‘/cobrain/groups/ml_group/experiments/dustpelt/imc_exp/data/dna/dataset_train.libsvm’ saved [499248/499248]



In [4]:
filename_raw_test = os.path.join(PATH_DATA, "dataset_test.libsvm")

if not os.path.exists(filename_raw_test):
    !wget -O {filename_raw_test} -t inf \
        https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/dna.scale.t

--2018-07-05 11:38:53--  https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/dna.scale.t
Resolving www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)... 140.112.30.26
Connecting to www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)|140.112.30.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 293744 (287K) [application/x-troff]
Saving to: ‘/cobrain/groups/ml_group/experiments/dustpelt/imc_exp/data/dna/dataset_test.libsvm’


2018-07-05 11:38:58 (72.0 KB/s) - ‘/cobrain/groups/ml_group/experiments/dustpelt/imc_exp/data/dna/dataset_test.libsvm’ saved [293744/293744]



The data is in `libsvm` input file format, therefore we use `sklearn`'s interface.

In [5]:
from sklearn.datasets import load_svmlight_file

X_train, y_train = load_svmlight_file(filename_raw_train, dtype=np.float64, query_id=False)
X_test, y_test = load_svmlight_file(filename_raw_test, dtype=np.float64, query_id=False)

In [13]:
from scipy.sparse import vstack

X = vstack([X_train, X_test])
y = np.concatenate((y_train, y_test))

Data info:
* of classes: 3
* of data: 2,000 / 1,186 (testing) / 1,400 (tr) / 600 (val)
* of features: 180

In [15]:
n_objects, n_features = 2000 + 1186, 180

assert n_objects == len(y), """Unexpected dimensions."""
assert (n_objects, n_features) == X.shape, """Unexpected dimensions."""

Create the target dataset for supervised clustering:
$$ R_{ij}
    = \begin{cases}
        +1 & \text{ if } y_i = y_j\,, \\
        -1 & \text{ otherwise.}
\end{cases}$$
We fill in only the negative class `-1`.

In [16]:
import tqdm

R = np.ones((n_objects, n_objects))
for i, yi in enumerate(tqdm.tqdm(y)):
    R[i, np.flatnonzero(y != yi)] = -1

100%|██████████| 3186/3186 [00:00<00:00, 30397.10it/s]


The row side-features matrix is already in CSR sparse format.

In [17]:
X

<3186x180 sparse matrix of type '<class 'numpy.float64'>'
	with 144902 stored elements in Compressed Sparse Row format>

The column side-features are an identity matrix.

In [18]:
from scipy.sparse import dia_matrix

Y = dia_matrix((np.ones(n_objects), 0), shape=(n_objects, n_objects))
Y = Y.tocsr()

Save the dataset into a gzipped pickle.

In [19]:
filename_staged = os.path.join(PATH_DATA, "staged_dataset.gz")

import gzip
import pickle

with gzip.open(filename_staged, "wb+", 4) as fout:
    pickle.dump((X, Y, R), fout)