# Staging the Mushroom Dataset

In [1]:
import os

import numpy as np

Path to the data folder

In [2]:
# PATH_TO_EXP = '/cobrain/groups/ml_group/experiments/dustpelt/imc_exp/'
# PATH_DATA = os.path.join(PATH_TO_EXP, 'data/mushrooms')
PATH_DATA = "../data/mushrooms"

Download the `mushroom` from `libsvm`.

In [3]:
filename_raw = os.path.join(PATH_DATA, "dataset.libsvm")
filename_staged = os.path.join(PATH_DATA, "staged_dataset.gz")

if not os.path.exists(filename_raw):
    !wget -O {filename_raw} -t inf \
        https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/mushrooms

--2018-05-08 16:52:08--  https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/mushrooms
Resolving www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)... 140.112.30.26
Connecting to www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)|140.112.30.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 879712 (859K)
Saving to: <<../data/mushroom.libsvm>>


2018-05-08 16:52:13 (201 KB/s) - <<../data/mushroom.libsvm>> saved [879712/879712]



Information about the dataset from [libsvm/datasets](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#mushrooms):

* Source: [UCI / mushrooms](https://archive.ics.uci.edu/ml/datasets/mushroom)
* Preprocessing: Each nominal attribute is expaned into several binary attributes.
The original attribute #12 has missing values and is not used.
  * numebr of classes: 2
  * numebr of features: 112
  * numebr of data / samples: 8124

In [4]:
n_objects, n_features = 8124, 112

  The data is in `libsvm` input file format, therefore we use `sklearn's` interface.

In [5]:
from sklearn.datasets import load_svmlight_file

X, y = load_svmlight_file(filename_raw, dtype=np.float64, query_id=False)

assert n_objects == len(y), """Unexpected dimensions."""
assert (n_objects, n_features) == X.shape, """Unexpected dimensions."""

Create the target dataset for supervised clustering:
$$ R_{ij}
    = \begin{cases}
        +1 & \text{ if } y_i = y_j\,, \\
        -1 & \text{ otherwise.}
\end{cases}$$
We fill in only the negative class `-1`.

In [6]:
import tqdm

R = np.ones((n_objects, n_objects))
for i, yi in enumerate(tqdm.tqdm(y)):
    R[i, np.flatnonzero(y != yi)] = -1

100%|██████████| 8124/8124 [00:00<00:00, 30783.80it/s]


The row side-features matrix is already in CSR sparse format.

In [7]:
X

<8124x112 sparse matrix of type '<class 'numpy.float64'>'
	with 170604 stored elements in Compressed Sparse Row format>

The colum side-features are an identity matrix.

In [8]:
from scipy.sparse import dia_matrix

Y = dia_matrix((np.ones(n_objects), 0), shape=(n_objects, n_objects))
Y = Y.tocsr()

Save the dataset into a gzipped pickle

In [9]:
import gzip
import pickle

#sgimc.utils.save() ?
with gzip.open(filename_staged, "wb+", 4) as fout:
    pickle.dump((X, Y, R), fout)

<br/>
<hr/>