In [1]:
%load_ext lab_black
%load_ext autotime
import pandas as pd
import numpy as np

time: 305 ms (started: 2022-09-20 08:16:30 -07:00)


One of the datasets used by [T-SNE Is Not Optimized to Reveal Clusters in Data](https://arxiv.org/abs/2110.02573) and [Stochastic Cluster Embedding](https://arxiv.org/abs/2108.08003) (SCE). It is suggested there that this dataset should be easy to get obvious clusters in the output, but that t-SNE fails to do so. The others are `cytometry`, `higgs`, `shuttle` and `tomoradar`.

It originates with the IJCNN 2001 Challenge and is time-series data. The [libsvmdata page](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#ijcnn1) lists 141,691 items, consisting of 49,990 in one dataset, and then 91,701 for testing. However, that page breaks that first dataset down into 35,000 training and 14,990 validation items, but the SCE paper doesn't use the validation set, so only reports that there are 126,701 items. 

It's worth understanding the data, which originally consisted only of four features using the naming from [a PDF explaining the dataset](http://www.geocities.ws/ijcnn/nnc_ijcnn01.pdf):

* cylinder identifier: this is explained as related to "a binary synchronization pulse related to a natural periodicity in the system". Zhang and co-workers interpret this as being related to which of the 10 cylinders in the engine is involved. This descriptor is a repeated pattern of 9 0s and then a 1.
* engine crankshaft RPM
* engine crankshaft load
* crankshaft acceleration

However, the libsvmdata site returns 22 features, based on the feature engineering done by [the IJCNN challenge winner's paper](https://doi.org/10.1109/IJCNN.2001.939502). The engine crankshaft RPM and load features are kept, but the cylinder identifier and crankshaft acceleration data are expanded to include their values at t-5...t+4 (inclusive).

Finally, the label used by the SCE authors is not the prediction label (which is engine fire/misfire), but the cylinder id converted to an integer between 1-10 (so basically it's a summary of the one-hot encoding of the first ten features in the dataset). The [SCE repo](https://github.com/rozyangno/sce) also has links to pre-processed data.

All in all, there aren't a lot of features here, and one of them is used for the label so this should be easy for dimensionality reduction methods to handle (which makes the observations about t-SNE in the two papers linked above all the more provocative).

## Download the data

The data comes bzipped, and is returned as a tuple with the first item being the data as described above, and the second being the prediction labels as -1/1.

In [2]:
import bz2
from io import BytesIO

import requests
from sklearn.datasets import load_svmlight_file


def load_svmlbz2_url(url):
    req = requests.get(url, timeout=10)
    with bz2.open(BytesIO(req.content)) as f:
        return load_svmlight_file(f)

time: 476 ms (started: 2022-09-20 08:16:31 -07:00)


In [3]:
training_and_val_data = load_svmlbz2_url(
    "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/ijcnn1.bz2"
)
training_and_val_data

(<49990x22 sparse matrix of type '<class 'numpy.float64'>'
 	with 649870 stored elements in Compressed Sparse Row format>,
 array([-1., -1., -1., ..., -1., -1., -1.]))

time: 3.16 s (started: 2022-09-20 08:16:31 -07:00)


In [4]:
test_data = load_svmlbz2_url(
    "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/ijcnn1.t.bz2"
)
test_data

(<91701x22 sparse matrix of type '<class 'numpy.float64'>'
 	with 1192113 stored elements in Compressed Sparse Row format>,
 array([-1., -1., -1., ..., -1., -1., -1.]))

time: 2.81 s (started: 2022-09-20 08:16:34 -07:00)


### Convert from sparse to dense and combine

In [5]:
test_dense = test_data[0].todense(order="C").A1.reshape(test_data[0].shape)

time: 40.3 ms (started: 2022-09-20 08:16:37 -07:00)


In [6]:
tv_dense = (
    training_and_val_data[0]
    .todense(order="C")
    .A1.reshape(training_and_val_data[0].shape)
)

time: 8.34 ms (started: 2022-09-20 08:16:37 -07:00)


In [7]:
data = np.vstack([tv_dense, test_dense])

time: 23.3 ms (started: 2022-09-20 08:16:37 -07:00)


In [8]:
data.shape

(141691, 22)

time: 2.69 ms (started: 2022-09-20 08:16:37 -07:00)


## Creating the target data

The first ten features are effectively a one-hot encoding of the cylinder id. So we shall decode that back to an integer (the shifting and `mod`ding is just to make the first label 0):

In [9]:
cylinder_id = np.mod(np.argmax(data[:, 0:10].astype(np.int8), axis=1) + 5, 10)

time: 8.35 ms (started: 2022-09-20 08:16:37 -07:00)


In [10]:
from drnb.util import categorize

target = pd.DataFrame(
    dict(
        fire=np.concatenate(
            [training_and_val_data[1].astype(np.int8), test_data[1].astype(np.int8)]
        ),
        cylinder_id=cylinder_id,
    )
)
categorize(target, "fire")
categorize(target, "cylinder_id")

time: 12.7 ms (started: 2022-09-20 08:16:37 -07:00)


In [11]:
target

Unnamed: 0,fire,cylinder_id
0,-1,0
1,-1,1
2,-1,2
3,-1,3
4,-1,4
...,...,...
141686,-1,1
141687,-1,2
141688,-1,3
141689,-1,4


time: 11.6 ms (started: 2022-09-20 08:16:37 -07:00)


## Pipeline

Technically, we could consider standardizing (Z-scaling) this input, but the difference in ranges of the data isn't very large.

In [12]:
from drnb.io.pipeline import create_default_pipeline

data_result = create_default_pipeline(check_for_duplicates=True).run(
    "ijcnn",
    data=data,
    target=target,
    tags=["lowdim"],
    verbose=True,
    url="https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#ijcnn1",
)

time: 1min 52s (started: 2022-09-20 08:16:37 -07:00)
