SAX-VSM public code release
This code supports our publication:
Senin, Pavel and Malinchik, Sergey, SAX-VSM: Interpretable Time Series ClassiCfication Using SAX and Vector Space Model, Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp.1175,1180, 7-10 Dec. 2013.
Our algorithm is based on the following work:
 Lin, J., Keogh, E., Wei, L. and Lonardi, S., Experiencing SAX: a Novel Symbolic Representation of Time Series. DMKD Journal, 2007.
 Salton, G., Wong, A., Yang., C., A vector space model for automatic indexing. Commun. ACM 18, 11, 613–620, 1975.
 Jones, D. R. , Perttunen, C. D., and Stuckman, B. E., Lipschitzian optimization without the lipschitz constant, Journal of Optimization Theory and Applications, vol. 79, no. 1, pp. 157–181, 1993
 The DiRect implementation source code is partially based on JCOOL.
0.0 In a nutshell
The proposed interpretable time series classification algorithm consists of two steps -- training and classification.
For training, labeled time series discretized with SAX via sliding window and "bag of words" constructed for each of the training classes (single bag per class). Processing bags with TFIDF yields a set of class-characteristic vectors -- one vector per class. Essentially, each element of that vector is a weighted discretized fragment of the input time series whose weight value reflects its "class-characteristic power" and the class specificity.
For classification, the unlabeled time series is discretized with sliding window-based SAX (exactly the same transform as for training) in order to transform it into a term frequency vector. Next, the cosine similarity computed between this vector and those constructed during training (i.e., vectors characterizing training classes). The unlabeled input time series assigned to a class with which the angle is smallest, i.e., the cosine value is largest. This is ltc.nnn schema in SMART notation.
Because it is easy to see which patterns contribute the most to the cosine similarity value, as well as to see which patterns have the highest weights after training, the algorithm naturally enables the interpretation of training and classification results.
The whole process is illustrated below:
The code is written in Java and I use maven to build it:
$ mvn package -P single [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [INFO] Building sax-vsm [INFO] task-segment: [package] ... [INFO] Building jar: /media/Stock/git/sax-vsm_classic.git/target/sax-vsm-0.0.1-SNAPSHOT-jar-with-dependencies.jar [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL
2.0 Running the classifier
SAXVSMClassifier is runnable from command line; running it without parameters prints usage help. Here is a trace of running SAX-VSM with Gun/Point dataset:
$ java -cp "target/sax-vsm-0.0.1-SNAPSHOT-jar-with-dependencies.jar" net.seninp.jmotif.SAXVSMClassifier \ -train src/resources/data/Gun_Point/Gun_Point_TRAIN -test src/resources/data/Gun_Point/Gun_Point_TEST \ -w 33 -p 17 -a 15 trainData classes: 2, series length: 150 training class: 2 series: 26 training class: 1 series: 24 testData classes: 2, series length: 150 test class: 2 series: 74 test class: 1 series: 76 classification results: strategy EXACT, window 33, PAA 17, alphabet 15, accuracy 1.00, error 0.00
3.0 Running the parameters sampler (optimizer)
Symbolic discretization with SAX -- the first step of our algorithm -- requires hyperparameters to be specified by the user. Unfortunately, their optimal selection is not trivial. We proposed to use Dividing Rectangles optimization scheme for accelerated selection of optimal parameter values.
The code implements the DiRect sampler which can be called from the command line. Below is the trace of running the sampler for Gun/Point dataset. The series in this dataset have length 150, so I define the sliding window range as [10-150], PAA size as [5-75], and the alphabet [2-18]:
$ java -jar target/sax-vsm-0.0.1-SNAPSHOT-jar-with-dependencies.jar \ -train src/resources/data/Gun_Point/Gun_Point_TRAIN -test src/resources/data/Gun_Point/Gun_Point_TEST \ -wmin 10 -wmax 150 -pmin 5 -pmax 75 -amin 2 -amax 18 --hold_out 1 -i 3 trainData classes: 2, series length: 150 training class: 2 series: 26 training class: 1 series: 24 testData classes: 2, series length: 150 test class: 2 series: 74 test class: 1 series: 76 running sampling for MINDIST strategy... iteration: 0, minimal value 0.18 at 80, 40, 10 iteration: 1, minimal value 0.04 at 80, 17, 10 iteration: 2, minimal value 0.04 at 80, 17, 10 min CV error 0.04 reached at [80, 17, 10], will use Params [windowSize=80, paaSize=17, alphabetSize=10, nThreshold=0.01, nrStartegy=MINDIST] running sampling for EXACT strategy... iteration: 0, minimal value 0.0 at 80, 40, 10 iteration: 1, minimal value 0.0 at 80, 40, 10 iteration: 2, minimal value 0.0 at 80, 40, 10 min CV error 0.00 reached at [80, 40, 10], [33, 17, 15], will use Params [windowSize=33, paaSize=17, alphabetSize=15, nThreshold=0.01, nrStartegy=EXACT] running sampling for NONE strategy... iteration: 0, minimal value 0.0 at 80, 40, 10 iteration: 1, minimal value 0.0 at 80, 40, 10 iteration: 2, minimal value 0.0 at 80, 40, 10 min CV error 0.00 reached at [80, 40, 10], [64, 40, 10], [33, 17, 15], will use Params [windowSize=33, paaSize=17, alphabetSize=15, nThreshold=0.01, nrStartegy=NONE] classification results: strategy MINDIST, window 80, PAA 17, alphabet 10, accuracy 0.92667, error 0.07333 classification results: strategy EXACT, window 33, PAA 17, alphabet 15, accuracy 1.00, error 0.00 classification results: strategy NONE, window 33, PAA 17, alphabet 15, accuracy 0.97333, error 0.02667
As shown in our work, DiRect provides a significant speed-up when compared with the grid search. Below is an illustration of DiRect-driven parameters optimization for SyntheticControl dataset. Left panel shows all points sampled by DiRect in the space
PAA ∗ W ndow ∗ Alphabet: red points correspond to high error values while green points correspond to low error values in cross-validation experiments. Note the green points concentration at W=42 (where the optimal value is). Middle panel shows the classification error heat map obtained by a complete scan of all 432 points of the hypercube slice when W=42. Right panel shows the classification error heat map of the same slice when the parameters search optimized by DiRect, the optimal solution (P=8,A=4) was found by sampling of 43 points (i.e., 10X speed-up for the densily sampled slice).
4.0 Interpretable classification
The class named
SAXVSMPatternExplorer prints the most significant class-characteristic patterns, their weights, and the time-series that contain those. The
best_words_heat.R script allows to plot these. Here is an example for the Gun/Point data:
Note, that the time series ranges highlighted by the approach correspond to distinctive class features: class Gun is characterized the most by articulated movements for prop retrieval and aiming, class Point is characterized by the ‘overshoot’ phenomenon and simple (when compared to Gun) movement before aiming.
Note, that the default choice for the best parameters validation on TEST data is a parameters set corresponding to the shortest sliding window, which you may want to change - for example to choose the point whose neighborhood contains the highest density of sampled points.
Also note that code implements 5 ways the TF (term frequency value) can be computed:
double tfValue = Math.log(1.0D + Integer.valueOf(wordInBagFrequency).doubleValue()); // double tfValue = 1.0D + Math.log(Integer.valueOf(wordInBagFrequency).doubleValue()); // double tfValue = normalizedTF(bag, word.getKey()); // double tfValue = augmentedTF(bag, word.getKey()); // double tfValue = logAveTF(bag, word.getKey());
For many datasets, these yield quite different accuracy.
The normalization threshold (used in SAX discretization) is also quite important hidden parameter -- changing it from 0.001 to 0.01 may significantly change the classification accuracy on a number of datasets where the original signal standard deviation is small, such as Beef.
Finally, note, that when cosine similarity is computed within the classification procedure, it may happen that its value is the same for all classes. In that case, the current implementation considers that the time series was missclassified, but you may want to assign it to one of the classes randomly.
6.0 The classification accuracy table
The following table was obtained in automated mode when using DiRect-driven parameters optimization scheme. Note, that the minimal CV error is the same for a number of parameter combinations, the sampler breaks ties by choosing a parameters set with the smallest sliding window.
|Dataset||Classes||Length||Euclidean 1NN||DTW 1NN||SAX-VSM|