Implements Sequtur (online) and Re-Pair (off-line) grammar induction algorithms for Grammarviz 2.0 and SAX-VSM-G. This code is released under GPL v.2.0.
[1] Nevill-Manning, C.G. and Witten, I.H., "Identifying Hierarchical Structure in Sequences: A linear-time algorithm", Journal of Artificial Intelligence Research, 7, 67-82, (1997).
[2] Larsson, N.J.; Moffat, A., "Offline dictionary-based compression", Data Compression Conference, 1999. Proceedings. DCC '99 , vol., no., pp.296,305, 29-31 Mar 1999.
If you are using this implementation for you academic work, please cite our Grammarviz 2.0 paper:
[Citation] Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S., Lerner, M., GrammarViz 2.0: a tool for grammar-based pattern discovery in time series, ECML/PKDD Conference, 2014.
The code is written in Java and I use maven to build it:
$ mvn package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Building GI
[INFO] task-segment: [package]
...
[INFO] Building jar: /media/Stock/git/jmotif-GI.git/target/jmotif-gi-0.3.1-SNAPSHOT.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
Following the original Eibe Frank's java implementation the code is built using global (static) variables:
String TEST3_STRING = "a b a b c a b c d a b c d e a b c d e f";
SAXRule r = SequiturFactory.runSequitur(TEST3_STRING);
System.out.println(SAXRule.printRules());
which prints the following output:
Number Name Level Occurr. Usage Yield Rule str Expaneded Indexes
0 R0 0 0 0 0 R1 R2 R3 R4 R4 f a b a b c a b c d a b c d e a b c d e f []
1 R1 1 5 2 2 a b a b [0, 2, 5, 9, 14]
2 R2 1 4 2 3 R1 c a b c [2, 5, 9, 14]
3 R3 1 3 2 4 R2 d a b c d [5, 9, 14]
4 R4 1 2 2 5 R3 e a b c d e [9, 14]
My own addition allows to retrieve the Sequitur rules as an iterable collection of GrammaRuleRecords and to map them back to the discretized time series:
GrammarRules rules = r.toGrammarRulesData();
GrammarRuleRecord rec = rules.get(4);
ArrayList<RuleInterval> intervals = rec.getRuleIntervals();
...
I've implemented RePair from scratch and it uses the same GrammaRules / GrammaRuleRecord data structures as for Sequitur, so it can be plugged into Grammarviz seamlessly:
String TEST_STRING = "abc abc cba XXX abc abc cba";
RePairGrammar rg = RePairFactory.buildGrammar(TEST_STRING);
System.out.println(rg.toGrammarRules());
which yields:
R0 -> R2 XXX R2
R1 -> abc cba : abc cba, [1, 5]
R2 -> abc R1 : abc abc cba, [0, 4]
Thanks to the algorithm's design, I was able to parallelize RePair. However, the cost of inter-tread communications and synchronization was the majot showstopper, so the current new implementation (>0.8.5) is single-threaded (but you can still get the parallel one -- it is tagged "old_repair" in the version control).
The both implemented GI algorithms, Sequitur and RePair, demonstrate a somewhat similar performance with minor differnces. Specifically:
- Sequitur implementation is slower than RePair
- Sequitur tends to produce more rules, but Sequitur rules are less frequent than RePair rules
- Sequitur rule-corresponding subsequences vary in length more
- Sequitur rules usually cover more points than RePair
- Sequitur rule coverage depth is lower than that of RePair
All these may affect the performance of the upstream time series analysis algorithms such as SAX-VSM-G, Grammarviz, and RRA. Here is the table with some numbers collected by running Sequitur and RePair using sliding window of size 150, PAA 6, and the alphabet 4. I used the EXACT numerosity reduction in these runs.
Dataset | Size | Sequitur | RePair | ||||||
---|---|---|---|---|---|---|---|---|---|
rules | time | cov.dpth | max.freq. | rules | time | cov.dpth | max.freq. | ||
Daily commute | 17175 | 292 | 8 | 12.8 | 45 | 362 | 4 | 18.3 | 53 |
Dutch power demand | 35040 | 916 | 38 | 26.6 | 124 | 769 | 14 | 29.6 | 162 |
ECG 0606 | 2300 | 67 | 4 | 18.4 | 11 | 74 | 1 | 37.4 | 14 |
ECG 108 | 21600 | 539 | 11 | 18.3 | 44 | 472 | 9 | 20.8 | 45 |
ECG 15 | 15000 | 279 | 9 | 19.2 | 58 | 239 | 5 | 25.7 | 71 |
ECG 300 | 536976 | 10178 | 4458 | 34.2 | 980 | 7649 | 2048 | 35.7 | 1673 |
ECG 308 | 5400 | 131 | 4 | 13.2 | 14 | 143 | 1 | 22.1 | 15 |
ECG 318 | 586086 | 7113 | 2234 | 27.8 | 1422 | 5112 | 1435 | 29.1 | 2942 |
Insect | 18667 | 632 | 19 | 17.1 | 25 | 584 | 10 | 18.3 | 32 |
Respiration, NPRS 43 | 4000 | 881 | 33 | 26.5 | 29 | 813 | 12 | 27.5 | 45 |
Respiration, NPRS 44 | 24125 | 1189 | 66 | 28.1 | 40 | 1057 | 17 | 28.9 | 61 |
TEK14 | 5000 | 205 | 5 | 27.4 | 78 | 237 | 3 | 32.6 | 130 |
TEK16 | 5000 | 181 | 4 | 25.8 | 100 | 210 | 2 | 31.9 | 157 |
TEK17 | 5000 | 190 | 7 | 26.5 | 190 | 208 | 2 | 32 | 208 |
Video dataset | 11251 | 285 | 11 | 16.9 | 29 | 301 | 7 | 21.8 | 30 |
Winding | 2500 | 70 | 3 | 10.6 | 5 | 225 | 1 | 33.5 | 5 |
1.0.1
- Optimized rule pruning algorithm
GrammarRules
,GrammarRuleRecord
, andRuleInterval
implementSerializable
1.0.0
- more tests & Grammarviz3.0 release.
0.8.6
- pre-1.0 release with improved RePair implementation.
0.0.1 - 0.8.5
- initial code development, parallel repair implementation lifecycle.