Enable text transductions involving multibyte character encodings #15

jrte · 2022-02-23T02:12:02Z

Ginr (2.1.0c) is improving Unicode support, but jrte is lagging and as it stands now only 7-bit (ASCII) text can be transduced. Jrte is using char[] and CharBuffer everywhere, which means that multibyte character encodings are decoded to 16-bit code points in jrte IInput streams. However ginr patterns are compiled to raw byte encodings and there is no way to specify a 16-but code point in a pattern.

Ginr is moving in the right direction, since it obviates the need for need to fully decode byte[]->char[] to present to runtime transductions. To keep up jrte must be refactored (extensively) to use byte[] and ByteBuffer to represent input sequences. The prologue will have to be extended to include utf8 = utf7 + {80..FF} and utf7-dependent definitions must be extended to include all utf8 bytes (eg PasteAny = (utf8, paste)*). Patterns that use non-ASCII characters will have to adapt as well since, for example, ('⅀', paste) will paste only the last byte of the UTF-8 encoding, ('⅀' @ PasteAny) would be required instead (this shouldn't be a problem for runs in master or head branches because they never worked with non-ASCII inputs anyway).

This will require an extensive rewrite of many jrte components and will be undertaken on a new branch (raw). Issue #6 will be addressed in raw by deprecating ITransduction.input(IInput[]) and the whole IInput framework and driving transductions from client code with sequential calls to Transduction.run(byte[] ...).

Be patient. This is old code, I'm old, even my dog is old. And Java hurts.

The text was updated successfully, but these errors were encountered:

This commit applies a stop-gap workaround for the IInput-related issues described in issue #6. It just applies to `jrte.Jrte.main()` the same logic as `test.FileRunner.main()` -- the input file path is specified as an input argument and the entire file contents are read into RAM. This workaround just obviates inclusion of jrte-HEAD-test.jar in classpath. Tests have been extended to include runs with FileRunner.main() for benchmarking and Jrte.main() as for regular runs. Output equivalence checks for equivalent output from verbose gc interval times extraction via two different FSTs and one regex. The `etc/sh/jrte.sh` script demonstrates how to use jrte to transduce a file. ``` java -cp jrte-HEAD.jar com.characterforming.jrte.Jrte \ [--nil] <transducer-name> <input-filepath> <gearbox-filepath> ``` The `--nil` option presents a `!nil` signal to the transduction before presenting the file contents. This is all that I intend to do with this issue #6 for now as a good fix will require replacing the entire IInput framework and transducing from raw byte streams; this will be undertaken in connection with issue #15. Signed-off-by: jrte <jrte.project@gmail.com>

jrte · 2022-02-23T15:06:42Z

Using ByteBuffer.allocateDirect(int capacity) to instantiate a direct ByteBuffer will reduce pressure on the JVM heap since the backing byte[] array will be allocated from native system RAM and filled using native I/O if capacity is large. Jrte in the raw branch should use large capacity direct byte buffers to wrap transduction input from large files. Jrte in head and master uses heap-bound char[] arrays and relies on io.InputStreamReader to decode byte[]->char[], which puts a lot of pressure on the heap, sadly.

jrte · 2022-03-26T13:41:14Z

Commit ec6487b is a start on this. It is not a commit to be built upon because there will be more refactoring of interfaces and classes to come. It can be used to compile text transduction patterns of your own design into a gearbox using the BaseTarget and to run those transductions in your application or in a shell with Jrte.main().

All text transduction is now done in the UTF-8 encoded domain, without decoding. This obviates decoding 8-bit utf-8 bytes and widening to 16-bit unicode chars. Arbitrary (almost) binary data can be included on ribose parameter tapes (enclosed in `backquotes`) using \xHH notation to include utf-8 encoding of unprintable characters. In ginr, printable non-ASCII characters can be represented naturally in regular expressions anf will be represented as respective utf-8 multi-byte encoding in compiled automata, for example:

sigma = '⅀';
sigma:pr;
DFA MIN States: 5      Trans: 4      Tapes: 1  Strg: 3 K

(START) � 2
2 � 3
3 � 4
4 -| (FINAL)

At present there is only one input method (BytesInput)and it requires all input to be preloaded into RAM (see issue #6). This will be fixed in a later commit, the problem relates to the mark/restore effector implementations which need to be reworked. The reason this sort of back-tracking is sometimes necessary is because at times the transduction will lead to a point where >1 features may be present but some of the features have common prefixes that trigger different effector patterns and cannot be discriminated until they have been scanned for some distance. In those cases the transduction can mark the point of departure and scan ahead until a specific feature is recognized and then reset to the mark and push a specific transducer to run with effectors enabled to process the feature. See the LinuxKernelStrict and LinuxKernelLoose transducers in LinuxKernelLog.inr for examples of this.

Many other changes are included in ec6487b and more are coming. The intention is to provide a factory component that will produce gearbox compiler instances (to compile ginr DFAs into a persistent store, the gearbox) and run-time instances (providing a capability to instantiate run-time transductions). A transduction is a binding of a user-defined ITarget instance with an ITransduction instance. An ITarget is a class that expresses 0 or more IEffector<T> classes and encapsulates domain-specific algorithms that assimilate data extracted from the transduction input into the target domain. Each IEffector class expresses a family of parameterized methods, each receiving a single byte from the input. When the transduction is presented with input and run, the input pattern selects and invokes a series of effectors for each byte scanned and so drives the target to assimilate the data extracted from features of interest.

Effector implemetations are light-weight classes that are expressed by an ITarget implementation and typically hold a reference to an ITarget instance that they act on when they are invoked from a running transduction. Effector classes extend may be simple BaseEffector extensions that express only one invoke() method or they may extend BaseParameterizedEffector<T> where T is a type that can be constructed from byte[][] (an array of utf8-encoded or binary byte[] arrays). In the latter case, the effector's compileParameter(int, byte[][]) method will be called once to produce a parameter instance of type T for each unique parameter value bound to the effector in any transducer pattern compiled into the gearbox. The effector implementation maintains an array of compiled parameter objects and the index of the compiled parameter to be used is passed with each call to the effector's invoke(int) method.

The BaseTarget class is sufficient for running most grep/awk-like transductions, given a set of patterns compiled from ginr into a gearbox. This class will be sealed in future. There is no longer a strict need for user-defined targets to inherit from BaseTarget, and multiple targets each with unique effectors, can be bound to a transduction. The BaseTarget expresses effectors that are actually implemented inline in the Transduction class, which implements ITarget as well as ITransduction. As of this commit, gearbox compilation is effected using the GearboxCompiler.main()method and text transductions can be run with theBaseTargetusingJrte.main(). To obtain an ITransductioninstance to run in an application context, instantiate Jrte and callJrte.transduction()`.

jrte · 2022-04-01T03:06:00Z

This is starting to come together. It still needs minor refactoring and revised and improved documentation. Below are CI test results from 2efbb81. The 20x stats are ms run times. All inputs were held in RAM and output waqs disabled for these bench-marking runs. The summary stats at the end of the 20x stats show throughput (mb/s), error rates (nul/kb) and relative speedup/slowdown for ribose vs regex (ribose:regex) averaged over the last 10 runs. Output is disabled for these runs and is enabled for the final runs which verify equivalence of output captured by regex and equivalent ribose transducers.

There are more changes coming to get rid of the gearbox metaphor. Jrte (Java Recursive Transduction Engine) is a legacy from a 2011 implementation (my 2nd, this is my 3rd, 1st was in late 80s when I discovered ginr). It'll be all ribose. There is a new ribose package containing the ribose component interfaces. These provide access to the ribose compiler, which assembles ribose transducers from 3-tape subsequential (single-valued) DFAs compiled by ginr into a runtime model (not a gearbox). The model is bound at compile time to an application-defined target class that the compiler instantiates to discover 1 or more effector classes. The model target instantiates its effectors and presents them to the compiler for parameter validation.

Effectors are simple, presenting only a parameterless invoke() (apply effect to target) method to transductions, or are parameterized by a generic type T and present additional methods setParameterCount(int count) (->T[count]), compileParameter(int ordinal, byte[][] bytes) (->T[ordinal] = new T(bytes)) and invoke(int ordinal) (->use T[ordinal] to apply effect to target). The byte arrays presented to be compiled are the sequences of [tokens] discovered in association with effector references in any of the transducers compiled into the model.

Ginr encodes backquoted tokens as mixed UTF-8 byte encodings of 16-bit UNICODE characters and unprintable bytes (represented as \xHH), What you see between [] in effector[with parameters] is what the effector will receive with compileParameter(int, byte[][]). The effector can use the static Bytes,decode(byte[]) method to recover String representations, and anything can be constructed from a string. So that's all good. Maybe even better with 3 or 4 strings. As many as are required, they can be acquired for free by including them in a ribose pattern for ginr.

A compiled ribose model includes the transducers assembled from ginr DFAs, a model target, the target's effectors, and their parameter byte[][]s. Additionally, it enumerates some special artifacts that may be referenced in ribose patterns. Tokens of the form !signal are signals that can be raised using the built-in in[!signal] effector. Transducers (referenced as @name) select[~value] and then paste bytes extracted from input or paste[some embellishment], all of which will be appended to the named ~value to be assimilated into the target when complete. Any number of signals and values can be defined, just give them right form of name and reference then in ribose patterns. They're free too.

So there are these named transducers and signals and values floating around and the static Ribose.compileRuntimeModel() method binds them all together with a target, effectors and effector parameters in a single model. The Ribose.loadRuntimeModel(File modelFIle, ITarget modelTarget) method, also static,loads a model file and goes through parameter binding with the model target. This is not a live target, it is only used to instantiate model effectors for parameter compilation when the model loads. When the model is completely loaded, IRiboseRuntime.newTransduction(ITarget liveTarget) will clone precompiled parameter objects from model to live effectors and bind live target and effectors to a runtime transduction stack. ITransduction provides a simple interface for controlling runtime transductions.

This commit picks up ginr 2.1.0c7, which has a new :pr display format and full support for representing text patterns in the UTF-8 encoded domain, and ribose is taking advantage of this to obviate decoding and widening of byte to char. Ginr is a gem and now that it can handle >64K states and inputs >16 bits wide it is truly an industrial-strength application. Pity this stuff didn't get into mainstream computing early on.

Note: the -1s in place of run times in the first RegexTest row just indicate that this was a non-capturing regex. The summary stats are complete as for the other rows.

     [echo] DateExtractor:
     [java] 2019/02/33 2019/11/141 2019/11/141 2020/08/314 
     [echo] TestRunner (benchmarking simple fsts driving base transduction effectors):
     [java]            RegexTest:   -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1 : 154.316 mb/s (chars)
     [java]       RegexGroupTest:   71  63  59  59  59  59  58  59  58  60  58  58  58  59  59  58  60  59  60  59 : 162.190 mb/s (chars)
     [java]         NilSpeedTest:   66  77  47  45  45  44  45  44  44  44  45  44  45  44  44  44  45  45  45  45 : 213.828 mb/s (bytes)
     [java]       PasteSpeedTest:  117 116  64  48  49  49  49  48  48  49  49  49  52  50  49  49  50  48  48  49 : 193.443 mb/s (bytes)
     [java]         NilPauseTest:  156 222  71  71  70  72  69  70  69  70  70  69  69  70  71  70  70  71  70  69 : 136.434 mb/s (bytes)
     [java]       PastePauseTest:  176  94  79  83  79  79  80  79  78  79  79  82  80  79  79  79  78  79  80  79 : 120.110 mb/s (bytes)
     [java]         PasteCutTest:   82  81  81  80  80  81  81  80  80  80  81  81  79  80  81  80  80  80  81  80 : 118.764 mb/s (bytes)
     [java]      SelectPasteTest:   77  75  74  75  74  74  74  74  75  75  75  75  75  75  74  73  75  74  74  75 : 128.010 mb/s (bytes)
     [java]       PasteCountTest:  620 224 226 246 251 224 222 223 223 223 223 223 223 222 221 222 263 269 239 232 :  40.808 mb/s (bytes)
     [java]          CounterTest:  184 194 168 170 168 172 167 176 176 168 169 167 168 169 169 169 167 169 169 168 :  56.631 mb/s (bytes)
     [java]            StackTest:  253 164 164 166 160 164 184 185 183 180 180 181 181 182 181 162 160 162 159 159 :  55.868 mb/s (bytes)
     [echo] Ribose (benchmarking fsts for extracting data from noisy inputs, vs similar regex, output muted):
     [exec]   115080  2199300 20420730 kern-10.log
     [java]       LinuxKernelNil:  136 126  93  93  93  93  93  93  93  93  93  93  92  92  93  92  92  93  92  92 : 210.765 mb/s  45.875 nul/kb
     [java]                RegEx:  490 490 455 452 462 457 453 456 455 453 452 454 455 455 453 453 456 454 455 476 :  42.680 mb/s   4.938 ribose:regex
     [java]          LinuxKernel:  174 109 160 118 116 115 116 116 111 110 109 109 109 109 108 108 109 109 109 109 : 178.996 mb/s  45.875 nul/kb
     [java]                RegEx:  537 545 492 498 490 490 490 489 490 490 489 489 491 489 490 491 489 490 491 489 :  39.761 mb/s   4.502 ribose:regex
     [java]     LinuxKernelLoose:  457 237 175 174 168 167 167 169 171 171 172 178 175 175 175 176 173 174 174 178 : 111.284 mb/s  45.875 nul/kb
     [java]                RegEx:  526 542 493 497 442 447 440 445 443 445 441 439 440 442 441 444 442 444 444 442 :  44.070 mb/s   2.525 ribose:regex
     [java]    LinuxKernelStrict:  690 292 207 228 232 209 204 206 204 204 202 206 202 204 204 212 203 205 204 205 :  95.138 mb/s  45.875 nul/kb
     [java]                RegEx:  520 537 506 501 501 500 501 501 501 501 501 500 500 501 502 500 501 505 501 513 :  38.763 mb/s   2.454 ribose:regex
     [exec]   23601  103261 1522902 verbosegc.vgc
     [java]           Tintervals:   35  22  22  16  21  22  21   9   9  11   9  10   9   8   9   9   8   8   9   9 : 165.040 mb/s  23.092 nul/kb
     [java]                RegEx:   24  18  18  20  22  10   8   9   6   5   5   4   5   4   4   5   4   4   5   4 : 330.080 mb/s   0.500 ribose:regex
     [java]           Sintervals:   35  21  20  14  20  20  21   8   8   8   8   9   8   8   8   9   8   8   8   8 : 177.116 mb/s  23.092 nul/kb
     [java]                RegEx:   19  13  13   7   5   5   6   6   5   5   4   4   4   4   4   4   4   4   4   4 : 363.088 mb/s   0.488 ribose:regex
     [echo] Ribose (output equivalence tests):
     [exec]  523  523 4303 build/patterns/test/Tintervals.jrte.out
     [exec]  523  523 4303 build/patterns/test/Sintervals.jrte.out
     [exec]  523  523 4303 build/patterns/test/Tintervals.regex.out
     [echo] ^^^ Identical
     [exec]   70279  210837 7683656 build/patterns/test/LinuxKernel.regex.out
     [exec]   70279  210837 7683656 build/patterns/test/LinuxKernel.jrte.out
     [exec]   70279  210837 7683656 build/patterns/test/LinuxKernelLoose.jrte.out
     [exec]   70279  210837 7683656 build/patterns/test/LinuxKernelStrict.jrte.out
     [echo] ^^^ Identical

This almost fixes issues #15 and #17, all that remains is to integrate construction of AutomatonCompiler.model ribose-patterns/*.inr. The AutomatonCompiler.model file included in this commit can be rebuilt byte for byte from artifacts included with this commit. Procedure for this is not yet included in build.xml: ``` mkdir -p build/ribose/automata cat ribose-patterns/alpha.inr ribose-patterns/Automaton.inr | ginr etc/sh/compile --target \ com.characterforming.jrte.engine.AutomatonCompiler \ build/ribose/automata build/ribose/AutomatonCompiler.model ``` - fixed a bug (missing comma) in alpha.inr - cleaned up alpha checks in after.inr - now using AutomatonCompiler target to compile DFAs into model - remove {Automaton, TransducerCompiler, InrTransition}.java - using ginr2.1.0d ginr/dev@34b5965 (2.1.0d Repair Alist bug) Signed-off-by: jrte <jrte.project@gmail.com>

jrte · 2022-04-20T01:47:14Z

This is complete in head branch and will be closed when head is merged into master.

jrte · 2022-04-20T02:41:42Z

Fixed in 1f47d51.

jrte added the epic Long time coming label Feb 23, 2022

jrte mentioned this issue Feb 23, 2022

ReaderInput and StreamInput are broken #6

Closed

jrte mentioned this issue Apr 12, 2022

Implement transducer for saved ginr DFA files #17

Closed

jrte closed this as completed Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable text transductions involving multibyte character encodings #15

Enable text transductions involving multibyte character encodings #15

jrte commented Feb 23, 2022 •

edited

Loading

jrte commented Feb 23, 2022 •

edited

Loading

jrte commented Mar 26, 2022

jrte commented Apr 1, 2022 •

edited

Loading

jrte commented Apr 20, 2022

jrte commented Apr 20, 2022

Enable text transductions involving multibyte character encodings #15

Enable text transductions involving multibyte character encodings #15

Comments

jrte commented Feb 23, 2022 • edited Loading

jrte commented Feb 23, 2022 • edited Loading

jrte commented Mar 26, 2022

jrte commented Apr 1, 2022 • edited Loading

jrte commented Apr 20, 2022

jrte commented Apr 20, 2022

jrte commented Feb 23, 2022 •

edited

Loading

jrte commented Feb 23, 2022 •

edited

Loading

jrte commented Apr 1, 2022 •

edited

Loading