-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable text transductions involving multibyte character encodings #15
Comments
This commit applies a stop-gap workaround for the IInput-related issues described in issue #6. It just applies to `jrte.Jrte.main()` the same logic as `test.FileRunner.main()` -- the input file path is specified as an input argument and the entire file contents are read into RAM. This workaround just obviates inclusion of jrte-HEAD-test.jar in classpath. Tests have been extended to include runs with FileRunner.main() for benchmarking and Jrte.main() as for regular runs. Output equivalence checks for equivalent output from verbose gc interval times extraction via two different FSTs and one regex. The `etc/sh/jrte.sh` script demonstrates how to use jrte to transduce a file. ``` java -cp jrte-HEAD.jar com.characterforming.jrte.Jrte \ [--nil] <transducer-name> <input-filepath> <gearbox-filepath> ``` The `--nil` option presents a `!nil` signal to the transduction before presenting the file contents. This is all that I intend to do with this issue #6 for now as a good fix will require replacing the entire IInput framework and transducing from raw byte streams; this will be undertaken in connection with issue #15. Signed-off-by: jrte <jrte.project@gmail.com>
Using |
Commit ec6487b is a start on this. It is not a commit to be built upon because there will be more refactoring of interfaces and classes to come. It can be used to compile text transduction patterns of your own design into a gearbox using the All text transduction is now done in the UTF-8 encoded domain, without decoding. This obviates decoding 8-bit utf-8 bytes and widening to 16-bit unicode chars. Arbitrary (almost) binary data can be included on ribose parameter tapes (enclosed in `
At present there is only one input method ( Many other changes are included in ec6487b and more are coming. The intention is to provide a factory component that will produce gearbox compiler instances (to compile ginr DFAs into a persistent store, the gearbox) and run-time instances (providing a capability to instantiate run-time transductions). A transduction is a binding of a user-defined Effector implemetations are light-weight classes that are expressed by an The |
This is starting to come together. It still needs minor refactoring and revised and improved documentation. Below are CI test results from 2efbb81. The 20x stats are ms run times. All inputs were held in RAM and output waqs disabled for these bench-marking runs. The summary stats at the end of the 20x stats show throughput There are more changes coming to get rid of the gearbox metaphor. Jrte (Java Recursive Transduction Engine) is a legacy from a 2011 implementation (my 2nd, this is my 3rd, 1st was in late 80s when I discovered ginr). It'll be all ribose. There is a new ribose package containing the ribose component interfaces. These provide access to the ribose compiler, which assembles ribose transducers from 3-tape subsequential (single-valued) DFAs compiled by ginr into a runtime model (not a gearbox). The model is bound at compile time to an application-defined target class that the compiler instantiates to discover 1 or more effector classes. The model target instantiates its effectors and presents them to the compiler for parameter validation. Effectors are simple, presenting only a parameterless Ginr encodes backquoted tokens as mixed UTF-8 byte encodings of 16-bit UNICODE characters and unprintable bytes (represented as \xHH), What you see between [] in effector[ A compiled ribose model includes the transducers assembled from ginr DFAs, a model target, the target's effectors, and their parameter byte[][]s. Additionally, it enumerates some special artifacts that may be referenced in ribose patterns. Tokens of the form So there are these named transducers and signals and values floating around and the static This commit picks up ginr 2.1.0c7, which has a new :pr display format and full support for representing text patterns in the UTF-8 encoded domain, and ribose is taking advantage of this to obviate decoding and widening of byte to char. Ginr is a gem and now that it can handle >64K states and inputs >16 bits wide it is truly an industrial-strength application. Pity this stuff didn't get into mainstream computing early on. Note: the -1s in place of run times in the first RegexTest row just indicate that this was a non-capturing regex. The summary stats are complete as for the other rows.
|
This almost fixes issues #15 and #17, all that remains is to integrate construction of AutomatonCompiler.model ribose-patterns/*.inr. The AutomatonCompiler.model file included in this commit can be rebuilt byte for byte from artifacts included with this commit. Procedure for this is not yet included in build.xml: ``` mkdir -p build/ribose/automata cat ribose-patterns/alpha.inr ribose-patterns/Automaton.inr | ginr etc/sh/compile --target \ com.characterforming.jrte.engine.AutomatonCompiler \ build/ribose/automata build/ribose/AutomatonCompiler.model ``` - fixed a bug (missing comma) in alpha.inr - cleaned up alpha checks in after.inr - now using AutomatonCompiler target to compile DFAs into model - remove {Automaton, TransducerCompiler, InrTransition}.java - using ginr2.1.0d ginr/dev@34b5965 (2.1.0d Repair Alist bug) Signed-off-by: jrte <jrte.project@gmail.com>
This is complete in head branch and will be closed when head is merged into master. |
Fixed in 1f47d51. |
Ginr (2.1.0c) is improving Unicode support, but jrte is lagging and as it stands now only 7-bit (ASCII) text can be transduced. Jrte is using
char[]
andCharBuffer
everywhere, which means that multibyte character encodings are decoded to 16-bit code points in jrteIInput
streams. However ginr patterns are compiled to raw byte encodings and there is no way to specify a 16-but code point in a pattern.Ginr is moving in the right direction, since it obviates the need for need to fully decode
byte[]
->char[]
to present to runtime transductions. To keep up jrte must be refactored (extensively) to usebyte[]
andByteBuffer
to represent input sequences. The prologue will have to be extended to includeutf8 = utf7 + {80..FF}
and utf7-dependent definitions must be extended to include all utf8 bytes (egPasteAny = (utf8, paste)*)
. Patterns that use non-ASCII characters will have to adapt as well since, for example,('⅀', paste)
will paste only the last byte of the UTF-8 encoding,('⅀' @ PasteAny)
would be required instead (this shouldn't be a problem for runs in master or head branches because they never worked with non-ASCII inputs anyway).This will require an extensive rewrite of many jrte components and will be undertaken on a new branch (raw). Issue #6 will be addressed in raw by deprecating
ITransduction.input(IInput[])
and the wholeIInput
framework and driving transductions from client code with sequential calls toTransduction.run(byte[] ...)
.Be patient. This is old code, I'm old, even my dog is old. And Java hurts.
The text was updated successfully, but these errors were encountered: