ReaderInput and StreamInput are broken #6

jrte · 2021-04-10T14:44:31Z

These are the available BaseInput subclasses that wrap stdin for simple text transduction processes. Neither of these work unless entire input is preloaded into a single buffer.

Input from stdin is limited to <4k and larger files must be preloaded as for jrte.test.com.characterforming.jrte.test.FileRunner.

jrte · 2021-04-12T13:59:32Z

To work around this and test patterns against file input please use jrte.test.com.characterforming.jrte.test.FileRunner as shown in the example for This is a transducer. It will be a few weeks before I can attend to this.

jrte · 2021-05-03T12:26:00Z

The core issue is the entanglement of IInput.mark/reset with core mark/reset effectors used to allow look-ahead in classification contexts. Effector implementation should be transparent to IInput, and should not use artifacts from IInput implementation at runtime. Mark/reset support will have to be built into the core Transduction using existing constructs, eg mark/reset[~selection].

Going forward I am going to try reducing input from byte streams only and provide UTF-8 decoder support in transducers that can be composed directly with UTF-8 input patterns using standard XML entity names eg ". Decoding will then be worked into each input pattern at the design stage. This will obviate runtime overhead associated with UTF-8 support in Java's InputStreamReader and simplify IInput implementation.

These changes will take some time and I'll ask for patient use of the workaround prescribed in previous comment in the meantime.

This commit applies a stop-gap workaround for the IInput-related issues described in issue #6. It just applies to `jrte.Jrte.main()` the same logic as `test.FileRunner.main()` -- the input file path is specified as an input argument and the entire file contents are read into RAM. This workaround just obviates inclusion of jrte-HEAD-test.jar in classpath. Tests have been extended to include runs with FileRunner.main() for benchmarking and Jrte.main() as for regular runs. Output equivalence checks for equivalent output from verbose gc interval times extraction via two different FSTs and one regex. The `etc/sh/jrte.sh` script demonstrates how to use jrte to transduce a file. ``` java -cp jrte-HEAD.jar com.characterforming.jrte.Jrte \ [--nil] <transducer-name> <input-filepath> <gearbox-filepath> ``` The `--nil` option presents a `!nil` signal to the transduction before presenting the file contents. This is all that I intend to do with this issue #6 for now as a good fix will require replacing the entire IInput framework and transducing from raw byte streams; this will be undertaken in connection with issue #15. Signed-off-by: jrte <jrte.project@gmail.com>

jrte · 2022-02-23T13:41:29Z

Not fixing this, but see workaround in commit message for bd49365 and comments in issue #15. Will close when head is merged into raw.

jrte · 2022-03-27T11:15:50Z

Please use BytesInput. Available in HEAD branch only, for now.

jrte · 2022-04-26T12:24:14Z

This is almost fixed in dev@1887933. The main problem with segmented input is fixed in that commit. IInput is gone and the ITransductor.input() method can be used to segment for input. Just call input() to push data onto the transductor input stack (LIFO) and call run() until the input stack is consumed, call input() and run() repeatedly to consume input more input.

Mark/reset will not work across input block boundaries for the time being. Same effect can be achieved by copying data that would otherwise be marked into a named value, them push value onto input stack at reset point. Mark/reset, if fully implemented, will not copy data but will retain references to blocks with marked data. A ITransductor.hasMark() method will indicate whether or not the transductor has marked data. Any data buffers passed to a marking transductor run() method will be retained on the input stack until the mark is reset. Callers must be aware that buffers passed to run() while the transducer is maintaining a mark MUST NOT be reused for data, at least until after the transductor stops marking.

jrte · 2022-05-01T03:51:04Z

Closing this. The fix works but with caveats. Mark/reset have no effect when there is >1 non-empty frame on the input stack. Input buffers passed into a marking transduction must not be reused (transductor holds original buffers in its mark stack and reuse will overwrite marked data). MArked buffers accumulate until reset() or unmark() are called. Call ITransdictor.hasMark() before reusing data buffers for Transductor.input().

Mark/reset are seldom needed and poorly implemented and will likely be deprecated and removed in the future. A better way would be to paste data that would otherwise be marked into a named value and push named value to reset.

jrte · 2022-05-01T23:37:54Z

Reopening this as it is not adequately tested. The cut boundaries imposed by ITransductor.limit() do not simulate marking at limit of physical buffer.

These were overlooked when replaced by TCompile.{map,model} and the build broke when they were removed. Some mending followed from this. Mark/reset was reworked but still needs more testing so I reopened issue #6. I will rework FileRunner to allow segmented input for ribose but will still need to load entire input into heap (or maybe a direct buffer off-heap) for regex runs. BaseTarget.{map,model} were replaced by TRun.{map,model} in a previous commit. These are generated by running TCompile on ginr automata compiled from patterns/test (ant -f build.xml ribose). Signed-off-by: jrte <jrte.project@gmail.com>

jrte · 2022-05-12T18:00:39Z

This is fixed in dev (78135db) but CI build with javadoc inclusion shows javadoc errors not present in local builds. Leaving this open until that is cleared up and dev merged into master.

jrte · 2022-05-13T05:14:47Z

Fixed and merged into master with javadoc cleanup (f2f9039).

jrte changed the title ~~InputStreamReader is broken~~ ReaderInput and StreamInput are broken Jan 8, 2022

jrte mentioned this issue Feb 23, 2022

Enable text transductions involving multibyte character encodings #15

Closed

jrte added the wontfix This will not be worked on label Feb 23, 2022

jrte added bug Something isn't working and removed wontfix This will not be worked on labels Apr 26, 2022

jrte closed this as completed in 5ac8303 May 1, 2022

jrte reopened this May 1, 2022

jrte closed this as completed in 78135db May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReaderInput and StreamInput are broken #6

ReaderInput and StreamInput are broken #6

jrte commented Apr 10, 2021 •

edited

Loading

jrte commented Apr 12, 2021

jrte commented May 3, 2021 •

edited

Loading

jrte commented Feb 23, 2022 •

edited

Loading

jrte commented Mar 27, 2022

jrte commented Apr 26, 2022

jrte commented May 1, 2022

jrte commented May 1, 2022

jrte commented May 12, 2022

jrte commented May 13, 2022

ReaderInput and StreamInput are broken #6

ReaderInput and StreamInput are broken #6

Comments

jrte commented Apr 10, 2021 • edited Loading

jrte commented Apr 12, 2021

jrte commented May 3, 2021 • edited Loading

jrte commented Feb 23, 2022 • edited Loading

jrte commented Mar 27, 2022

jrte commented Apr 26, 2022

jrte commented May 1, 2022

jrte commented May 1, 2022

jrte commented May 12, 2022

jrte commented May 13, 2022

jrte commented Apr 10, 2021 •

edited

Loading

jrte commented May 3, 2021 •

edited

Loading

jrte commented Feb 23, 2022 •

edited

Loading