Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReaderInput and StreamInput are broken #6

Closed
jrte opened this issue Apr 10, 2021 · 9 comments
Closed

ReaderInput and StreamInput are broken #6

jrte opened this issue Apr 10, 2021 · 9 comments
Labels
bug Something isn't working

Comments

@jrte
Copy link
Owner

jrte commented Apr 10, 2021

These are the available BaseInput subclasses that wrap stdin for simple text transduction processes. Neither of these work unless entire input is preloaded into a single buffer.

Input from stdin is limited to <4k and larger files must be preloaded as for jrte.test.com.characterforming.jrte.test.FileRunner.

@jrte
Copy link
Owner Author

jrte commented Apr 12, 2021

To work around this and test patterns against file input please use jrte.test.com.characterforming.jrte.test.FileRunner as shown in the example for This is a transducer. It will be a few weeks before I can attend to this.

@jrte
Copy link
Owner Author

jrte commented May 3, 2021

The core issue is the entanglement of IInput.mark/reset with core mark/reset effectors used to allow look-ahead in classification contexts. Effector implementation should be transparent to IInput, and should not use artifacts from IInput implementation at runtime. Mark/reset support will have to be built into the core Transduction using existing constructs, eg mark/reset[~selection].

Going forward I am going to try reducing input from byte streams only and provide UTF-8 decoder support in transducers that can be composed directly with UTF-8 input patterns using standard XML entity names eg &quot;. Decoding will then be worked into each input pattern at the design stage. This will obviate runtime overhead associated with UTF-8 support in Java's InputStreamReader and simplify IInput implementation.

These changes will take some time and I'll ask for patient use of the workaround prescribed in previous comment in the meantime.

@jrte jrte changed the title InputStreamReader is broken ReaderInput and StreamInput are broken Jan 8, 2022
@jrte jrte added the wontfix This will not be worked on label Feb 23, 2022
jrte added a commit that referenced this issue Feb 23, 2022
This commit applies a stop-gap workaround for the IInput-related issues
described in issue #6. It just applies to `jrte.Jrte.main()` the same
logic as `test.FileRunner.main()` -- the input file path is specified as
an input argument and the entire file contents are read into RAM. This
workaround just obviates inclusion of jrte-HEAD-test.jar in classpath.

Tests have been extended to include runs with FileRunner.main() for
benchmarking and Jrte.main() as for regular runs. Output equivalence
checks for equivalent output from verbose gc interval times extraction
via two different FSTs and one regex.

The `etc/sh/jrte.sh` script demonstrates how to use jrte to transduce a
file.
```
java -cp jrte-HEAD.jar com.characterforming.jrte.Jrte \
[--nil] <transducer-name> <input-filepath> <gearbox-filepath>
```
The `--nil` option presents a `!nil` signal to the transduction before
presenting the file contents.

This is all that I intend to do with this issue #6 for now as a good fix
will require replacing the entire IInput framework and transducing from
raw byte streams; this will be undertaken in connection with issue #15.

Signed-off-by: jrte <jrte.project@gmail.com>
@jrte
Copy link
Owner Author

jrte commented Feb 23, 2022

Not fixing this, but see workaround in commit message for bd49365 and comments in issue #15. Will close when head is merged into raw.

@jrte
Copy link
Owner Author

jrte commented Mar 27, 2022

Please use BytesInput. Available in HEAD branch only, for now.

@jrte jrte added bug Something isn't working and removed wontfix This will not be worked on labels Apr 26, 2022
@jrte
Copy link
Owner Author

jrte commented Apr 26, 2022

This is almost fixed in dev@1887933. The main problem with segmented input is fixed in that commit. IInput is gone and the ITransductor.input() method can be used to segment for input. Just call input() to push data onto the transductor input stack (LIFO) and call run() until the input stack is consumed, call input() and run() repeatedly to consume input more input.

Mark/reset will not work across input block boundaries for the time being. Same effect can be achieved by copying data that would otherwise be marked into a named value, them push value onto input stack at reset point. Mark/reset, if fully implemented, will not copy data but will retain references to blocks with marked data. A ITransductor.hasMark() method will indicate whether or not the transductor has marked data. Any data buffers passed to a marking transductor run() method will be retained on the input stack until the mark is reset. Callers must be aware that buffers passed to run() while the transducer is maintaining a mark MUST NOT be reused for data, at least until after the transductor stops marking.

@jrte jrte closed this as completed in 5ac8303 May 1, 2022
@jrte
Copy link
Owner Author

jrte commented May 1, 2022

Closing this. The fix works but with caveats. Mark/reset have no effect when there is >1 non-empty frame on the input stack. Input buffers passed into a marking transduction must not be reused (transductor holds original buffers in its mark stack and reuse will overwrite marked data). MArked buffers accumulate until reset() or unmark() are called. Call ITransdictor.hasMark() before reusing data buffers for Transductor.input().

Mark/reset are seldom needed and poorly implemented and will likely be deprecated and removed in the future. A better way would be to paste data that would otherwise be marked into a named value and push named value to reset.

@jrte
Copy link
Owner Author

jrte commented May 1, 2022

Reopening this as it is not adequately tested. The cut boundaries imposed by ITransductor.limit() do not simulate marking at limit of physical buffer.

@jrte jrte reopened this May 1, 2022
jrte added a commit that referenced this issue May 2, 2022
These were overlooked when replaced by TCompile.{map,model} and the
build broke when they were removed. Some mending followed from this.

Mark/reset was reworked but still needs more testing so I reopened
issue #6. I will rework FileRunner to allow segmented input for ribose
but will still need to load entire input into heap (or maybe a direct
buffer off-heap) for regex runs.

BaseTarget.{map,model} were replaced by TRun.{map,model} in a previous
commit. These are generated by running TCompile on ginr automata
compiled from patterns/test (ant -f build.xml ribose).

Signed-off-by: jrte <jrte.project@gmail.com>
@jrte
Copy link
Owner Author

jrte commented May 12, 2022

This is fixed in dev (78135db) but CI build with javadoc inclusion shows javadoc errors not present in local builds. Leaving this open until that is cleared up and dev merged into master.

@jrte jrte closed this as completed in 78135db May 13, 2022
@jrte
Copy link
Owner Author

jrte commented May 13, 2022

Fixed and merged into master with javadoc cleanup (f2f9039).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant