New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extraneous output when RS is a regex #212
Comments
In comparing the pre-UTF8 code to the current version I noticed that the next character is only fetched when
so |
You have a talent for understatement. :-). Thanks for the report. I will try to find some time to work on this. The original changes took a number of hours of work; I suppose it's not surprising that it's still not 100% correct. We'll get it going, eventually. |
In case it's helpful, I'm looking at this as well. fnematch was my first contribution to awk (submitted to NetBSD in 2012). And it so happens that in preparation to publish some escape sequence handling fixes that I've had in a private tree for a while now (but which need updating to include \u support), earlier this week I finally took the time to study the new utf8 code. |
@arnoldrobbins Thanks, that does fix the issue for me for both the reduced test case and the original input that triggered the bug. |
miguel's rewrite has been accepted. |
There seems to be a problem with extraneous output when RS is set to a regex by the user. For the following example:
The expected output is:
This is what gawk produces and what OTA produced prior to the UTF-8 changes. Now, however, we get:
Note the leading
ute:
. If change the example to:we get:
The number of extra characters is never more than 4 (e.g. the length of a rune).
The code at the end of
fnematch
that calls ungetc() looks incorrect (why is it using the length of the last rune?) but I am not sure whether that is related.The text was updated successfully, but these errors were encountered: