Extraneous output when RS is a regex #212

millert · 2023-11-14T17:47:01Z

There seems to be a problem with extraneous output when RS is set to a regex by the user. For the following example:

printf "%% This is the RIPE Database query service.\n\n%% Information related to '194.32.71.0/24AS15562'\n\nroute:\t194.32.71.0/24\n" | \
    awk 'BEGIN{ RS = "route:" ; FS = "\n" }{ print $1 }'

The expected output is:

% This is the RIPE Database query service.
        194.32.71.0/24

This is what gawk produces and what OTA produced prior to the UTF-8 changes. Now, however, we get:

% This is the RIPE Database query service.
ute:    194.32.71.0/24

Note the leading ute:. If change the example to:

printf "%% This is the RIPE Database query service.\n\nroute:\t194.32.71.0/24\n" | \
    awk 'BEGIN{ RS = "route:" ; FS = "\n" }{ print $1 }'

we get:

% This is the RIPE Database query service.
e:      194.32.71.0/24

The number of extra characters is never more than 4 (e.g. the length of a rune).

The code at the end of fnematch that calls ungetc() looks incorrect (why is it using the length of the last rune?) but I am not sure whether that is related.

The text was updated successfully, but these errors were encountered:

millert · 2023-11-14T21:48:12Z

In comparing the pre-UTF8 code to the current version I noticed that the next character is only fetched when j + 1 == k whereas the current code fetches a new character each time though the loop. The old code always did:

c = (uschar)buf[j];

so c is the current character. In the new code, rune is the next character, which was buf[k] in the old scheme of things. Changing the code to only load the next rune when j + 1 == k avoids the extraneous output but causes a few test failures. So it seems this is not so simple.

arnoldrobbins · 2023-11-15T04:28:01Z

So it seems this is not so simple.

You have a talent for understatement. :-). Thanks for the report. I will try to find some time to work on this. The original changes took a number of hours of work; I suppose it's not surprising that it's still not 100% correct. We'll get it going, eventually.

mpinjr · 2023-11-15T13:11:42Z

In case it's helpful, I'm looking at this as well.

fnematch was my first contribution to awk (submitted to NetBSD in 2012). And it so happens that in preparation to publish some escape sequence handling fixes that I've had in a private tree for a while now (but which need updating to include \u support), earlier this week I finally took the time to study the new utf8 code.

arnoldrobbins · 2023-11-15T18:33:11Z

@millert @plan9 Please see #213 which I think fixes the problem. It passes the test suite. It does not take #211 into account. Enjoy. :-)

millert · 2023-11-15T18:38:04Z

@arnoldrobbins Thanks, that does fix the issue for me for both the reduced test case and the original input that triggered the bug.

plan9 · 2023-11-20T15:40:26Z

miguel's rewrite has been accepted.

arnoldrobbins mentioned this issue Nov 15, 2023

Fix fnematch. #213

Closed

mpinjr mentioned this issue Nov 15, 2023

Fix fnematch utf8 support #214

Merged

plan9 closed this as completed Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraneous output when RS is a regex #212

Extraneous output when RS is a regex #212

millert commented Nov 14, 2023

millert commented Nov 14, 2023

arnoldrobbins commented Nov 15, 2023

mpinjr commented Nov 15, 2023

arnoldrobbins commented Nov 15, 2023

millert commented Nov 15, 2023

plan9 commented Nov 20, 2023

Extraneous output when RS is a regex #212

Extraneous output when RS is a regex #212

Comments

millert commented Nov 14, 2023

millert commented Nov 14, 2023

arnoldrobbins commented Nov 15, 2023

mpinjr commented Nov 15, 2023

arnoldrobbins commented Nov 15, 2023

millert commented Nov 15, 2023

plan9 commented Nov 20, 2023