Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraneous output when RS is a regex #212

Closed
millert opened this issue Nov 14, 2023 · 6 comments
Closed

Extraneous output when RS is a regex #212

millert opened this issue Nov 14, 2023 · 6 comments

Comments

@millert
Copy link
Contributor

millert commented Nov 14, 2023

There seems to be a problem with extraneous output when RS is set to a regex by the user. For the following example:

printf "%% This is the RIPE Database query service.\n\n%% Information related to '194.32.71.0/24AS15562'\n\nroute:\t194.32.71.0/24\n" | \
    awk 'BEGIN{ RS = "route:" ; FS = "\n" }{ print $1 }'

The expected output is:

% This is the RIPE Database query service.
        194.32.71.0/24

This is what gawk produces and what OTA produced prior to the UTF-8 changes. Now, however, we get:

% This is the RIPE Database query service.
ute:    194.32.71.0/24

Note the leading ute:. If change the example to:

printf "%% This is the RIPE Database query service.\n\nroute:\t194.32.71.0/24\n" | \
    awk 'BEGIN{ RS = "route:" ; FS = "\n" }{ print $1 }'

we get:

% This is the RIPE Database query service.
e:      194.32.71.0/24

The number of extra characters is never more than 4 (e.g. the length of a rune).

The code at the end of fnematch that calls ungetc() looks incorrect (why is it using the length of the last rune?) but I am not sure whether that is related.

@millert
Copy link
Contributor Author

millert commented Nov 14, 2023

In comparing the pre-UTF8 code to the current version I noticed that the next character is only fetched when j + 1 == k whereas the current code fetches a new character each time though the loop. The old code always did:

c = (uschar)buf[j];

so c is the current character. In the new code, rune is the next character, which was buf[k] in the old scheme of things. Changing the code to only load the next rune when j + 1 == k avoids the extraneous output but causes a few test failures. So it seems this is not so simple.

@arnoldrobbins
Copy link
Collaborator

So it seems this is not so simple.

You have a talent for understatement. :-). Thanks for the report. I will try to find some time to work on this. The original changes took a number of hours of work; I suppose it's not surprising that it's still not 100% correct. We'll get it going, eventually.

@mpinjr
Copy link
Contributor

mpinjr commented Nov 15, 2023

In case it's helpful, I'm looking at this as well.

fnematch was my first contribution to awk (submitted to NetBSD in 2012). And it so happens that in preparation to publish some escape sequence handling fixes that I've had in a private tree for a while now (but which need updating to include \u support), earlier this week I finally took the time to study the new utf8 code.

@arnoldrobbins
Copy link
Collaborator

@millert @plan9 Please see #213 which I think fixes the problem. It passes the test suite. It does not take #211 into account. Enjoy. :-)

@millert
Copy link
Contributor Author

millert commented Nov 15, 2023

@arnoldrobbins Thanks, that does fix the issue for me for both the reduced test case and the original input that triggered the bug.

@plan9
Copy link
Collaborator

plan9 commented Nov 20, 2023

miguel's rewrite has been accepted.

@plan9 plan9 closed this as completed Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants