Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 324-1 #617

Merged
merged 7 commits into from
May 17, 2023
Merged

Bug 324-1 #617

merged 7 commits into from
May 17, 2023

Conversation

phidebian
Copy link

This is a patch candidate for ksh93 Issue#324

Phi added 2 commits March 30, 2023 08:08
printf format string can handle sequential argv[] access via % and * as
well as indexed argv[] access via %x$ *x$ (* used for width/prec)

In a given format string it is possible to mix'n'match sequential/indexed
access, and mix'n'match the argv[] types.

'%*2$s' 1 2  Indexed access for 2$ making argv[1] an int, seq access for 's'
    then accessing arv[0] for s, here we mix seq/ndx access.
'%*2$s %s' 1 2 Same as above, the 1st format makes argv[0] a 'string'
    argv[1] an 'int' then the second format make argv[1] a 'string' this
    is a mixed type access for arv[1].

The former algo for printf-format handling builtin (say printf) is to save
the builting argv[] in a data structure called struct printf located in
src/cmd/ksh93/bltins/print.c, as you noted here this is located on a .c
file meaning it is private to the builtin of the ksh.

Yet, all the code to scan a printf format string (not limited to printf,
may be scanf too) is located in lib/libast/sfio/* meaning all sort of
jazz to to setup and use the private struct printf access at sfio level
via function pointers.

This printf data struct has a member called nextarg[] that is initially
setup to the builtin argv[]. It is called nextarg, because on sequential
access the % or * format access *nextarg++ meaning each sequential access
consume an arg. This is a major design flaw, because it can't support
indexed access %x$ *x$ very will, for this on 1st indexed access, a new
sfio level (not builtin) data structure is build called fp[], it rescan
the format string to find out all the indexed access, enter them in this
kind of cache, it record the type/value of the argv[x]. Now another problem
occurs, the type enter is what we see on 1st index 'x' occurence, so if
we got '%2$d %2$f' then for occurence 2 i.e fp[1] we enter type 'd' and
fetch the value nextarg[1], since nextarg may have been moved forward by
sequential access it provide bogus value, then later access to %$2f will
retrieve the cahced value for fp[1] find an 'int' and ksh is unable to
convert it to 'double' and all sort of crash occurs.

This is a complete design flaw, may be due to initial design for sequential
access only (like bash) then piled on a kludge to try an indexed access.

This patch will leave all the sfio infrastruct (functions) untouched,
yet will implement workaround on each flaws, the reasons not to touch the
existing function is that the format scan may be used by scanf() and
I don't fix scanf for now.

The workaroud is based on adding more members in the struct printf data
structure, basicall add argv0 the inital backup of argv[] received by the
builtin, nextarg is left unchanged, the former infrastructure still maintain
i.e advance, it are sequential accesse tie, yet indexed access is the
argv0[x] access with lazy conversion, i.e type conversion on demand, i.e
late in the printf() process.

In short this fix works 'as if' the format string is pre-pocessed,
replacing any non indexed access (i.e seq access) by an explicit computed
index access, the index generated beeing simply the seq index i.e something
similar to.
  +----++------+-++-----------+-- Seq
  |    ||      | ||  ndx ---+ |
  v    vv      v vv         v v
'%s   %*s     %*.*s       %*1$s'   expansed to (ignore spaces)
'%1$s %3$*2$s %6$*4$.*5$s %7$*1$s'
  ^    ^  ^    ^  ^   ^    ^  ^
  |    |  |    |  |   |    |  |
  +----+--+----+--+---+----+--+-- Computed Seq index.
  Note the index inversion due to the syntax of the format string
  printf '%*.*s' width prec data; is equivalent to
  printf '%3$*1$.*2$'  width prec data;

Here I keep all the infrastructure, yet working around bugs
- exend() function src/cmd/ksh93/bltins/print.c do setup type/value for a
  given argp then bump nextarg, to work around I made a new function
  reload()that do the same but save/restore nextarg.
  This because 'may be' extend is used by other code (scanf?)

- sffmtpos() duplicate the format string processing done here sfvprintf(),
  so we implement the same argp, argn processing so that each argv[]
  cached is on its right position.

- sfvprintf() same as above with use nargs/xargs make sure we accees the
  correct argv[] position.
  argv[] vs fp[] type mismatch we do a late argv[] evaluation (uncached).
@hyenias
Copy link

hyenias commented Apr 6, 2023

When I try to grab this PR's diff info, I get the following:

# Successfully acquire diff file
$ wget https://github.com/ksh93/ksh/pull/617.diff
# Change into freshly pulled and clean ksh folder
$ cd ksh
# Receive errors when I merged in the 617.diff
$ git apply ../617.diff
../617.diff:28: trailing whitespace.
		pdata.nextarg=argv;		/* buf324 */	
../617.diff:30: trailing whitespace.
		
../617.diff:47: trailing whitespace.

../617.diff:130: trailing whitespace.
	
../617.diff:138: trailing whitespace.
unalias 'err_''exit'
warning: squelched 24 whitespace errors
warning: 29 lines add whitespace errors.
$
# I guess it worked. When I issue a git diff command, I can see various spacing and alignment errors highlighted in red
$ git diff

@hyenias
Copy link

hyenias commented Apr 6, 2023

Is this the way it is suppose to work now?

$ f=$'%3$s %s %s %s\n' ksh -c 'printf "$f" a b c d e f'
c d e f
$
$ f=$'%3$s %s %s %s\n' ksh617 -c 'printf "$f" a b c d e f'
c a b c
f d e f

@hyenias
Copy link

hyenias commented Apr 6, 2023

Quick speed check on above.

$ cat test.ksh
#!/bin/ksh
integer i
time {
for ((i=500000;i--;)) do; printf "$f" a b c d e f > /dev/null; done
}
$
$ f=$'%3$s %s %s %s\n' ksh test.ksh

real	0m06.123s
user	0m03.723s
sys	0m02.356s
$ f=$'%3$s %s %s %s\n' ksh test.ksh

real	0m06.086s
user	0m03.728s
sys	0m02.316s
$ f=$'%3$s %s %s %s\n' ksh test.ksh

real	0m06.011s
user	0m03.602s
sys	0m02.367s
$
$ f=$'%3$s %s %s %s\n' ksh617 test.ksh

real	0m05.884s
user	0m03.459s
sys	0m02.382s
$ f=$'%3$s %s %s %s\n' ksh617 test.ksh

real	0m05.782s
user	0m03.378s
sys	0m02.362s
$ f=$'%3$s %s %s %s\n' ksh617 test.ksh

real	0m05.706s
user	0m03.290s
sys	0m02.374s
$
# Best run out of 3 using real user time. PR617 is approximately %5 faster for this one test. 
$ echo $((6.011/5.706))
1.05345250613389415

@McDutchie
Copy link

McDutchie commented Apr 6, 2023

Is this the way it is suppose to work now?

$ f=$'%3$s %s %s %s\n' ksh -c 'printf "$f" a b c d e f'
c d e f
$
$ f=$'%3$s %s %s %s\n' ksh617 -c 'printf "$f" a b c d e f'
c a b c
f d e f

I don't know, I still haven't wrapped my head around this whole bug. But I will note the zsh behaviour is identical to the patched ksh's behaviour above.

@phidebian
Copy link
Author

Sorry I didn't spotted the post.

Yes I have a big problem with my work env, and it looks like my editor (emacs autoindent, untabify, C templates, etc..) produce incompatible src, because it get confuse with all this space/tab jazz, and at the moment I don't have a process to reinject my changes properly. Sorry about that I try to work out a way to not pollute your sources.

So far this is a candidate, so may be some points or some behavior got to be adjusted, bugs fixed.

Is this the way it is suppose to work now?

$ f=$'%3$s %s %s %s\n' ksh -c 'printf "$f" a b c d e f'
c d e f
$
$ f=$'%3$s %s %s %s\n' ksh617 -c 'printf "$f" a b c d e f'
c a b c
f d e f

Yes this is what the patch produce, and I think it is correct unless one say otherwise.

The patch implement as per printf(3) docco auto numbering of non indexed argv[] access.
for ref the docco

 Thus,  printf("%*d", width, num); and  printf("%2$*1$d", width, num); are equivalent.

Where we clearly see that non indexed are sequentially numbered, and the fact the % syntax do get value index %x$ before the width/prec *x$ make the inversion of number clear. Then

$ f=$'%3$s %s %s %s\n' ksh -c 'printf "$f" a b c d e f'
is equivalent to
$ f=$'%3$s %1$s %2$s %3$s\n' ksh -c 'printf "$f" a b c d e f'
            ^----^----^ auto numbered

On my side I got

$ f=$'%3$s %s %s %s\n' /d1/ksh-324-1/arch/linux.i386-64/bin/ksh -c 'printf "$f" a b c d e f'
c a b c
f d e f

#                      vvv
$ f=$'%3$s %s %s %s\n' zsh -c 'printf "$f" a b c d e f'
c a b c
f d e f

On this f= test patched ksh behave as zsh (not that I am saying we always have to follow it, I displayed cases where zsh is wrong)

Production ksh logic was more or less forward progress i.e it does something along *argv++ to access that lead to error, beside producing strange things like

TC$ f=$'%2$s %s %s\n' ; printf "$f" a b c          
ksh: printf: a: parameter not set
b c 

not talk about infinite loops or segv.

I guess in fact nobody ever relying of mix'n'match indexing vs non-indexing access because it was too buggy so I gues we will not break anything.

What works with prod ksh is non mix'n'match.

TC$ f=$'%s %s\n' ; printf "$f" a b c d  
a b
c d

TC$ f=$'%1$s %2$s\n' ; printf "$f" a b c d
a b
c d

TC$ f=$'%2$s %1$s\n' ; printf "$f" a b c d
b a
d c

Then the bumpy area

TC$ f=$'%1$s %1$s\n' ; printf "$f" a b c d
a a
b b
c c
d d

# So far so good

TC$ f=$'%2$s %2$s\n' ; printf "$f" a b c d
ksh: printf: a: parameter not set
ksh: printf: c: parameter not set
b b
d d

# Huh ?

Let me know if you think something must be adjusted.

Cheers,

- Fix coding style to match surroudning code
- Remove extraneous spaces/tabs
- Change space indentation to tab indentation
@hyenias
Copy link

hyenias commented Apr 6, 2023

@phidebian Thanks for the clarification. As this is a difficult area with the way printf has been used for so many years, it took many explanations for me to get on the same path as you. I believe I understand now and will start testing more later today.

New/Fixed/Changed printf behavior for ksh:
As it relates to the format string in printf, indexed parameters will no longer have any influence on subsequent sequential parameters as all sequential parameters are explicitly resolved to their corresponding auto-numbered indexed position prior to any variable parameter expansion.

@hyenias
Copy link

hyenias commented Apr 7, 2023

Performance using best out of 3 runs, ksh vs ksh617:

  • f=$'%s\n' yields faster by %1.3, 1.01351605030974282, 5.399/5.327 seconds
  • f=$'%s %s %s\n' yields slower by %0.7, 0.993306559571619813, 5.194/5.229 seconds
  • f=$'%s %s %1$s\n' yields slower by %8.5, 0.924842713824179561, 5.439/5.881 seconds [this is understandable see below]
$ f=$'%s %s %1$s\n' ksh -c 'printf "$f" a b c d e f'
a b c
e f
$ f=$'%s %s %1$s\n' ksh617 -c 'printf "$f" a b c d e f'
a b a
c d c
e f e

Found no errors so far. Will continue testing tomorrow.

@phidebian
Copy link
Author

Regarding perf. I tried not to change the original code logic much, but instead put some band aids around faulty code, while to be honest we should rework this code path, the actual code as 3 or 4 format (don't remember exactly) string scan when at least 1 indexed access exist, while theorically this all things can be done in 1 pass.

My fix try not to upset the code base more than necessary, i.e I reused most of the original code,

  • For non indexed access, like %s %s %s we should see no diff.
  • For clean indexed access, i.e non mixing types like %2$s %1$s %3$s no perf diff

Perf diff occurs when mixing types, something impossible to do with a compiler (C) but doable with an interpreter,

  • %3$s %3$d argv[2] is first seen as a string enter f[2].type=str f[2].value=&str then on 2nd we got a type mismatch, we went a int the cache as an string, we go convert and pay.
  • %3$d %3$s simetrical of the above
  • %3$*3$s vicious one a mix with only 1 % , the 3$ is both wanted as a string and number
  • %1$d %s same as %3$s %3$d but obfuscated (missing an explicit 1$)
  • %s %1$d symetrical of the above....

Because I use the std method for conversion this fix a relativly safe, I didn't entered all the possible in the QA, I have a QA script generator, I should test all the possible type mix double/int double/str and the infamous %()T I didn't check that at all. may be it a segv.

I give it a more thorough QA when I get back, but for the main use we should be good enough.

Cheers,

@hyenias
Copy link

hyenias commented Apr 8, 2023

For clean indexed access: %2$s %1$s %3$s: 5.542/5.732s, 0.966852756454989532 = 4.3% slower
For non indexed access: %s %s %s: 5.179/5.227s, 0.990816912186722785 = 0.9% slower

@hyenias
Copy link

hyenias commented Apr 8, 2023

Hand testing completed. I could not break this new preprocessing auto-numbering string formatting approach to resolve the mix'n'match issues. However, the %s type field output is still not producing correct results when proceeded by a %b type field.

Original 2nd reproducer of #324 is still not working.

# Original 1st reproducer, excepted results: \\\\ \\
$ printf '%1$s %1$b\n' '\\\\'
\\\\ \\
# Original 2nd reproducer, excepted results: \\ \\\\
$ printf '%b %1$s\n' '\\\\'
\\ \\

Additionally, hard coding all the parameters did not help. So now, the parameter indexing issue is resolved but the %b special processing appears to be applied to %s. See printf --man within ksh for details where it reads, \\ Backslash character.

$ printf '%1$b %1$s\n' '\\\\' # should be \\ \\\\
\\ \\

@phidebian
Copy link
Author

phidebian commented Apr 8, 2023

Hum I thought I looked this one, but I may have been distracted by the other problems a lost sight, I look again...
Thanx @hyenias for thorough testing.

EDIT: The BE is broken this morning, I submitted #631, meanwhile I'll debug this on prev version.

@phidebian
Copy link
Author

phidebian commented Apr 8, 2023

Hurray, BE's back, so I cutted the 2nd round for the patch candidate.
Here is the branch on my fork, https://github.com/phidebian/ksh/tree/bug-324-2, I am not pushing a PR, until this is reviewed and accepted.

As I envisioned, I lost sigth of the initial request because the initial request of the kind %1$b %1$s where triggering all the mix'n'match bugs. One side ot the bug was type mismatch, something not possible with compilers like C, but indeed possible with an interpreter.

The fix kept the initial logic to build the fp[] array (kinda argv[] cache for type/value), because I didn't wanted to change the original code too much, but then when a type mismatch occurs like %1$*1$s the arv[0] is both an integer and a string, if the cache hold int/1 then initial code was using str/&value will got 1 as pointer value and kaboom, I then implemented a late conversion. It all works fine like

$ printf  '%1$s %1$d %1$x %1$o %1$e %1$g\n' 1.2                            
1.2 1 1 1 1.200000e+00 1.2

Here 1$ is many types, there are all late converted.

And the the infamous %b, this one is internally handled as a string, the format is forced as %s, yet to keep the pseudo type information, one stuff this into the base, base=0 mean string and base=-1 means string with %b.

My test regarding type mismatch forgot to look the base, now I look at it and it is fixed.

$ printf  '%1$s %1$b\n'             '\\\\'     
\\\\ \\

$ printf  '%1$b %1$s\n'             '\\\\'
\\ \\\\

For the record, we are almost on pare with ZSH except

'%*2$s %s\n'   per printf(3) docco is equivalent to '%1$*2$ %2$\n'
                                                      ^------^-----Auto-numbered
KSH
$ printf '%*2$s %s\n'  1 2 3 4 5 6 7 8 9 0                 
 1 2            # 1$ autonum is printed with a width *2$ explicit and 2$ autonum is print sans width
   3 4          # At end of fmt string 2 args are consumed, we loop on next one, i.e 3 4
     5 6
       7 8

ZSH
$ printf '%*2$s %s\n'  1 2 3 4 5 6 7 8 9 0
 3 4            # No reasons to use 3 for first non indexed  argv, it don't follow printf(3), zsh make something like *2$  
     7 8        # move away next non indexed

To better illustrate this goof
$ printf '%*3$.*2$s %s\n'  5 4 3 2 1  
printf: 3: argument specifier out of range
  3 2

$ printf '%*2$.*3$s %s\n'  5 4 3 2 1

   2 1

Other diff with ZSH

KSH
 printf '%3$*2$.*1$s %2$s %1$s\n'     1 2 3 4 5 6 7 8 9 0
 3 2 1
    6 5 4
       9 8 7
  0

ZSH
$ printf '%3$*2$.*1$s %2$s %1$s\n'     1 2 3 4 5 6 7 8 9 0

printf: 3: argument specifier out of range
 3 2 1
    6 5 4
       9 8 7

Here all the indexes are explicit, and yet ZSH barf on incomplete list.

@hyenias
Copy link

hyenias commented Apr 9, 2023

Alrighty. You have pointed out more wrong stuff with ZSH as compared what you have coded for ksh's new printf behavior. So far, what you have done all makes sense to me.

As I do not want to spend another 6 hours or so revalidating another set of code in your adjusted bug-324-2, I went through this following procedure to find the code changes you did in order to correct the last outstanding issue with %b as PR branches are made to track the flow of changes.

@hyenias procedure followed for work on an existing PR needing edits. This has the result of creating a new branch named pr_617 on my local machine so that I can hand edit the changes found. The author that created the PR would already have such a branch and may need to perform git pull to sync it up with the ksh93/master/617 branch as @McDutchie adjusted your code to the repo's standards already. I do not want to trash his work on reformatting your code nor do I want to start from scratch and test everything again.

$ git fetch origin pull/617/head:pr_617
$ git checkout pr_617

Then, I opened your bug-324-2 branch along with the ksh93/617 branch and stared-and-compared. For the sake of my time as I am not the author that is more familiar with his own code, I skipped comments and tests just looked for code impacting changes.

Here is what I found:

diff --git a/src/lib/libast/sfio/sfvprintf.c b/src/lib/libast/sfio/sfvprintf.c
index 9454c936..92855b30 100644
--- a/src/lib/libast/sfio/sfvprintf.c
+++ b/src/lib/libast/sfio/sfvprintf.c
@@ -319,9 +319,9 @@ loop_fmt :
                                                {       if(!fp &&
                                                           !(fp = (*_Sffmtposf)(f,oform,oargs,ft,0)) )
                                                                goto pop_fmt;
+                                                       if(n > xargs)
+                                                               xargs = n;
                                                }
-                                               if(n > xargs)
-                                                       xargs = n;
                                                else
                                                        n=++nargs;

@@ -600,9 +600,8 @@ loop_fmt :
                        argp = ++nargs;
                if(fp)
                {       if(ft && ft->extf)
-                       {       if(fmt == fp[argp].ft.fmt)
-                               {       if(fp[argp].ft.fmt != fp[argp].fmt)
-                                               fmt = fp[argp].ft.fmt;
+                       {       if(fmt == fp[argp].ft.fmt && fp[argp].ft.fmt == fp[argp].fmt)
+                               {
                                        argv = fp[argp].argv;
                                        size = fp[argp].ft.size;
                                }

I spot checked and it all seems to work for me. Does the above diff cover all of your changes from @McDutchie last commit 366770d on this PR?

I would recommend updating your local copy of this PR #617 then making all the necessary changes to the code base then push back up to this PR.

@phidebian Thank you for all the work you have done on this PR to resolve issue #324.

@phidebian
Copy link
Author

I am not too familiar with the code review process you follow, I should read it... meanwhile WATCHOUT, the thread here is called bug-324-1, may be because I did a PR, while my final last patch candidate is https://github.com/phidebian/ksh/tree/bug-324-2, didn't made a PR because didn't want to pollute the process if it was not good.

If you spent 6hour on the 324-1 you did the bulk of it, 324-2 is just 1 test line change from

			{	if(fmt==fp[argp].ft.fmt&&fp[argp].ft.fmt == fp[argp].fmt)
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Added in bug-324-2

The first (original) test catch type mismatch between the currently scanned fmt and a cached entry for the same index, say %1$d %1$s while scanning the later out fmt='s' while the fp[0].fmt='d', then we reload the current %1$s% ft, and don't touch the cache.

The second test comes when the 1st succeed as in %1$b %1$s in that case the 1st format filled the cache with fp[0].fmt's'; f[0].ft.fmt='b';fp[0].value=... Note the ugly hack %b becomes %s then on 2nd format we got a match on 1st test, but still the fp[0].fmt != fp[0].ft.fmt and since 1sst test is ok, it mean the current scan %s don't match the cached original %b we got to reload ft

Regarding perf, I decided not to update the cache, i.e the the first cache entry setup is kept all the time, because there is no assumtion about what torture we will get next, beside all this mix'n'match is marginal, people use it normally, i.e don't really venture in grey area like %1$*1$s that really make no sense, but yet we don't want to core dump.

Note that before working on this fix, I made another one who was fixing nothing, but just reject C illegal combo, as explained C don't allow mix'n'match in index vs non index access, but then @hyenias said ksh is not C that I understand, the fact we are re-doing the scanning for long argv[] is something C don't do and we enjoy (hum I got to torture C vsnprintf(...va_list) may be this one loops on argv[] too never tried that ).

@McDutchie for this bug-324-2 I inaugurated my new code review process before committing, and then before push, before PR, etc... I should not convert random <tabs> with spaces in the code base, should not have illegal <space> in my new code, YET my checker is may be too zealous and I discover the line I mentioned if(fmt==fp[argp].ft.fmt&&fp[argp].ft.fmt == fp[argp].fmt) should be if(fmt==fp[argp].ft.fmt && fp[argp].ft.fmt == fp[argp].fmt) i.e <space> around &&, so when you pick that one you may fix that, OR if you prefere I can fix it in bug-324-3 and do a proper PR you let me know.

Have a good one, and thanx @hyenias I rekon this all thing is marginal, most people use printf normally and generally don't torture it like this, yet when torture happen we got to produce something with a given logic, producing predictable result, and admitting that non indexed access are auto numbered may looks odd, but at list understandable.

Now the goodies, do you know the %. format selector, yes this add some complexity of the scan because we do have dot between width.prec.

printf '%..:s\n' abc def qqq

Not very well documented, I got bitten the hardway during fix, thanx to tests/builtins.sh, anyway I like it and will use it :-)

@hyenias
Copy link

hyenias commented Apr 9, 2023

@phidebian I did not know about ksh's printf format string concatenation with delimiter of %..<delimiter>s. Nice, I will have to use that in the future. Thanks for fixing and sharing this easter egg in ksh's printf.

$ printf '%..:s PM\n' 12 30 45
12:30:45 PM
$ printf '%..-s\n' 04 09 2023
04-09-2023

@phidebian
Copy link
Author

I wonder why this thing was invented, yet it allow things like building a PATH like variable from an array

$ a=(p1 p2 p3 p4)

$ printf '%..:s\n'  "${a[@]}"
p1:p2:p3:p4

It allow also the generation of a comma separated list of word without to worry about leading/trailing excess comma, could be usefull in docco printout, or C code generation :-)

$ printf '%s,' w1 w2 w3  #Trailing comma
w1,w2,w3,

$ printf ',%s' w1 w2 w3  # Leading comma
,w1,w2,w3

$ printf '%..,s' w1 w2 w3 # Cleanly separated word
w1,w2,w3

I wonder how many undocumentd features exist like the __ that is pretty obscure to me :-)

Notably:

Remove long, fix-specific comments from source code. These will be
reused in the final git commit message instead, for documentation
findable via e.g. 'git blame'. They won't make sense on their own
20 years from now without the context of the commit.

sfio.h: Rename reload to reloadf for consistency
with extf, eventf.

tests/{shtests,printf.sh}: Instead of err_exit in comments, just
grep -c '^T ' to count the number of tests in printf.sh.
@McDutchie McDutchie merged commit 87480ec into ksh93:dev May 17, 2023
McDutchie pushed a commit that referenced this pull request May 17, 2023
@stephane-chazelas reported:
>     $ printf '%1$s %1$b\n' '\\\\'
>     \\\\ b
>     $ printf '%b %1$s\n' '\\\\'
>     \\
>
> expected:
>
>     $ printf '%1$s %1$b\n' '\\\\'
>     \\\\ \\
>     $ printf '%b %1$s\n' '\\\\'
>     \\ \\\\

@hyenias commented:
> After some trial-and-error and some investigation I believe I
> understand the trouble you expressed. Trouble occurs when x$ aka
> parameter field is used[*].
>
> Example: Reorder parameters and repeat the 2nd:
>
>     $ printf '%3$s %2$s %1$s %2$s\n' a b c
>     c b a b
>
> More troubled output:
>
>     $ printf '%b %1$s\n' '\\\\'
>     \\
>     $ printf '%1$b %1$s\n' '\\\\'
>     \\ \\
>     $ printf '%s %1$s\n' '\\\\'
>     \\\\
>     $ printf '%s %1$s\n' 'a'
>     a
>     $ printf '%1$s %1$s\n' 'a'
>     a a
>
> [*] See n$ parameter field: https://en.wikipedia.org/wiki/Printf

Analysis and fix:

printf format string can handle sequential argv[] access via % and
* as well as indexed argv[] access via %x$ *x$ (* used for
width/prec)

In a given format string it is possible to mix'n'match sequential/
indexed access, and mix'n'match the argv[] types.

'%*2$s' 1 2  Indexed access for 2$ making argv[1] an int, seq
    access for 's' then accessing arv[0] for s, here we mix seq/ndx
    access.

'%*2$s %s' 1 2 Same as above, the 1st format makes argv[0] a
    'string' argv[1] an 'int' then the second format make argv[1] a
    'string' this is a mixed type access for arv[1].

The former algo for printf-format handling builtin (say printf) is
to save the builtin argv[] in a data structure called struct printf
located in src/cmd/ksh93/bltins/print.c. This is located in a .c
file, meaning it is private to the ksh builtin.

Yet, all the code to scan a printf format string (not limited to
printf, may be scanf too) is located in lib/libast/sfio/ meaning
all sorts of jazz to to setup and use the private struct printf
access at sfio level via function pointers.

This printf data struct has a member called nextarg[] that is
initially setup to the builtin argv[]. It is called nextarg,
because on sequential access the % or * format access *nextarg++,
meaning each sequential access, consumes an arg. This is a major
design flaw, because it can't support indexed access %x$ *x$ very
well.

For this, on 1st indexed access, a new sfio level (not builtin)
data structure is built called fp[]. It rescans the format string
to find out all the indexed access, enters them in this kind of
cache, recording the type/value of the argv[x].

Now another problem occurs: the type entered is what we see on 1st
index 'x' occurence, so if we got '%2$d %2$f' then for occurence 2
i.e. fp[1] we enter type 'd' and fetch the value nextarg[1]. Since
nextarg may have been moved forward by sequential access, it
provides a bogus value. Then later access to %$2f will retrieve the
cached value for fp[1], find an 'int', and ksh is unable to convert
it to 'double' and all sorts of crash occur.

This is a complete design flaw, may be due to initial design for
sequential access only (like bash) then piled on a kludge to try an
indexed access.

This patch will leave all the sfio infrastructure (functions)
untouched, yet will implement a workaround on each flaw. The
reasons not to touch the existing functions is that the format scan
may be used by scanf() and I'm not fixing scanf for now.

The workaroud is based on adding more members in the struct printf
data structure, basically add argv0, the inital backup of argv[]
received by the builtin. nextarg is left unchanged, the former
infrastructure still maintained i.e. advance; they are tied to
sequential access. Yet, indexed access is the argv0[x] access with
lazy conversion, i.e. type conversion on demand, i.e. late in the
printf() process.

In short, this fix works 'as if' the format string is pre-pocessed,
replacing any non-indexed access (i.e. seq access) by an explicitly
computed index access, the index generated being simply the seq
index i.e. something similar to:

  +----++------+-++-----------+-- Seq
  |    ||      | ||  ndx ---+ |
  v    vv      v vv         v v
'%s   %*s     %*.*s       %*1$s'   expanded to (ignore spaces)
'%1$s %3$*2$s %6$*4$.*5$s %7$*1$s'
  ^    ^  ^    ^  ^   ^    ^  ^
  |    |  |    |  |   |    |  |
  +----+--+----+--+---+----+--+-- Computed Seq index.
  Note the index inversion due to the syntax of the format string
  printf '%*.*s' width prec data; is equivalent to
  printf '%3$*1$.*2$'  width prec data;

Here I keep all the infrastructure, yet working around bugs.

src/lib/libast/sfio/sfvprintf.c:
- Implement argp, argn processing so that each argv[] cached is on
  its right position.
- Use nargs/xargs make sure we accees the correct argv[] position.
- argv[] vs fp[] type mismatch we do a late argv[] evaluation
  (uncached).

src/lib/libast/sfio/sftable.c:
- sffmtpos() duplicates the format string processing done here in
  sfvprintf(), so we implement the same argp, argn processing so
  that each argv[] cached is on its right position.

src/cmd/ksh93/bltins/print.c:
- Add reload() function. It is called when the caller wants to
  solve a fp[x] cache miss, i.e., a type mismatch between the
  cached type/value and the desired new type. It does the same as
  extend(), i.e. setup type/value for a given argp and then bump
  nextarg, but also saves/restores nextarg. This because 'maybe'
  extend is used by other code (scanf?).

Resolves: #324
McDutchie added a commit that referenced this pull request May 18, 2023
The second part of the patch from one of the PR comments[*] didn't
make it into the PR, so it failed to fix one of the originally
reported reproducers. This patch applies that fix.

This corrects the following reproducer:

    $ printf '%b %1$s\n' '\\\\'
    \\

to:

    $ printf '%b %1$s\n' '\\\\'
    \\ \\\\

[*] #617 (comment)
McDutchie added a commit that referenced this pull request May 18, 2023
The second part of the patch from one of the PR comments[*] didn't
make it into the PR, so it failed to fix one of the originally
reported reproducers. This patch applies that fix.

This corrects the following reproducer:

    $ printf '%b %1$s\n' '\\\\'
    \\

to:

    $ printf '%b %1$s\n' '\\\\'
    \\ \\\\

[*] #617 (comment)
McDutchie added a commit that referenced this pull request May 18, 2023
The second part of the patch from one of the PR comments[*] didn't
make it into the PR, so it failed to fix one of the originally
reported reproducers. This patch applies that fix.

This corrects the following reproducer:

    $ printf '%b %1$s\n' '\\\\'
    \\ \\

to:

    $ printf '%b %1$s\n' '\\\\'
    \\ \\\\

[*] #617 (comment)

Co-authored-by: Martijn Dekker <martijn@inlv.org>
McDutchie added a commit that referenced this pull request May 18, 2023
The second part of the patch from one of the PR comments[*] didn't
make it into the PR, so it failed to fix one of the originally
reported reproducers. This patch applies that fix.

This corrects the following reproducer:

    $ printf '%b %1$s\n' '\\\\'
    \\ \\

to:

    $ printf '%b %1$s\n' '\\\\'
    \\ \\\\

[*] #617 (comment)

Co-authored-by: Martijn Dekker <martijn@inlv.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants