Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scanf produces wrong %n output value after integer conversion #4562

Closed
vicuna opened this Issue Jun 5, 2008 · 4 comments

Comments

Projects
None yet
2 participants
@vicuna
Copy link
Collaborator

vicuna commented Jun 5, 2008

Original bug ID: 4562
Reporter: premchai21
Assigned to: @pierreweis
Status: closed (set by @pierreweis on 2009-04-29T18:44:06Z)
Resolution: fixed
Priority: normal
Severity: minor
Version: 3.10.2
Fixed in version: 3.11.0+beta
Category: ~DO NOT USE (was: OCaml general)

Bug description

Comments added for clarity.

$ ocaml
Objective Caml version 3.10.2

let g s = Scanf.sscanf s "%d%n" (fun i n -> (i, n));;

val g : string -> int * int =

g "99";;

  • : int * int = (99, 2) (* Correct. *)

g "99 syntaxes all in a row";;

  • : int * int = (99, 3) (* Wrong. *)

g "-20 degrees Celsius";;

  • : int * int = (-20, 4) (* Also wrong. *)

for i = 32 to 126 do

if ((i < 48) || (i >= 58)) && (i != 95) then
  let (i, n) = g ("42" ^ (String.make 1 (char_of_int i))) in
  if n != 3 then Printf.printf "Hmm: %d\n%!" n

done;;

  • : unit = () (* Happens with all printable
    ASCII chars in [^0-9_]. *)

Additional information

This is on Debian unstable AMD64, version 3.10.2-3 of the "ocaml" package.
A cursory glance at stdlib/scanf.ml makes me think that not only is it
peeking a char and then erroneously counting that as part of the character
count, but the Scanning stuff doesn't have a way to do otherwise, making
this probably require a larger change to fix than I would have otherwise
expected. :-(

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Jun 6, 2008

Comment author: @pierreweis

This is clearly a semantical issue, not a bug.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Jun 6, 2008

Comment author: @pierreweis

I think you overlooked the definition of the %n conversion; in the documentation for Scanf, it is stated as:

  • [n]: returns the number of characters read so far.

If we accept this definition, %n is not supposed to give the number of characters of tokens, or even be related to the length of tokens: it just returns the number of characters that have been read so far'' to return those tokens. Hence, there is no errors in the examples you gave: the number of characters read so far'' to return the tokens you asked for are precisely those reported by the call to scanf.

This behaviour is also briefly explained in a note of the documentation:

Note: a scan may often require to examine one character in advance;
when this ``lookahead'' character does not belong to the token read,
it is stored back in the scanning buffer and becomes the next
character read.

A seminal example of this kind of scan that require a lookahead character is the very useful %0c conversion that means: ``test the current input character without reading it''. To let you examine the NEXT character to be read, the %0c conversion must read this character and stores it to be the next character to be read.

This behaviour is not at all uncommon: in fact, almost all the conversions necessitate such a lookahead, %s, %d, %f, and so on. This is clear if asking an integer from the string "0123abc": scanf must read the character 'a' before stating that the number indeed ends at character '3' of the input. Hence after reading 123, the %n conversion returns the exact count of character read so far which is 5.

Scanf.sscanf "0123abc" "%i%n" (fun n count_for_n -> n, count_for_n);;

  • : int * int = (123, 5)

Note also that reading a single character after the integer does not change the ``number of character read so far'', since there is no need to read any character more to find 'a':

Scanf.sscanf "0123abc" "%i%n%c%n"

(fun n count_for_n c count_for_c -> n, count_for_n, c, count_for_c);;

  • : int * int * char * int = (123, 5, 'a', 5)
@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Jun 6, 2008

Comment author: premchai21

I do not see that note paragraph anywhere in http://caml.inria.fr/pub/docs/manual-ocaml/libref/Scanf.html or in my local copy of the documentation. Where is that note located?

Every C scanf implementation that I have seen defines %n to mean "number of characters read so far" with the semantics of "number of characters consumed that were used to match tokens or other parts of the format string, not including any lookahead characters read from the input stream". In the absence of a formal definition, the Caml documentation can reasonably be interpreted this way as well. It is also a much more common and useful case to require the number of characters matched without including any lookahead characters. Making the interpretation of lookahead (which is more an internal detail of the Scanf module) a necessary part of constructing the conversion strings and functions feels rather unclean.

Even the note paragraph that you quote doesn't seem to contradict that idea; it states that when a lookahead character is stored back into the scanning buffer, it becomes the next character read. This to me implies that it has been unread and is therefore no longer considered read as regards the logical state of the scanner, even if one more character had to be physically read from the input in order to produce this state.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 8, 2008

Comment author: @pierreweis

This is fixed in the current development version:

    Objective Caml version 3.11+dev15

let g s = Scanf.sscanf s "%d%n" (fun i n -> (i, n));;

val g : string -> int * int =

g "99";;

  • : int * int = (99, 2)

g "99 syntaxes all in a row";;

  • : int * int = (99, 2)

g "-20 degrees Celsius";;

  • : int * int = (-20, 3)

So, now, the lookahead character is no more counted as read, even if it really has been. I agree with you that this semantics is clearer and more sound.

@vicuna vicuna closed this Apr 29, 2009

@vicuna vicuna added the bug label Mar 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.