New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command-line inputs: bad parsing? #33

Open
a3n opened this Issue Oct 8, 2018 · 13 comments

Comments

Projects
None yet
4 participants
@a3n

a3n commented Oct 8, 2018

While checking the documentation of MirageOS, more precisely the tutorial Hello MirageOS World, at the end of Step 1, I encountered an unexpected behaviour of the command-line parsing for the Solo5 backend.
Namely, if I do:
mirage configure -t hvt
make depend
make
./solo5-hvt -- hello.hvt --hello="simple test",
then I get the following error message:
[...]
hello: too many arguments, don't know what to do with "test"
[...].
It seems to be the cases for all inputs containing at least one white space.
However, it is not the case when I execute the equivalent commands for a Unix backend.
Moreover, it is not the case if the input is set as default value:
mirage configure -t hvt --hello="simple test",
then everything works fine.
I am using opam 2.0.0, mirage 3.2.0, mirage-solo5 0.4.0.

@mato

This comment has been minimized.

Show comment
Hide comment
@mato

mato Oct 9, 2018

Member

Copying from Solo5/solo5#281, different manifestation of the same underlying parser:

If the unikernel boot parameter values contain double quotes (") they correctly end up in the Key_gen module if it is compiled for the unix target. In the hvt case the quotes get lost and must be escaped with an addition backspace ("). In any case I believe it should be consistent and behave in the same way for unix and hvt targets.

In the Solo5 case, the code that is responsible for this gets the command line as a single string (unlike unix, not an argv[]). This then, via https://github.com/mirage/mirage-bootvar-solo5/blob/824796885d7354a7740a71d58438f076c3cdab3b/lib/bootvar.ml#L19 calls out to Parse_argv.parse, which lives in https://github.com/mirage/parse-argv and is also used by the Xen target.

Unfortunately we cannot change the behaviour where the command line is passed in from the tender (or Xen) as a single string. So, either Parse_argv can be improved in some way, or we just need to document the differences in behaviour.

Member

mato commented Oct 9, 2018

Copying from Solo5/solo5#281, different manifestation of the same underlying parser:

If the unikernel boot parameter values contain double quotes (") they correctly end up in the Key_gen module if it is compiled for the unix target. In the hvt case the quotes get lost and must be escaped with an addition backspace ("). In any case I believe it should be consistent and behave in the same way for unix and hvt targets.

In the Solo5 case, the code that is responsible for this gets the command line as a single string (unlike unix, not an argv[]). This then, via https://github.com/mirage/mirage-bootvar-solo5/blob/824796885d7354a7740a71d58438f076c3cdab3b/lib/bootvar.ml#L19 calls out to Parse_argv.parse, which lives in https://github.com/mirage/parse-argv and is also used by the Xen target.

Unfortunately we cannot change the behaviour where the command line is passed in from the tender (or Xen) as a single string. So, either Parse_argv can be improved in some way, or we just need to document the differences in behaviour.

@ansiwen

This comment has been minimized.

Show comment
Hide comment
@ansiwen

ansiwen Oct 9, 2018

So, the information obviously gets lost, when the argv strings are merged into a single string without escaping, which one could consider as a bug. As it seems like Parse_argv parses a whitespace separated list of optionally quoted strings, why not taking the argv strings, escaping the contained quotes, wrapping them in quotes and join them with a space when generating the single string?

ansiwen commented Oct 9, 2018

So, the information obviously gets lost, when the argv strings are merged into a single string without escaping, which one could consider as a bug. As it seems like Parse_argv parses a whitespace separated list of optionally quoted strings, why not taking the argv strings, escaping the contained quotes, wrapping them in quotes and join them with a space when generating the single string?

@mato

This comment has been minimized.

Show comment
Hide comment
@mato

mato Oct 9, 2018

Member
Member

mato commented Oct 9, 2018

@ansiwen

This comment has been minimized.

Show comment
Hide comment
@ansiwen

ansiwen Oct 9, 2018

I can feel your reluctance to do that in C, but if the code that throws away the information is written in C, I don't see another way to fix it. Otherwise the information is lost. And escaping certain characters in strings should be so common, that we can probably find well-cured code in C that does that.

ansiwen commented Oct 9, 2018

I can feel your reluctance to do that in C, but if the code that throws away the information is written in C, I don't see another way to fix it. Otherwise the information is lost. And escaping certain characters in strings should be so common, that we can probably find well-cured code in C that does that.

@mato

This comment has been minimized.

Show comment
Hide comment
@mato

mato Oct 9, 2018

Member

Just discussed this with @hannesm, the best way forward right now is to make the unix target also use Parse_argv instead of just OS.Env.argv directly. That way we get consistent behaviour for all targets, and we can document that behaviour, and we don't need any parsing code in C on the Solo5 side.

Member

mato commented Oct 9, 2018

Just discussed this with @hannesm, the best way forward right now is to make the unix target also use Parse_argv instead of just OS.Env.argv directly. That way we get consistent behaviour for all targets, and we can document that behaviour, and we don't need any parsing code in C on the Solo5 side.

@hannesm hannesm referenced this issue Oct 9, 2018

Merged

unix bootvar #931

@hannesm

This comment has been minimized.

Show comment
Hide comment
@hannesm

hannesm Oct 9, 2018

Member

@a3n from your referenced website, I could only find $ ./solo5-hvt -- hello.hvt --hello="Hola!" which works as expected for me.

I agree that the behaviour between the targets (on Unix, ./hello --hello="foo bar" works, while it doesn't with solo5-hvt) is unfortunate. I prepared a fix mirage/mirage#931 which adjusts the unix behaviour to the same as other targets. We should document the parse-argv somewhere.

@mato I am as well not happy with complex C parsing code.

Member

hannesm commented Oct 9, 2018

@a3n from your referenced website, I could only find $ ./solo5-hvt -- hello.hvt --hello="Hola!" which works as expected for me.

I agree that the behaviour between the targets (on Unix, ./hello --hello="foo bar" works, while it doesn't with solo5-hvt) is unfortunate. I prepared a fix mirage/mirage#931 which adjusts the unix behaviour to the same as other targets. We should document the parse-argv somewhere.

@mato I am as well not happy with complex C parsing code.

@ansiwen

This comment has been minimized.

Show comment
Hide comment
@ansiwen

ansiwen Oct 9, 2018

@mato @hannesm Having it consistent is clearly an improvement, of course. Still, I want to make sure I understand correctly: you think searching for the " character and replacing it with \" is too complex to implement it in C, and you prefer the user do it by themselves? (with sed or whatever) Or is there something else that would need to be parsed that I'm not aware of now?

FWIW I would clearly vote for letting the solo5 code do this instead of the user, because it the other behavior is counter-intuitive for most people I presume, and it seems to me to be a well confined small problem, that can safely be implemented in C. I also volunteer to write it, if that should be the issue.

ansiwen commented Oct 9, 2018

@mato @hannesm Having it consistent is clearly an improvement, of course. Still, I want to make sure I understand correctly: you think searching for the " character and replacing it with \" is too complex to implement it in C, and you prefer the user do it by themselves? (with sed or whatever) Or is there something else that would need to be parsed that I'm not aware of now?

FWIW I would clearly vote for letting the solo5 code do this instead of the user, because it the other behavior is counter-intuitive for most people I presume, and it seems to me to be a well confined small problem, that can safely be implemented in C. I also volunteer to write it, if that should be the issue.

@hannesm

This comment has been minimized.

Show comment
Hide comment
@hannesm

hannesm Oct 9, 2018

Member

you think searching for the " character and replacing it with " is too complex to implement it in C

yes. you would have to handle " when inside ' or ", or prefixed by \" - this would require a huge amount of complexity (same as bash command line parsing etc.) for no obvious benefit, plus there'll be need to make the xen bootvar consistent with this.

the other behavior is counter-intuitive for most people

as I mentioned in the issue on mirage/mirage, we need to document the boot parameter (parse_argv), and that should be fine then.

it seems to me to be a well confined small problem, that can safely be implemented in C. I also volunteer to write it, if that should be the issue.

String parsing in C is complex and hard to write in my opinion. If you come up with less then 3 lines of C code (without any dynamic allocation etc.) to do the escaping, and an exhaustive test suite against different mirage targets, I'm happy to re-think my current opinion.

Member

hannesm commented Oct 9, 2018

you think searching for the " character and replacing it with " is too complex to implement it in C

yes. you would have to handle " when inside ' or ", or prefixed by \" - this would require a huge amount of complexity (same as bash command line parsing etc.) for no obvious benefit, plus there'll be need to make the xen bootvar consistent with this.

the other behavior is counter-intuitive for most people

as I mentioned in the issue on mirage/mirage, we need to document the boot parameter (parse_argv), and that should be fine then.

it seems to me to be a well confined small problem, that can safely be implemented in C. I also volunteer to write it, if that should be the issue.

String parsing in C is complex and hard to write in my opinion. If you come up with less then 3 lines of C code (without any dynamic allocation etc.) to do the escaping, and an exhaustive test suite against different mirage targets, I'm happy to re-think my current opinion.

@hannesm

This comment has been minimized.

Show comment
Hide comment
@hannesm

hannesm Oct 9, 2018

Member

on a further note: ./solo5-hvt -- hello.hvt '--hello="foo bar"' -- having single quotes around the complete argument -- works well (and once the mirage-bootvar-unix is used behaves consistently across backends).

Member

hannesm commented Oct 9, 2018

on a further note: ./solo5-hvt -- hello.hvt '--hello="foo bar"' -- having single quotes around the complete argument -- works well (and once the mirage-bootvar-unix is used behaves consistently across backends).

@ansiwen

This comment has been minimized.

Show comment
Hide comment
@ansiwen

ansiwen Oct 10, 2018

you think searching for the " character and replacing it with " is too complex to implement it in C

yes. you would have to handle " when inside ' or ", or prefixed by \" - this would require a huge amount of complexity (same as bash command line parsing etc.) for no obvious benefit, plus there'll be need to make the xen bootvar consistent with this.

Why would we have to? If I look at Parse_argv code, I only see handling of " and \. IIUC, to let Parse_argv.parse resemble an array of strings that is identical to the original argv array, it is enough to replace " and \ with \" and \\ respectively, wrap it in " and join them with spaces. I implemented this and I didn't find any problems. Can you give a problematic example that would not be correctly handled by that?

the other behavior is counter-intuitive for most people

as I mentioned in the issue on mirage/mirage, we need to document the boot parameter (parse_argv), and that should be fine then.

I will give you a real life example, why it is indeed cumbersome to use IMHO. I have a unikernel that takes JSON strings as boot parameters. With the current implementation of solo5, I have to write

$ ./solo5-hvt --net=tap100 ./unikernel --json '"{\"foo\": \"bar\"}"'

or – if I can't use ' because I want to expand variables – I even have to write

$ ./solo5-hvt --net=tap100 ./unikernel --json "\"{\\\"foo\\\": \\\"$BAR\\\"}\""

which... come on... looks horrible, or – if I want to read it from a file –

$ ./solo5-hvt --net=tap100 ./unikernel --json $(cat my.json | sed 's/\(["\]\)/\\\1/g')

With correct escaping within the solo5 tender I can just write

$ ./solo5-hvt --net=tap100 ./unikernel --json '{"foo": "bar"}'
$ ./solo5-hvt --net=tap100 ./unikernel --json "{\"foo\": \"$BAR\"}"
$ ./solo5-hvt --net=tap100 ./unikernel --json $(cat my.json)

which is so much more what one would expect.

it seems to me to be a well confined small problem, that can safely be implemented in C. I also volunteer to write it, if that should be the issue.

String parsing in C is complex and hard to write in my opinion. If you come up with less then 3 lines of C code (without any dynamic allocation etc.) to do the escaping, and an exhaustive test suite against different mirage targets, I'm happy to re-think my current opinion.

In general you are right of course, but I still have the understanding that in this case only trivial replacement of two characters and wrapping is required. 2 lines of code is a bit too ambitious, but I managed to write it with only 4 additional lines (not counting empty lines). Please have a look at the code in Solo5/solo5#283, and if you guys are fine with it in general, I'm happy to add tests as well. ;-)

ansiwen commented Oct 10, 2018

you think searching for the " character and replacing it with " is too complex to implement it in C

yes. you would have to handle " when inside ' or ", or prefixed by \" - this would require a huge amount of complexity (same as bash command line parsing etc.) for no obvious benefit, plus there'll be need to make the xen bootvar consistent with this.

Why would we have to? If I look at Parse_argv code, I only see handling of " and \. IIUC, to let Parse_argv.parse resemble an array of strings that is identical to the original argv array, it is enough to replace " and \ with \" and \\ respectively, wrap it in " and join them with spaces. I implemented this and I didn't find any problems. Can you give a problematic example that would not be correctly handled by that?

the other behavior is counter-intuitive for most people

as I mentioned in the issue on mirage/mirage, we need to document the boot parameter (parse_argv), and that should be fine then.

I will give you a real life example, why it is indeed cumbersome to use IMHO. I have a unikernel that takes JSON strings as boot parameters. With the current implementation of solo5, I have to write

$ ./solo5-hvt --net=tap100 ./unikernel --json '"{\"foo\": \"bar\"}"'

or – if I can't use ' because I want to expand variables – I even have to write

$ ./solo5-hvt --net=tap100 ./unikernel --json "\"{\\\"foo\\\": \\\"$BAR\\\"}\""

which... come on... looks horrible, or – if I want to read it from a file –

$ ./solo5-hvt --net=tap100 ./unikernel --json $(cat my.json | sed 's/\(["\]\)/\\\1/g')

With correct escaping within the solo5 tender I can just write

$ ./solo5-hvt --net=tap100 ./unikernel --json '{"foo": "bar"}'
$ ./solo5-hvt --net=tap100 ./unikernel --json "{\"foo\": \"$BAR\"}"
$ ./solo5-hvt --net=tap100 ./unikernel --json $(cat my.json)

which is so much more what one would expect.

it seems to me to be a well confined small problem, that can safely be implemented in C. I also volunteer to write it, if that should be the issue.

String parsing in C is complex and hard to write in my opinion. If you come up with less then 3 lines of C code (without any dynamic allocation etc.) to do the escaping, and an exhaustive test suite against different mirage targets, I'm happy to re-think my current opinion.

In general you are right of course, but I still have the understanding that in this case only trivial replacement of two characters and wrapping is required. 2 lines of code is a bit too ambitious, but I managed to write it with only 4 additional lines (not counting empty lines). Please have a look at the code in Solo5/solo5#283, and if you guys are fine with it in general, I'm happy to add tests as well. ;-)

ansiwen added a commit to ansiwen/solo5 that referenced this issue Oct 10, 2018

Escape boot parameters to quoted space-separated format of Parse_argv
Parse_argv.parse is expecting a space separated list of quoted
backslash-escaped strings.  Until now the parameters of argv are just
concatenated unmodified, so the user has to take care of the quoting
and escaping of the boot parameters by themselves.  This change wraps
every element of argv in ", escapes " and \ to \" and \\, and joins
them with spaces into a single string, so that Parse_argv.parse can
resemble a list of strings identical to argv.

Related: Solo5#281
Fixes: mirage/mirage-solo5#33

ansiwen added a commit to ansiwen/solo5 that referenced this issue Oct 11, 2018

Escape boot parameters to quoted space-separated format of Parse_argv
Parse_argv.parse is expecting a space separated list of quoted
backslash-escaped strings.  Until now the parameters of argv are just
concatenated unmodified, so the user has to take care of the quoting
and escaping of the boot parameters by themselves.  This change wraps
every element of argv in ", escapes " and \ to \" and \\, and joins
them with spaces into a single string, so that Parse_argv.parse can
resemble a list of strings identical to argv.

Related: Solo5#281
Fixes: mirage/mirage-solo5#33

ansiwen added a commit to ansiwen/solo5 that referenced this issue Oct 11, 2018

Escape boot parameters to quoted space-separated format of Parse_argv
Parse_argv.parse is expecting a space separated list of quoted
backslash-escaped strings.  Until now the parameters of argv are just
concatenated unmodified, so the user has to take care of the quoting
and escaping of the boot parameters by themselves.  This change wraps
every element of argv in ", escapes " and \ to \" and \\, and joins
them with spaces into a single string, so that Parse_argv.parse can
resemble a list of strings identical to argv.

Related: Solo5#281
Fixes: mirage/mirage-solo5#33

ansiwen added a commit to ansiwen/solo5 that referenced this issue Oct 12, 2018

Escape boot parameters to quoted space-separated format of Parse_argv
Parse_argv.parse is expecting a space separated list of quoted
backslash-escaped strings.  Until now the parameters of argv are just
concatenated with spaces, which results in lost information if argv
elements contain spaces themselves, and the user has to take care of
the quoting and escaping of the boot parameters by themselves.
Together with the escape requirements of typical shells, several
layers of escaping are necessary, which is tedious and error prone.
To avoid this, this change replaces ' ', '"' and '\' with '\ ' '\"' and
'\\' respectively, and joins them with spaces into a single string,
so that Parse_argv.parse can resemble a list of strings identical to
the original argv, which makes additional escaping unnecessary.

Related: Solo5#281
Fixes: mirage/mirage-solo5#33

ansiwen added a commit to ansiwen/solo5 that referenced this issue Oct 12, 2018

Escape boot parameters to quoted space-separated format of Parse_argv
Parse_argv.parse is expecting a space separated list of quoted
backslash-escaped strings.  Until now the parameters of argv are just
concatenated with spaces, which results in lost information if argv
elements contain spaces themselves, and the user has to take care of
the quoting and escaping of the boot parameters by themselves.
Together with the escape requirements of typical shells, several
layers of escaping are necessary, which is tedious and error prone.
To avoid this, this change replaces ' ', '"' and '\' with '\ ' '\"' and
'\\' respectively, and joins them with spaces into a single string,
so that Parse_argv.parse can resemble a list of strings identical to
the original argv, which makes additional escaping unnecessary.

Related: Solo5#281
Fixes: mirage/mirage-solo5#33
@ansiwen

This comment has been minimized.

Show comment
Hide comment
@ansiwen

ansiwen Oct 12, 2018

I just realized, that I don't have to quote the strings, since also the space character can be escaped. That further simplified the code and I'm indeed down to two additional lines now. 😜 (Not that I would say number of lines is more important than readability, but I was surprised that it's actually possible.)

ansiwen commented Oct 12, 2018

I just realized, that I don't have to quote the strings, since also the space character can be escaped. That further simplified the code and I'm indeed down to two additional lines now. 😜 (Not that I would say number of lines is more important than readability, but I was surprised that it's actually possible.)

@hannesm

This comment has been minimized.

Show comment
Hide comment
@hannesm

hannesm Oct 13, 2018

Member

as already expressed: I value uniformity between targets very high, and I'm not fond of adding any additional line of C code. From what I read, there's no way on Xen to retrieve an argv vector instead of cmdline (via xenstore's /vm/UUID/image/cmdline as done in mirage-bootvar-xen via start_info page), also unclear to me whether virtio would provide this functionality (atm it uses mutliboot struct and cmdline in there).

Please see mirage/mirage#931 for my proposal, adapting the current behaviour of hvt/virtio/xen to unix, which unifies the behaviour across all backends!

This will be my last comment here since there's really nothing more to add from my side; I don't see how to get uniformity across targets unless patching up xen toolchain, and virtio.

Member

hannesm commented Oct 13, 2018

as already expressed: I value uniformity between targets very high, and I'm not fond of adding any additional line of C code. From what I read, there's no way on Xen to retrieve an argv vector instead of cmdline (via xenstore's /vm/UUID/image/cmdline as done in mirage-bootvar-xen via start_info page), also unclear to me whether virtio would provide this functionality (atm it uses mutliboot struct and cmdline in there).

Please see mirage/mirage#931 for my proposal, adapting the current behaviour of hvt/virtio/xen to unix, which unifies the behaviour across all backends!

This will be my last comment here since there's really nothing more to add from my side; I don't see how to get uniformity across targets unless patching up xen toolchain, and virtio.

@ansiwen

This comment has been minimized.

Show comment
Hide comment
@ansiwen

ansiwen Oct 14, 2018

@hannesm: In favor of usability I wouldn't have a problem with separate uniformity among the argv based launchers (unix, hvt), and among the launchers that only use a single string for all unikernel parameters (virtio, xen) where we have no other choice. It is anyway not strictly uniform, because in the latter case you have to wrap all parameters in " or escape the spaces.

However, assuming @mato agrees with you here, I extended the Parse_argv module with a single quote handling, that allows at least to write my examples from above like:

$ ./solo5-hvt --net=tap100 ./unikernel --json \''{"foo": "bar"}'\'
$ ./solo5-hvt --net=tap100 ./unikernel --json \''{"foo": '$BAR'}'\'
$ ./solo5-hvt --net=tap100 ./unikernel --json \'$(cat my.json)\'

See mirage/parse-argv#6

ansiwen commented Oct 14, 2018

@hannesm: In favor of usability I wouldn't have a problem with separate uniformity among the argv based launchers (unix, hvt), and among the launchers that only use a single string for all unikernel parameters (virtio, xen) where we have no other choice. It is anyway not strictly uniform, because in the latter case you have to wrap all parameters in " or escape the spaces.

However, assuming @mato agrees with you here, I extended the Parse_argv module with a single quote handling, that allows at least to write my examples from above like:

$ ./solo5-hvt --net=tap100 ./unikernel --json \''{"foo": "bar"}'\'
$ ./solo5-hvt --net=tap100 ./unikernel --json \''{"foo": '$BAR'}'\'
$ ./solo5-hvt --net=tap100 ./unikernel --json \'$(cat my.json)\'

See mirage/parse-argv#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment