New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow skipping quotes for all MPD command arguments without spaces #591

Closed
jodal opened this Issue Nov 22, 2013 · 23 comments

Comments

4 participants
@jodal
Member

jodal commented Nov 22, 2013

The MPD docs on request format says:

If arguments contain spaces, they should be surrounded by double quotation marks.

We're currently expecting to always get quotes, and only allow skipping the quotes for some cases that has been required to support popular clients.

We should probably support both command (\S+) and command "([^"]+)" for all MPD commands.

@txomon

This comment has been minimized.

Member

txomon commented Nov 25, 2013

I think something like ("?)(?P\S+)\1 would be more effective. I have seen that this is an insane thing among mpd clients, and they can put double quotes from first argument, having or not spaces...

@jodal

This comment has been minimized.

Member

jodal commented Nov 25, 2013

("?)(\S+)\1 will not match quoted strings with spaces inside, which is exactly the main the reason to quote strings.

Another idea that came up on IRC is to use shlex.split() from the stdlib to normalize the MPD commands. It handles quoted parts with spaces as well as space and \t inbetween the parts correctly (see #592):

>>> import shlex
>>> shlex.split('find artist "foo bar"')
['find', 'artist', 'foo bar']
>>> shlex.split('find artist "foo bar" "album" baz')
['find', 'artist', 'foo bar', 'album', 'baz']
>>> shlex.split('find\tartist "foo bar" "album"\tbaz')
['find', 'artist', 'foo bar', 'album', 'baz']
>>> 
@kingosticks

This comment has been minimized.

Member

kingosticks commented Nov 25, 2013

I like shlex, that's really nice and simple. On a related note, I've been playing with trying to support escaped nested quotes like MPD does but it's been a total failure. The '' escape characters don't seem to make it to the dispatcher which has got me totally stumped (any hints?)
And with shlex maybe we could could do something like:

>>> shlex.split('find artist "foo \\"bar\\" baz"')
['find', 'artist', 'foo "bar" baz']
@txomon

This comment has been minimized.

Member

txomon commented Nov 25, 2013

I don't see any problem with it... Take into account that shlex is intended to receive that string, so the string,
find artist "foo \"bar\" baz" would be converted to what you see.

In python, thought using single or double quotes, you can scape both (if I recall correctly)

@txomon

This comment has been minimized.

Member

txomon commented Nov 25, 2013

Try with \\\\ to make it work (one scaping for python, and other for shlex)

@kingosticks

This comment has been minimized.

Member

kingosticks commented Nov 25, 2013

You misunderstand @txomon, I don't have any problems with the output I printed. My problem was trying to do the same thing but using re.

@txomon

This comment has been minimized.

Member

txomon commented Nov 25, 2013

Oh sorry! I did misunderstood =).

About mathching against re, I tried so, and I finally left it alone when my RegExp was over 30 chars...

@txomon

This comment has been minimized.

Member

txomon commented Nov 25, 2013

Maybe with the (?["'])(?P<var>.*)\1 method, something could be achieved...

@txomon

This comment has been minimized.

Member

txomon commented Nov 25, 2013

eval()-like functionality may help here...

@adamcik

This comment has been minimized.

Member

adamcik commented Nov 26, 2013

With shlex you need to watch out with respect to lack of utf-8 support. But if done right and operating on bytestrings it might work. So this would need test cases proving that it works IMO

@adamcik

This comment has been minimized.

Member

adamcik commented Nov 26, 2013

[p.decode('utf-8') for p in shlex.split(u'command "foo øå \\"bar\\""'.encode('utf-8'))] for how to juggle things right with respect to encodings. We could also hook in at https://github.com/mopidy/mopidy/blob/develop/mopidy/utils/network.py#L385 and either do the lexing directly as part of the decoding there or at the very least just pass through the bytestrings so we can shlex them safely.

@kingosticks

This comment has been minimized.

Member

kingosticks commented Nov 26, 2013

Thanks @adamcik, I think it's LineProtocol.decode() that had been confusing what I was seeing (unicode is like kryptonite). I think passing the byte strings through and shlex-ing in MpdSession.on_line_received() might be neat. This way, LineProtcol can stick to working in lines rather than tokens, which is a MPD specific use-case. I shall work on some test cases.

@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

Tried (?:([^\s"]+)|"([^"\\]*(?:\\.[^"\\]*)*)") as a variant. This is a combo of ([^\s"]+) match any number of non-space or quote characters. And "([^"\\]*(?:\\.[^"\\]*)*)" which is the loop unrolled version of "(?:[^"\\]|\\.)*", which is match a group surrounded by quotes, where to group contains any number of characters that are not a quotes or backslashes, or match a backslash followed by any character.

https://gist.github.com/adamcik/60497619254b586fdb43 adds a print to find_handlers to show the results when running the test suite or queries.

@kingosticks

This comment has been minimized.

Member

kingosticks commented Jan 18, 2014

Doesn't moving to tokens with shlex fix this too?
On 18 Jan 2014 12:02, "Thomas Adamcik" notifications@github.com wrote:

Tried (?:([^\s"]+)|"([^"](?:.[^"]))") as a variant. This is a
combo of ([^\s"]+) match any number of non-space or quote characters. And
"([^"]
(?:.[^"]))" which is the loop unrolled version of
"(?:[^"]|.)*", which is match a group surrounded by quotes, where to
group contains any number of characters that are not a quotes or
backslashes, or match a backslash followed by any character.

https://gist.github.com/adamcik/60497619254b586fdb43 adds a print to
find_handlers to show the results when running the test suite or queries.


Reply to this email directly or view it on GitHubhttps://github.com//issues/591#issuecomment-32680460
.

@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

In [6]: re.match(r'("?)([^"\\]*(?:\\.[^"\\]*)*)\1', '"test \\t foo"').groups()
Out[6]: ('"', 'test \\t foo')

In [7]: re.match(r'("?)([^"\\]*(?:\\.[^"\\]*)*)\1', 'test \\t foo').groups()
Out[7]: ('', 'test \\t foo')

Could also work if we want to keep the regexps more or less like they are now, but would suggest we use $QUOTEDSTRING type placeholders instead of using the actual regexp directly all over.

@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

I'm unsure about shlex with respect to if we get the correct behavior for mpd, that is does it use the POSIX mode or not, and is the shlex POSIX or non-POSIX mode equal to MPDs handling? As such I was playing with this regexp which can be tailored to perfectly match MPD once we figure out what that is, and if it's even needed.

@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

Note that the case foo te\"st "bar" does not get handled correctly by the current suggested regex.

@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

In [19]: [a or b for a, b in re.findall(r'(?:((?:[^"\\\s]|\\.)+)|"((?:[^"\\]|\\.)*)")', 'find\tartist t\\"est "foo\\" bar" "album"\tbaz')]
Out[19]: ['find', 'artist', 't\\"est', 'foo\\" bar', 'album', 'baz']
@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

add asd\"asd
ACK [2@0] {add} Invalid unquoted character

So not an issue :-)

@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

Having looked in more detail at MPDs behavior shlex would give us the wrong behavior:

OK MPD 0.16.0
asdd
ACK [5@0] {} unknown command "asdd"
add "\"asd"
ACK [50@0] {add} directory or file not found
add "\t asd"
ACK [50@0] {add} directory or file not found
add foo
ACK [50@0] {add} directory or file not found
add foo"bar"add
ACK [2@0] {add} Invalid unquoted character
add "foo"bar
ACK [2@0] {add} Space expected after closing '"'
Jan 18 14:29 : client: [0] opened from 127.0.0.1:48100
Jan 18 14:29 : client: [0] process command "asdd"
Jan 18 14:29 : client: [0] command returned -1
Jan 18 14:29 : client: [0] process command "add "\"asd""
Jan 18 14:29 : database: get song: "asd
Jan 18 14:29 : client: [0] command returned -1
Jan 18 14:29 : client: [0] process command "add "\t asd""
Jan 18 14:29 : database: get song: t asd
Jan 18 14:29 : client: [0] command returned -1
Jan 18 14:30 : client: [0] process command "add foo"
Jan 18 14:30 : database: get song: foo
Jan 18 14:30 : client: [0] command returned -1
Jan 18 14:30 : client: [0] process command "add foo"bar"add"
Jan 18 14:30 : client: [0] command returned -1
Jan 18 14:30 : client: [0] process command "add "foo"bar"
Jan 18 14:30 : client: [0] command returned -1
In [5]: shlex.split('add "\\"asd"', posix=True)
Out[5]: ['add', '"asd']

In [6]: shlex.split('add "\\t asd", posix=True)
Out[6]: ['add', '\\t asd']

In [7]: shlex.split('add foo"bar"add', posix=True)
Out[7]: ['add', 'foobaradd']

In [8]: shlex.split('add "foo"bar', posix=True)
Out[8]: ['add', 'foobar']

In [12]: shlex.split('add "\\t asd"', posix=False)
Out[12]: ['add', '"\\t asd"']

In [13]: shlex.split('add "\\"asd"', posix=False)
Out[13]: ['add', '"\\"', 'asd"']

In [14]: shlex.split('add "\\t asd"', posix=False)
Out[14]: ['add', '"\\t asd"']

In [15]: shlex.split('add foo"bar"add', posix=False)
Out[15]: ['add', 'foo"bar"add']

In [16]: shlex.split('add "foo"bar', posix=False)
Out[16]: ['add', '"foo"', 'bar']
@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

So summing up what I've found out, the tokenizer in MPD has four token types we probably care about:

WORD  = [a-zA-Z][a-zA-Z0-9]*
UNQOUTED = ord(char) >= 32 and char != " and char != \
QUOTED = "((?:[^"\\]|\\.)*)" where capture group one needs to be backslash unescaped
PARAM = QUOTED | UNQUOTED

So this bug is really that we are expecting a QUOTED value where MPD accepts a PARAM

@adamcik

This comment has been minimized.

Member

adamcik commented Jan 18, 2014

I've now checked the handling, and WORD is only every used for the command, everything else is a PARAM.

Taking all my rambling on this issue today into mind this gives us the following code, which takes the following into account:

  • Matches the command using the stricter WORD rules.
  • Unquoted strings consist of any number of word unicode word characters except backslash or quotes,
  • Quoted strings may contain anything, backslash escapes the following character.
word_re = re.compile(r'^([a-z][a-z0-9]+)(?:\s+|$)(.*)$')
param_re = re.compile(r'^(?:([^\W"\\]+)|"([^"\\]*(?:\\.[^"\\]*)*)")(?:\s+|$)(.*)$', re.UNICODE)
unescape_re = re.compile(r'\\(.)')

def tokenize(line):
    match = word_re.match(line)
    if not match:
        raise Exception
    command, remainder = match.groups()
    yield command
    while remainder:
        match = param_re.match(remainder)
        if not match:
            raise Exception
        unquoted, quoted, remainder = match.groups()
        yield unquoted or unescape_re.sub(r'\g<1>', quoted)

This will not match MPD 100% for error messages, but I believe the behavior for the tokenizing to be identical with this code. Though a bunch of the error handling could be added, or we could write a tokenizer that loops over the characters following the rules above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment