Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If the "Accept" header contains a media-type with a quoted parameter including a space, the resulting MIMEAccept instance is incorrect #1623

Closed
exhuma opened this issue Aug 6, 2019 · 9 comments · Fixed by #2615
Assignees
Milestone

Comments

@exhuma
Copy link

exhuma commented Aug 6, 2019

According to RFC 2045 Section 5.1, mime-type parameters can be quoted. As additional reference, the quoted-string is defined in RFC 7230 Section 3.2.6 (which is the same as in RFC 2616 Section 2.2)

This means that the following media-type is a valid media-type string:

fictitious/media-type; myparam="some value"

Currently, werkzeug sees this as fictitious/media-type; myparam="some, which includes the first double-quote, but drops the rest. Additionally, the parsed MIMEAccept instance also contains an entry for ('value"', 1). In other words, the above media-type results in the following two entries in the MIMEAccept instance:

  • ('fictitious/media-type; myparam="some', 1)
  • ('value"', 1)

Appending q=0.5 results in this:

  • ('fictitious/media-type; myparam="some', 1)
  • ('value"', 0.5)

I would expect it to contain just one entry as ('fictitious/media-type; myparam="some value"', 1)

We currently get one such media-type in an accept header and cannot use the normal werkzeug "accept" mechanism which makes this cumbersome. I just discovered this so I still need to write a workaround. I am considering appending the solution as a PR to this ticket. I already have a compliant parser lying around, which is based on a very simple state-based parser.

@exhuma
Copy link
Author

exhuma commented Aug 6, 2019

For reference, this is the parser I currently use in a different project (it does not yet handle escaped double-quotes though as I'd need a lookbehind):

def argparser(data: str) -> Generator[Tuple[str, str], None, None]:
    """
    Given a string like ``foo=bar; bar=baz; bla="blubb"``, produce a stream of
    key/value pairs.

    This parser is used to parse RFC-2045 (Section 5.1) compliant mimetype
    parameters.

    Example::

        >>> args = 'foo=bar; bar=baz; bla="blubb"'
        >>> for k, v in argparser(args):
        ...     print(k, v)
        foo bar
        bar baz
        bla blubb
    """
    state = 'keyname'
    current_key_name = []
    current_value = []  # type: List[str]
    charstream = iter(data)
    for char in charstream:
        if state == 'keyname':
            if char in ' \t\r\n;':
                continue  # ignore whitespace and other junk
            elif char == '=':
                state = 'value'
                continue
            else:
                current_key_name.append(char)
        elif state == "value":
            if char == ';':
                state = 'keyname'
                key = ''.join(current_key_name)
                value = ''.join(current_value)
                del current_value[::]
                del current_key_name[::]
                yield key, value
                continue
            elif char == '"':
                state = 'quoted_value'
                continue
            current_value.append(char)
        elif state == 'quoted_value':
            if char == '"':
                state = 'keyname'
                key = ''.join(current_key_name)
                value = ''.join(current_value)
                del current_value[::]
                del current_key_name[::]
                yield key, value
                continue
            current_value.append(char)
        else:
            raise ValueError('Unexpected parser state!')

    # return the remaining key/value pair
    if current_key_name and current_value:
        key = ''.join(current_key_name)
        value = ''.join(current_value)
        del current_value[::]
        del current_key_name[::]
        yield key, value

@exhuma
Copy link
Author

exhuma commented Aug 6, 2019

If there are no objections to this code I will start a PR with this implementation.

Note that I have not done any performance tests in comparison with the existing regex implementation! This is the main point where I am reluctant to open a PR prematurely.

@davidism
Copy link
Member

davidism commented Aug 6, 2019

Thanks, but if possible I'd rather fix the regex.

@exhuma
Copy link
Author

exhuma commented Aug 6, 2019

ok. I'll give it a shot

exhuma added a commit to exhuma/werkzeug that referenced this issue Aug 6, 2019
exhuma added a commit to exhuma/werkzeug that referenced this issue Aug 6, 2019
@exhuma
Copy link
Author

exhuma commented Aug 6, 2019

Simply removing the white-space exclusion from one of the patterns was enough. This at least covers the existing unit-tests. I added a quoted value with whitespace and it passes.

I can monkey-patch this at work and see if it makes our test-suite pass for the project I am currently working on.

@exhuma
Copy link
Author

exhuma commented Aug 7, 2019

Okay, the tests don't run yet. There's also a media-type parameter which contains a comma in its value. I'll keep working on it.

@exhuma
Copy link
Author

exhuma commented Aug 8, 2019

I'm having trouble of getting this regex to work. My biggest two problems are that on the one hand I need to know when I am "inside" double quotes while also splitting the list of media-types by comma.

I have for now resorted to a "manual" parsing which is a bit different to the one I wrote above (that one was an old experiment).

I'm afraid my regex-foo is not strong enough to implement this as regex. I am using my parser as workaround for now which I use in stead of request.accept_mimetypes.

If anyone runs into this same problem, my parser is located in a gist

Note that this has some edge-cases still, especially I did not properly test extra junk white-space. Also some operations might be superfluous. But for now that gist works for me and covers all our internal test-cases. Also the functions quote_arg_value and unquote_arg_value are not symmetric which is I don't particularly like. But I will leave it like this for now.

@davidism davidism self-assigned this Mar 6, 2023
@davidism
Copy link
Member

davidism commented Mar 6, 2023

While refactoring parse_options_header in #2614, I realized that for some reason we're using a different parser for Accept, even though it should use the same one. https://httpwg.org/specs/rfc9110.html#field.accept shows that it uses value; key=value; key=value ..., value2; .... This should use the list parser, then use the options parser on each item.

@davidism
Copy link
Member

davidism commented Mar 6, 2023

Looks like the reason for the weird parser was because previous specs allowed for things like text/html;charset=utf-8;q=1;extra=some, where charset was part of the media type, but extra was not. RFC 9110 drops support for extra parameters, and assumes q is last.

@davidism davidism added this to the 2.3.0 milestone Mar 6, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants