Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added UTF8 tests #146

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Added UTF8 tests #146

wants to merge 1 commit into from

Conversation

Krinkle
Copy link

@Krinkle Krinkle commented Jun 2, 2023

Upstreaming additional test cases written by @jbboehr. These were helpful in developing https://github.com/jbboehr/libmustache.

Original location: https://github.com/jbboehr/mustache-spec/commits/b96be9fd4c6d6984828d93169fe7e86d8a8aec2f

Copy link
Member

@jgonggrijp jgonggrijp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing @Krinkle!

At first sight, these tests seem totally reasonable. Of course, a Mustache implementation (or any software, for that matter) should ideally support UTF-8 encoded text.

The more I think about it, however, the more I believe that these tests lack a clear rationale. Why exactly do these tests need to exist? What does it mean if an implementation does not pass them?

If an implementation fails, does that actually say anything about the implementation, or perhaps rather about the library that is responsible for reading the file from disk and/or for parsing the YAML/JSON?

Putting it differently: suppose that the standard library for my programming language is capable of reading UTF-8 encoded text from a file and loading it into a standard text datastructure for the programming language (that will be string, String or str in most cases). Furthermore, suppose that my Mustache implementation reads templates from that standard datastructure and that my implementation is already known to support ASCII-encoded templates. Is it even possible for my implementation to fail in that case? After all, UTF-8 is a superset of ASCII. I don't think the octet 0x7b can mean anything other than { in UTF-8. Likewise for } and all the sigils.

Has libmustache ever failed these tests, @jbboehr? If it did, what did you need to do in order to pass them?

@agentgt
Copy link

agentgt commented Nov 5, 2023

Is it even possible for my implementation to fail in that case? After all, UTF-8 is a superset of ASCII. I don't think the octet 0x7b can mean anything other than { in UTF-8. Likewise for } and all the sigils.

I don't think it needs to be in the spec or have tests but I believe this is possible in some languages. I'm not going to reference where I discuss this as I got rather embarrassingly unhinged but apparently if you write say a PHP lambda code in latin1 (what encoding the file is saved in is what the literals become) and then take a section that is UTF-8 because the template is and return a new template and there is a chance of corruption.

While the probability of that happening in 2023 is pretty darn low back in the 2000s when I started my career it happened often (albeit usually with XML...).

Also if an implementation naively assumes ASCII and parses as UTF-8 template you will get corruption but that case is pretty obvious (you can concat ASCII bytes to UTF-8 but you cannot split UTF-8 strings as though they were ASCII).

That is you cannot just naively parse byte by byte looking for the { (0x7b) as the second bytes in a multi-byte character very well could be 0x7b. (I think. I'm fairly sure that UTF-8 does not say multi-bytes after the first have to be 127 but maybe I'm mistaken).

I guess it is not possible. I thought there were fringe cases with BOM so yes I guess your safe parsing byte by byte.

However if someone sets a delimiter above 127 then there will be problems. Maybe that is something that could be specified. Delimiters must be below 127. But again probably not worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants