Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mod_translation_updater.py: Parses escape codes in .tr files incorrectly #14414

Open
rubenwardy opened this issue Feb 25, 2024 · 4 comments
Open
Labels
Bug Issues that were confirmed to be a bug @ Startup / Config / Util @ Translation

Comments

@rubenwardy
Copy link
Member

rubenwardy commented Feb 25, 2024

Minetest version

Minetest 5.9.0-dev

Summary

The script uses regex to parse translation lines, which is incorrect. Take the following example:

Something@@= Something@@

The script will ignore this line as it thinks the @ is escaping the =, but it is actually adding @. Another issue: the script ignores invalid lines without warning!

The script also assumes that a translation is only ever on one line. This is not the case, given @\n:

A @n newline = Une @
nouvelle ligne

Steps to reproduce

I've written my own Python .tr file parser with unit tests here. Unit tests should be added to mod_translation_updater.py, it's unacceptable to have a parser without tests.

My implemention throws a lot of information away (ie: comments) so couldn't be taken directly, but inspiration can be taken from the approach (which is based on the .cpp code)

@rubenwardy rubenwardy added Unconfirmed bug Bug report that has not been confirmed to exist/be reproducible Bug Issues that were confirmed to be a bug @ Startup / Config / Util @ Translation and removed Unconfirmed bug Bug report that has not been confirmed to exist/be reproducible labels Feb 25, 2024
@appgurueu
Copy link
Contributor

The current behavior is definitely incorrect. Are you sure that regex is inadequate, as opposed to being inadequately used, though? From a quick glance and the issues you have presented, I currently see no reason why this would need "more than regex" (a context-free or even context-sensitive grammar).

@rubenwardy
Copy link
Member Author

The regex used is naive and incorrect. I think this can be captured by a regular language, but it's better to match the engine parser and do it character by character

@Wuzzy2
Copy link
Contributor

Wuzzy2 commented Feb 27, 2024

In my defense: I never claimed this script was validating, so garbage in, garbage out. I've written my own *.tr validator separately, named mtt_check.py at https://codeberg.org/Wuzzy/Minetest_Translation_Tools (feel free to tear it a new one as well lol). Also, this script indeed does make some assumptions about the input, these are documented in the script's README.

That having said, yes, this @@ problem is 100% a bug. Fun fact: I actually did my own tests (not unit tests tho) before submitting the script, but I didn't include them. I thought I tested all escapes pretty extensively. *shrug*. Ah, maybe I didn't test the end-of-line stuff well enough.

I agree that writing the parser in code rather than regex sounds like a better and less error-prone alternative. Although I’m not entirely convinced a regex-solution would be impossible to solve this problem. Maybe a complex regex could do the trick? Python Regex is powerful. But even then, code is probably more robust.

The script also assumes that a translation is only ever on one line.

Yeah, I forgot about that. To be honest, I really hate that feature of the TR language when @n works just fine. We should nuke it, it makes the parser more complex. (I know we can't nuke it due to compat. :-( ) I never saw a single TR file using this.

But in any case, throwing away information is absolutely not an option; preserving both comment texts and their position is important.

@y5nw
Copy link
Contributor

y5nw commented Feb 27, 2024

That having said, yes, this @@ problem is 100% a bug. Fun fact: I actually did my own tests (not unit tests tho) before submitting the script, but I didn't include them. I thought I tested all escapes pretty extensively. shrug. Ah, maybe I didn't test the end-of-line stuff well enough.

I agree that writing the parser in code rather than regex sounds like a better and less error-prone alternative. Although I’m not entirely convinced a regex-solution would be impossible to solve this problem. Maybe a complex regex could do the trick? Python Regex is powerful. But even then, code is probably more robust.

I had a similar situation last year when writing Lua code to parse quoted C strings (I forgot the exact purpose). The approach I recall coming up with was to match (\+)(.) (i.e. multiple backslashes followed by a character; IIRC there were cases guarding against backslashes at the end of the file so this was irrelevant) and then unescaping that string based on the number of backslashes matched.

I assume a similar approach can be taken here by matching multiple @s with a following character:

  • Even number of @ -> the sequence is unescaped to n/2 repetitions of @ and the following character is kept as-is (e.g. @@@@c is unescaped to @@c)
  • Odd number of @ -> the sequence is unescaped to (n-1)/2 repetitions of @ and the following character is considered part of an escape sequence (e.g. @@@= is unescaped to @=)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Issues that were confirmed to be a bug @ Startup / Config / Util @ Translation
Projects
None yet
Development

No branches or pull requests

4 participants