Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract only strings and comments from codeblock #109

Merged
merged 1 commit into from
Nov 9, 2023

Conversation

dalance
Copy link
Contributor

@dalance dalance commented Oct 28, 2023

Fixes #95

This PR adds codeblock parsing support to extract only strings and comments.

@google-cla
Copy link

google-cla bot commented Oct 28, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
@mgeisler mgeisler requested a review from dyoo October 29, 2023 21:08
@mgeisler
Copy link
Collaborator

Hi @dalance,

This PR adds codeblock parsing support to extract only strings and comments.

Wow, thank you so much! This looks really good from a quick look!

Could you test how https://google.github.io/comprehensive-rust/zh-CN/hello-world/small-example.html looks after this? In particular, does mdbook-i18n-normalize do the right thing here? It should turn this message into 5 individual messages, one for each line comment.

@dyoo has written a lot of this code and I hope he can review this. Also, I think this will collide with #107 and the design he started there to support things like translation comments. So we should make sure we can get everything working nicely together here 🙂

I think we would want an option in book.toml for this? That could certainly be added later down the road if someone asks for it.

@dalance
Copy link
Contributor Author

dalance commented Oct 30, 2023

Hi @mgeisler,
Thank you for your review.
I fixed some codes which are suggested.

The small-example.md becomes like below.
I think it is an expected result.

#: src/hello-world/small-example.md:6
msgid "// Program entry point\n"
msgstr ""

#: src/hello-world/small-example.md:7
msgid "// Mutable variable binding\n"
msgstr ""

#: src/hello-world/small-example.md:8 src/traits/impl-trait.md:15
msgid "\"{x}\""
msgstr ""

#: src/hello-world/small-example.md:8
msgid "// Macro for printing, like printf\n"
msgstr ""

#: src/hello-world/small-example.md:9
msgid "// No parenthesis around expression\n"
msgstr ""

#: src/hello-world/small-example.md:10
msgid "// Math like in other languages\n"
msgstr ""

#: src/hello-world/small-example.md:15
msgid "\" -> {x}\""
msgstr ""

The conflicting point with #107 seems to be the modifitation to Group enum.
It may be better that splitting this PR to changing to Group which owns Event and adding parse_codeblock function?

I think we would want an option in book.toml for this? That could certainly be added later down the road if someone asks for it.

This PR changes messages.pot significantly.
So an option to disable this feature may be useful for existing users.

@dalance
Copy link
Contributor Author

dalance commented Oct 30, 2023

I changed some logics to treat continuous line comments.
This is because the previous code can't handle indented continuous line comments.

@mgeisler
Copy link
Collaborator

Thanks for all the updates!

I think we would want an option in book.toml for this? That could certainly be added later down the road if someone asks for it.

This PR changes messages.pot significantly. So an option to disable this feature may be useful for existing users.

Yeah, I can imagine it will change the files a lot — but for the better!

Please consider adding a line about this new behavior to the documentation. I think it could just be a little paragraph in Marking Sections to be Skipped for Translation which says something like

Note that we don't extract the full text of code blocks. Only text that is recognized as comments and literal strings is extracted.

Something like that, feel free to rephrase it how you like.

The small-example.md becomes like below. I think it is an expected result.

Yes, this looks great! Please also try running

mdbook-i18n-normalize po/zh-CN.po po/zh-CN-normalized.po

and verify that the existing translations are correctly paired up with the new messages. The normalization tool is what is supposed to let us change how we extract messages while also keeping existing translations intact. If I did things correctly, it should "just work", but it'll be nice to have confirmation from you 🙂

@dalance
Copy link
Contributor Author

dalance commented Oct 30, 2023

The result of mdbook-i18n-normalize is below.
It works fine too.

#: src/hello-world/small-example.md:6
msgid "// Program entry point\n"
msgstr "// 程序入口\n"

#: src/hello-world/small-example.md:7
msgid "// Mutable variable binding\n"
msgstr "// 可变变量绑定\n"

#: src/hello-world/small-example.md:8
#: src/traits/impl-trait.md:15
msgid "\"{x}\""
msgstr "\"{x}\""

#: src/hello-world/small-example.md:8
msgid "// Macro for printing, like printf\n"
msgstr "// 与 printf 类似的输出宏\n"

#: src/hello-world/small-example.md:9
msgid "// No parenthesis around expression\n"
msgstr "// 表达式周围没有括号\n"

#: src/hello-world/small-example.md:10
msgid "// Math like in other languages\n"
msgstr "// 与其他语言类似的数值计算\n"

#: src/hello-world/small-example.md:15
msgid "\" -> {x}\""
msgstr "\" -> {x}\""

@dalance dalance force-pushed the parse_codeblock branch 2 times, most recently from c7b861f to e19b096 Compare October 30, 2023 10:22
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
@mgeisler
Copy link
Collaborator

@henrif75, when this is merged, we will do a 0.3 release. This is a breaking change: we need to run mdbook-i18n-normalize on our translations to keep any translated comments intact (as tested by @dalance in #109 (comment)). I'll make sure to explain this again in the release notes.

@dalance
Copy link
Contributor Author

dalance commented Oct 31, 2023

I found a normalization issue.
The result of re-executed nomalization to po/zh-CN-normalized.po is here.

#: src/hello-world/small-example.md:6
msgid "// Program entry point"
msgstr "// 程序入口"

The trailing linebreak is missing.
This is because msgid/msgstr don't have a context which is in codeblock or not.
I can add a workaround like below, but hand-written detection is required because language is not clear.

pub fn extract_messages(document: &str) -> Vec<(usize, String)> {
    if document.starts_with("//") && document.ends_with("\n") {
        return vec![(1, document.to_string())];
    }

How do you think the workaround?
Or any information which shows msgid/msgstr is originated from codeblock should be added to po file?

@dalance
Copy link
Contributor Author

dalance commented Oct 31, 2023

I fixed the reviewed issues.
Additionally, I found my mistake about the usage of syntect stack through fuzz test, and fixed it too.

@henrif75
Copy link
Contributor

henrif75 commented Nov 1, 2023

@henrif75, when this is merged, we will do a 0.3 release. This is a breaking change: we need to run mdbook-i18n-normalize on our translations to keep any translated comments intact (as tested by @dalance in #109 (comment)). I'll make sure to explain this again in the release notes.

Sounds good. I assume it's going to break ongoing PRs?

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
@dyoo
Copy link
Collaborator

dyoo commented Nov 1, 2023

I am so sorry for late review comments; I thought I pressed Submit, and it turns out that I did not. :(

@dalance dalance force-pushed the parse_codeblock branch 2 times, most recently from fb769d7 to cd1faa1 Compare November 2, 2023 00:53
@dalance
Copy link
Contributor Author

dalance commented Nov 2, 2023

I refrected the review, introduced Flags struct to resolve the normarilzation issue, and rebased to the latest main.

@dalance
Copy link
Contributor Author

dalance commented Nov 2, 2023

My solution seems to be broken...
msginit can't handle flags in messages.pot.
Another solution may be required.

@dalance
Copy link
Contributor Author

dalance commented Nov 2, 2023

Built-in flags seems to be passed through msginit.
So using c-format flag instead of codeblock flag may resolve the issue.
But it may be confusable.

@mgeisler
Copy link
Collaborator

mgeisler commented Nov 2, 2023

I refrected the review, introduced Flags struct to resolve the normarilzation issue, and rebased to the latest main.

If this is only for normalization, then please don't add extra code to support it — normalization is something translators (or @henrif75) will run only once after a new breaking release. I don't think the main code should be expanded to support it.

In this case, I believe the problem are some newlines which are incorrect after running normalize? As long as the normal flow works, then these messages will be corrected by msgmerge and that's good enough.

@dyoo, are you happy with the logic here? Please finalize the review so we can merge it.

@dalance
Copy link
Contributor Author

dalance commented Nov 2, 2023

If there is no need to resolve the normalization issue, I'll remove the logic.
I think migration from v0.2 to v0.3 will works fine without it.

@dalance dalance force-pushed the parse_codeblock branch 2 times, most recently from c821d01 to a48720a Compare November 2, 2023 08:51
@dalance
Copy link
Contributor Author

dalance commented Nov 2, 2023

I removed the logic related to Flags.

@mgeisler
Copy link
Collaborator

mgeisler commented Nov 2, 2023

If there is no need to resolve the normalization issue, I'll remove the logic.

Thanks @dalance!

@henrif75, when this is merged, we will do a 0.3 release. This is a breaking change: we need to run mdbook-i18n-normalize on our translations to keep any translated comments intact (as tested by @dalance in #109 (comment)). I'll make sure to explain this again in the release notes.

Sounds good. I assume it's going to break ongoing PRs?

Hey @henrif75, yes, it will affect PRs that touches code blocks. Most PRs don't, I believe so the impact should be less than when the tooling learnt to ignore most Markdown formatting in version 0.2.

Note that there will only be conflicts if we run the normalization tool on a translation — until we do, the translation will work just fine in all places except for the code blocks.

So I suggest that we land the current in-flight PRs, run the normalization tool, and then people can continue translating like normal (but with much smaller PO files: a quick test shows that we go from 19k to 16k lines!).

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Show resolved Hide resolved
@dalance
Copy link
Contributor Author

dalance commented Nov 6, 2023

I restored heuristic logic for codeblock which doesn't have lang-specifier, and refactored stack operation logic.

Copy link
Collaborator

@dyoo dyoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes; looks good to me.

@mgeisler
Copy link
Collaborator

mgeisler commented Nov 8, 2023

Amazing, thanks @dyoo for the review and thanks @dalance for all the work here! I'm very excited to merge and release this to the world 😄

@mgeisler
Copy link
Collaborator

mgeisler commented Nov 8, 2023

Sorry for the tiny merge conflict in lib.rs, @dalance 🙂 We should be able to merge this right after you resolve it.

@dalance
Copy link
Contributor Author

dalance commented Nov 9, 2023

I resolved the conflict.

@mgeisler mgeisler merged commit 9e894bc into google:main Nov 9, 2023
6 checks passed
@mgeisler mgeisler mentioned this pull request Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract only strings and comments from code blocks
4 participants