-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse Markdown in mdbook-xgettext #449
Conversation
This looks like a great start, thanks for working on it! |
8071fb4
to
5e730f4
Compare
A bit of experimentation with the existing translations shows no difference for some of them, especially de and da. I suspect that these are just out of date? I do see differences for ko and pt-BR. My thinking is that I will run a I'm not sure what the other options are, though. What do you think? |
Yes, the German and Danish translations are very incomplete right now. The only full translations are Korean and Portuguese.
Yeah, if the translations are out of date, there's not much we can do about it. However, I was not thinking that you would
The effect of this would be a However, this only works when the translated paragraphs have the same Markdown structure as the original text. @jooyunghan indicated that this is not always the case for Korean. @jiyongp, @rastringer, @hugojacob, and @ronaldfw what do you think about this? See also the discussion in #318. |
There are a lot of complications and edge cases in trying to follow this clever approach:
I've pushed a commit with my helper in it, after having manually edited
Here's what's happening:
The result is that lots of things aren't translated anymore, for various fiddly reasons suggested above. Surprisingly, some things are newly translated, for example 18,21c15
< * Experience with Java, Go, Python, JavaScript...: You get the same memory safety
< as in those languages, plus a similar high-level language feeling. In addition
< you get fast and predictable performance like C and C++ (no garbage collector)
< as well as access to low-level hardware (should you need it)
---
> * Java, Go, Python, JaveScript: 이 언어들과 동일한 메모리 안정성과 함께, '하이레벨'언어의 느낌을 느낄 수 있습니다. 거기에 더해, 가비지 컬렉터가 없는 C/C++와 유사한 수준의 빠르고 예측 가능한 성능을 기대할 수 있습니다. 그리고 필요한 경우 저수준 하드웨어를 다루는 코드로 작성할 수 있습니다. I feel like this is a never-ending struggle, and maybe what I've got is good enough? What do you think? |
(and, I see there are conflicts in |
Right! I was conceptually thinking that you would do this on a per-file basis. So for a given
Yes, that's a good point... this should not turn into a marathon project where everything must be perfect. When I look at the differences between the Here we have a code block which was split before and is now joined into a single message. The only other big change is the bullet points in the lists: Here individual bullet points have been turned into one big message. If they would be split into individual messages, then the output would be just like before (and thus safe for the existing translations). I would suggest implementing this: splitting lists into individual bullet points. That transformation alone sounds like something that can be safely executed on all With that, we should have a drop-in replacement: the code blocks will lose their translation, but we don't have a lot of those strings (I looked for relevant strings in Does that approach sound doable? |
Last week, I was thinking this was impossible due to the deduplication, but the `#:`` comments do, indeed, contain enough data to do this. And in fact, fixing this results in a surprisingly good result, according to the reproduction steps above. So, I think I will update the other translations as I've done for ko, and then leave it at that. There are a few reasons to not want to delve into the per-list-item parse:
|
OK, I've finished all four languages. There are still some discrepancies, but only a handful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I think this looks great!
Please take out the commit with the helper — unless it's somehow useful in the future when we do further transformations?
I agree with you that we should get this merged so that we can start streamlining the PO files to the new parsing.
This upgrades from just splitting Markdown files on double-newlines, to using a Markdown parser to break them into more appropriate chunks. The upshot is that code samples are all in one message, lists are bundled together, and generally it should be easier to translate.
* Parse Markdown to support translation. This upgrades from just splitting Markdown files on double-newlines, to using a Markdown parser to break them into more appropriate chunks. The upshot is that code samples are all in one message, lists are bundled together, and generally it should be easier to translate. * [WIP] helper to update po files for new translation * process synthetic input file-by-file * review comments * remove temporary code * fix msgfmt lints
This uses a full Markdown parser to parse the book contents into messages for translation.
The existing functionality broke messages on double-newlines (
\n\n
), otherwise returning the text exactly as found in the original file. The updated approach still takes string slices from the original file, but uses the markdown parser to better identify the breaks between messages.Status: work in progress
Fixes #318.