-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MediaWiki reader on internalLink for JCK languages #8525
Conversation
I checked the wikipedia rendering rules for English, for example: https://en.wikipedia.org/wiki/100-year_flood
is equivalent to
which would be rendered to
This is a special treatment on wikipedia wikitext parser, which would apply to English (or Latin languages) wikipedia, but not to CJK languages. Probably we should not apply this treatment if pandoc is not aware of input language. |
Here's the link to their syntax description for "blending" links: https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link Note the exception, which we currently don't implement:
|
But I did some experiments on Wikipedia's sandbox, and their actual implementation of blending seems to be much simpler. Only ASCII letters are taken to be part of the "link trail." The blended content stops as soon as you get a punctuation mark or a non-ASCII character, even an accented latin character. I think we should try to match this behavior -- and it will handle your case well. |
Yes. It looks like a appropriate solution. |
Hi @jgm, please help to review my code changes. |
Looks great! |
For the record, the mediawiki documentation indicates that the definition of what content after a wikilink is considered part of it is indeed language-specific. What this means, in practice, is that the specification is encoded in the language's I'm not 100% sure, but the issue this PR raised (that CJK languages don't separate words with whitespace) may be addressed in MediaWiki's segmentByWord() method —which, according to the docs, is (re)implemented in the zh_hans, yue, and ja configuration files— in addition to any language-specific |
The rules for "blending" characters outside a link into the link are described here: https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link These pose a problem for CJK languages, which generally don't have spaces after links. However, it turns out that the blending behavior, as implemented on Wikipedia, is (contrary to the documentation) only for ASCII letters. This commit implements that restriction, which fixes the problem for CJK. (jgm#8525)
This issue has just come to my attention. While it is laudable that this fixed an issue parsing documents from Chinese Wikipedia, it seems to have also broken tons of other languages, as partially pointed out by @waldyrious. That is, this likely fixed links for languages using non-segmented scripts, such as Chinese, Japanese, Thai, while simultaneously breaking similar cases in every other language (except English which was specifically tested and given exception). While on English wikipedia, A through Z is the acceptable "link trail", on Spanish Wikipedia this also includes the letters in the Spanish alphabet which can carry diacritics, namely á, é, í, ó, ú, and ñ. Similar cases exist on Italian, French, Portuguese, and German Wikipedias. This is then even more egregious in languages using a segmented script that isn't Latin, such as on Arabic, Russian, Hebrew, Hindi, Tamil, and Urdu Wikipedias. In short, while it seems to me that the "proper" solution to this would be to have to have rules for each sub-dialect of Wikitext, a much simpler and likely sufficient solution which would likely fix nearly all cases would be to default-allow this blended link behavior for all trailing characters, unless the trailing characters belong to one of the unsegmented scripts (Chinese characters, Thai, Lao, etc.). This can be easily tested by checking the script of the following characters (Each character in Unicode belongs to a script, each of which are defined by ISO 15924). While I'm not sure off-hand how to access this information in Haskell, if one can do this easily via some package, this would likely be the easiest and most inclusive solution (with the caveat that more scripts should be added the this list of unsegmented scripts, such as Khmer and Balinese for example). I hope this is helpful and hope we can see this solved. Thank you! |
@waldyrious @lwolfsonkin thanks for bringing this to my attention. Could you open a new issue about this, so we don't lose track? |
Previously we only included ASCII letters. That is correct for English but not for, e.g., Spanish (see comment in #8525). A safer approach is to include all letters except those in the CJK unified ideograph ranges.
I've added a commit along the lines of @lwolfsonkin 's suggestion. No need to open a new issue if this is adequate. |
This PR would fix wrong link text for Chinese or JCK languages which words are not seperated by blank space. Could anyone share the purpose for this
linktrail
variable?For example:
是[[台灣|台式]]的傳統[[月餅]]
would be parsed as below, which is abviously unreasonable.
After this fix, json output would be:
Original data:
chinapedia/mediawiki-to-gfm@40b17e7
Json output with fix:
chinapedia/mediawiki-to-gfm@63add8c