Fix MediaWiki reader on internalLink for JCK languages #8525

liruqi · 2023-01-05T11:50:37Z

This PR would fix wrong link text for Chinese or JCK languages which words are not seperated by blank space. Could anyone share the purpose for this linktrail variable?

For example:

是[[台灣|台式]]的傳統[[月餅]]

would be parsed as below, which is abviously unreasonable.

        {
          "t": "Str",
          "c": "是"
        },
        {
          "t": "Link",
          "c": [
            [
              "",
              [],
              []
            ],
            [
              {
                "t": "Str",
                "c": "台式的傳統"
              }
            ],
            [
              "台灣",
              "wikilink"
            ]
          ]
        },
        {
          "t": "Link",
          "c": [
            [
              "",
              [],
              []
            ],
            [
              {
                "t": "Str",
                "c": "月餅"
              }
            ],
            [
              "月餅",
              "wikilink"
            ]
          ]
        },

After this fix, json output would be:

        {
          "t": "Str",
          "c": "是"
        },
        {
          "t": "Link",
          "c": [
            [
              "",
              [],
              []
            ],
            [
              {
                "t": "Str",
                "c": "台式"
              }
            ],
            [
              "台灣",
              "wikilink"
            ]
          ]
        },
        {
          "t": "Str",
          "c": "的傳統"
        },
        {
          "t": "Link",
          "c": [
            [
              "",
              [],
              []
            ],
            [
              {
                "t": "Str",
                "c": "月餅"
              }
            ],
            [
              "月餅",
              "wikilink"
            ]
          ]
        },

Original data:
chinapedia/mediawiki-to-gfm@40b17e7

Json output with fix:
chinapedia/mediawiki-to-gfm@63add8c

liruqi · 2023-01-05T13:27:14Z

I checked the wikipedia rendering rules for English, for example: https://en.wikipedia.org/wiki/100-year_flood

[[drainage basin]]s

is equivalent to

[[drainage basins|drainage basin]]

which would be rendered to

<a href="/wiki/Drainage_basin" title="Drainage basin">drainage basins</a>

This is a special treatment on wikipedia wikitext parser, which would apply to English (or Latin languages) wikipedia, but not to CJK languages. Probably we should not apply this treatment if pandoc is not aware of input language.

jgm · 2023-01-13T03:22:32Z

Here's the link to their syntax description for "blending" links:

https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link

Note the exception, which we currently don't implement:

Exception: a trailing apostrophe (') and any characters following the apostrophe are not blended.

jgm · 2023-01-13T03:28:30Z

But I did some experiments on Wikipedia's sandbox, and their actual implementation of blending seems to be much simpler. Only ASCII letters are taken to be part of the "link trail." The blended content stops as soon as you get a punctuation mark or a non-ASCII character, even an accented latin character.

I think we should try to match this behavior -- and it will handle your case well.
Do you want to update your PR?

liruqi · 2023-01-13T03:32:36Z

Yes. It looks like a appropriate solution.

liruqi · 2023-01-13T04:41:01Z

Hi @jgm, please help to review my code changes.

jgm · 2023-01-13T05:59:57Z

Looks great!

waldyrious · 2023-01-22T00:47:43Z

I did some experiments on Wikipedia's sandbox, and their actual implementation of blending seems to be much simpler. Only ASCII letters are taken to be part of the "link trail." The blended content stops as soon as you get a punctuation mark or a non-ASCII character, even an accented latin character.

For the record, the mediawiki documentation indicates that the definition of what content after a wikilink is considered part of it is indeed language-specific. What this means, in practice, is that the specification is encoded in the language's MessagesXx.php file, namely in a variable called $linkTrail. For example, here's the English configuration file (which should match the experience at the English Wikipedia Sandbox), and here's the Chinese configuration file (which should match the experience at the Chinese Wikipedia Sandbox).

I'm not 100% sure, but the issue this PR raised (that CJK languages don't separate words with whitespace) may be addressed in MediaWiki's segmentByWord() method —which, according to the docs, is (re)implemented in the zh_hans, yue, and ja configuration files— in addition to any language-specific $linkTrail configuration.

The rules for "blending" characters outside a link into the link are described here: https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link These pose a problem for CJK languages, which generally don't have spaces after links. However, it turns out that the blending behavior, as implemented on Wikipedia, is (contrary to the documentation) only for ASCII letters. This commit implements that restriction, which fixes the problem for CJK. (jgm#8525)

lwolfsonkin · 2023-07-05T18:09:52Z

This issue has just come to my attention. While it is laudable that this fixed an issue parsing documents from Chinese Wikipedia, it seems to have also broken tons of other languages, as partially pointed out by @waldyrious. That is, this likely fixed links for languages using non-segmented scripts, such as Chinese, Japanese, Thai, while simultaneously breaking similar cases in every other language (except English which was specifically tested and given exception).

While on English wikipedia, A through Z is the acceptable "link trail", on Spanish Wikipedia this also includes the letters in the Spanish alphabet which can carry diacritics, namely á, é, í, ó, ú, and ñ. Similar cases exist on Italian, French, Portuguese, and German Wikipedias. This is then even more egregious in languages using a segmented script that isn't Latin, such as on Arabic, Russian, Hebrew, Hindi, Tamil, and Urdu Wikipedias.

In short, while it seems to me that the "proper" solution to this would be to have to have rules for each sub-dialect of Wikitext, a much simpler and likely sufficient solution which would likely fix nearly all cases would be to default-allow this blended link behavior for all trailing characters, unless the trailing characters belong to one of the unsegmented scripts (Chinese characters, Thai, Lao, etc.). This can be easily tested by checking the script of the following characters (Each character in Unicode belongs to a script, each of which are defined by ISO 15924). While I'm not sure off-hand how to access this information in Haskell, if one can do this easily via some package, this would likely be the easiest and most inclusive solution (with the caveat that more scripts should be added the this list of unsegmented scripts, such as Khmer and Balinese for example).

I hope this is helpful and hope we can see this solved. Thank you!

Pinging @liruqi and also @jgm.

jgm · 2023-07-05T18:36:25Z

@waldyrious @lwolfsonkin thanks for bringing this to my attention. Could you open a new issue about this, so we don't lose track?

Previously we only included ASCII letters. That is correct for English but not for, e.g., Spanish (see comment in #8525). A safer approach is to include all letters except those in the CJK unified ideograph ranges.

jgm · 2023-07-06T06:13:13Z

I've added a commit along the lines of @lwolfsonkin 's suggestion. No need to open a new issue if this is adequate.

Fix MediaWiki reader on internalLink

f403c0b

liruqi changed the title ~~Fix MediaWiki reader on internalLink~~ Fix MediaWiki reader on internalLink for JCK languages Jan 6, 2023

Apply blending to ascii characters only

ad6bd25

liruqi force-pushed the master branch from 2eb4e5e to ad6bd25 Compare January 13, 2023 04:25

jgm merged commit fdfa9fc into jgm:main Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MediaWiki reader on internalLink for JCK languages #8525

Fix MediaWiki reader on internalLink for JCK languages #8525

liruqi commented Jan 5, 2023 •

edited

Loading

liruqi commented Jan 5, 2023 •

edited

Loading

jgm commented Jan 13, 2023

jgm commented Jan 13, 2023

liruqi commented Jan 13, 2023

liruqi commented Jan 13, 2023

jgm commented Jan 13, 2023

waldyrious commented Jan 22, 2023

lwolfsonkin commented Jul 5, 2023

jgm commented Jul 5, 2023

jgm commented Jul 6, 2023

Fix MediaWiki reader on internalLink for JCK languages #8525

Fix MediaWiki reader on internalLink for JCK languages #8525

Conversation

liruqi commented Jan 5, 2023 • edited Loading

liruqi commented Jan 5, 2023 • edited Loading

jgm commented Jan 13, 2023

jgm commented Jan 13, 2023

liruqi commented Jan 13, 2023

liruqi commented Jan 13, 2023

jgm commented Jan 13, 2023

waldyrious commented Jan 22, 2023

lwolfsonkin commented Jul 5, 2023

jgm commented Jul 5, 2023

jgm commented Jul 6, 2023

liruqi commented Jan 5, 2023 •

edited

Loading

liruqi commented Jan 5, 2023 •

edited

Loading