Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<nowiki> appearing where it should not #259

Open
desb42 opened this issue Oct 27, 2018 · 11 comments
Open

<nowiki> appearing where it should not #259

desb42 opened this issue Oct 27, 2018 · 11 comments

Comments

@desb42
Copy link
Collaborator

desb42 commented Oct 27, 2018

The following wikitext

{{markup|title=Using only footnote-style references
|<nowiki>Lorem ipsum.<ref>Source name, access date, etc.</ref>

Lorem ipsum dolor sit amet.<ref>Source name, access date, etc.</ref>

==References==
{{Reflist}}</nowiki>
|Lorem ipsum.<ref>Source name, access date, etc.</ref>
Lorem ipsum dolor sit amet.<ref>Source name, access date, etc.</ref>

{{fake heading|sub=3|References}}
{{Reflist}}
}}

In the wikipedia sandbox produces
nowiki_true

Xowa produces
nowiki_bad

Ignoring the red error, note the presence of the text '<nowiki>' and '</nowiki>' in the left hand column

The handling of <nowiki> does not seem quite correct

@gnosygnu
Copy link
Owner

Ugh... This sounds like a parser issue. Is there an existing page where you're seeing this error? (I just want to get an idea of how widespread this issue is)

As for the actual fix, I'll have to look at the nowiki implementation. This is a particularly complicated piece of code that I wrote early in the XOWA parser implementation. It's possible that either my impersonation wasn't good enough, or MediaWiki changed something recently.

I'll look again at the code later this week, but depending on how widespread the above is, this may be my highest priority.

Thanks!

@desb42
Copy link
Collaborator Author

desb42 commented Oct 28, 2018

I stumbled across it when looking at Template:Reflist/doc - that is, I found it when looking at the documentation to Template:Reflist
I cannot tell how widespread it is; however I suspect it is an edge case

@desb42
Copy link
Collaborator Author

desb42 commented Oct 28, 2018

I have just scanned all the enwiki html databases (18 of them) and the only one with <nowiki> in it seems to be 1965–66_TSV_1860_Munich_season

@gnosygnu
Copy link
Owner

Thanks for the follow-up.

I found the issue. It's related to the <tag> function. The simplified example wikitext would be the following:

{{#tag:pre|<nowiki>A<b>B</b></nowiki>}}

... which outputs nowiki tags

This behavior is caused by the tag function wrapping the original contents in a UNIQ block and unwrapping later. I have to look at MediaWiki code later to see what is the proper fix. A sloppy proof of concept hack would be to make the following change to https://github.com/gnosygnu/xowa/blob/master/400_xowa/src/gplx/xowa/xtns/pfuncs/strings/Pfunc_tag.java#L47

if (args_len > 0) {	// handle no args; EX: "{{#tag:ref}}" -> "<ref></ref>"
	byte[] temp = Pf_func_.Eval_arg_or_empty(ctx, src, caller, self, args_len, 0);
	temp = ctx.Wiki().Parser_mgr().Main().Parse_text_to_html(Xop_ctx.New__sub__reuse_page(ctx), temp);
	tmp_bfr.Add(temp);
}

However, this won't work on a permanent basis b/c the Main() parser should not be invoked in nested calls

I'll comment again here when I have a more robust fix.

On another note, how do you scan the html databases? I assume you have some adhoc code that un-hzips each html page and then scans the full-text? If so, how long does that take? I'd imagine it would take at least 2+ hours for each scan (unless you're saving the un-hzipped content as files somewhere)

@desb42
Copy link
Collaborator Author

desb42 commented Oct 29, 2018

To scan the html, I have a simple python script that does essentially as you describe

see the gist checkhtml.py

On the machine I use this takes about 30 mins
This produces 6059 entries

@gnosygnu
Copy link
Owner

gnosygnu commented Nov 1, 2018

Cool. This should pick up most of the errors, since they aren't hzipped.

I'll give the python script a try when I get home later. It's interesting that your script is relatively concise yet powerful. One day, when I get rid of hzip, it'll be pretty useful in scanning through all the html pages

@desb42
Copy link
Collaborator Author

desb42 commented Feb 14, 2019

I have just found an instance of this <nowiki> issue which has broader consequences
redness
Within the source of the page is the following lines

<th scope="row" class="navbox-group" style="background: white; 
-moz-box-shadow: inset 2px 2px 0 <nowiki>#F0001C</nowiki>, inset -2px -2px 0 <nowiki>#F0001C</nowiki>; 
-webkit-box-shadow: inset 2px 2px 0 <nowiki>#F0001C</nowiki>, inset -2px -2px 0 <nowiki>#F0001C</nowiki>; 
box-shadow: inset 2px 2px 0 <nowiki>#F0001C</nowiki>, inset -2px -2px 0 <nowiki>#F0001C</nowiki>;;width:1%">

Note the presence of many <nowiki>

Looking at the wikitext the area under discussion is {{Party of European Socialists}}
This in turn contains three {{Party of European Socialists/meta/color}} entries
And that template contains the <nowiki> entry

I think it needs a little boost in priority

@gnosygnu
Copy link
Owner

Thanks for the example. Will take a look at it this weekend, but nowiki debugging always gives me a headache.

@desb42
Copy link
Collaborator Author

desb42 commented Mar 12, 2019

And here's another <nowiki> the other way around. That is <nowiki> tags do not seem to be taken
en.wiktionary.org/wiki/Wiktionary:Beer_parlour/2011/March#CFI_and_vandalism
vandalism

The wikitext off this section is:

== CFI and vandalism ==

Now this is a section CFI could do well without:

<div style="border-left: 1px solid #C00; border-left-width: 3px; padding-left: .5em; margin-left: 2em;">
<nowiki>==Vandalism==</nowiki>

From time to time, various parties will insert material into Wiktionary which clearly has nothing 

Xowa is treating the ==Vandalism== as a header, mediawiki just as text

@desb42
Copy link
Collaborator Author

desb42 commented May 13, 2019

I thought I would take a look at this and have noticed quote a lot of commented out code regarding UNIQ
So I reinstated them to see what happens

The example I was specifically tracking down was en.wikipedia.org/wiki/Template:Party of European Socialists/meta/color

It does seem to work with the current code (this is due to the nowiki text being 'esacpaed')

I tracked things to Xop_tblw_wkr.java Atrs_make
This routine essentially finds all the tokens associated with the attributes to the table element, works out where they start and end and then throws them away.
For <nowiki>, there is piece of commented out code to use Uniq_mgr

Instead, I took the tokens identified and effectively passed them through Xot_tmpl_wtr.Write
This seemed to work in the short term

However, I believe there is an underlying issue with the table tokens - they all assume that they refer to the original source
Using the above approach I think the object prv_tblw should not only be adjusted for range but also for the potentially new and different sized source
(Or am I just rambling)

@gnosygnu
Copy link
Owner

I thought I would take a look at this and have noticed quote a lot of commented out code regarding UNIQ
So I reinstated them to see what happens

Yeah, I added this a while ago. I forget why I left it commented (probably did not want to risk changing behavior)

Let me put it on tab for this weekend. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants