Skip to content

Bug 1323861 - Remove the readScript method, r=Gijs#345

Merged
gijsk merged 1 commit intomozilla:masterfrom
evanxd:bug-1323861
Feb 22, 2017
Merged

Bug 1323861 - Remove the readScript method, r=Gijs#345
gijsk merged 1 commit intomozilla:masterfrom
evanxd:bug-1323861

Conversation

@evanxd
Copy link
Copy Markdown
Contributor

@evanxd evanxd commented Jan 23, 2017

Fixes #334.

@evanxd evanxd mentioned this pull request Jan 23, 2017
@evanxd evanxd force-pushed the bug-1323861 branch 3 times, most recently from 3901384 to 4d6a5b1 Compare January 25, 2017 07:49
<script type="text/javascript" src="//m.addthis.com/live/red_lojson/300lo.json?si=5887328e015c3caf&amp;bkl=0&amp;bl=1&amp;sid=5887328e015c3caf&amp;pub=ra-536db77a775cf072&amp;rev=v7.9.5-wp&amp;ln=en&amp;pc=men&amp;cb=0&amp;ab=-&amp;dp=www.breitbart.com&amp;fp=tech%2F2016%2F12%2F22%2Fneutral-snopes-fact-checker-david-emery-un-angry-trump-supporters%2F&amp;fr=&amp;of=0&amp;pd=0&amp;irt=0&amp;vcl=0&amp;md=0&amp;ct=0&amp;tct=0&amp;abt=0&amp;cdn=0&amp;pi=1&amp;rb=0&amp;gen=100&amp;chr=UTF-8&amp;mk=Donald%20Trump%2Cfacebook%2CFact%20Check%2Cfake%20news%2CSnopes%2CTech%2Csnopes&amp;colc=1485255313842&amp;jsl=13505&amp;uvs=5887328e084443ac000&amp;skipb=1&amp;callback=addthis.cbs.oln9_77519642505367390">
<script type="text/javascript" id="cadmpjs" async="" src="//c1.rfihub.net/js/smarttag.js"></script>
<script type="text/javascript" id="pubnationjs" async="" src="//report-ads-to.pubnation.com/dist/pnr.js?t=pn-52225fd0c8484f06"></script>
</script>
Copy link
Copy Markdown
Contributor Author

@evanxd evanxd Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gijsk ,

The XHTML content generated from @mozilla.org/xmlextras/xmlserializer;1" contains some invalid <script> nodes which our JSDOMParser module cannot handle with, like above I removed in the patch.

After removing the below nested <script> nodes

<script>
  <script></script>
  <script></script>
</sciprt>

then the tests can be passed.

Since xmlserializer generated the nested <script> nodes which is invalid structure and cannot be passed with w3c xhtml validation, I think we should fix the issue in xmlserializer module not in JSDOMParser module, right? What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we all agree that is xmlserializer issue, then this bug is not about Reader Mode or Readability.js. We might need to find a someone in gecko team to fix it?

Copy link
Copy Markdown
Contributor Author

@evanxd evanxd Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But since xmlserializer module is designed for focusing generating xml serializer, it make sense it don't have knowledge about nested <script> nodes are invalid. So looks like we should fix the issue in our JSDOMParser module. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to fix this in JSDOMParser. I don't know off-hand how to do this in the best way. See also #77 and #87 . There are unit test for the parser's behaviour, so we should add to those when we do fix this.

@evanxd evanxd force-pushed the bug-1323861 branch 3 times, most recently from da6fd0a to ec5c19c Compare February 8, 2017 10:40
@evanxd evanxd changed the title Bug 1323861 - Fix JSDOMParser parser issue Bug 1323861 - Find out correct closing script tag, r=Gijs Feb 8, 2017
@evanxd
Copy link
Copy Markdown
Contributor Author

evanxd commented Feb 8, 2017

Wrote a patch to fix #334 and tests. And the CI is good.

@gijsk , could you help review it? Thanks.

@evanxd evanxd self-assigned this Feb 8, 2017
JSDOMParser.js Outdated
var indexOfNextScriptOpeningTag = this.html.indexOf("<script", this.currentChar);
var indexOfNextScriptClosingTag = this.html.indexOf("</script>", this.currentChar);
// Found out the closing script tag when there is no other opening/closing tag or
// index of closing tag is bigger than opening tag's, which means rest of opening/closing script tags are pairs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually true... What happens to the first opening and closing tag means nothing for whether all the next opening/closing tags are pairs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I also don't think it works well. Consider:

<script ...></script>
<!-- Something something </script> something -->
<p> random important content</p>
<script>
// more script
</script>

and now, if I'm reading your code right, the first closing of the <script> will be ignored because we spot a </script> lower down that is not preceded by a <script>.

I think the problem here is that the code tries to be HTML-style clever about how to read script tags, but if we're guaranteeing XHTML input that shouldn't be necessary and will actually cause problems in the case of nested script elements. What happens if you replace the call to readScript with readChildren ? Maybe we should just remove that code altogether. See also comments on #252

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let's try to it.

var html = '<head><script><script src="foo.js"></script></script></head>';
var doc = new JSDOMParser().parse(html);
expect(doc.firstChild.firstChild.tagName).eql("SCRIPT");
expect(doc.firstChild.firstChild.textContent).eql("<script src=\"foo.js\"></script>");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just parse these as actual child elements?

@evanxd evanxd force-pushed the bug-1323861 branch 4 times, most recently from b1b6826 to a01a6e2 Compare February 10, 2017 07:36
@evanxd
Copy link
Copy Markdown
Contributor Author

evanxd commented Feb 10, 2017

@gijsk ,

I've updated the patch for your comments. I removed the readScript method and fixed all test failures by updating all source.html pages to XTHML format.

Could you help review it again? Thanks.


var BASETESTCASE = '<html><body><p>Some text and <a class="someclass" href="#">a link</a></p>' +
'<div id="foo">With a <script>With < fancy " characters in it because' +
'<div id="foo">With a <script>With &lt; fancy " characters in it because' +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes (to single '<' or '>' characters) actually necessary to make tests pass? Why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replace all < and > with &lt; and &gt; because of the below three reasons.

  1. Our xml serializer just replaces all < and > with &lt; and &gt;. And the output of the xml serializer is just the input of reader mode.
  2. In XHTML, the script and style elements are declared as having #CDATA content. 1
  3. Use external scripts if your script uses < or & or ]]> or --. 2 So replace < and > in embedded script.

Due to above reasons, I think we might need to change these source.html pages. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if it works without the replacement we should leave the files alone.

I think replacing the < and > that start/end CDATA sections (like my comment below) is definitely wrong.

Copy link
Copy Markdown
Contributor Author

@evanxd evanxd Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think replacing the < and > that start/end CDATA sections (like my comment below) is definitely wrong.

Agreed, let's do it. I'll not replace < and > that start/end CDATA sections.

But I cannot understand I think that if it works without the replacement we should leave the files alone. Do you mean that we can replace < and > in an embedded script if there is no CDATA in it? For example, we can replace the below < or > with &lt; or &gt;.

<script>
  // There is no CDATA here.
  for (var i = 0; i < 10; i++) {
    console.log(i);
  }
</script>

After doing replacement, it will looks like

<script>
  // There is no CDATA here.
  for (var i = 0; i &lt; 10; i++) {
    console.log(i);
  }
</script>

Is that good to you?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I just wouldn't bother replacing them unless it's necessary to make tests pass, which I don't think it is. :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll only replace < with &lt; in an embedded script there is no CDATA in it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what I said... why is it necessary to replace them at all? Do the tests break if you don't replace them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tests will be broken if we don't replace < in embedded script tags without CDATA.

<aside class="side-ad">
<script type="text/javascript" language="JavaScript">
// <![CDATA[
// &lt;![CDATA[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong (in that it should work as-is)

@gijsk
Copy link
Copy Markdown
Contributor

gijsk commented Feb 10, 2017

The JSDOMParser changes look OK, but I was kind of expecting us not to have to make any/many test changes, maybe besides the JSDOMParser unit test itself. Am I wrong?

@evanxd evanxd force-pushed the bug-1323861 branch 2 times, most recently from 7b411f7 to 66c7f42 Compare February 20, 2017 10:36
@evanxd
Copy link
Copy Markdown
Contributor Author

evanxd commented Feb 20, 2017

Fixed testing failures by replacing < with < in an embedded script without CDATA except bbc-1, buzzfeed-1, herald-sun-1, and liberation-1 tests. Continue to work on the rest of failures tomorrow...

@evanxd evanxd force-pushed the bug-1323861 branch 2 times, most recently from 4067b38 to ea166e0 Compare February 21, 2017 09:33
@evanxd
Copy link
Copy Markdown
Contributor Author

evanxd commented Feb 21, 2017

Hi @gijsk ,

I updated the test files to replace < with &lt; in inner script nodes in source.html pages to fix all previous test failures. The replacements are necessary to pass all tests.

The CI is good now. Could you take a look again? Thanks.

@evanxd evanxd force-pushed the bug-1323861 branch 2 times, most recently from 3947864 to 0666a27 Compare February 21, 2017 09:47
@evanxd evanxd changed the title Bug 1323861 - Find out correct closing script tag, r=Gijs Bug 1323861 - Remove the readScript method, r=Gijs Feb 21, 2017

it("should strip !-based comments within script tags", function() {
var html = '<script><!--Silly test > <script src="foo.js"></script>--></script>';
var html = '<script>&lt;!--Silly test > &lt;script src="foo.js">&lt;/script>--></script>';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, is this really what the XML serializer does? It doesn't keep XML-style comments? That seems...surprising.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, that is a mistake. Already updated the patch to fix it. Please take a look. Thanks.

@gijsk gijsk merged commit 0f14737 into mozilla:master Feb 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants