Bug 1323861 - Remove the readScript method, r=Gijs by evanxd · Pull Request #345 · mozilla/readability

evanxd · 2017-01-23T10:28:27Z

Fixes #334.

evanxd · 2017-01-25T08:12:14Z

test/test-pages/breitbart/source.html

-    <script type="text/javascript" src="//m.addthis.com/live/red_lojson/300lo.json?si=5887328e015c3caf&amp;bkl=0&amp;bl=1&amp;sid=5887328e015c3caf&amp;pub=ra-536db77a775cf072&amp;rev=v7.9.5-wp&amp;ln=en&amp;pc=men&amp;cb=0&amp;ab=-&amp;dp=www.breitbart.com&amp;fp=tech%2F2016%2F12%2F22%2Fneutral-snopes-fact-checker-david-emery-un-angry-trump-supporters%2F&amp;fr=&amp;of=0&amp;pd=0&amp;irt=0&amp;vcl=0&amp;md=0&amp;ct=0&amp;tct=0&amp;abt=0&amp;cdn=0&amp;pi=1&amp;rb=0&amp;gen=100&amp;chr=UTF-8&amp;mk=Donald%20Trump%2Cfacebook%2CFact%20Check%2Cfake%20news%2CSnopes%2CTech%2Csnopes&amp;colc=1485255313842&amp;jsl=13505&amp;uvs=5887328e084443ac000&amp;skipb=1&amp;callback=addthis.cbs.oln9_77519642505367390">
-        <script type="text/javascript" id="cadmpjs" async="" src="//c1.rfihub.net/js/smarttag.js"></script>
-        <script type="text/javascript" id="pubnationjs" async="" src="//report-ads-to.pubnation.com/dist/pnr.js?t=pn-52225fd0c8484f06"></script>
-    </script>


@gijsk ,

The XHTML content generated from @mozilla.org/xmlextras/xmlserializer;1" contains some invalid <script> nodes which our JSDOMParser module cannot handle with, like above I removed in the patch.

After removing the below nested <script> nodes

<script> <script></script> <script></script> </sciprt>

then the tests can be passed.

Since xmlserializer generated the nested <script> nodes which is invalid structure and cannot be passed with w3c xhtml validation, I think we should fix the issue in xmlserializer module not in JSDOMParser module, right? What do you think?

If we all agree that is xmlserializer issue, then this bug is not about Reader Mode or Readability.js. We might need to find a someone in gecko team to fix it?

But since xmlserializer module is designed for focusing generating xml serializer, it make sense it don't have knowledge about nested <script> nodes are invalid. So looks like we should fix the issue in our JSDOMParser module. What do you think?

We probably need to fix this in JSDOMParser. I don't know off-hand how to do this in the best way. See also #77 and #87 . There are unit test for the parser's behaviour, so we should add to those when we do fix this.

evanxd · 2017-02-08T10:45:10Z

Wrote a patch to fix #334 and tests. And the CI is good.

@gijsk , could you help review it? Thanks.

gijsk · 2017-02-08T13:26:42Z

JSDOMParser.js

+            var indexOfNextScriptOpeningTag = this.html.indexOf("<script", this.currentChar);
+            var indexOfNextScriptClosingTag = this.html.indexOf("</script>", this.currentChar);
+            // Found out the closing script tag when there is no other opening/closing tag or
+            // index of closing tag is bigger than opening tag's, which means rest of opening/closing script tags are pairs.


This isn't actually true... What happens to the first opening and closing tag means nothing for whether all the next opening/closing tags are pairs.

Unfortunately I also don't think it works well. Consider:

<script ...></script>  <p> random important content</p> <script> // more script </script>

and now, if I'm reading your code right, the first closing of the <script> will be ignored because we spot a </script> lower down that is not preceded by a <script>.

I think the problem here is that the code tries to be HTML-style clever about how to read script tags, but if we're guaranteeing XHTML input that shouldn't be necessary and will actually cause problems in the case of nested script elements. What happens if you replace the call to readScript with readChildren ? Maybe we should just remove that code altogether. See also comments on #252

Good point! Let's try to it.

gijsk · 2017-02-08T13:27:12Z

test/test-jsdomparser.js

+    var html = '<head><script><script src="foo.js"></script></script></head>';
+    var doc = new JSDOMParser().parse(html);
+    expect(doc.firstChild.firstChild.tagName).eql("SCRIPT");
+    expect(doc.firstChild.firstChild.textContent).eql("<script src=\"foo.js\"></script>");


Why can't we just parse these as actual child elements?

evanxd · 2017-02-10T07:42:22Z

@gijsk ,

I've updated the patch for your comments. I removed the readScript method and fixed all test failures by updating all source.html pages to XTHML format.

Could you help review it again? Thanks.

gijsk · 2017-02-10T11:13:27Z

test/test-jsdomparser.js


 var BASETESTCASE = '<html><body><p>Some text and <a class="someclass" href="#">a link</a></p>' +
-                   '<div id="foo">With a <script>With < fancy " characters in it because' +
+                   '<div id="foo">With a <script>With &lt; fancy " characters in it because' +


Are these changes (to single '<' or '>' characters) actually necessary to make tests pass? Why?

I replace all < and > with < and > because of the below three reasons.

Our xml serializer just replaces all < and > with < and >. And the output of the xml serializer is just the input of reader mode.

In XHTML, the script and style elements are declared as having #CDATA content. 1

Use external scripts if your script uses < or & or ]]> or --. 2 So replace < and > in embedded script.

Due to above reasons, I think we might need to change these source.html pages. What do you think?

I think that if it works without the replacement we should leave the files alone.

I think replacing the < and > that start/end CDATA sections (like my comment below) is definitely wrong.

I think replacing the < and > that start/end CDATA sections (like my comment below) is definitely wrong.

Agreed, let's do it. I'll not replace < and > that start/end CDATA sections.

But I cannot understand I think that if it works without the replacement we should leave the files alone. Do you mean that we can replace < and > in an embedded script if there is no CDATA in it? For example, we can replace the below < or > with < or >.

<script> // There is no CDATA here. for (var i = 0; i < 10; i++) { console.log(i); } </script>

After doing replacement, it will looks like

<script> // There is no CDATA here. for (var i = 0; i < 10; i++) { console.log(i); } </script>

Is that good to you?

Well, I just wouldn't bother replacing them unless it's necessary to make tests pass, which I don't think it is. :-)

Got it. I'll only replace < with < in an embedded script there is no CDATA in it.

That's not what I said... why is it necessary to replace them at all? Do the tests break if you don't replace them?

Yes, the tests will be broken if we don't replace < in embedded script tags without CDATA.

gijsk · 2017-02-10T11:18:22Z

test/test-pages/ars-1/source.html

                        <aside class="side-ad">
                            <script type="text/javascript" language="JavaScript">
-                                // <![CDATA[
+                                // &lt;![CDATA[


This looks wrong (in that it should work as-is)

gijsk · 2017-02-10T11:19:15Z

The JSDOMParser changes look OK, but I was kind of expecting us not to have to make any/many test changes, maybe besides the JSDOMParser unit test itself. Am I wrong?

evanxd · 2017-02-20T10:42:47Z

Fixed testing failures by replacing < with < in an embedded script without CDATA except bbc-1, buzzfeed-1, herald-sun-1, and liberation-1 tests. Continue to work on the rest of failures tomorrow...

evanxd · 2017-02-21T09:41:56Z

Hi @gijsk ,

I updated the test files to replace < with < in inner script nodes in source.html pages to fix all previous test failures. The replacements are necessary to pass all tests.

The CI is good now. Could you take a look again? Thanks.

gijsk · 2017-02-21T11:04:25Z

test/test-jsdomparser.js


  it("should strip !-based comments within script tags", function() {
-    var html = '<script><!--Silly test > <script src="foo.js"></script>--></script>';
+    var html = '<script>&lt;!--Silly test > &lt;script src="foo.js">&lt;/script>--></script>';


Uh, is this really what the XML serializer does? It doesn't keep XML-style comments? That seems...surprising.

Uh, that is a mistake. Already updated the patch to fix it. Please take a look. Thanks.

evanxd mentioned this pull request Jan 23, 2017

Breitbart.com #334

Closed

evanxd force-pushed the bug-1323861 branch 3 times, most recently from 3901384 to 4d6a5b1 Compare January 25, 2017 07:49

evanxd commented Jan 25, 2017

View reviewed changes

evanxd force-pushed the bug-1323861 branch from 4d6a5b1 to b5166ac Compare January 25, 2017 08:33

evanxd force-pushed the bug-1323861 branch 3 times, most recently from da6fd0a to ec5c19c Compare February 8, 2017 10:40

evanxd changed the title ~~Bug 1323861 - Fix JSDOMParser parser issue~~ Bug 1323861 - Find out correct closing script tag, r=Gijs Feb 8, 2017

evanxd self-assigned this Feb 8, 2017

gijsk requested changes Feb 8, 2017

View reviewed changes

evanxd force-pushed the bug-1323861 branch 4 times, most recently from b1b6826 to a01a6e2 Compare February 10, 2017 07:36

gijsk reviewed Feb 10, 2017

View reviewed changes

evanxd force-pushed the bug-1323861 branch 2 times, most recently from 7b411f7 to 66c7f42 Compare February 20, 2017 10:36

evanxd force-pushed the bug-1323861 branch 2 times, most recently from 4067b38 to ea166e0 Compare February 21, 2017 09:33

evanxd force-pushed the bug-1323861 branch 2 times, most recently from 3947864 to 0666a27 Compare February 21, 2017 09:47

evanxd changed the title ~~Bug 1323861 - Find out correct closing script tag, r=Gijs~~ Bug 1323861 - Remove the readScript method, r=Gijs Feb 21, 2017

gijsk reviewed Feb 21, 2017

View reviewed changes

Bug 1323861 - Remove the readScript method, r=Gijs

b8252e4

evanxd force-pushed the bug-1323861 branch from 0666a27 to b8252e4 Compare February 22, 2017 03:59

gijsk approved these changes Feb 22, 2017

View reviewed changes

gijsk merged commit 0f14737 into mozilla:master Feb 22, 2017

Conversation

evanxd commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evanxd Jan 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evanxd Jan 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evanxd commented Feb 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evanxd commented Feb 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evanxd Feb 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gijsk commented Feb 10, 2017

Uh oh!

evanxd commented Feb 20, 2017

Uh oh!

evanxd commented Feb 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evanxd commented Jan 23, 2017 •

edited

Loading

evanxd Jan 25, 2017 •

edited

Loading

evanxd Jan 25, 2017 •

edited

Loading

evanxd Feb 16, 2017 •

edited

Loading