Bug 1323861 - Remove the readScript method, r=Gijs#345
Conversation
3901384 to
4d6a5b1
Compare
| <script type="text/javascript" src="//m.addthis.com/live/red_lojson/300lo.json?si=5887328e015c3caf&bkl=0&bl=1&sid=5887328e015c3caf&pub=ra-536db77a775cf072&rev=v7.9.5-wp&ln=en&pc=men&cb=0&ab=-&dp=www.breitbart.com&fp=tech%2F2016%2F12%2F22%2Fneutral-snopes-fact-checker-david-emery-un-angry-trump-supporters%2F&fr=&of=0&pd=0&irt=0&vcl=0&md=0&ct=0&tct=0&abt=0&cdn=0&pi=1&rb=0&gen=100&chr=UTF-8&mk=Donald%20Trump%2Cfacebook%2CFact%20Check%2Cfake%20news%2CSnopes%2CTech%2Csnopes&colc=1485255313842&jsl=13505&uvs=5887328e084443ac000&skipb=1&callback=addthis.cbs.oln9_77519642505367390"> | ||
| <script type="text/javascript" id="cadmpjs" async="" src="//c1.rfihub.net/js/smarttag.js"></script> | ||
| <script type="text/javascript" id="pubnationjs" async="" src="//report-ads-to.pubnation.com/dist/pnr.js?t=pn-52225fd0c8484f06"></script> | ||
| </script> |
There was a problem hiding this comment.
@gijsk ,
The XHTML content generated from @mozilla.org/xmlextras/xmlserializer;1" contains some invalid <script> nodes which our JSDOMParser module cannot handle with, like above I removed in the patch.
After removing the below nested <script> nodes
<script>
<script></script>
<script></script>
</sciprt>then the tests can be passed.
Since xmlserializer generated the nested <script> nodes which is invalid structure and cannot be passed with w3c xhtml validation, I think we should fix the issue in xmlserializer module not in JSDOMParser module, right? What do you think?
There was a problem hiding this comment.
If we all agree that is xmlserializer issue, then this bug is not about Reader Mode or Readability.js. We might need to find a someone in gecko team to fix it?
There was a problem hiding this comment.
But since xmlserializer module is designed for focusing generating xml serializer, it make sense it don't have knowledge about nested <script> nodes are invalid. So looks like we should fix the issue in our JSDOMParser module. What do you think?
da6fd0a to
ec5c19c
Compare
JSDOMParser.js
Outdated
| var indexOfNextScriptOpeningTag = this.html.indexOf("<script", this.currentChar); | ||
| var indexOfNextScriptClosingTag = this.html.indexOf("</script>", this.currentChar); | ||
| // Found out the closing script tag when there is no other opening/closing tag or | ||
| // index of closing tag is bigger than opening tag's, which means rest of opening/closing script tags are pairs. |
There was a problem hiding this comment.
This isn't actually true... What happens to the first opening and closing tag means nothing for whether all the next opening/closing tags are pairs.
There was a problem hiding this comment.
Unfortunately I also don't think it works well. Consider:
<script ...></script>
<!-- Something something </script> something -->
<p> random important content</p>
<script>
// more script
</script>
and now, if I'm reading your code right, the first closing of the <script> will be ignored because we spot a </script> lower down that is not preceded by a <script>.
I think the problem here is that the code tries to be HTML-style clever about how to read script tags, but if we're guaranteeing XHTML input that shouldn't be necessary and will actually cause problems in the case of nested script elements. What happens if you replace the call to readScript with readChildren ? Maybe we should just remove that code altogether. See also comments on #252
There was a problem hiding this comment.
Good point! Let's try to it.
test/test-jsdomparser.js
Outdated
| var html = '<head><script><script src="foo.js"></script></script></head>'; | ||
| var doc = new JSDOMParser().parse(html); | ||
| expect(doc.firstChild.firstChild.tagName).eql("SCRIPT"); | ||
| expect(doc.firstChild.firstChild.textContent).eql("<script src=\"foo.js\"></script>"); |
There was a problem hiding this comment.
Why can't we just parse these as actual child elements?
b1b6826 to
a01a6e2
Compare
|
@gijsk , I've updated the patch for your comments. I removed the Could you help review it again? Thanks. |
|
|
||
| var BASETESTCASE = '<html><body><p>Some text and <a class="someclass" href="#">a link</a></p>' + | ||
| '<div id="foo">With a <script>With < fancy " characters in it because' + | ||
| '<div id="foo">With a <script>With < fancy " characters in it because' + |
There was a problem hiding this comment.
Are these changes (to single '<' or '>' characters) actually necessary to make tests pass? Why?
There was a problem hiding this comment.
I replace all < and > with < and > because of the below three reasons.
- Our xml serializer just replaces all
<and>with<and>. And the output of the xml serializer is just the input of reader mode. - In XHTML, the script and style elements are declared as having #CDATA content. 1
- Use external scripts if your script uses < or & or ]]> or --. 2 So replace
<and>in embedded script.
Due to above reasons, I think we might need to change these source.html pages. What do you think?
There was a problem hiding this comment.
I think that if it works without the replacement we should leave the files alone.
I think replacing the < and > that start/end CDATA sections (like my comment below) is definitely wrong.
There was a problem hiding this comment.
I think replacing the < and > that start/end CDATA sections (like my comment below) is definitely wrong.
Agreed, let's do it. I'll not replace < and > that start/end CDATA sections.
But I cannot understand I think that if it works without the replacement we should leave the files alone. Do you mean that we can replace < and > in an embedded script if there is no CDATA in it? For example, we can replace the below < or > with < or >.
<script>
// There is no CDATA here.
for (var i = 0; i < 10; i++) {
console.log(i);
}
</script>After doing replacement, it will looks like
<script>
// There is no CDATA here.
for (var i = 0; i < 10; i++) {
console.log(i);
}
</script>Is that good to you?
There was a problem hiding this comment.
Well, I just wouldn't bother replacing them unless it's necessary to make tests pass, which I don't think it is. :-)
There was a problem hiding this comment.
Got it. I'll only replace < with < in an embedded script there is no CDATA in it.
There was a problem hiding this comment.
That's not what I said... why is it necessary to replace them at all? Do the tests break if you don't replace them?
There was a problem hiding this comment.
Yes, the tests will be broken if we don't replace < in embedded script tags without CDATA.
test/test-pages/ars-1/source.html
Outdated
| <aside class="side-ad"> | ||
| <script type="text/javascript" language="JavaScript"> | ||
| // <![CDATA[ | ||
| // <![CDATA[ |
There was a problem hiding this comment.
This looks wrong (in that it should work as-is)
|
The JSDOMParser changes look OK, but I was kind of expecting us not to have to make any/many test changes, maybe besides the JSDOMParser unit test itself. Am I wrong? |
7b411f7 to
66c7f42
Compare
|
Fixed testing failures by replacing |
4067b38 to
ea166e0
Compare
|
Hi @gijsk , I updated the test files to replace The CI is good now. Could you take a look again? Thanks. |
3947864 to
0666a27
Compare
test/test-jsdomparser.js
Outdated
|
|
||
| it("should strip !-based comments within script tags", function() { | ||
| var html = '<script><!--Silly test > <script src="foo.js"></script>--></script>'; | ||
| var html = '<script><!--Silly test > <script src="foo.js"></script>--></script>'; |
There was a problem hiding this comment.
Uh, is this really what the XML serializer does? It doesn't keep XML-style comments? That seems...surprising.
There was a problem hiding this comment.
Uh, that is a mistake. Already updated the patch to fix it. Please take a look. Thanks.
Fixes #334.