Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong parsing with ParseSettings.preserveCase #1149

Closed
anexplore opened this issue Nov 23, 2018 · 3 comments
Closed

wrong parsing with ParseSettings.preserveCase #1149

anexplore opened this issue Nov 23, 2018 · 3 comments
Labels
bug Confirmed bug that we should fix fixed
Milestone

Comments

@anexplore
Copy link

jsoup version:1.11.3
when using case sensitive settings, parse wrong

public class TestJsoupParser {

    public static void main(String[] args) {
        Parser parser = Parser.htmlParser();
        parser.settings(ParseSettings.preserveCase); // this line
        String html = "<div class=\"bdsharebuttonbox\">"
                + "<A class=bds_more href=\"http://share.baidu.com/code#\" data-cmd=\"more\">分享到:</A>"
                + "<A title=分享到QQ空间 class=bds_qzone href=\"http://share.baidu.com/code#\" data-cmd=\"qzone\">"
                + "</A><A title=分享到新浪微博 class=bds_tsina href=\"http://share.baidu.com/code#\" data-cmd=\"tsina\"></A>"
                + "<A title=分享到腾讯微博 class=bds_tqq href=\"http://share.baidu.com/code#\" data-cmd=\"tqq\"></A>"
                + "<A title=分享到人人网 class=bds_renren href=\"http://share.baidu.com/code#\" data-cmd=\"renren\"></A>"
                + "<A title=分享到微信 class=bds_weixin href=\"http://share.baidu.com/code#\" data-cmd=\"weixin\"></A>"
                + "</div>\r\n";
        Document doc = Jsoup.parse(html, "", parser);
        System.out.println(doc.html());
    }
    

}

the result is:

<html>
 <head></head>
 <body>
  <div class="bdsharebuttonbox">
   <A class="bds_more" href="http://share.baidu.com/code#" data-cmd="more">
    分享到:
   </A>
   <A class="bds_more" href="http://share.baidu.com/code#" data-cmd="more">
    <A title="分享到QQ空间" class="bds_qzone" href="http://share.baidu.com/code#" data-cmd="qzone"></A>
    <A title="分享到QQ空间" class="bds_qzone" href="http://share.baidu.com/code#" data-cmd="qzone">
     <A title="分享到新浪微博" class="bds_tsina" href="http://share.baidu.com/code#" data-cmd="tsina"></A>
     <A title="分享到新浪微博" class="bds_tsina" href="http://share.baidu.com/code#" data-cmd="tsina">
      <A title="分享到腾讯微博" class="bds_tqq" href="http://share.baidu.com/code#" data-cmd="tqq"></A>
      <A title="分享到腾讯微博" class="bds_tqq" href="http://share.baidu.com/code#" data-cmd="tqq">
       <A title="分享到人人网" class="bds_renren" href="http://share.baidu.com/code#" data-cmd="renren"></A>
       <A title="分享到人人网" class="bds_renren" href="http://share.baidu.com/code#" data-cmd="renren">
        <A title="分享到微信" class="bds_weixin" href="http://share.baidu.com/code#" data-cmd="weixin"></A>
       </A>
      </A>
     </A>
    </A>
   </A>
  </div>
  <A class="bds_more" href="http://share.baidu.com/code#" data-cmd="more">
   <A title="分享到QQ空间" class="bds_qzone" href="http://share.baidu.com/code#" data-cmd="qzone">
    <A title="分享到新浪微博" class="bds_tsina" href="http://share.baidu.com/code#" data-cmd="tsina">
     <A title="分享到腾讯微博" class="bds_tqq" href="http://share.baidu.com/code#" data-cmd="tqq">
      <A title="分享到人人网" class="bds_renren" href="http://share.baidu.com/code#" data-cmd="renren">
       <A title="分享到微信" class="bds_weixin" href="http://share.baidu.com/code#" data-cmd="weixin"> 
       </A>
      </A>
     </A>
    </A>
   </A>
  </A>
 </body>
</html>

however, when not use preserveCase , result is right

<html>
 <head></head>
 <body>
  <div class="bdsharebuttonbox">
   <a class="bds_more" href="http://share.baidu.com/code#" data-cmd="more">分享到:</a>
   <a title="分享到QQ空间" class="bds_qzone" href="http://share.baidu.com/code#" data-cmd="qzone"></a>
   <a title="分享到新浪微博" class="bds_tsina" href="http://share.baidu.com/code#" data-cmd="tsina"></a>
   <a title="分享到腾讯微博" class="bds_tqq" href="http://share.baidu.com/code#" data-cmd="tqq"></a>
   <a title="分享到人人网" class="bds_renren" href="http://share.baidu.com/code#" data-cmd="renren"></a>
   <a title="分享到微信" class="bds_weixin" href="http://share.baidu.com/code#" data-cmd="weixin"></a>
  </div> 
 </body>
</html>
@anexplore
Copy link
Author

see #1150

@anexplore
Copy link
Author

anexplore commented Nov 26, 2018

the reasion is:
methods in HtmlTreeBuilder,the tag name is get by Element‘s nodeName() which can be case sensitive。

however in HtmlTreeBuilderState the tag name is get by normalName() which is lowercase. when call
method in HtmlTreeBuilder you pass lowercase as the param,but HtmlTreeBuilder use nodeName() to compare

And the static class contants ,like label array, all element is in lowercase

@jhy jhy closed this as completed in 7ff7c43 Dec 23, 2018
@jhy
Copy link
Owner

jhy commented Dec 23, 2018

Thanks for identifying this! Fixed.

@jhy jhy added bug Confirmed bug that we should fix fixed labels Dec 23, 2018
@jhy jhy added this to the 1.12.1 milestone Dec 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix fixed
Projects
None yet
Development

No branches or pull requests

2 participants