Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue 1294 #1389

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions issues/IssueFixing1294.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Issues
## Issue 1294
### Problem Discreption
Given a ruby html segment, Jsoup cannot parse it correctly.

![](C:\Users\xingx\AppData\Roaming\Typora\typora-user-images\image-20200420171640816.png)

![](C:\Users\xingx\AppData\Roaming\Typora\typora-user-images\image-20200420171730747.png)

The results in the second picture shows that Jsoup terminate `<rtc>`tag ahead of schedule.

### Problem Reasons

The problem is caused by the implementation of how Jsoup handle tags. It create an enumeration that enumerate all possible tags it would meet. For those unknown tags, Jsoup will close it directly. More detailed code could be seen in the org.jsoup.parser.HtmlTreeBuilderParser, line 581.

![](https://imgur.com/CUX6qXL.png)

It will firstly check whether `<ruby>` tag had been pushed into stack since tag `<rp>` and `<rt>` must appeared as a child node of ruby node. Besides, it will generate end tag and pop out all node until its current node is ruby node. This logic implies that only `<rb>` and `<rt>` must be direct child of jsoup.

### Problem Solution

There will be not more than `rb` and `rt` tag in `ruby` tag. Besides, someone could add `div` or `span` to which is meaningless but to divide up some area which maybe used to apply CSS.

![](https://imgur.com/hJ0zwPu.png)

So the best way to deal with it is to deal it as a normal tag and just use default way to handle it.

![](https://imgur.com/X9CXqko.png)

Now Jsoup hanld it correctly.

![](https://imgur.com/aNawR6S.png)
12 changes: 7 additions & 5 deletions src/main/java/org/jsoup/parser/HtmlTreeBuilderState.java
Original file line number Diff line number Diff line change
Expand Up @@ -579,12 +579,14 @@ private boolean inBodyStartTag(Token t, HtmlTreeBuilder tb) {
case "rp":
case "rt":
if (tb.inScope("ruby")) {
tb.generateImpliedEndTags();
if (!tb.currentElement().normalName().equals("ruby")) {
tb.error(this);
tb.popStackToBefore("ruby"); // i.e. close up to but not include name
}
tb.reconstructFormattingElements();
tb.insert(startTag);
// tb.generateImpliedEndTags();
// if (!tb.currentElement().normalName().equals("ruby")) {
// tb.error(this);
// tb.popStackToBefore("ruby"); // i.e. close up to but not include name
// }
// tb.insert(startTag);
}
// todo - is this right? drops rp, rt if ruby not in scope?
break;
Expand Down