Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try and preserve the structure of the html during a diff #350

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lonetwin
Copy link

@lonetwin lonetwin commented Sep 12, 2022

There exists a bug in the current htmldiff code, where by the generated diff
changes the structure of the html (notice that the <div id="middle">
appears at the beginning instead of the middle):

>>> from lxml.html import diff
>>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>"
>>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>"
>>> diff.htmldiff(a, b)
('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and new</ins> <del>some old</del> text</div><div id="last">more old text</div>')
>>>

This patchset is an attempt to fix that issue.

lonetwin and others added 5 commits September 12, 2022 19:11
There exists a bug in the current `htmldiff` code, where by the generated diff
changes the structure of the html:

```python
>>> from lxml.html import diff
>>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>"
>>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>"
>>> diff.htmldiff(a, b)
('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and new</ins> <del>some old</del> text</div><div id="last">more old text</div>')
>>>

```

This patchset is an attempt to fix that issue.
Comment on lines -57 to +58
<a href="http://google.com">search <ins> Link: http://google.com</ins>
<del> Link: http://yahoo.com</del> </a>
<a href="http://google.com">search <ins> Link: http://google.com
</ins> <del> Link: http://yahoo.com</del> </a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a necessary change? It doesn't feel right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, didn't feel right to me either but this test breaks without it, and I didn't have much luck trying to figure out why. Initial instinct was that it has something to do with the re.sub in the pwrapped definition at the top of the test file (possibly the + in the re over matches ?). Apologies, I didn't try digging deeper but I'll give it a shot now.

Copy link
Author

@lonetwin lonetwin Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the underlying cause for this appears to be an extra trailing whitespace that appears to get added to the href_token which pushes the line length of the string to over 70 (the line length limit of textwrap, used by pdiff). Here are the diffs with the current lxml code and one taken with this patchset:

current lxml diff:

<a href="http://google.com">search <ins> Link: http://google.com</ins> <del> Link: http://yahoo.com</del> </a>

diff with this patchset:
<a href="http://google.com">search <ins> Link: http://google.com </ins> <del> Link: http://yahoo.com</del> </a>

What I'm not entirely sure about is why the trailing whitespace which ideally should also be present in the current lxml diff (according to call to href_token in fixup_chunks), isn't.

In any case, would you consider this as a regression in behavior worthy of a fix or does this fix of the test case suffice ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What strikes me is that there is a space before the closing </ins> but not before the closing </del>. Given that there is a space after both of the opening <ins> and <del>, it's debatable whether the new behaviour isn't really better than the old. But I would expect both tag types to be consistent, at least.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'll investigate and attempt to make them consistent.

@@ -87,6 +87,13 @@ Whitespace is generally ignored for the diff but preserved during the diff:
second
<ins>third</ins> </pre>

Ensure we don't preserve the structure of the html during the diff:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"don't" ?

Copy link
Author

@lonetwin lonetwin Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, that was a brain fart ! Good catch. I think I might have wanted to add something like 'Ensure we don't mess with the structure ...' and instead decided to go with the opposite phrasing 'Ensue we preserve ...' but obviously messed up. I'll fix in the next commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 248 to 250
for chunk in ins_chunks:
if chunk in balanced:
doc.append(chunk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems risky. I'm not sure that we can assume chunks to be unique enough to just do an "in" test here.

Maybe it could help to put the three parts back together, but give each chunk a marker like (chunk, "balanced") and (chunk, "unbalanced"), and then do something based on the marker. Or look at the index boundaries where we switch between the unbalanced and balanced parts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, agreed that we can't assume the chunks to be unique enough to just do an 'in' test here. I'll think about your suggestion and give this another go. Thanks for your review.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ran into an issue rising out of precisely the problem you caught ! The visible effect being that the <ins> tags were not being added at places where they ought to be.

I've updated the approach by changing split_unbalanced to optionally return a list of tuples containing the (<index-in-the-input=chunks-list>, <chunk>). This then serves to uniquely identify the chunk across the balanced/unbalanced lists ...which then makes the in operator reliable.

I've defaulted to retain the existing behavior of split_unbalanced since it is also called from cleanup_delete and I didn't want to touch that function.

Hopefully, you find the changes suitable.

…merge_insert`

In the recently revised implementation of `merge_insert` we were checking
whether a tag exists in the `balanced/unbalanced` tags list by referring to
just the tag itself. This is unreliable since the same tag might exist in both
list. Incorrect identification leads to skipping of the `<ins>` tags in some
cases, resulting in incorrect diff being rendered.

This pathset fixes the issue described and adds a test to validate it.
@lonetwin lonetwin requested a review from scoder December 15, 2022 11:57
Comment on lines -57 to +58
<a href="http://google.com">search <ins> Link: http://google.com</ins>
<del> Link: http://yahoo.com</del> </a>
<a href="http://google.com">search <ins> Link: http://google.com
</ins> <del> Link: http://yahoo.com</del> </a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What strikes me is that there is a space before the closing </ins> but not before the closing </del>. Given that there is a space after both of the opening <ins> and <del>, it's debatable whether the new behaviour isn't really better than the old. But I would expect both tag types to be consistent, at least.

balanced.append(None)
start.extend(
[chunk for name, pos, chunk in tag_stack])
[chunk_to_add for name, pos, chunk_to_add in tag_stack])
balanced = [chunk for chunk in balanced if chunk is not None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition in this line probably doesn't work any more if with_idx=True.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, wouldn't it ? Here balanced would be something like [(1, '<div>'), None, ...] if with_idx==True, so the if chunk is not None check still does the right thing.

Comment on lines +98 to +103
>>> a = "<div><p>Some text that will change</p><p>Some tags will be added</p></div>"
>>> b = "<div><div><p>Some text that has changed a bit</p><p>All of this is new</p></div></div>"
>>> pdiff(a, b)
<div><div><p>Some text that <ins>has changed a bit</ins> </p>
<p><ins>All of this is new</ins></p> </div> <del>will
change</del><p><del>Some tags will be added</del></p> </div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff looks weird. Is it getting confused by the nested <div>s ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. In this specific case, the rendered diff looks 'ok' to me (with the insertions preceding the deletions) but yeah, I imagine in larger context it might be less than ideal. I'm going to take another stab at this.

Copy link
Author

@lonetwin lonetwin Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, fwiw, this isn't very different from the diff generated by the current code for the same input. The only difference being the closing </p> tag of the first input paragraph is placed after the </ins> instead of at the end of the changes for that paragraph.

current lxml diff:

<div><div><p>Some text that <ins>has changed a bit</ins><p><ins>All of
this is new</ins></p> </p></div> <del>will change</del><p><del>Some
tags will be added</del></p> </div>

diff generated with this patchset

<div><div><p>Some text that <ins>has changed a bit</ins> </p>
<p><ins>All of this is new</ins></p> </div> <del>will
change</del><p><del>Some tags will be added</del></p> </div>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants