Cleanup and fix bugs #22

nelhage · 2017-09-11T06:06:00Z

This branch substantially rewrites lcs and merge to be shorter and simpler, and in the process fixes a number of bugs:

fixes weird edge case #19 (bug was in lcs)
fixed Changeset::new ignores difference in blank lines #17 (this bug was in merge)
fixed Fix and re-enable quickcheck fuzzy tests #21 (was revealing bugs in all three of lcs, merge, and check_changeset)

Sorry for the invasive PR, but this was the best way I could find to fix these bugs.

The old one had some edge cases around multiple delimiters (`trim_left_matches` trimming over-aggressively), and I found it harder to follow. We now construct and check a simpler and more explicit pair of invariants: - The join of the (Same, Rem) diff elements should produce exactly the "old" text. - The join of the (Same, Add) diff elements should produce exactly the "new" text.

Use the [standard dynamic-programming LCS algorithm][0]. This is shorter than the previous version, and passes a minimized version of the johannhof#19 test. [0]: https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Code_for_the_dynamic_programming_solution

Simplify and reduce duplication, and in the process fix an empty-line test case (ala johannhof#17)

These now pass.

nelhage · 2017-09-11T16:31:16Z

I can't seem to figure out how to placate clippy -- it complains about an unused import of difference in src/main.rs, but the import is used, and the build fails without it.

nelhage · 2017-09-11T16:38:41Z

src/lcs.rs

+    for i in 0..N {
+        for j in 0..M {
+            if b[j] == a[i] {
+                if i == 0 || j == 0 {


There's a bunch of explicit edge-case handling here in order to avoid needing an (N+1)*(M+1)-sized array; If you're willing to allocate a buffer row/column of zero values, a bunch of these can go away. Since we're allocating regardless, it might be worth eating the slight memory increase for a substantial code simplification.

This prevents an error about an used import.

renatoathaydes · 2017-10-14T17:01:06Z

I think I am experiencing one or more of these bugs.
When is this PR going to be merged and bug fixes released?

renatoathaydes · 2017-10-14T17:40:57Z

Here's a little test I added to the code in this PR, and it fails:

#[test]
fn test_diff_brief() {
    let text1 = "Hello\nworld";
    let text2 = "Ola\nmundo";

    let changeset = Changeset::new(text1, text2, "\n");

    //assert_eq!(changeset.distance, 4);

    assert_eq!(
        changeset.diffs,
        vec![
            Difference::Same("".to_string()),
            Difference::Rem("Hello".to_string()),
            Difference::Add("Ola".to_string()),
            Difference::Rem("world".to_string()),
            Difference::Add("mundo".to_string()),
        ]
    );
}

Do you agree it should pass? I can try to fix it myself if so.

renatoathaydes · 2017-10-15T11:44:47Z

I changed the test and added a bunch more tests, and I fixed a minor bug in the merge function in a pull-request to this pull-request :)

This crate was extremely under-tested. I hope it now covers most cases in simple tests (not sure what cases the quickcheck test was supposed to be covering, but anyway, given the many bugs this PR fixes, seems like that was not enough).

colin-kiegel · 2017-10-15T15:43:57Z

can we land these fixes soon - or do a fork?

I'm the author of pretty-assertions and I am experiencing these issues downstream too.

When this PR first appeared I was just working on these bugfixes myself. I have yet another lcs implementation ready and I am seriously thinking about cutting the dependency to difference.rs and just implementing my lcs version directly in pretty-assertions.

- this blocks issue #12 - PR johannhof/difference.rs/pull/22 will fix this

nelhage · 2017-10-16T15:10:45Z

Thanks for the votes of support.

@johannhof, if you don't have the time to maintain difference.rs, I'd be happy to join as a co-maintainer or take on ownership of the crate, if you're looking for help. I suspect @colin-kiegel or others might be as well.

If we continue to get no response, I suspect it'll make sense to fork and start a new crate for diff'ing. I'd be happy to contribute this code or otherwise help out if useful.

- Add a few test cases - Include a reference for the LCS algorithm

renatoathaydes · 2017-10-18T17:45:32Z

I wrote an email to @johannhof, hopefully we can continue with development of this lib in some way or another.

But I would like to propose that we modify the algorithm to find diffs.

The reason is that, for example, this test fails:

#[test]
fn test_diff_very_similar_text_with_same_line_count() {
    let text1 = "\
            The sky is blue.\n\
            Plants are green.\n\
            The moon is white.\n\
            The night is black.";

    let text2 = "\
            The sky is green.\n\
            Plants are green.\n\
            The night is white.\n\
            The moon is black.";

    let changeset = Changeset::new(text1, text2, "\n");

    assert_eq!(
        changeset.diffs,
        vec![
            Difference::Rem("The sky is blue.".to_string()),
            Difference::Add("The sky is green.".to_string()),
            Difference::Same("Plants are green.".to_string()),
            Difference::Rem("The moon is white.".to_string()),
            Difference::Add("The night is white.".to_string()),
            Difference::Rem("The night is black.".to_string()),
            Difference::Add("The moon is black.".to_string()),
        ]
    );
}

What we get instead is 2 Rem followed by 2 Add in the end... That's really not the best result, but is intrinsic to the algorithm being used.

The fix seems simple enough: just don't do the "merge" of several successive Rems or Adds at the end! I've implemented that and it's simpler, works better, and I can make a further pull request (on top of this branch currently) if you guys @nelhage @colin-kiegel @johanhelsing agree.

Let me know what you think.

nelhage · 2017-10-18T17:51:31Z

I definitely agree that not merging successive Rem and Adds is a preferable API. I kept the original behavior when making this patch, since it seemed invasive enough as-is; But if we're going to fork or bump major versions or otherwise make large changes already, I'm very in favor of removing the merge behavior. Many consumers of this API seems to end up having to "undo" the merges, anyways.

colin-kiegel · 2017-10-18T21:08:07Z

@renatoathaydes would it be overhead to maintain ~~grouping~~ ordering guarantees? Like Rem, ..., Rem, Add, ..., Add?

renatoathaydes · 2017-10-23T18:29:26Z

@colin-kiegel I had a look. The diff algorithm is just too simple right now.
I made a series of improvements, but to make the diff make more sense (like IntelliJ's diff) it's necessary to "match" not only Same but also "similar" parts of text when there's a way to split the text into a lower separator (e.g. if splitting using "\n", then use " " to check similarities, when using " " as separator, check similarities using "").

I've written a bunch of tests for this, but it's not working yet... we'll have to modify a little bit the lcs function I think to be able to achieve that.

See what I've done so far in this branch if you want to have a look: https://github.com/renatoathaydes/difference.rs/commits/more-tests

colin-kiegel · 2017-10-24T07:36:42Z

@renatoathaydes

Ok, I can live without ordering of successive Adds and Rems. I was just asking. ;-)
You also mentioned nested comparisons. pretty_assertions already does this "on top" of difference.rs - you might want to have a look at this code - the result looks like this:

EDIT: But this discussion seems to drift off-topic of the original PR. Maybe we should continue somewhere else?

renatoathaydes · 2017-10-24T09:14:26Z

@colin-kiegel I will continue trying to "group" the Adds and Removes because that considerably improves the diff.

I agree, we should continue somewhere else, and specifically, we should start a new fork given the differences.rs maintainer has not gotten back to us.

BTW I also implemented the "nested" diff in my lib vinegar :) We have some overlaps, but mostly related to these diffs (we should probably move this nested diff implementation to differences.rs!). vinegar is focussed on providing a tiny BDD framework for Rust similar to Spock, not on the diffs themselves (though good error messages is also a focus).

I am not a Rust expert and already have too many projects to maintain, so it would be nice if you or @nelhage forked this project and we could make more improvements from there.

nelhage · 2017-10-25T14:36:14Z

Hey folks,

I just wanted to say that I've realized I don't have the time and energy to start+maintain a new crate and figure out the API I'd want so on, on my own right now. I'd consider collaborating on such a project if someone else is interested, or anyone is welcome to use my code from this branch if it's helpful.

johannhof · 2017-10-25T14:44:10Z

Hey folks, sorry for the delay on this. As you might have guessed, I've been a bit busy working on other things. I'll give @nelhage and @renatoathaydes full commit rights on this repo. The LCS algorithm is an adaption of the one from Myers' paper, which can be found here: http://xmailserver.org/diff2.pdf

I'll try to parse the contents of this PR and give my opinion on it, but I don't want to block progress and I trust you to take it in the right direction.

nelhage · 2017-10-29T15:37:42Z

Thanks for the commit right, @johannhof. I'm going to merge this PR now, since I'm pretty confident it maintains the API while fixing bugs. If we want to contemplate further changes to the API (not merging ops, ensuring ordering properties on ops), let's pursue that work in new PRs and/or issues.

renatoathaydes · 2017-10-30T13:33:07Z

@nelhage cool, I will start targeting this repo with my new pull requests (even though I got commit rights, I will not use it unless no one else wants to maintain this project and I actually get the time to do that).

@johannhof thanks for the clarification. I will take some time to read this paper and see if I can make my tests pass and still keep the general algorithm, so that the resulting diff is more natural.

nelhage added 4 commits September 11, 2017 06:19

Rewrite lcs()

9656d90

Use the [standard dynamic-programming LCS algorithm][0]. This is shorter than the previous version, and passes a minimized version of the johannhof#19 test. [0]: https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Code_for_the_dynamic_programming_solution

Rewrite merge().

b8dea98

Simplify and reduce duplication, and in the process fix an empty-line test case (ala johannhof#17)

Re-enable quickcheck tests.

96adbb4

These now pass.

nelhage force-pushed the fix-tests branch from 7c0dbcb to 96adbb4 Compare September 11, 2017 13:29

Placate clippy

73f3702

nelhage force-pushed the fix-tests branch from 7d824c1 to 73f3702 Compare September 11, 2017 16:28

nelhage commented Sep 11, 2017

View reviewed changes

Scope a use under a feature guard

8ef0f92

This prevents an error about an used import.

nelhage force-pushed the fix-tests branch from 63544ae to 8ef0f92 Compare September 11, 2017 17:06

renato added 4 commits October 14, 2017 20:22

Comment attributing algorithm to wikipedia article.

1613146

Run simpler diff algorithm when strings are completely unrelated.

9ef02e0

Consume all lines on diff when merging.

0f8fadb

Simplified merge function.

4a8ad74

colin-kiegel added a commit to rust-pretty-assertions/rust-pretty-assertions that referenced this pull request Oct 15, 2017

stick with upstream difference crate ... for now

366069f

- this blocks issue #12 - PR johannhof/difference.rs/pull/22 will fix this

colin-kiegel added a commit to rust-pretty-assertions/rust-pretty-assertions that referenced this pull request Oct 15, 2017

stick with upstream difference crate ... for now

423d35e

- this blocks issue #12 - PR johannhof/difference.rs/pull/22 will fix this

colin-kiegel mentioned this pull request Oct 15, 2017

Incorrect diff value placement rust-pretty-assertions/rust-pretty-assertions#12

Closed

renato and others added 2 commits October 16, 2017 18:30

Reverted unnecessary changes to merge.

cfaf2e9

Merge pull request #1 from renatoathaydes/multi-line-diff-fixes

271ad04

- Add a few test cases - Include a reference for the LCS algorithm

nelhage merged commit a45509c into johannhof:master Oct 29, 2017

nelhage deleted the fix-tests branch October 29, 2017 15:37

colin-kiegel mentioned this pull request Oct 29, 2017

new release v1.0.1 ? #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup and fix bugs #22

Cleanup and fix bugs #22

nelhage commented Sep 11, 2017

nelhage commented Sep 11, 2017

nelhage Sep 11, 2017

renatoathaydes commented Oct 14, 2017

renatoathaydes commented Oct 14, 2017

renatoathaydes commented Oct 15, 2017 •

edited

Loading

colin-kiegel commented Oct 15, 2017 •

edited

Loading

nelhage commented Oct 16, 2017

renatoathaydes commented Oct 18, 2017

nelhage commented Oct 18, 2017

colin-kiegel commented Oct 18, 2017 •

edited

Loading

renatoathaydes commented Oct 23, 2017

colin-kiegel commented Oct 24, 2017 •

edited

Loading

renatoathaydes commented Oct 24, 2017

nelhage commented Oct 25, 2017

johannhof commented Oct 25, 2017

nelhage commented Oct 29, 2017

renatoathaydes commented Oct 30, 2017

Cleanup and fix bugs #22

Cleanup and fix bugs #22

Conversation

nelhage commented Sep 11, 2017

nelhage commented Sep 11, 2017

nelhage Sep 11, 2017

Choose a reason for hiding this comment

renatoathaydes commented Oct 14, 2017

renatoathaydes commented Oct 14, 2017

renatoathaydes commented Oct 15, 2017 • edited Loading

colin-kiegel commented Oct 15, 2017 • edited Loading

nelhage commented Oct 16, 2017

renatoathaydes commented Oct 18, 2017

nelhage commented Oct 18, 2017

colin-kiegel commented Oct 18, 2017 • edited Loading

renatoathaydes commented Oct 23, 2017

colin-kiegel commented Oct 24, 2017 • edited Loading

renatoathaydes commented Oct 24, 2017

nelhage commented Oct 25, 2017

johannhof commented Oct 25, 2017

nelhage commented Oct 29, 2017

renatoathaydes commented Oct 30, 2017

renatoathaydes commented Oct 15, 2017 •

edited

Loading

colin-kiegel commented Oct 15, 2017 •

edited

Loading

colin-kiegel commented Oct 18, 2017 •

edited

Loading

colin-kiegel commented Oct 24, 2017 •

edited

Loading