Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup and fix bugs #22

Merged
merged 12 commits into from
Oct 29, 2017
Merged

Cleanup and fix bugs #22

merged 12 commits into from
Oct 29, 2017

Conversation

nelhage
Copy link
Collaborator

@nelhage nelhage commented Sep 11, 2017

This branch substantially rewrites lcs and merge to be shorter and simpler, and in the process fixes a number of bugs:

Sorry for the invasive PR, but this was the best way I could find to fix these bugs.

The old one had some edge cases around multiple
delimiters (`trim_left_matches` trimming over-aggressively), and I
found it harder to follow.

We now construct and check a simpler and more explicit pair of
invariants:

- The join of the (Same, Rem) diff elements should produce exactly the
  "old" text.
- The join of the (Same, Add) diff elements should produce exactly the
  "new" text.
Use the [standard dynamic-programming LCS algorithm][0]. This is
shorter than the previous version, and passes a minimized version of
the johannhof#19 test.

[0]: https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Code_for_the_dynamic_programming_solution
Simplify and reduce duplication, and in the process fix an empty-line
test case (ala johannhof#17)
@nelhage
Copy link
Collaborator Author

nelhage commented Sep 11, 2017

I can't seem to figure out how to placate clippy -- it complains about an unused import of difference in src/main.rs, but the import is used, and the build fails without it.

for i in 0..N {
for j in 0..M {
if b[j] == a[i] {
if i == 0 || j == 0 {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bunch of explicit edge-case handling here in order to avoid needing an (N+1)*(M+1)-sized array; If you're willing to allocate a buffer row/column of zero values, a bunch of these can go away. Since we're allocating regardless, it might be worth eating the slight memory increase for a substantial code simplification.

This prevents an error about an used import.
@renatoathaydes
Copy link
Contributor

I think I am experiencing one or more of these bugs.
When is this PR going to be merged and bug fixes released?

@renatoathaydes
Copy link
Contributor

Here's a little test I added to the code in this PR, and it fails:

#[test]
fn test_diff_brief() {
    let text1 = "Hello\nworld";
    let text2 = "Ola\nmundo";

    let changeset = Changeset::new(text1, text2, "\n");

    //assert_eq!(changeset.distance, 4);

    assert_eq!(
        changeset.diffs,
        vec![
            Difference::Same("".to_string()),
            Difference::Rem("Hello".to_string()),
            Difference::Add("Ola".to_string()),
            Difference::Rem("world".to_string()),
            Difference::Add("mundo".to_string()),
        ]
    );
}

Do you agree it should pass? I can try to fix it myself if so.

@renatoathaydes
Copy link
Contributor

renatoathaydes commented Oct 15, 2017

I changed the test and added a bunch more tests, and I fixed a minor bug in the merge function in a pull-request to this pull-request :)

This crate was extremely under-tested. I hope it now covers most cases in simple tests (not sure what cases the quickcheck test was supposed to be covering, but anyway, given the many bugs this PR fixes, seems like that was not enough).

@colin-kiegel
Copy link
Collaborator

colin-kiegel commented Oct 15, 2017

can we land these fixes soon - or do a fork?

I'm the author of pretty-assertions and I am experiencing these issues downstream too.

When this PR first appeared I was just working on these bugfixes myself. I have yet another lcs implementation ready and I am seriously thinking about cutting the dependency to difference.rs and just implementing my lcs version directly in pretty-assertions.

colin-kiegel added a commit to rust-pretty-assertions/rust-pretty-assertions that referenced this pull request Oct 15, 2017
colin-kiegel added a commit to rust-pretty-assertions/rust-pretty-assertions that referenced this pull request Oct 15, 2017
@nelhage
Copy link
Collaborator Author

nelhage commented Oct 16, 2017

Thanks for the votes of support.

@johannhof, if you don't have the time to maintain difference.rs, I'd be happy to join as a co-maintainer or take on ownership of the crate, if you're looking for help. I suspect @colin-kiegel or others might be as well.

If we continue to get no response, I suspect it'll make sense to fork and start a new crate for diff'ing. I'd be happy to contribute this code or otherwise help out if useful.

renato and others added 2 commits October 16, 2017 18:30
- Add a few test cases
- Include a reference for the LCS algorithm
@renatoathaydes
Copy link
Contributor

I wrote an email to @johannhof, hopefully we can continue with development of this lib in some way or another.

But I would like to propose that we modify the algorithm to find diffs.

The reason is that, for example, this test fails:

#[test]
fn test_diff_very_similar_text_with_same_line_count() {
    let text1 = "\
            The sky is blue.\n\
            Plants are green.\n\
            The moon is white.\n\
            The night is black.";

    let text2 = "\
            The sky is green.\n\
            Plants are green.\n\
            The night is white.\n\
            The moon is black.";

    let changeset = Changeset::new(text1, text2, "\n");

    assert_eq!(
        changeset.diffs,
        vec![
            Difference::Rem("The sky is blue.".to_string()),
            Difference::Add("The sky is green.".to_string()),
            Difference::Same("Plants are green.".to_string()),
            Difference::Rem("The moon is white.".to_string()),
            Difference::Add("The night is white.".to_string()),
            Difference::Rem("The night is black.".to_string()),
            Difference::Add("The moon is black.".to_string()),
        ]
    );
}

What we get instead is 2 Rem followed by 2 Add in the end... That's really not the best result, but is intrinsic to the algorithm being used.

The fix seems simple enough: just don't do the "merge" of several successive Rems or Adds at the end! I've implemented that and it's simpler, works better, and I can make a further pull request (on top of this branch currently) if you guys @nelhage @colin-kiegel @johanhelsing agree.

Let me know what you think.

@nelhage
Copy link
Collaborator Author

nelhage commented Oct 18, 2017

I definitely agree that not merging successive Rem and Adds is a preferable API. I kept the original behavior when making this patch, since it seemed invasive enough as-is; But if we're going to fork or bump major versions or otherwise make large changes already, I'm very in favor of removing the merge behavior. Many consumers of this API seems to end up having to "undo" the merges, anyways.

@colin-kiegel
Copy link
Collaborator

colin-kiegel commented Oct 18, 2017

@renatoathaydes would it be overhead to maintain grouping ordering guarantees? Like Rem, ..., Rem, Add, ..., Add?

@renatoathaydes
Copy link
Contributor

@colin-kiegel I had a look. The diff algorithm is just too simple right now.
I made a series of improvements, but to make the diff make more sense (like IntelliJ's diff) it's necessary to "match" not only Same but also "similar" parts of text when there's a way to split the text into a lower separator (e.g. if splitting using "\n", then use " " to check similarities, when using " " as separator, check similarities using "").

I've written a bunch of tests for this, but it's not working yet... we'll have to modify a little bit the lcs function I think to be able to achieve that.

See what I've done so far in this branch if you want to have a look: https://github.com/renatoathaydes/difference.rs/commits/more-tests

@colin-kiegel
Copy link
Collaborator

colin-kiegel commented Oct 24, 2017

@renatoathaydes

  • Ok, I can live without ordering of successive Adds and Rems. I was just asking. ;-)
  • You also mentioned nested comparisons. pretty_assertions already does this "on top" of difference.rs - you might want to have a look at this code - the result looks like this:

EDIT: But this discussion seems to drift off-topic of the original PR. Maybe we should continue somewhere else?

@renatoathaydes
Copy link
Contributor

@colin-kiegel I will continue trying to "group" the Adds and Removes because that considerably improves the diff.

I agree, we should continue somewhere else, and specifically, we should start a new fork given the differences.rs maintainer has not gotten back to us.

BTW I also implemented the "nested" diff in my lib vinegar :) We have some overlaps, but mostly related to these diffs (we should probably move this nested diff implementation to differences.rs!). vinegar is focussed on providing a tiny BDD framework for Rust similar to Spock, not on the diffs themselves (though good error messages is also a focus).

I am not a Rust expert and already have too many projects to maintain, so it would be nice if you or @nelhage forked this project and we could make more improvements from there.

@nelhage
Copy link
Collaborator Author

nelhage commented Oct 25, 2017

Hey folks,

I just wanted to say that I've realized I don't have the time and energy to start+maintain a new crate and figure out the API I'd want so on, on my own right now. I'd consider collaborating on such a project if someone else is interested, or anyone is welcome to use my code from this branch if it's helpful.

@johannhof
Copy link
Owner

Hey folks, sorry for the delay on this. As you might have guessed, I've been a bit busy working on other things. I'll give @nelhage and @renatoathaydes full commit rights on this repo. The LCS algorithm is an adaption of the one from Myers' paper, which can be found here: http://xmailserver.org/diff2.pdf

I'll try to parse the contents of this PR and give my opinion on it, but I don't want to block progress and I trust you to take it in the right direction.

@nelhage
Copy link
Collaborator Author

nelhage commented Oct 29, 2017

Thanks for the commit right, @johannhof. I'm going to merge this PR now, since I'm pretty confident it maintains the API while fixing bugs. If we want to contemplate further changes to the API (not merging ops, ensuring ordering properties on ops), let's pursue that work in new PRs and/or issues.

@nelhage nelhage merged commit a45509c into johannhof:master Oct 29, 2017
@nelhage nelhage deleted the fix-tests branch October 29, 2017 15:37
@colin-kiegel colin-kiegel mentioned this pull request Oct 29, 2017
@renatoathaydes
Copy link
Contributor

@nelhage cool, I will start targeting this repo with my new pull requests (even though I got commit rights, I will not use it unless no one else wants to maintain this project and I actually get the time to do that).

@johannhof thanks for the clarification. I will take some time to read this paper and see if I can make my tests pass and still keep the general algorithm, so that the resulting diff is more natural.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix and re-enable quickcheck fuzzy tests weird edge case Changeset::new ignores difference in blank lines
4 participants