Skip to content

CollectionAssert.AreEquivalent is extremely slow #2799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nvborisenko opened this issue Apr 9, 2018 · 34 comments · Fixed by #3831
Closed

CollectionAssert.AreEquivalent is extremely slow #2799

nvborisenko opened this issue Apr 9, 2018 · 34 comments · Fixed by #3831
Assignees
Labels
is:bug pri:high requires:releasenotes Issues that should be included in the release notes
Milestone

Comments

@nvborisenko
Copy link
Contributor

var lines1 = new List<string> { };
var lines2 = new List<string> { };

for (int i = 0; i < 10000; i++)
{
  lines1.Add(i.ToString());
  lines2.Add(i.ToString());
}

CollectionAssert.AreEquivalent(lines1, lines2);

Verification of 10000 string elements takes 10 seconds. Verification of 100k elements may take more than 30 minutes (cannot wait until it finishes). This behavior takes a place in NUnit 3.10 and not reproducible in 3.9

@CharliePoole
Copy link
Member

Several of us did changes to collection assert in that period. I'll take a look to see if one of mine caused this.

@mikkelbu
Copy link
Member

mikkelbu commented Apr 9, 2018

Actually, I think that my change in #2600 has caused the problem (I've not had time to check it yet, as I'm halfway a VS upgrade). In the case above we will have to run through the entire remaining collection for every match we want to remove. Before my change we would find each match immediately as the first element.

@jnm2
Copy link
Contributor

jnm2 commented Apr 9, 2018

The ideal situation might be to implement IEqualityComparer in all our internal comparer classes so they can be used with HashSet and Dictionary. AreEquivalent would use HashSet.SetEquals with O(n + m) complexity (the counts of the two collections).

@CharliePoole
Copy link
Member

I'm wondering if the various collection constraints should possibly be separated in their implementation, separately optimized and only then possibly recombined to remove duplication.

@jnm2
Copy link
Contributor

jnm2 commented Apr 9, 2018

That sounds like a good idea. Either way, the collection equivalence constraint itself will probably need to use HashSet.SetEquals eventually to get to the standard O(n+m) people expect. It won't be able to do that unless all our internal NUnit equality comparers implement IEqualityComparer.

@CharliePoole
Copy link
Member

Implementing IEqualityComparer means either (1) forgetting about tolerance or (2) making tolerance a constructor argument for comparers and using a new one either for every comparison or perhaps caching a few of them. Could get complicated.

@jnm2
Copy link
Contributor

jnm2 commented Apr 10, 2018

That's an interesting question. Is it even meaningful to test for set equality with tolerance? I don't know what set equality means without transitive element equality comparisons, and if you have tolerance, you don't have transitive equality. My attempt at searching didn't turn anything up. It would be cool to know.

We'd only need to adapt to IEqualityComparer when using a hash table. We'd only use a hash table for operations that compare two collections. A hash table in theory requires transitive equality comparisons and transitive hash codes. This is what the IEqualityComparer contract demands, so if it's inherently possible to use a hash table for the comparison, it's inherently possible to adapt our element comparer to IEqualityComparer.

Let's say there is a hypothetical comparison we want NUnit to do between two collections which is not inherently possible to losslessly adapt to IEqualityComparer. If that is the case, that would mean the comparison cannot even in theory make use of a hash table. We would then have to fall back to something slower. But before we worry about that, does such a comparison meaningfully exist?

@herebebeasties
Copy link

herebebeasties commented Apr 16, 2018

I think set equality with tolerance does makes sense. Essentially, "is it possible to match every item in set A with one or more items in set B using equality function F such that no item in set B is left unmatched". But I think that's always going to be O(n*m) in the general case.

Even items that implement IComparable such that you could potentially sort them first and then compare the streams pairwise might have out-of-order matches between the sets, so that wouldn't be good enough. To really speed it up you'd need the ability to map your items into some kind of lower-precision ordered map (e.g. Math.Round() and search in adjacent buckets). None of that would really work with a general IEqualityComparer<T> and you'd still need to flag every item that's within tolerance and then see if there are any left over afterwards.

I don't think I'd expect NUnit to try solve any of that for me out of the box.

ETA: As @jnm2 says, all uses of "set" in this answer should really be replaced with "bag", as there may be duplicate items.

@jnm2
Copy link
Contributor

jnm2 commented Apr 26, 2018

@herebebeasties Good point. I was talking about set equality, but Is.EquivalentTo goes beyond set equality since that compares the number of repeated values. (The conceptual equivalent of comparing two Dictionary<T, int> countsByValue rather than comparing two HashSet<T> uniqueValues.)

We'd have to have an algorithm that not only did the O(m*n) initial mapping but also proceeded by trying every combination of overlaps to see if it could match the items 1:1. It seems like that would have factorial complexity on top of the n*m, depending on how connected the mapping is (worse with larger tolerances). This logic doesn't sound familiar, so I wonder if NUnit is currently returning wrong results for EquivalentTo in combination with Tolerance. Separate issue to fix, though.

How does it sound to implement the logic to first determine whether the element comparison is transitive, and if so, use HashSet<>.SetEquals Dictionary?

@jnm2 jnm2 added this to the 3.11 milestone Apr 26, 2018
@jnm2
Copy link
Contributor

jnm2 commented Apr 26, 2018

Also, any takers? If this interests you, please jump in!
If not, I'll do it as one of my next PRs.

@jnm2
Copy link
Contributor

jnm2 commented Apr 26, 2018

Well, this went quite a bit quicker than I anticipated! @ggeurts has started a PR already.

@ggeurts
Copy link
Contributor

ggeurts commented Apr 27, 2018

Yes, I have started work on a PR, but still need to work out a way to get a hash code provider where that is possible for equivalence operations.

@jnm2
Copy link
Contributor

jnm2 commented Apr 27, 2018

@ggeurts I'm thinking we need IChainComparer to have something like a bool TryGetHashCodeProvider(out Func<object, int> hashCodeProvider) method.

TimeSpanToleranceComparer would return false if a nonzero tolerance is specified. StringComparer would return StringComparer.Ordinal.GetHashCode and StringComparer.CurrentCultureIgnoreCase.GetHashCode, whichever matches the equality comparison it's doing. Etc.

@ggeurts
Copy link
Contributor

ggeurts commented Apr 27, 2018

For the enumerable, array, dictionary and similar comparers custom hash code implementations will be needed for sorted and unsorted scenarios. I am experimenting with int? IChainComparer.GetHashCode(object o) implementations.

@jnm2
Copy link
Contributor

jnm2 commented Apr 27, 2018

@ggeurts Since the hash code provider can be picked without examining a particular object o, it seems like it would perform better to make that decision once (TryGetHashCodeProvider) and reuse it for all elements. Unless I'm missing something?

@ggeurts
Copy link
Contributor

ggeurts commented Apr 27, 2018

Is the collection element type always known?

@jnm2
Copy link
Contributor

jnm2 commented Apr 27, 2018

@CharliePoole
Copy link
Member

@ggeurts In fact, as it seemed to me in that PR, analyzing the args to decide if the element type can be known in advance is most likely the key to an efficient implementation. The current inefficient implementation has the virtue of actually working for object[] arguments containing mixed types, but that's probably not something that really happens a lot.

@ggeurts
Copy link
Contributor

ggeurts commented Apr 27, 2018 via email

@CharliePoole
Copy link
Member

@ggeurts Yes, I agree. The full set of criteria (as far as I've figured out) for a more efficient approach is:

  1. All elements must be same type, which is most easily guaranteed if it's a typed array or other enumerable.

  2. Must not be using a non-zero tolerance.

  3. Must not be a type handled in a non-standard way by our equality comparer.

You could probably cross off the third point by providing special IEqualityComparers for such types.

ggeurts added a commit to ggeurts/nunit that referenced this issue May 4, 2018
Extended NUnitEqualityComparer with hash code calculations to speed up collection equivalence tests

Fix spelling mistake
ggeurts added a commit to ggeurts/nunit that referenced this issue May 4, 2018
Extended NUnitEqualityComparer with hash code calculations to speed up collection equivalence tests

Fix spelling mistake
ggeurts added a commit to ggeurts/nunit that referenced this issue May 4, 2018
Extended NUnitEqualityComparer with hash code calculations to speed up collection equivalence tests

Fix spelling mistake
ggeurts added a commit to ggeurts/nunit that referenced this issue May 4, 2018
Extended NUnitEqualityComparer with hash code calculations to speed up collection equivalence tests

Fix spelling mistake
@ggeurts
Copy link
Contributor

ggeurts commented May 4, 2018

I have extended NUnitEqualityComparer and related classes (such as chain comparers and equalityadapter) with hash code generation. For most collection equivalence tests the complexity is now O(log(n)) instead of O(n*m). The exception is where external comparers are used that are based on IComparer or Comparison delegates, because it is not possible to calculate unique hash codes for those cases.

It is possible to speed up collection equivalence tests further by providing CollectionTally with an optimized IEqualityComparer implementation rather than it always using NUnitEqualityComparer for comparisons. However, I prefer to first wait for review of the changes committed so far.

@ggeurts
Copy link
Contributor

ggeurts commented May 7, 2018

@jnm2 @CharliePoole Is ggeurts@d893d83 ok for review or do I need to break it down into smaller commits?

@jnm2
Copy link
Contributor

jnm2 commented May 22, 2018

@ggeurts Sorry, I haven't had a lot of free time! That is difficult to review all at once. How much work do you think it is to break it into smaller descriptive commits?

@ggeurts
Copy link
Contributor

ggeurts commented May 25, 2018

I will break down the changes into smaller commits to simplify review. However, my spare time is limited as well. It may take a short while to get there.

@rprouse rprouse removed this from the 3.11 milestone Aug 16, 2018
@nvborisenko
Copy link
Contributor Author

Hello Team,

Up this issue.

Actually this issue is not a big problem for us because of we decided not to use nunit on one of our project. But I remember that I spent several days to figure out what is wrong in my continuous integration setup to execute super simple tests there. I was so surprised when saw the actual reason why tests execution was hanging.

To save other people, let's review the fix. Thanks!

@cidthecoatrack
Copy link

I would like to bump this - it's still open, and I definitely still see the issue in 3.12.0

@rprouse
Copy link
Member

rprouse commented Apr 19, 2021

From #3825, this has gotten even worse in 3.13. The example code from @NightElfik

[Test]
public void NUunitRepro() {
	var dict = new Dictionary<int, string>();
	for (int i = 0; i < 10000; ++i) {
		dict.Add(i, i.ToString());
	}
	CollectionAssert.AreEquivalent(dict.Values.AsEnumerable(),
		Enumerable.Range(0, 10000).Select(j => j.ToString()));
}

My timings for various versions of NUnit;

  • 3.9 => 27ms
  • 3.10.1 => 6.7 sec
  • 3.11 => 6.8 sec
  • 3.12 => 6.9 sec
  • 3.13 => 17.8 sec
  • 3.13.1 => 17.8 sec

We should probably dust of the PR #2830 and see if we can get it finished.

@rprouse rprouse added this to the 3.13.2 milestone Apr 19, 2021
@rprouse rprouse assigned rprouse and unassigned ggeurts Apr 19, 2021
@rprouse
Copy link
Member

rprouse commented Apr 19, 2021

I've been profiling the code and going through it all day. The fact that our constraints are not generic and take in objects is part of the problem, but most of it is a build-up of small features that we have accepted over time. The recent increase in time looks to be caused by recursion detection that was fixed recently. Other increases are the addition of new comparers for new types and fixes for equivalence.

@NightElfik, @nvborisenko and @cidthecoatrack , In both of the code examples, they are determining equivalence of two collections that are in the same order. AreEquivalent checks if the collections contain the same items in any order, so it is expensive because it must compare each item in actual to every item in expected.

Switching the above example to the following runs in 0.037 sec.

[Test]
public void CollectionAreEqualTests()
{
    var dict = new Dictionary<int, string>();
    for (int i = 0; i < 10000; ++i)
    {
        dict.Add(i, i.ToString());
    }
    CollectionAssert.AreEqual(dict.Values.AsEnumerable(),
        Enumerable.Range(0, 10000).Select(j => j.ToString()));
}

@cidthecoatrack
Copy link

I have user an Order By expression, coupled with Is.EqualTo, as a workaround for some time now, so the recommendation is valid. But the beauty of Is.EquivalentTo is that is compares what is missing and what is extra, not just that index i is different. Could we maybe extend the equivalence to do some internal sorting of some kind, and then essentially do the same as Is.Equal, but instead of kicking out at the first unmatched index, keeps assessing?

@CharliePoole
Copy link
Member

It may be important to remember that NUnit's CollectionAssert, although modeled after Microsoft's, works on IEnumerable as well as ICollection. Of course, that could be taken into account in the implementation.

@stevenaw
Copy link
Member

It's separate from most of the ideas discussed in this issue and corresponding PR, but something I've been trying to wrap my head around recently is how it might look to make the internal IChainComparer determination less O(n) and and more O(1). The comparer used is largely based on types of objects being compared and might be a good opportunity for caching (potentially pre-seeded for some simple primitives like int and string).

@rprouse 's point of furthering generic constraint usage could also help there (though I suspect would make it harder to port to 3.13.x in a non-breaking way)

@rprouse
Copy link
Member

rprouse commented Apr 20, 2021

I made a small breakthrough this morning. If the two collections are in the same order it is really slow, but if the two lists are reverse of each other, then it is fast.

This takes 22 seconds

[Test]
public void IntCollectionAreEquivalentTests()
{
    var actual = Enumerable.Range(0, SIZE);
    var expected = Enumerable.Range(0, SIZE);
    CollectionAssert.AreEquivalent(actual, expected);
}

This takes 0.006 seconds

[Test]
public void IntCollectionReversedAreEquivalentTests()
{
    var actual = Enumerable.Range(0, SIZE);
    var expected = Enumerable.Range(0, SIZE).Select(i => SIZE - i - 1);
    CollectionAssert.AreEquivalent(actual, expected);
}

Initial thinking is that CollectionTally.TryRemove iterates through the actual in reverse order making the evaluation of the two lists O(n) for reversed lists but O(n*n) for lists in the same order.

public void TryRemove(object o)
{
    for (int index = _missingItems.Count - 1; index >= 0; index--)
    {
        if (ItemsEqual(_missingItems[index], o))
        {
            _missingItems.RemoveAt(index);
            return;
        }
    }

    _extraItems.Add(o);
}

I expect that the actual and expected are usually in at least a similar order, so an easy fix would be to reverse the iteration in the for statement making lists in the same order best case scenario degrading to fully reversed lists being worst case scenario.

Then we can apply an optimization by sorting the two collections in the tally if they are sortable. Determining if the collections are sortable is the hard part. My thinking there is that a collection is sortable if,

  • It is strongly typed by being generic. All items in the collection need to be the same type to be sortable.
  • The type of the items is numeric, a string, or Comparer<T>.Default finds an IComparable<T> for the type.

I'll do some experimentation with that, but I'd appreciate if the @nunit/framework-team can poke holes in my sortable logic. I'm sure I am missing something 😄

@siprbaum
Copy link

@mikkelbu
Copy link
Member

mikkelbu commented Apr 20, 2021

Regarding the ordering, then I noted above - #2799 (comment) - (and later lost track of it) that I introduced this problem, when I reversed the iteration, so that we delete from the back of the list instead of the front (as the example was comparing two large collections which mostly consisted of 0s).

So reversing the iteration back to the original code will probably make performance worse for collections consisting of few distinct values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is:bug pri:high requires:releasenotes Issues that should be included in the release notes
Projects
None yet
10 participants