Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Better grouping algorithm #51
A soon as dupeGuru is looking for duplicates that are not exactly the same, there's the issue of discarded matches coming up. For some discarded matches, it's impossible not to discard them because one side of the match is already part of a group that the other side of the match can't be in.
But after a quick glance at the grouping code, it seems possible that a match is discarded on the basis that one side is an unconfirmed part of a group. If that file is never confirmed, it means that some discarded matches could be used to safely make new groups without conflicting with any other group.