Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dublicates from list of dicts #127

Closed
carlbordum opened this issue Jun 10, 2017 · 5 comments
Closed

Remove dublicates from list of dicts #127

carlbordum opened this issue Jun 10, 2017 · 5 comments

Comments

@carlbordum
Copy link

A method, function or perhaps even a frozendict could help achieve this in a clean way.

@mahmoud
Copy link
Owner

mahmoud commented Jun 18, 2017

Hi @Zaab1t ! This is definitely one of the more common requests I get. It's mostly a matter of how you want to do duplicate detection. Most cases people have an "id" key of some sort, so they can do something like

from boltons import iterutils

dupe_dicts = [{"id": 1, "val": 3}, {"id": 2, "val": 5}, {"id": 1, "val": 1}]

deduped_dicts = iterutils.unique(dupe_dicts, key=lambda x: x.get('id'))

print(deduped_dicts)

But it gets more complex as the data structures become more nested and the equality test stricter. frozendict would help in some cases, but even that won't cover a highly nested dictionary. Did you have a specific use case you can share?

@carlbordum
Copy link
Author

Well I get my data online and sometimes the same data appeared multiple times, which caused problems. Current solution:

def remove_dublicate_dicts(iterable):
    """Only keep one copy of each dict with the exact same key/value
    pairs.
    :rtype: list of dicts.
    """
    s = set()
    for d in iterable:
        hashable_dict = tuple((key, value) for key, value in d.items())
        s.add(hashable_dict)
    return [dict(item) for item in s]
>>> remove_dublicate_dicts([{'a': 123, 'b': 0}, {'a': 123, 'b': 1}, {'a': 123, 'b': 0}])
[{'a': 123, 'b': 0}, {'a': 123, 'b': 1}]

@mahmoud
Copy link
Owner

mahmoud commented Jun 18, 2017

Right, so your dicts' values are not nested and all hashable, which means you can just do:

unique_dicts = iterutils.unique(dupe_dicts, key=lambda d: d.items())

And you should be good! :)

@carlbordum
Copy link
Author

I see. Taking care of nested mutable values wouldn't be fun. I'm gonna take a shot at it though. See how pretty a solution, we can come up with :)

@tiwo
Copy link
Contributor

tiwo commented Jul 16, 2017

In Python 3, d.items() is a view object, so it needs to be key=lambda d: tuple(d.items()) key=lambda d: frozenset(d.items()).
On the other hand, while dicts and lists don't support hashing, they have "deep" equality comparison - of course, then you'd have to compare each new item with each past item, which is inefficient.

And in Python 2 too, where items(d) is a list I believe, and the order isn't guaranteed.

@mahmoud mahmoud closed this as completed Dec 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants