Remove dublicates from list of dicts #127

carlbordum · 2017-06-10T01:32:09Z

A method, function or perhaps even a frozendict could help achieve this in a clean way.

mahmoud · 2017-06-18T19:03:51Z

Hi @Zaab1t ! This is definitely one of the more common requests I get. It's mostly a matter of how you want to do duplicate detection. Most cases people have an "id" key of some sort, so they can do something like

from boltons import iterutils

dupe_dicts = [{"id": 1, "val": 3}, {"id": 2, "val": 5}, {"id": 1, "val": 1}]

deduped_dicts = iterutils.unique(dupe_dicts, key=lambda x: x.get('id'))

print(deduped_dicts)

But it gets more complex as the data structures become more nested and the equality test stricter. frozendict would help in some cases, but even that won't cover a highly nested dictionary. Did you have a specific use case you can share?

carlbordum · 2017-06-18T20:00:33Z

Well I get my data online and sometimes the same data appeared multiple times, which caused problems. Current solution:

def remove_dublicate_dicts(iterable):
    """Only keep one copy of each dict with the exact same key/value
    pairs.
    :rtype: list of dicts.
    """
    s = set()
    for d in iterable:
        hashable_dict = tuple((key, value) for key, value in d.items())
        s.add(hashable_dict)
    return [dict(item) for item in s]

>>> remove_dublicate_dicts([{'a': 123, 'b': 0}, {'a': 123, 'b': 1}, {'a': 123, 'b': 0}])
[{'a': 123, 'b': 0}, {'a': 123, 'b': 1}]

mahmoud · 2017-06-18T20:03:30Z

Right, so your dicts' values are not nested and all hashable, which means you can just do:

unique_dicts = iterutils.unique(dupe_dicts, key=lambda d: d.items())

And you should be good! :)

carlbordum · 2017-06-18T22:20:44Z

I see. Taking care of nested mutable values wouldn't be fun. I'm gonna take a shot at it though. See how pretty a solution, we can come up with :)

tiwo · 2017-07-16T13:51:30Z

In Python 3, d.items() is a view object, so it needs to be ~~key=lambda d: tuple(d.items())~~ key=lambda d: frozenset(d.items()).
On the other hand, while dicts and lists don't support hashing, they have "deep" equality comparison - of course, then you'd have to compare each new item with each past item, which is inefficient.

And in Python 2 too, where items(d) is a list I believe, and the order isn't guaranteed.

mahmoud closed this as completed Dec 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove dublicates from list of dicts #127

Remove dublicates from list of dicts #127

carlbordum commented Jun 10, 2017

mahmoud commented Jun 18, 2017

carlbordum commented Jun 18, 2017

mahmoud commented Jun 18, 2017

carlbordum commented Jun 18, 2017

tiwo commented Jul 16, 2017 •

edited

Remove dublicates from list of dicts #127

Remove dublicates from list of dicts #127

Comments

carlbordum commented Jun 10, 2017

mahmoud commented Jun 18, 2017

carlbordum commented Jun 18, 2017

mahmoud commented Jun 18, 2017

carlbordum commented Jun 18, 2017

tiwo commented Jul 16, 2017 • edited

tiwo commented Jul 16, 2017 •

edited