Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise ValueError on attribute or data_key collision #992

Merged
merged 3 commits into from Oct 12, 2018

Conversation

lafrech
Copy link
Member

@lafrech lafrech commented Oct 11, 2018

Closes #967.

The idea is to prevent a user from creating a Schema with two fields with the same data_key or the same attribute as the behaviour would be undefined

  • when dumping if several fields have the same data_key
  • when loading if several fields have the same attribute

Note that several fields can have the same attribute if

  • either only one of them is not dump_only (i.e. only one of them is loaded)
  • the Schema is used only for dumping

(and conversely for data_key and load)

The commit subject and the error message does not say it explicitly but this also checks collision between data_key/attribute and field names (for fields with no data_key/attribute modifier).

This is a breaking change because before the change, someone could write a Schema with duplicate attributes, provided the schema is only used for dumping. This can still be achieved, but duplicate attributes must be marked dump_only (and conversely for data_key and load). Hence the change to the fixtures in test/base.py.

Pro:

  • Prevent hard to debug errors

Cons:

  • Make a legit corner case (Schema with duplicates but only used for dumping/loading) a bit more verbose to write
  • Performance impact at schema initialization?

TODO:

  • Changelog
  • Docs (upgrading)

if len(load_attributes) != len(set(load_attributes)):
attributes_duplicates = {x for x in load_attributes if load_attributes.count(x) > 1}
raise ValueError('Duplicate attributes: {}'.format(attributes_duplicates))

return fields_dict

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the code style/practice for the project is with regard to utility functions for everything, but I might consider putting the duplicate detection/selection into a separate function. things could be abbreviated to something like the following:

duplicate_dump_data_keys = find_duplicates([
    obj.data_key or name for name, obj in iteritems(fields_dict) if not obj.load_only
])
if duplicate_dump_data_keys is not None:
    raise ValueError('Duplicate data_keys: {}'.format(duplicate_dump_data_keys))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wondered what I could factorize and I figured factorizing just the set comprehension was not worth it.

But indeed, find_duplicates can also handle the len(list) != len(set) shortcut comparison (return None if same length) and the duplicate list generation (the set comprehension).

Thanks for reviewing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could shorten the code by calculating the list of duplicates first, but I figured it would be better performance-wise to pass fast in the nominal no-dups case and only calculate the list if there are duplicates. This is a design choice/trade-off that makes the find_duplicates method a bit less generic. We could leave it inlined here. That's not a huge code duplication.

No strong feeling either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards keeping this as is, if only for the performance consideration @lafrech pointed out. And the duplication is minimal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting with the length check is a good trade-off, given errors like these are rare. Factoring out the duplicate-check should include that in the function. From a quick 12-item list test, the set comp takes about ~2.7us per run (presence of duplicates doesn't affect the time more than a few %), checking lengths ahead of time takes 0.5us. There's a middle-ground way as well, which relies on a set of non-duplicate values to compare against.

But as @sloria says, the duplication is fairly minimal and there don't appear to be other places in the codebase that need duplicate-calculation :)

Copy link
Member

@sloria sloria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Go ahead with the docs updates, then feel free to merge @lafrech .

@lafrech lafrech force-pushed the dev_raise_on_attribute_or_data_key_collision branch from cdc0498 to 64106ed Compare October 12, 2018 12:47
@lafrech lafrech merged commit 0760d30 into dev Oct 12, 2018
@lafrech lafrech deleted the dev_raise_on_attribute_or_data_key_collision branch October 12, 2018 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge nested data with the same 'data_key'
3 participants