New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise ValueError on attribute or data_key collision #992
Conversation
if len(load_attributes) != len(set(load_attributes)): | ||
attributes_duplicates = {x for x in load_attributes if load_attributes.count(x) > 1} | ||
raise ValueError('Duplicate attributes: {}'.format(attributes_duplicates)) | ||
|
||
return fields_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the code style/practice for the project is with regard to utility functions for everything, but I might consider putting the duplicate detection/selection into a separate function. things could be abbreviated to something like the following:
duplicate_dump_data_keys = find_duplicates([
obj.data_key or name for name, obj in iteritems(fields_dict) if not obj.load_only
])
if duplicate_dump_data_keys is not None:
raise ValueError('Duplicate data_keys: {}'.format(duplicate_dump_data_keys))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wondered what I could factorize and I figured factorizing just the set comprehension was not worth it.
But indeed, find_duplicates can also handle the len(list) != len(set) shortcut comparison (return None if same length) and the duplicate list generation (the set comprehension).
Thanks for reviewing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could shorten the code by calculating the list of duplicates first, but I figured it would be better performance-wise to pass fast in the nominal no-dups case and only calculate the list if there are duplicates. This is a design choice/trade-off that makes the find_duplicates method a bit less generic. We could leave it inlined here. That's not a huge code duplication.
No strong feeling either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards keeping this as is, if only for the performance consideration @lafrech pointed out. And the duplication is minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting with the length check is a good trade-off, given errors like these are rare. Factoring out the duplicate-check should include that in the function. From a quick 12-item list test, the set comp takes about ~2.7us per run (presence of duplicates doesn't affect the time more than a few %), checking lengths ahead of time takes 0.5us. There's a middle-ground way as well, which relies on a set of non-duplicate values to compare against.
But as @sloria says, the duplication is fairly minimal and there don't appear to be other places in the codebase that need duplicate-calculation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Go ahead with the docs updates, then feel free to merge @lafrech .
cdc0498
to
64106ed
Compare
Closes #967.
The idea is to prevent a user from creating a
Schema
with two fields with the samedata_key
or the sameattribute
as the behaviour would be undefineddata_key
attribute
Note that several fields can have the same attribute if
dump_only
(i.e. only one of them is loaded)Schema
is used only for dumping(and conversely for
data_key
and load)The commit subject and the error message does not say it explicitly but this also checks collision between
data_key
/attribute
and field names (for fields with nodata_key
/attribute
modifier).This is a breaking change because before the change, someone could write a
Schema
with duplicateattribute
s, provided the schema is only used for dumping. This can still be achieved, but duplicateattribute
s must be marked dump_only (and conversely fordata_key
and load). Hence the change to the fixtures in test/base.py.Pro:
Cons:
Schema
with duplicates but only used for dumping/loading) a bit more verbose to writeTODO: