Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dereference #21

Merged
merged 9 commits into from Mar 2, 2017
Merged

Fix dereference #21

merged 9 commits into from Mar 2, 2017

Conversation

ilex
Copy link
Contributor

@ilex ilex commented Feb 19, 2017

What do these changes do:

  • Remove ReferenceField._is_instance property as it is useless (field is
    shared between model instances).
  • Rewrite dereferencing to fix some issues:
    • Dereference unhashable ids (auto dereferencing still does not work for
      this case as reference id is treated as dereferenced document not as an id).
    • Set reference field value to None if there is no document with such
      id (as auto dereferencing does).
    • Fix bug when some fields reference to different collections but have
      the same id. So that only one of that field will be set with correct
      value as the document_map was {id --> document}. Example:
class User(MongoModel):
    name = fields.CharField(primary_key=True)
class Category(MongoModel):
    title = fields.CharField(primary_key=True)
class Post(MongoModel):
    user = fields.ReferenceField(User)
    category = fields.ReferenceField(Category)
user = User(name='Bob').save()
category = Category(title='Bob').save()
post = Post(user=user, category=category)
with no_auto_dereference(post):
    post.refresh_from_db()
    dereference(post)
    # one of the follow statements will fail as both post.user and post.category
    # will be User (or Category).
    assert isinstance(post.user, User)
    assert isinstance(post.category, Category)
  • Dereference DBRefs in dict-like fields. As there is no field for DBRef the only way to store it
    is within dict-like fields (DictField, OrderedDictField). So dereference traverses such containers to find DBRef values and try to dererference them.

If reference is not found while dereferencing it should set to None.
- Remove ReferenceField._is_instance propery as it is useless (field is
  shared between model instances).
- Rewrite dereferencing to fix some issues:
  - Dereference unhashable ids (still auto dereferencing does not work for
    this case).
  - Set reference field value to None if there is no document with such
    id (as auto dereferencing does).
  - Fix bug when some fields reference to different collections but have
    the same id. So that only one of that field will be set with correct
    value as the document_map was {id --> document}.
  - Dereference DBRefs in dict-like fields.
@ilex
Copy link
Contributor Author

ilex commented Feb 19, 2017

There are also some thoughts about unhashable ids:

  • As mention in mongodb documentation _id is immutable so seems it should be hashable.
  • Seems that auto dereferencing does not work for unhashable reference ids.
  • There is also issue with reference_map:
class User(MongoModel):
    name = fields.CharField(primary_key=True)
class Comment(EmbeddedMongoModel):
    body = fields.CharField()
    user = fields.ReferenceField(User)
class Post(MongoModel):
    comments = fields.EmbeddedDocumentListField(Comment)

user = User(name='bob').save()
post = Post(comments=[
    Comment('comment1',user=user), 
    Comment('comment2',user=user),
    Comment('comment3',user=user),
    Comment('comment4',user=user)]).save()
# if we dereference such post
# a reference_map for 'user' collection will be:
# {'user': ['bob', 'bob', 'bob', 'bob']} and db request will be something like:
# db.user.find({_id: {$in:['bob', 'bob', 'bob', 'bob']})
# and we can't use ``set`` instead of ``list`` as id could be unhashable...

@@ -91,13 +95,16 @@ def _find_references(model_instance, reference_map, fields=None):


def _resolve_references(database, reference_map):
document_map = _ObjectMap()
document_map = {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be cleaner with a defaultdict(_ObjectMap).

# check auto_dereferncing
# hand.refresh_from_db()
# self.assertIsInstance(hand.cards[0], Card)
# self.assertIsInstance(hand.cards[1], Card)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed.

comment.refresh_from_db()
with no_auto_dereference(comment):
dereference(comment)
self.assertIsNone(comment.post)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be nice to show that the id for the deleted object is still accessible if accessed without automatic dereferencing.

@llvtt
Copy link

llvtt commented Feb 22, 2017

Hello again!

Thank you for bringing these issues to our attention, and thanks again for fixing them!

I propose that we hold off on scanning all dictionaries for DBRefs for now. It would probably be better to implement a DBRefField or GenericReferenceField that stores its references as DBRefs, so that we can refine our scan to only those fields that are likely to contain DBRefs instead of scanning all dictionaries.

I think I know what needs to happen to make auto dereferencing work with unhashable ids:

  1. The dereference_id function needs to convert the model_id to whatever mongo type is necessary by calling model_class._mongometa.pk.to_mongo(model_id) on it. The way we call find_one in dereference_id also needs to be changed, since PyMongo could interpret the model_id as the query itself, rather than the _id, if model_id is a collections.Mapping type. So we'll need to be explicit with the query: {'_id': model_class._mongometa.pk.to_mongo(model_id)}. Together, these changes will allow us to make queries on the id, where the id is a dict/model.
  2. In EmbeddedDocumentField.to_mongo, we should call to_son on value only if value is an instance of self.related_model. Otherwise, we should just return the value. A parallel change should happen for EmbeddedDocumentListField. This change will prevent us from calling to_son on dictionaries that we might pass to to_mongo from dereference_id.
  3. If a model id is a dict (perhaps one that hasn't been instantiated yet to an EmbeddedMongoModel), ReferenceField will always try to turn it into self.related_model. This won't work. We'll need to add another check to see if the dict represents self.related_model. We can check with if get_document(value.get('_cls')) == self.related_model.

As for the issue with reference_map, are you referring to the duplicate ids within the $in query? The query does produce the expected results, even if it isn't as efficient as possible. We could fix it by implementing something similar to _ObjectMap, but for sets. We could add ids to this data structure, then turn it into a list before executing the query.

Thanks again for the pull request! Please let me know if I can help you in any way.

ilex added 4 commits February 22, 2017 10:37
As it would be better to implement a DBRefField or
GenericReferenceField that stores its references as DBRefs.
Remove unnecessary _get_value and _set_value functions as
container is always a model_instance._data which is a dict.
Copy link

@llvtt llvtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work. Thanks for helping out.

pymodm/fields.py Outdated
# if value is a dict convert in to model
# so we can properly generate SON
if isinstance(value, dict):
return related_model.from_document(value).to_son()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that SON is going to add over a dict is a deterministic key ordering. If it's not SON to begin with, then we've already lost that, so there's no point in trying to convert it to SON here.

Probably the cases for this should be:

  1. Check if instance of self.related_model
  2. Check if instance of dict
  3. Raise ValidationError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right if it is a dict we can't use it in mongodb as {'a': 1, 'b': 2} and {'b': 2, 'a': 1} are different values for mongodb. So it checks:

  1. Is it a SON value (it has been converted already and we do not need to convert it again).
  2. Is it an instance.
  3. Is it a dict - then we try to create a model instance and then create a son (as model should always generate SON with the same keys order, doesn't it?)

So to_mongo will always generate SON object with the same keys order. If it is needed to check if SON is correctly converted to Model Instance too then we can combine dict and SON under one branch but again with converting to model first.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. to_son iterates the fields of the model class and outputs keys in a deterministic order, as you noted. That makes sense now. Carry on!

pymodm/fields.py Outdated

# TODO: Should we raise a ValidationError as we could not
# convert value to SON?
return value
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Maybe something along the lines of:

raise ValidationError('%s is not a valid %s' % (value, self.related_model.__name__))

pymodm/fields.py Outdated
def to_mongo(self, value):
return value.to_son()
return EmbeddedDocumentField._value_to_mongo(self.related_model,
value)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to move _value_to_mongo to RelatedModelFieldsBase as an instance/classmethod and call it something like _model_to_document, to make it clearer what it's doing. Then, we can call it here like self._model_to_document(...).

pymodm/fields.py Outdated
@@ -1106,7 +1125,8 @@ def to_python(self, value):
for item in value]

def to_mongo(self, value):
return [doc.to_son() for doc in value]
return [EmbeddedDocumentField._value_to_mongo(self.related_model, doc)
for doc in value]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange to call this method in another class that isn't in this class' hierarchy. See above comment for a suggestion on how to clear this up.

pymodm/fields.py Outdated
elif isinstance(value, self.related_model):
return value
try:
doc = self.related_model.from_document(value)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just return here and avoid setting doc = value above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we can but that requires to change from_document. And maybe that is the right way.
Here

cls_name = dct.get('_cls')
if there is a _cls field in value we create a cls from that field and it does not check if the cls is subclass of current cls so we can do something like this without any exception:

post = User.from_document({'title': 'I am a Post', '_cls': 'Post'})
assert isinstance(post, Post) == True

Without these changes we should check if doc is instance of related_model or not. It could be a reference id in form of Document.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. For example, value could be a dict that represents an embedded document, which is also the primary key. That makes sense now. Thank you for your explanation. I prefer it this way to changing from_document.

pymodm/fields.py Outdated
return value
try:
doc = self.related_model.from_document(value)
except Exception:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only ValueError is possible here, from this line:

raise ValueError(
. No ValidationErrors should be triggered by this. Are there other cases I'm not thinking of?

pymodm/fields.py Outdated
elif self.model._mongometa._auto_dereference:
# Attempt to dereference the value as an id.
dereference_id = _import('pymodm.dereference.dereference_id')
return dereference_id(self.related_model, value)

# TODO: should we convert value to mongo here?
# return self.related_model._mongometa.pk.to_mongo(value)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. We should do that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change this test

Comment(post=1234).full_clean()

start to fail cause it try to convert 1234 into ObjectId. As 1234 is not a valid value is that ok that full_clean does not raise ValidationError?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of that test is to show that you can assign the id of a referenced document directly and it will pass validation. However, the test value "1234" is not valid. My opinions:

  • Convert using "to_python" (not to_mongo) here, which could yield a different result. Since we're asking for the python value, and not the mongo one, that's what's expected here.

  • The test is using a bad test value. We should change it to a valid ObjectId hex string like "58b477046e32ab215dca2b57".

  • I think there's a separate problem with validation happening in ReferenceField. Really, the test that fails with your changes should already be failing. We should just use to_mongo as our validation function, like we do with Decimal128Field. We don't actually need the separate validate_related_model function, since checking that the pk is defined is already done in to_mongo.

  • I notice that the error message strings from validate_related_model and to_mongo are slightly different, so that may cause some tests to fail. Those tests should be updated to reflect the new error message.

@@ -150,6 +150,9 @@ class CardIdentity(EmbeddedMongoModel):
suit = fields.IntegerField(
choices=(HEARTS, DIAMONDS, SPADES, CLUBS))

class Meta:
final = final_value
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

pymodm/fields.py Outdated
@@ -1213,6 +1240,4 @@ def __get__(self, inst, owner):
return self

def __set__(self, inst, value):
MongoModel = _import('pymodm.base.models.MongoModel')
super(ReferenceField, self).__set__(inst, value)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't doing anything anymore, we can just remove it.

Copy link

@llvtt llvtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanations.

pymodm/fields.py Outdated
# if value is a dict convert in to model
# so we can properly generate SON
if isinstance(value, dict):
return related_model.from_document(value).to_son()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. to_son iterates the fields of the model class and outputs keys in a deterministic order, as you noted. That makes sense now. Carry on!

pymodm/fields.py Outdated
elif isinstance(value, self.related_model):
return value
try:
doc = self.related_model.from_document(value)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. For example, value could be a dict that represents an embedded document, which is also the primary key. That makes sense now. Thank you for your explanation. I prefer it this way to changing from_document.

@llvtt llvtt merged commit 2c7c1ab into mongodb:master Mar 2, 2017
@llvtt
Copy link

llvtt commented Mar 2, 2017

Merged!

Many thanks for the hard work you put into this. These changes will be in the forthcoming 0.3.1 release.

@ilex ilex deleted the fix_dereference branch March 2, 2017 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants