Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of serializing nested collections is poor #10

Closed
sloat opened this issue Dec 28, 2013 · 7 comments
Closed

Performance of serializing nested collections is poor #10

sloat opened this issue Dec 28, 2013 · 7 comments

Comments

@sloat
Copy link

sloat commented Dec 28, 2013

I worked up a quick test using the nose timed decorator.

class TestSerializerTime(unittest.TestCase):

    def setUp(self):
        self.users = []
        self.blogs = []
        letters = list(string.ascii_letters)

        for i in range(500):
            self.users.append(User(''.join(random.sample(letters, 15)),
                email='jiod@fjios.com', age=random.randint(10, 50)))

        for i in range(500):
            self.blogs.append(Blog(''.join(random.sample(letters, 50)),
                user=random.choice(self.users)))

    @timed(.2)
    def test_small_blog_set(self):
        res = BlogSerializer(self.blogs[:20], many=True)

    @timed(.4)
    def test_medium_blog_set(self):
        res = BlogSerializer(self.blogs[:250], many=True)

    @timed(1)
    def test_large_blog_set(self):
        res = BlogSerializer(self.blogs, many=True)

    @timed(.1)
    def test_small_user_set(self):
        res = UserSerializer(self.users[:20], many=True)

    @timed(.2)
    def test_medium_user_set(self):
        res = UserSerializer(self.users[:250], many=True)

    @timed(.5)
    def test_large_user_set(self):
        res = UserSerializer(self.users, many=True)

The user tests all pass, but the medium and large blog tests do not. Obviously, these could pass on some machines, but it's still rather slow.

I did a little bit more testing with profile. Serializing the whole blog collection was running between 5 and 6s.

It looks like the bottleneck is the deepcopy operation in serializer.py and it doesn't seem like the call can be removed, or changed to a pickle/unpickle operation.

I'm going to keep digging to see what I can do. If you have any insight, I'd appreciate the help. Thanks!

@sloria
Copy link
Member

sloria commented Dec 29, 2013

The deepcopy operation is expensive, but necessary, so that serializers can store errors from nested serializers.

I did a little work with cProfile and your code above (the gist of the script is here) and found 2 significant speedups:

  • Passing an instance--not a class--into a Nested field.

Example:

collaborators = fields.Nested(UserSerializer(), many=True)

instead of

collaborators = fields.Nested(UserSerializer, many=True)

This avoids repeating the initialization code (including the deepcopy) for each collaborator. In the future, it'll be better to cache the nested serializer object, or disallow passing classes altogether.

  • Overriding __deepcopy__ method of field objects so they are only shallow copied (9c0f062). Even though the declared_fields dictionary must be deep-copied, field objects themselves don't need to be deep-copied.

These two modifications decreased the execution time of the above script by almost half.

Thanks for reporting this. I will continue to do more profiling and see where performance can be improved even further.

@sloria
Copy link
Member

sloria commented Dec 29, 2013

I underestimated the effect of passing in an instance into a nested Field: doing this for both the "user" and the "collaborators" field reduces the total runtime of the profiling script from ~5.7s to ~1.6s.

class BlogSerializer(Serializer):
    title = fields.String()
    user = fields.Nested(UserSerializer())
    collaborators = fields.Nested(UserSerializer(), many=True)
    categories = fields.List(fields.String)
    id = fields.String()

@sloat
Copy link
Author

sloat commented Dec 29, 2013

That's great! Very interesting...

Thanks again!

@sloria
Copy link
Member

sloria commented Dec 29, 2013

Glad I could help.

I've made some further improvements so that performance should be good whether you pass in a Serializer class or a Serializer object into a Nested field.

@sloria sloria closed this as completed Dec 29, 2013
@andras-tim
Copy link

I have same issue with serialize, and I was run into this issue.

I think, you should update the documentation http://marshmallow.readthedocs.org/en/latest/nesting.html about use instance instead of class passing trough.

andras-tim added a commit to andras-tim/StoreKeeper that referenced this issue May 15, 2015
@mgd722
Copy link

mgd722 commented May 29, 2018

@sloria, just to be clear-- performance should now be similar in both of the following scenarios:

fields.Nested(UserSerializer(), many=True)
fields.Nested(UserSerializer, many=True)

so if nested fields are running slow for me it's just my shitty code?

@sloria
Copy link
Member

sloria commented May 30, 2018

@mgd722 I haven't compared the two usages in a while, but they should be similar. If you're serializing ORM objects, I'd first look into your relationship loading technique and make sure you're not running into the n+1 problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants