Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] set_class = OrderedSet by default? #1744

Closed
lafrech opened this issue Feb 12, 2021 · 7 comments · Fixed by #1896
Closed

[RFC] set_class = OrderedSet by default? #1744

lafrech opened this issue Feb 12, 2021 · 7 comments · Fixed by #1896
Milestone

Comments

@lafrech
Copy link
Member

lafrech commented Feb 12, 2021

I assume the performance impact would be minimal. From a quick look, it would only affect schema instantiation.

The benefit is that when using Python 3.7+, all schemas would be ordered. They still wouldn't return OrderedDict instances unless ordered Meta is passed, but they would keep fields declaration order in their output and in their schema list, so that order would be respected in API responses and apispec users would get their fields documented in declaration order.

Currently, users can achieve this in a base schema with

class OrderedBaseSchema(Schema):
    set_class = OrderedSet

My gut feeling is that this change would have no functional downside and a negligible perf impact at field init (less that we care about when adding a new feature), while using ordered=True does have an impact on serialization performance (I didn't measure it) and functionality (equality test depends on order), hence my proposal. There would still be an interest in allowing the use of OrderedDict.

If we go this route, we could use OrderedSet by default in marshmallow 3.x, and in marshmallow 4 perhaps rework the ordered feature because schemas would already be ordered so it would be specifically for users really needing an OrderedDict instance. Perhaps drop the ordered Meta option and let the user specify a dict class of their choice another way. But that's another story.

See marshmallow-code/flask-smorest#228 (comment).

@sloria
Copy link
Member

sloria commented Mar 18, 2021

I'm not opposed to this, so long as there is no perf impact (I think your intuition on this is right).

@lafrech
Copy link
Member Author

lafrech commented Apr 8, 2021

Or since this is easily achieved in user code thanks to a base schema as shown above, we postpone this to marshmallow 4 as described in my last paragraph.

Adding v4 milestone for now.

@lafrech lafrech added this to the 4.0 milestone Apr 8, 2021
@deckar01
Copy link
Member

deckar01 commented Oct 26, 2021

using ordered=True does have an impact on serialization performance

I believe that is caused by OrderedDict. None of the OrderedSet instance attributes are ever accessed during dump(), only the OrderedDict dump_fields. Even though dict is ordered now, it is still faster than OrderedDict. Waiting until 3.7+ would avoid using OrderedDict also.

OrderedDict Benchmark
#!/usr/bin/env python3.9

from datetime import datetime
from collections import OrderedDict


def run(D):
    A = [('a', None), ('b', None), ('r', None), ('c', None), ('d', None)]
    B = [('1', None), ('2', None), ('3', None)]
    t0 = datetime.now()
    for i in range(5000000):
        s = D(A)
        s['e'] = None
        del s['e']
        s.update(B)
    t = datetime.now() - t0
    return t


a = run(dict)
b = run(OrderedDict)
m = min(a, b)

print(f'dict\t\t{a}\t{100*(a/m-1):+.0f}%')
print(f'OrderedDict\t{b}\t{100*(b/m-1):+.0f}%')
dict            0:00:02.280595  +0%
OrderedDict     0:00:04.608753  +102%

OrderedSet is significantly slower than set.

OrderedSet Benchmark
#!/usr/bin/env python3.9

from datetime import datetime
from marshmallow.orderedset import OrderedSet


def run(D, i=200000):
    A = ['a', 'b', 'c', 'd']
    B = D(['1', '2', '3'])
    t0 = datetime.now()
    for _ in range(i):
        s = D(A)
        s.add('e')
        s.remove('e')
        s &= B
    return datetime.now() - t0


a = run(set)
b = run(OrderedSet)
m = min(a, b)

print(f'set\t\t{a}\t{100*(a/m-1):+.0f}%')
print(f'OrderedSet\t{b}\t{100*(b/m-1):+.0f}%')
set             0:00:00.065251  +0%
OrderedSet      0:00:01.575174  +2314%

If an application has schema instantiation in its hot loop I think the vast majority of the execution time is going to be consumed by memory allocation. +1 for dropping (or deprecating + ignoring) ordered in 4.0 and using OrderedSet (and dict). I have read that there are more performant implementations of OrderedSet out there that are wrappers around the 3.7+ ordered dict.

@lafrech
Copy link
Member Author

lafrech commented Oct 27, 2021

I believe that is caused by OrderedDict. None of the OrderedSet instance attributes are ever accessed during dump(), only the OrderedDict dump_fields.

Yes, this is exactly what I meant. Hence my proposal. In 4.0, I'd drop ordered and just allow setting set and dict classes as class attributes. Meanwhile, we can already use OrderedSet by default.

@deckar01
Copy link
Member

deckar01 commented Oct 27, 2021

@lafrech I was intrigued by #1891 (comment), so I took a stab at replacing set_class with list. The only set operations we use are unique, intersection, and difference. I implemented them as generators with a set membership check in 4-5 line utilities. Some operations may be more performant than the existing unordered set logic, because they only build 1 set instead of (up to) 3.

dev...deckar01:1744-list-attributes

@lafrech
Copy link
Member Author

lafrech commented Nov 8, 2021

@deckar01 Nice.

On the short run (marshmallow 3.x) we'll probably be keeping set_class if only for backwards compatibility. There seems to be consensus with this so we may go on with #1896. This can wait until we drop Python 3.6.

The question of whether we keep vendoring the OrderedSet recipe or we import it from a 3rd-party package is independent (#1891).

Later on (marshmallow 4), we can either keep set_class as a class property, either hide it as an implementation detail. Then we may consider switching to an implementation such as yours. I have no strong feeling. Some would call that reinventing the wheel, some would like the simplicity. The code is small enough to be error-proofed, but arguably it is a little less straight-forward when used from Schema.

@greyli
Copy link

greyli commented Dec 24, 2022

+1 for this. It will become a key issue when the user generates the local OpenAPI spec file with apispec while the spec output is not deterministic (apiflask/apiflask#373).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants