Performance #228

lidatong · 2020-06-07T15:57:13Z

Performance in general is on my radar as things to tackle next, as this library gains traction, and the top of a 1.0 release checklist.

In general after some thought I don't think caching / memoization is the right way to tackle this. A few reasons why:

it requires careful thought about how it behaves under concurrency, specifically with respect to memory visibility
could have a big memory footprint on large codebases with a lot of composite dataclasses, and potentially duplicated across threads!
immutability -- should the cached object be mutable / how can we protect it from changes?

Instead, I think an approach involving code generation is the way to go -- similar to how the dataclasses core module itself is implemented. When you think about it, a schema is only generated once and known at "module-load time". In other languages we might call this "compile-time". We can see the code-generation approach utilized in codec schema libraries in other languages, be it json or even other data-interchange formats like protobuf

Going this route, the schema now is loaded as just more code, so to speak, instead of living in memory.

The text was updated successfully, but these errors were encountered:

yacc143 · 2020-06-29T16:13:50Z

Just some observations from benchmarking some code dumping 1000 simple objects resulting in ~3.4 MB of JSON:

@dataclass_json
@dataclass
class Test:
    id : int
    value : str
    second : str

testvalue = [Test(i, TESTSTR, TESTSTR[0:200]) for i in range(1000)]
testvalue2 = [dict(id=i, value=TESTSTR, second=TESTSTR[0:200]) for i in range(1000)]

I've created a number of methods to dump these lists of objects:

def callDCJS():
    len(Test.schema().dumps(testvalue, many=True))

import ujson

def callJS():
    len(ujson.dumps(testvalue2))

import json
def callJS2():
    len(json.dumps(testvalue2))

def callDCJS2(schema=Test.schema()):
    len(schema.dumps(testvalue, many=True))

def callDCJS3(schema=Test.schema()):
    len(ujson.dumps(schema.dump(testvalue, many=True)))

As you can see the callJS functions are the ones the dump the native list of Python dictionaries, while the DCJS ones use dataclasses_json.

And the astounding numbers suggest to me that dataclasses_json (marshmallow? Not sure if it uses it under the hood, haven't looked at the code yet) has optimization prospects:

(uber38) andreas@obelix:~/work/venvs/uber38/NLP/test38.py$ time python3.8 test38.py 
callJS               0.009778 [0.010089821879984112, 0.010208235128905747, 0.009778391534543678]
callJS2              0.018111 [0.01811142571168384, 0.019572446219253917, 0.02105408304198939]
callDCJS             0.028419 [0.02841886689791361, 0.03097145833031697, 0.032973052026961255]
callDCJS2            0.026938 [0.0269384963994471, 0.029055634546708932, 0.034756338603475746]
callDCJS3            0.018359 [0.01837075536919607, 0.018359050057148812, 0.021072761750676565]

The first time is the minimum. As you can see, the best strategy that you can currently use with dataclasses_json seems to use it to serialize to Python data structures, and then use the fastest JSON python package that you can find for your data. (And ujson seems to be fast, beating out the standard json module by factor 2. And the time differences between DCJS2/DCJS3 suggest that dataclasses_json use the default json module.

JsBergbau · 2021-04-29T11:06:09Z

Even when building the dict by accessing each element of the dataclass like dict(id=test.i, value=test.value, second=test.second)
and then dumping to JSON is in my tests about twice faster than using dataclass_json
Building the string manually like json = '{"id": ' + str(test.i) and so on takes even only about half of using json.dumps, so dataclass_json is about 4 times slower than building the string manually.

Is there any timetable to give it better performance?

cakemanny · 2021-05-03T18:19:53Z

Hi, it's interesting that you mention code generation.
I didn't think to suggest it, as I thought maybe you were trying to stay dynamic.
I implemented essentially the from_dict part of the API using some code generation March last year, in case my approach might be interesting to you.
https://github.com/cakemanny/fastclasses-json

edit: there is now a release on pypi, to_dict, and some configurable options

lidatong pinned this issue Jun 7, 2020

miohtama mentioned this issue Jul 12, 2021

PyPi release? cakemanny/fastclasses-json#2

Closed

ZeldaZach mentioned this issue Sep 9, 2022

Add Support for letter_case cakemanny/fastclasses-json#7

Closed

yifanmai mentioned this issue May 3, 2023

JSON serialization performance stanford-crfm/levanter#133

Closed

george-zubrienko unpinned this issue Sep 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance #228

Performance #228

lidatong commented Jun 7, 2020

yacc143 commented Jun 29, 2020 •

edited

Loading

JsBergbau commented Apr 29, 2021 •

edited

Loading

cakemanny commented May 3, 2021 •

edited

Loading

Performance #228

Performance #228

Comments

lidatong commented Jun 7, 2020

yacc143 commented Jun 29, 2020 • edited Loading

JsBergbau commented Apr 29, 2021 • edited Loading

cakemanny commented May 3, 2021 • edited Loading

yacc143 commented Jun 29, 2020 •

edited

Loading

JsBergbau commented Apr 29, 2021 •

edited

Loading

cakemanny commented May 3, 2021 •

edited

Loading