Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

switch back to stdlib json #7674

Closed
richvdh opened this issue Jun 10, 2020 · 13 comments · Fixed by #7936 or #8259
Closed

switch back to stdlib json #7674

richvdh opened this issue Jun 10, 2020 · 13 comments · Fixed by #7936 or #8259
Assignees
Labels
A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)

Comments

@richvdh
Copy link
Member

richvdh commented Jun 10, 2020

I found my benchmark data and it looks like stdlib json is faster than simplejson for dumping data nowadays.

Using python 3.7.5 (seconds per loop: smaller is better):

Results
=======
                             loads (large obj)   loads (small objs)    dumps (large obj)   dumps (small objs)
json 2.0.9                               0.012                0.011                0.013                0.015
simplejson 3.17.0                        0.011                0.011                0.019                0.027
@richvdh
Copy link
Member Author

richvdh commented Jun 10, 2020

see also: #3009, matrix-org/python-canonicaljson#23

@erikjohnston erikjohnston added z-p2 (Deprecated Label) A-Performance Performance, both client-facing and admin-facing labels Jun 11, 2020
@auscompgeek
Copy link
Contributor

Did a little experimenting, and it turns out that:

  • simplejson has a feature where it'll serialise bytes by default, and
  • synapse passes bytes to canonicaljson in some places.

So if canonicaljson is changed to use the stdlib json, either it'll have to decode bytestrings it gets, or synapse needs a patch to stop passing bytes to canonicaljson.

@auscompgeek
Copy link
Contributor

Actually it's even worse than that: some of the C-S handlers also return bytes, I just found /voipServer does.

@richvdh
Copy link
Member Author

richvdh commented Jun 12, 2020

to demonstrate the problem:

>>> import simplejson
>>> simplejson.dumps(b"abc")
'"abc"'
>>> import json
>>> json.dumps(b"abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'bytes' is not JSON serializable

@clokep
Copy link
Contributor

clokep commented Jun 16, 2020

I think it was mentioned in #synapse-dev:matrix.org, but I believe the first step of this was to fix the places where Synapse passes bytes into canonicaljson.

@richvdh
Copy link
Member Author

richvdh commented Jun 24, 2020

Next steps on this:

  • Fix known places where we pass bytes objects into json.dumps or canonicaljson.encode:
    • /voipServer is the only one I'm aware of
  • Create a (pre-release?) version of canonicaljson that uses json rather than simplejson
  • Update synapse to the use the new version of canonicaljson; this will probably break a number of tests
  • Fix the tests
  • If it all looks good, make a proper release of canonicaljson, this fixing Remove simplejson dependency python-canonicaljson#23

While we're working on canonicaljson, assuming we decide it is safe to switch to stock json, it would be good to revert the non-test parts of matrix-org/python-canonicaljson#9, which, as per the comments there, are not required with stock json.

@richvdh richvdh removed their assignment Jul 2, 2020
@clokep clokep self-assigned this Jul 7, 2020
@clokep
Copy link
Contributor

clokep commented Jul 7, 2020

Looks like we also directly use simplejson in a bunch of places in Synapse. I'll clean those up too.

@clokep
Copy link
Contributor

clokep commented Jul 16, 2020

The build in #7803 passed, which means that current develop can run with the standard library JSON module.

I don't think we can just merge matrix-org/python-canonicaljson#24 and release canonicaljson though since older Synapse's might upgrade to this newer version of canonicaljson, which would break them.

We could release a Synapse which sets a maximum version of canonicaljson to the current version, then release canonicaljson and remove the max limit in Synapse? Not sure if there's a better way to do that.

@clokep
Copy link
Contributor

clokep commented Jul 21, 2020

A few more options from Rich (see conversation):

  • make canonicaljson import one of two libraries depending on what you set some global variable to

This seems plausible, although is a bit ugly.

  • flip around which level picks the json impl. So rather than doing from canonicaljson import json everywhere, have synapse pass in a json impl when it invokes CJ

This would work, but means we'll have to constantly recreate JSON encoders, which would probably have a performance impact. (Or I guess we could have an initialization function which would get called to initialize the encoders or something?)

  • have a whole new library called canonicaljson2

This would work fine, but will require changing the package and module names, updating Synapse to use canonicaljson2, and all of the imports.

  • vendor the 10 lines of python that we actually need into Synapse

I assume there was a good reason to have this be a separate library in the first place, not sure if those reasons are still valid.

I think the easiest way forward is the first option:

  • It is forwards compatible (a newer Synapse with an older canonicaljson will still work fine).
  • It is backwards compatible (an older Synapse will still work fine).

The major downside for any of the options that have us keeping support for older Synapses in canonicaljson is that matrix-org/python-canonicaljson#24 cannot be merged, we can tweak something so that most of the benefits are used though. The only way I see around that is releasing it as a canonicaljson2 library.

@clokep
Copy link
Contributor

clokep commented Jul 23, 2020

My current plan for this is:

  • Modify canonicaljson to have an API that allows setting the json version (but have it default to simplejson, as it does now). See Add an API to choose the underlying JSON encoder. python-canonicaljson#25.
  • Release canonicaljson 1.2.0.
  • Update Synapse's requirements to require canonicaljson 1.2.0 or greater.
  • Update Synapse to opt to use the standard library json module.

I think this is tenable and isn't too much of a faff.

@clokep
Copy link
Contributor

clokep commented Sep 4, 2020

While workin gon #7670 I realized that this isn't fully done. We do use the standard library JSON for all all canonicaljson operations, but not for other operations.

There's a handful of places that we from canonicaljson import json, which actually ends up with the simplejson implementation. Two ways to fix this:

  • Modify those imports to import json.
  • Modify canonicaljson to configure the local json variable in the set_json_library function.

Any thoughts on which of these is the "right" way to go?

@clokep clokep reopened this Sep 4, 2020
@richvdh
Copy link
Member Author

richvdh commented Sep 4, 2020

I'm inclined towards the former. Seems simpler?

@clokep
Copy link
Contributor

clokep commented Sep 4, 2020

That was my thought too. I think importing from canonicaljson is actually just a bit confusing. Especially now that we explicitly declare we want to use the json from the standard library.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)
Projects
None yet
4 participants