Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Replace ujson with orjson as default json library #1509

Conversation

harshanarayana
Copy link
Contributor

@harshanarayana harshanarayana commented Mar 4, 2019

Address the default JSON library replacement item mentioned in #1479

Since https://github.com/ijl/orjson has started supporting wheel for all platforms, I suppose we can not migrate to orjson as the default library for python3.6 and above. However, for 3.5, we will still retain ujson until orjson is backported to it or we decide to drop ujson completely and rely on default json for python3.5

Signed-off-by: Harsha Narayana <harsha2k4@gmail.com>
@harshanarayana harshanarayana changed the title #1479 - Replace ujson with orjson as default json library WIP - #1479 - Replace ujson with orjson as default json library Mar 4, 2019
@harshanarayana harshanarayana changed the title WIP - #1479 - Replace ujson with orjson as default json library WIP - Replace ujson with orjson as default json library Mar 4, 2019
@codecov
Copy link

codecov bot commented Mar 4, 2019

Codecov Report

Merging #1509 into master will increase coverage by 0.1%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #1509     +/-   ##
=========================================
+ Coverage   91.35%   91.45%   +0.1%     
=========================================
  Files          18       18             
  Lines        1781     1791     +10     
  Branches      337      340      +3     
=========================================
+ Hits         1627     1638     +11     
  Misses        130      130             
+ Partials       24       23      -1
Impacted Files Coverage Δ
sanic/request.py 99.52% <100%> (ø) ⬆️
sanic/testing.py 95.29% <100%> (ø) ⬆️
sanic/response.py 100% <100%> (ø) ⬆️
sanic/router.py 95.89% <0%> (+0.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d581315...1f8edbc. Read the comment docs.

Signed-off-by: Harsha Narayana <harsha2k4@gmail.com>
@ahopkins
Copy link
Member

ahopkins commented Mar 5, 2019

@harshanarayana Why not just continue to allow the developer to pick the module? Are we comfortable that orjson is stable enough?

@sjsadowski
Copy link
Contributor

@ahopkins I actually started replacing ujson with orjson personally a bit ago. I'm pretty happy with it and I'm using it in production. Of course that's anecdotal, but I'd give it a pass if I was doing the review and there were no other concerns.

Personally I think sanic was already opinionated - @channelcat could have used the standard json module but went with ujson instead. Fast and correct is the way to go, and given that ujson has largely been abandoned, I think we should adopt this as a replacement and default - but still allow people to use their own json library if that's what they want to do.

This leads to a larger discussion though - do we start down the path of deprecating python 3.5 support. I'll pose that question in the community.

@sjsadowski
Copy link
Contributor

After thinking about this some, I have two really big issues: 1) removing ujson significantly slows responses and 2) I don't want for sanic to be reliant on abandoned projects.

orjson is a young project, but it does serve a need, as illustrated below:

All of the following tests were using sanic with 12 workers (since that's how many cores I have), and running wrk with 100 concurrent connections and 2 threads.

With ujson:
Running 10s test @ http://localhost:5000/
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.86ms 831.17us 7.34ms 83.63%
Req/Sec 71.40k 7.82k 81.65k 55.50%
1420371 requests in 10.00s, 169.32MB read
Requests/sec: 142009.29
Transfer/sec: 16.93MB

without ujson - default library:
Running 10s test @ http://localhost:5000/
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.99ms 0.98ms 5.18ms 84.19%
Req/Sec 62.65k 12.16k 79.92k 54.00%
1246242 requests in 10.00s, 148.56MB read
Requests/sec: 124605.16
Transfer/sec: 14.85MB

with orjson:
Running 10s test @ http://localhost:5000/
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.97ms 1.02ms 5.92ms 84.81%
Req/Sec 69.23k 1.74k 72.03k 52.00%
1377608 requests in 10.01s, 164.22MB read
Requests/sec: 137685.65
Transfer/sec: 16.41MB

This are just median examples, but on average over multiple runs I would see anywhere between 8-15% response time decrease (of json over ujson), and that's a poor trade-off. When going back to the standard json library, even with documentation, too many people will just ignore the docs and (might) fuss about performance. With orjson we shave that to 3-5%, and I'm sure we can hope for better optimization going forward.

So orjson gets my vote as the new default. It may be a young project, but it's being kept up which is more than I can say about ujson.

@ijl
Copy link

ijl commented Mar 6, 2019

It's important for performance that json_dumps() not convert bytes to str. There is the cost of converting bytes to str and as well the cost of converting str back to bytes when writing the request. I don't see sanic.response.json_dumps in the documentation. Is it meant to be private? If so, I think it can just be changed to return bytes.

@sjsadowski
Copy link
Contributor

Thanks for the info @ijl - yes the json_dumps is a private method that can be overridden by supplying the dumps=.

@harshanarayana can you make that update on your branch? I don't think it's significantly impacting to remove the type conversion. Someone else from @huge-success/sanic-core-devs check me to make sure I'm not wrong.

@harshanarayana
Copy link
Contributor Author

@sjsadowski Sure. Let me make the changes required to use the data in bytes instead of string format. I will update this PR with the changes shortly.

@huge-success/sanic-core-devs Can I get a No/no-go on the following items.

  1. orjson will be the default package used for json related ops
  2. ujson will be retained as default on 3.5
  3. json is the ultimate fallback for all 3 environments

@sjsadowski
Copy link
Contributor

You've got a go from me, but I'd like to get at least two other people to weigh in with a "go" before we merge. @ashleysommer @yunstanford @seemethere @ahopkins @abuckenheimer

@ahopkins
Copy link
Member

ahopkins commented Mar 6, 2019

Is something broken in ujson right now? I understand the project has not had any changes, but does it need any?

If we are looking to make a change, have we looked at others? (https://github.com/python-rapidjson/python-rapidjson) Before making a change, I think we should do some more due dilligence and testing. Especially since it seems that orjson would decerease performance.

@ashleysommer
Copy link
Member

ashleysommer commented Mar 6, 2019

I agree with @ahopkins.

I don't actually see anything wrong with sticking with ujson for now. It is fast and it works.

We will be dropping support for python 3.5 when it hits EOL in 2020-09, it would make sense that if we want to migrate to a python 3.6-only library, we do so then.

And additionally, as @ahopkins mentioned, why is orjson the intended library of choice, when high performance alternatives like rapid-json are available and compatible?

@yunstanford
Copy link
Member

I don't have a strong opinion here, It could be included as extra_requires, so pip install sanic[orjson].

Also, i'd like to refactor the code a little bit. Basically, we have compat.py, and handle those module imports. so that in other modules, we just import from compat.py.

@sjsadowski
Copy link
Contributor

So there are multiple open issues since 2017 related to number handling and deserialization issues. People have submitted PRs to address them which have been ignored.

Personally, I don't care if it's orjson or rapidjson or something else - so long as the library owner(s) respond to issues. I'd rather be ahead of problems by a voluntary change than working backward because something we rely on breaks - though to be fair we could always just drop ujson support in favor of default json. Though like I said, there's a performance hit in doing so and I'd rather replace the dead library with something actively maintained.... unless we want to fork it and maintain it ourselves, which I recommend against.

@harshanarayana
Copy link
Contributor Author

@ahopkins @ashleysommer @yunstanford @sjsadowski

I will put this PR on hold for now. Let me create a quick benchmark of a few different JSON libraries available out there and see which of them perform consistently better without having any breaking changes if possible.

@yunstanford I am totally for the compat.py. It can help us do a few different things down the line.

@harshanarayana harshanarayana changed the title WIP - Replace ujson with orjson as default json library WIP - Replace ujson with orjson as default json library - [HOLD] Mar 7, 2019
@ahopkins
Copy link
Member

ahopkins commented Mar 7, 2019

I second the compat.py idea from @yunstanford

@harshanarayana harshanarayana changed the title WIP - Replace ujson with orjson as default json library - [HOLD] WIP - Replace ujson with orjson as default json library Mar 7, 2019
@ijl
Copy link

ijl commented Mar 12, 2019

orjson 2.0.2 supports python3.5.

@ahopkins
Copy link
Member

@ijl In light of the renewed discussion on bumping the min version to 3.6, can you put together some new benchmarks for us to compare ujson with orjson?

@ijl
Copy link

ijl commented Apr 28, 2019

There are benchmarks at https://github.com/ijl/orjson#performance.

@ahopkins
Copy link
Member

I ran a benchmark between ujson and orjson on my machine.

Here is the test using a bit of json from here: https://www.json.org/example.html.

from sanic import Sanic
from sanic.response import json
import sys
import orjson

app = Sanic(log_config=None)
WORKERS = int(sys.argv[-1])

data = {"web-app": {"servlet": [{"servlet-name": "cofaxCDS","servlet-class": "org.cofax.cds.CDSServlet","init-param": {"configGlossary:installationAt": "Philadelphia, PA","configGlossary:adminEmail": "ksm@pobox.com","configGlossary:poweredBy": "Cofax","configGlossary:poweredByIcon": "/images/cofax.gif","configGlossary:staticPath": "/content/static","templateProcessorClass": "org.cofax.WysiwygTemplate","templateLoaderClass": "org.cofax.FilesTemplateLoader","templatePath": "templates","templateOverridePath": "","defaultListTemplate": "listTemplate.htm","defaultFileTemplate": "articleTemplate.htm","useJSP": False,"jspListTemplate": "listTemplate.jsp","jspFileTemplate": "articleTemplate.jsp","cachePackageTagsTrack": 200,"cachePackageTagsStore": 200,"cachePackageTagsRefresh": 60,"cacheTemplatesTrack": 100,"cacheTemplatesStore": 50,"cacheTemplatesRefresh": 15,"cachePagesTrack": 200,"cachePagesStore": 100,"cachePagesRefresh": 10,"cachePagesDirtyRead": 10,"searchEngineListTemplate": "forSearchEnginesList.htm","searchEngineFileTemplate": "forSearchEngines.htm","searchEngineRobotsDb": "WEB-INF/robots.db","useDataStore": True,"dataStoreClass": "org.cofax.SqlDataStore","redirectionClass": "org.cofax.SqlRedirection","dataStoreName": "cofax","dataStoreDriver": "com.microsoft.jdbc.sqlserver.SQLServerDriver","dataStoreUrl": "jdbc:microsoft:sqlserver://LOCALHOST:1433;DatabaseName=goon","dataStoreUser": "sa","dataStorePassword": "dataStoreTestQuery","dataStoreTestQuery": "SET NOCOUNT ON;select test='test';","dataStoreLogFile": "/usr/local/tomcat/logs/datastore.log","dataStoreInitConns": 10,"dataStoreMaxConns": 100,"dataStoreConnUsageLimit": 100,"dataStoreLogLevel": "debug","maxUrlLength": 500,},},{"servlet-name": "cofaxEmail","servlet-class": "org.cofax.cds.EmailServlet","init-param": {"mailHost": "mail1","mailHostOverride": "mail2",},},{"servlet-name": "cofaxAdmin","servlet-class": "org.cofax.cds.AdminServlet",},{"servlet-name": "fileServlet","servlet-class": "org.cofax.cds.FileServlet",},{"servlet-name": "cofaxTools","servlet-class": "org.cofax.cms.CofaxToolsServlet","init-param": {"templatePath": "toolstemplates/","log": 1,"logLocation": "/usr/local/tomcat/logs/CofaxTools.log","logMaxSize": "","dataLog": 1,"dataLogLocation": "/usr/local/tomcat/logs/dataLog.log","dataLogMaxSize": "","removePageCache": "/content/admin/remove?cache=pages&id=","removeTemplateCache": "/content/admin/remove?cache=templates&id=","fileTransferFolder": "/usr/local/tomcat/webapps/content/fileTransferFolder","lookInContext": 1,"adminGroupID": 4,"betaServer": True,},},],"servlet-mapping": {"cofaxCDS": "/","cofaxEmail": "/cofaxutil/aemail/*","cofaxAdmin": "/admin/*","fileServlet": "/static/*","cofaxTools": "/tools/*",},"taglib": {"taglib-uri": "cofax.tld","taglib-location": "/WEB-INF/tlds/cofax.tld",},}}


@app.route("/ujson/<id:int>", methods=["GET"])
async def ujson_test(request, id):
    return json(data)


@app.route("/orjson/<id:int>", methods=["GET"])
async def orjson_test(request, id):
    return json(data, dumps=orjson.dumps)


if __name__ == "__main__":
    app.run(
        debug=False,
        access_log=False,
        workers=WORKERS,
        port=3000,
        host="0.0.0.0",
    )

Running ujson

$ wrk -t12 -c1000 http://localhost:3000/ujson/1     
Running 10s test @ http://localhost:3000/ujson/1
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.66ms   12.37ms 224.11ms   87.77%
    Req/Sec     7.04k     1.30k   24.53k    73.92%
  840661 requests in 10.04s, 2.25GB read
Requests/sec:  83704.54
Transfer/sec:    229.10MB

Running orjson

$ wrk -t12 -c1000 http://localhost:3000/orjson/1
Running 10s test @ http://localhost:3000/orjson/1
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.40ms   10.17ms 188.65ms   87.32%
    Req/Sec     6.80k     1.22k   10.57k    77.36%
  818819 requests in 10.10s, 2.16GB read
Requests/sec:  81102.62
Transfer/sec:    218.66MB

Here it is again with a much smaller payload:

@app.route("/ujson/<id:int>", methods=["GET"])
async def ujson_test(request, id):
    data = {"id": id}
    return json(data)


@app.route("/orjson/<id:int>", methods=["GET"])
async def orjson_test(request, id):
    data = {"id": id}
    return json(data, dumps=orjson.dumps)

Running ujson

$ wrk -t12 -c1000 http://localhost:3000/ujson/1     
Running 10s test @ http://localhost:3000/ujson/1
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.96ms   10.69ms 148.34ms   88.26%
    Req/Sec     9.10k     2.66k   37.72k    72.76%
  1091223 requests in 10.10s, 121.76MB read
Requests/sec: 108042.64
Transfer/sec:     12.06MB

Running orjson

$ wrk -t12 -c1000 http://localhost:3000/orjson/1
Running 10s test @ http://localhost:3000/orjson/1
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    12.09ms   12.23ms 148.70ms   88.54%
    Req/Sec     8.39k     2.07k   41.86k    74.83%
  1002999 requests in 10.07s, 115.74MB read
Requests/sec:  99625.89
Transfer/sec:     11.50MB

As you can see, my results do not seem to match the referenced benchmarks that orjson is more performant. @ijl Can you look at my methodology and see if I am missing something?

Copy link
Member

@ahopkins ahopkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a hold on this until we resolve the impact on performance.

@ijl
Copy link

ijl commented Apr 29, 2019

That is testing with 1,000 client connections (i.e., there is massive queuing) and looking for a difference, in a whole-program benchmark, of the time taken in serializing {"id":1}, which is eight bytes! There's nothing to be learned from that benchmark. The benchmarks in the orjson repository are correct and reproducible. I don't want to spend time on sanic and would encourage you to close this branch.

@sjsadowski
Copy link
Contributor

So this is fun - specifically because our average response times seem to go down, but the handled requests also go down. I'm not sure how we want to gauge performance in this process, because there are some people who will argue either side - faster is better at the costs of requests, or volume of requests is better at the cost of speed.

Also the very biggest question: is the bottleneck in sanic or is the bottleneck in the library?

@vltr
Copy link
Member

vltr commented Apr 29, 2019

The benchmarks in the orjson repository are correct and reproducible.

Benchmarks are quite susceptible to the kind of data you put in, of course. You can have a very simple object to serialize or a very complex one. But you can also want to use your own JSON serialization / deserialization functions (ex: you might like to use the hooks system from the stdlib json module) or use what's "out of the box" (in case of a third-party lib, like Sanic). What I think is that the developer should be able to choose his own JSON implementation. If in case you're a "power user" or wants to get some speed out of Sanic, then install ujson, orjson, RapidJson or whatever you seem fit.

As from the above benchmark, it might look superficial but it's actually valid for some web frameworks benchmark systems. I just don't the link with me here.

I don't want to spend time on sanic and would encourage you to close this branch.

Well, you don't have to if you don't want to. As I said, I prefer developers to choose their JSON serialization / deserialization functions by themselves and I would not like Sanic to have an opinion on it since even ujson is not a pure-Python implementation (so you can have the same problem installing it on, let's say, Alpine Linux as you could have with orjson). We need to stay open for anyone and stay closer to that idea.

If our job is not to stay open about what the developers (our main audience) want, then why are we doing it opensource after all?

Just my two cents.

@ahopkins
Copy link
Member

ahopkins commented Apr 29, 2019

@ijl I think you misunderstood me. I am not asking you to work on Sanic. I just meant more am I missing something more as a straight port from json.dumps to ujson.dumps to orjson.dumps since you are much more familiar with serialization than I.

And, according to a quick test @vltr did to share with me, it looks like there is a difference in that orjson is returning a bytes string. So, I would need to dig into it deeper, but perhaps that is causing the performance difference in how Sanic treats that.

I am not trying to say that your benchmarks are incorrect. Merely understand why Sanic would not see the same correlation. Also, I tested with not just the 8 bytes super simple response, but also a more nested and realistic API response. Both results are above.

@sjsadowski Yup. Meaning we have more due diligence to do.

@vltr It is a feature of sanic already to allow the developer to chose at response time what serializer to use. See the above. The question is which should be the default that plain vanilla Sanic ships with?

@vltr
Copy link
Member

vltr commented Apr 29, 2019

@vltr It is a feature of sanic already to allow the developer to chose at response time what serializer to use. See the above. The question is which should be the default that plain vanilla Sanic ships with?

@ahopkins I know 😉 I think that we could lean towards (in the future) to something like @yunstanford suggested: sanic[orjson], sanic[ujson], etc ... It's just an idea, though. I know we had some problems with unit testing and the vanilla JSON lib in the past, but that was just a matter of getting the dumps call with the right arguments. Anyway, this is just a roughly "vague" idea.

@harshanarayana
Copy link
Contributor Author

harshanarayana commented May 3, 2019

sanic[orjson], sanic[ujson]

@vltr I think this would be the best approach to let the end user pick what they want. I think it's a good idea to close this PR and do the necessary changes to enable the end user to pick their preferred json toolchain instead.

@ahopkins
Copy link
Member

ahopkins commented May 3, 2019

@harshanarayana I can agree to that.

@ahopkins ahopkins closed this May 3, 2019
@gatopeich
Copy link

gatopeich commented May 31, 2019

So how do we choose json library now?

BTW orjson clearly outperforms ujson in my domain-specific benchmark:

  • Decoding 10,000 randomized json documents, extracting data and encoding the data
  • Input is a 500 bytes 2-levels deep dict subclass (native dict does better overall),
  • output is 300 bytes sub-dict extracted from it,
  • orjson shines specially in the encoding direction (py_to_json)
  • working with bytes by default, vs ujson's strings
  • producing the exact same output
$ python3.7 -OO tests/json_performance.py 
With standard json ...
json py_to_json: 1.73e+05/s
json json_to_py: 2.24e+05/s
json json_to_py+access+py_to_json: 1.25e+05/s
json Request/parsed/response type[size]: str[501] dict[5] str[306]
json Encoded request/parsed/response:
  b'{"amfInstanceId":"554afdee-c72a-8c43-449a-807809d59999","resynchroni...
  {'amfInstanceId': '554afdee-c72a-8c43-449a-807809d59999', 'resynchroni...
  b'{"authType":"5G_AKA","5gAuthData":{"rand":"69ff361e32192528fd5ef6d94...
json Last request/parsed/response hash: 25f3ab 01e36b b0dc3c

With ujson ...
ujson py_to_json: 2.94e+05/s
ujson json_to_py: 1.72e+05/s
ujson json_to_py+access+py_to_json: 1.9e+05/s
ujson Request/parsed/response type[size]: str[501] dict[5] str[306]
ujson Encoded request/parsed/response:
  b'{"amfInstanceId":"554afdee-c72a-8c43-449a-807809d59999","resynchroni...
  {'amfInstanceId': '554afdee-c72a-8c43-449a-807809d59999', 'resynchroni...
  b'{"authType":"5G_AKA","5gAuthData":{"rand":"69ff361e32192528fd5ef6d94...
ujson Last request/parsed/response hash: 25f3ab 01e36b b0dc3c

With orjson ...
orjson py_to_json: 6.53e+05/s
orjson json_to_py: 2.24e+05/s
orjson json_to_py+access+py_to_json: 2.76e+05/s
orjson Request/parsed/response type[size]: bytes[501] dict[5] bytes[306]
orjson Encoded request/parsed/response:
  b'{"amfInstanceId":"554afdee-c72a-8c43-449a-807809d59999","resynchroni...
  {'amfInstanceId': '554afdee-c72a-8c43-449a-807809d59999', 'resynchroni...
  b'{"authType":"5G_AKA","5gAuthData":{"rand":"69ff361e32192528fd5ef6d94...
orjson Last request/parsed/response hash: 25f3ab 01e36b b0dc3c

@ahopkins
Copy link
Member

@gatopeich I don't doubt it. And the benchmarks that I've seen elsewhere seem consistent. However, there is a question about conversion to strings or byte strings. And, in my experiments, orjson did not perform as well inside sanic routes. Perhaps there are some other tweaks we need to make, and I have not tried it in relation to #1475. That will be a good experiment too.

With that said, the response.json method allows you to pass a dumps kwarg to override the usage of ujson if you choose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants