New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - Replace ujson with orjson as default json library #1509
WIP - Replace ujson with orjson as default json library #1509
Conversation
Signed-off-by: Harsha Narayana <harsha2k4@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #1509 +/- ##
=========================================
+ Coverage 91.35% 91.45% +0.1%
=========================================
Files 18 18
Lines 1781 1791 +10
Branches 337 340 +3
=========================================
+ Hits 1627 1638 +11
Misses 130 130
+ Partials 24 23 -1
Continue to review full report at Codecov.
|
Signed-off-by: Harsha Narayana <harsha2k4@gmail.com>
@harshanarayana Why not just continue to allow the developer to pick the module? Are we comfortable that |
@ahopkins I actually started replacing ujson with orjson personally a bit ago. I'm pretty happy with it and I'm using it in production. Of course that's anecdotal, but I'd give it a pass if I was doing the review and there were no other concerns. Personally I think sanic was already opinionated - @channelcat could have used the standard json module but went with ujson instead. Fast and correct is the way to go, and given that ujson has largely been abandoned, I think we should adopt this as a replacement and default - but still allow people to use their own json library if that's what they want to do. This leads to a larger discussion though - do we start down the path of deprecating python 3.5 support. I'll pose that question in the community. |
After thinking about this some, I have two really big issues: 1) removing ujson significantly slows responses and 2) I don't want for sanic to be reliant on abandoned projects. orjson is a young project, but it does serve a need, as illustrated below: All of the following tests were using sanic with 12 workers (since that's how many cores I have), and running wrk with 100 concurrent connections and 2 threads. With ujson: without ujson - default library: with orjson: This are just median examples, but on average over multiple runs I would see anywhere between 8-15% response time decrease (of json over ujson), and that's a poor trade-off. When going back to the standard json library, even with documentation, too many people will just ignore the docs and (might) fuss about performance. With orjson we shave that to 3-5%, and I'm sure we can hope for better optimization going forward. So orjson gets my vote as the new default. It may be a young project, but it's being kept up which is more than I can say about ujson. |
It's important for performance that json_dumps() not convert bytes to str. There is the cost of converting bytes to str and as well the cost of converting str back to bytes when writing the request. I don't see |
Thanks for the info @ijl - yes the json_dumps is a private method that can be overridden by supplying the dumps=. @harshanarayana can you make that update on your branch? I don't think it's significantly impacting to remove the type conversion. Someone else from @huge-success/sanic-core-devs check me to make sure I'm not wrong. |
@sjsadowski Sure. Let me make the changes required to use the data in bytes instead of string format. I will update this PR with the changes shortly. @huge-success/sanic-core-devs Can I get a No/no-go on the following items.
|
You've got a go from me, but I'd like to get at least two other people to weigh in with a "go" before we merge. @ashleysommer @yunstanford @seemethere @ahopkins @abuckenheimer |
Is something broken in If we are looking to make a change, have we looked at others? (https://github.com/python-rapidjson/python-rapidjson) Before making a change, I think we should do some more due dilligence and testing. Especially since it seems that |
I agree with @ahopkins. I don't actually see anything wrong with sticking with ujson for now. It is fast and it works. We will be dropping support for python 3.5 when it hits EOL in 2020-09, it would make sense that if we want to migrate to a python 3.6-only library, we do so then. And additionally, as @ahopkins mentioned, why is |
I don't have a strong opinion here, It could be included as Also, i'd like to refactor the code a little bit. Basically, we have |
So there are multiple open issues since 2017 related to number handling and deserialization issues. People have submitted PRs to address them which have been ignored. Personally, I don't care if it's orjson or rapidjson or something else - so long as the library owner(s) respond to issues. I'd rather be ahead of problems by a voluntary change than working backward because something we rely on breaks - though to be fair we could always just drop ujson support in favor of default json. Though like I said, there's a performance hit in doing so and I'd rather replace the dead library with something actively maintained.... unless we want to fork it and maintain it ourselves, which I recommend against. |
@ahopkins @ashleysommer @yunstanford @sjsadowski I will put this PR on hold for now. Let me create a quick benchmark of a few different JSON libraries available out there and see which of them perform consistently better without having any breaking changes if possible. @yunstanford I am totally for the |
I second the |
orjson 2.0.2 supports python3.5. |
@ijl In light of the renewed discussion on bumping the min version to 3.6, can you put together some new benchmarks for us to compare |
There are benchmarks at https://github.com/ijl/orjson#performance. |
I ran a benchmark between Here is the test using a bit of json from here: https://www.json.org/example.html. from sanic import Sanic
from sanic.response import json
import sys
import orjson
app = Sanic(log_config=None)
WORKERS = int(sys.argv[-1])
data = {"web-app": {"servlet": [{"servlet-name": "cofaxCDS","servlet-class": "org.cofax.cds.CDSServlet","init-param": {"configGlossary:installationAt": "Philadelphia, PA","configGlossary:adminEmail": "ksm@pobox.com","configGlossary:poweredBy": "Cofax","configGlossary:poweredByIcon": "/images/cofax.gif","configGlossary:staticPath": "/content/static","templateProcessorClass": "org.cofax.WysiwygTemplate","templateLoaderClass": "org.cofax.FilesTemplateLoader","templatePath": "templates","templateOverridePath": "","defaultListTemplate": "listTemplate.htm","defaultFileTemplate": "articleTemplate.htm","useJSP": False,"jspListTemplate": "listTemplate.jsp","jspFileTemplate": "articleTemplate.jsp","cachePackageTagsTrack": 200,"cachePackageTagsStore": 200,"cachePackageTagsRefresh": 60,"cacheTemplatesTrack": 100,"cacheTemplatesStore": 50,"cacheTemplatesRefresh": 15,"cachePagesTrack": 200,"cachePagesStore": 100,"cachePagesRefresh": 10,"cachePagesDirtyRead": 10,"searchEngineListTemplate": "forSearchEnginesList.htm","searchEngineFileTemplate": "forSearchEngines.htm","searchEngineRobotsDb": "WEB-INF/robots.db","useDataStore": True,"dataStoreClass": "org.cofax.SqlDataStore","redirectionClass": "org.cofax.SqlRedirection","dataStoreName": "cofax","dataStoreDriver": "com.microsoft.jdbc.sqlserver.SQLServerDriver","dataStoreUrl": "jdbc:microsoft:sqlserver://LOCALHOST:1433;DatabaseName=goon","dataStoreUser": "sa","dataStorePassword": "dataStoreTestQuery","dataStoreTestQuery": "SET NOCOUNT ON;select test='test';","dataStoreLogFile": "/usr/local/tomcat/logs/datastore.log","dataStoreInitConns": 10,"dataStoreMaxConns": 100,"dataStoreConnUsageLimit": 100,"dataStoreLogLevel": "debug","maxUrlLength": 500,},},{"servlet-name": "cofaxEmail","servlet-class": "org.cofax.cds.EmailServlet","init-param": {"mailHost": "mail1","mailHostOverride": "mail2",},},{"servlet-name": "cofaxAdmin","servlet-class": "org.cofax.cds.AdminServlet",},{"servlet-name": "fileServlet","servlet-class": "org.cofax.cds.FileServlet",},{"servlet-name": "cofaxTools","servlet-class": "org.cofax.cms.CofaxToolsServlet","init-param": {"templatePath": "toolstemplates/","log": 1,"logLocation": "/usr/local/tomcat/logs/CofaxTools.log","logMaxSize": "","dataLog": 1,"dataLogLocation": "/usr/local/tomcat/logs/dataLog.log","dataLogMaxSize": "","removePageCache": "/content/admin/remove?cache=pages&id=","removeTemplateCache": "/content/admin/remove?cache=templates&id=","fileTransferFolder": "/usr/local/tomcat/webapps/content/fileTransferFolder","lookInContext": 1,"adminGroupID": 4,"betaServer": True,},},],"servlet-mapping": {"cofaxCDS": "/","cofaxEmail": "/cofaxutil/aemail/*","cofaxAdmin": "/admin/*","fileServlet": "/static/*","cofaxTools": "/tools/*",},"taglib": {"taglib-uri": "cofax.tld","taglib-location": "/WEB-INF/tlds/cofax.tld",},}}
@app.route("/ujson/<id:int>", methods=["GET"])
async def ujson_test(request, id):
return json(data)
@app.route("/orjson/<id:int>", methods=["GET"])
async def orjson_test(request, id):
return json(data, dumps=orjson.dumps)
if __name__ == "__main__":
app.run(
debug=False,
access_log=False,
workers=WORKERS,
port=3000,
host="0.0.0.0",
) Running
Running
Here it is again with a much smaller payload: @app.route("/ujson/<id:int>", methods=["GET"])
async def ujson_test(request, id):
data = {"id": id}
return json(data)
@app.route("/orjson/<id:int>", methods=["GET"])
async def orjson_test(request, id):
data = {"id": id}
return json(data, dumps=orjson.dumps) Running
Running
As you can see, my results do not seem to match the referenced benchmarks that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting a hold on this until we resolve the impact on performance.
That is testing with 1,000 client connections (i.e., there is massive queuing) and looking for a difference, in a whole-program benchmark, of the time taken in serializing |
So this is fun - specifically because our average response times seem to go down, but the handled requests also go down. I'm not sure how we want to gauge performance in this process, because there are some people who will argue either side - faster is better at the costs of requests, or volume of requests is better at the cost of speed. Also the very biggest question: is the bottleneck in sanic or is the bottleneck in the library? |
Benchmarks are quite susceptible to the kind of data you put in, of course. You can have a very simple object to serialize or a very complex one. But you can also want to use your own JSON serialization / deserialization functions (ex: you might like to use the hooks system from the stdlib As from the above benchmark, it might look superficial but it's actually valid for some web frameworks benchmark systems. I just don't the link with me here.
Well, you don't have to if you don't want to. As I said, I prefer developers to choose their JSON serialization / deserialization functions by themselves and I would not like Sanic to have an opinion on it since even If our job is not to stay open about what the developers (our main audience) want, then why are we doing it opensource after all? Just my two cents. |
@ijl I think you misunderstood me. I am not asking you to work on Sanic. I just meant more am I missing something more as a straight port from json.dumps to ujson.dumps to orjson.dumps since you are much more familiar with serialization than I. And, according to a quick test @vltr did to share with me, it looks like there is a difference in that orjson is returning a bytes string. So, I would need to dig into it deeper, but perhaps that is causing the performance difference in how Sanic treats that. I am not trying to say that your benchmarks are incorrect. Merely understand why Sanic would not see the same correlation. Also, I tested with not just the 8 bytes super simple response, but also a more nested and realistic API response. Both results are above. @sjsadowski Yup. Meaning we have more due diligence to do. @vltr It is a feature of sanic already to allow the developer to chose at response time what serializer to use. See the above. The question is which should be the default that plain vanilla Sanic ships with? |
@ahopkins I know 😉 I think that we could lean towards (in the future) to something like @yunstanford suggested: |
@vltr I think this would be the best approach to let the end user pick what they want. I think it's a good idea to close this PR and do the necessary changes to enable the end user to pick their preferred json toolchain instead. |
@harshanarayana I can agree to that. |
So how do we choose json library now?BTW orjson clearly outperforms ujson in my domain-specific benchmark:
|
@gatopeich I don't doubt it. And the benchmarks that I've seen elsewhere seem consistent. However, there is a question about conversion to strings or byte strings. And, in my experiments, orjson did not perform as well inside sanic routes. Perhaps there are some other tweaks we need to make, and I have not tried it in relation to #1475. That will be a good experiment too. With that said, the response.json method allows you to pass a dumps kwarg to override the usage of ujson if you choose. |
Address the default JSON library replacement item mentioned in #1479
Since https://github.com/ijl/orjson has started supporting wheel for all platforms, I suppose we can not migrate to orjson as the default library for python3.6 and above. However, for 3.5, we will still retain
ujson
untilorjson
is backported to it or we decide to dropujson
completely and rely on default json for python3.5