Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update µWS (0.10.9) #41

Closed
ghost opened this issue Oct 13, 2016 · 6 comments
Closed

Update µWS (0.10.9) #41

ghost opened this issue Oct 13, 2016 · 6 comments

Comments

@ghost
Copy link

ghost commented Oct 13, 2016

It's a rewrite. Run on Linux for perf.

@ghost
Copy link
Author

ghost commented Oct 13, 2016

However, the C++ server is also the most verbose and the most complex implementation. The language is an enormous multi-paradigm conglomeration that includes everything from the low-level memory management, raw pointers, and inline assembly to classes with multiple inheritance, templates, lambdas, and exceptions.

Seems about right for someone that puts JSON parsing in the hot path of a WebSocket benchmark 😉

This is the interface of 0.10:

#include <uWS/uWS.h>

int main() {
    uWS::Hub h;

    h.onMessage([&h](uWS::WebSocket<uWS::SERVER> ws, char *message, size_t length, uWS::OpCode opCode) {
        // either echo it
        ws.send(message, length, opCode);
        // or broadcast it
        h.getDefaultGroup<uWS::SERVER>().broadcast(message, length, opCode);
    });

    h.listen(3000);
    h.run();
}

@ghost
Copy link
Author

ghost commented Oct 13, 2016

No but seriously, you do realize JSON parsing kind of taints the whole benchmark? There are about 5 million different JSON parsers available for C++. Parsing JSON is an O(n) operation while parsing a WebSocket frame is an O(1) operation. You could instead just change the benchmark so that you simply differentiate between echo and broadcast based on the length of the message. That way you eliminate any JSON bullshit from the equation.

@jackc
Copy link
Contributor

jackc commented Oct 14, 2016

However, the C++ server is also the most verbose and the most complex implementation. The language is an enormous multi-paradigm conglomeration that includes everything from the low-level memory management, raw pointers, and inline assembly to classes with multiple inheritance, templates, lambdas, and exceptions.

Seems about right for someone that puts JSON parsing in the hot path of a WebSocket benchmark 😉

The intent of the shootout isn't only to be a benchmark of performance. The goal is also to compare how much effort, skill, and knowledge is required to develop in the various languages and frameworks. I don't think it is controversial to say it takes more effort, skill, and knowledge to be effective in C++ than any other popular language in use today.

To be honest, I actually wanted the shootout to be even further from a strict performance metric of just the websocket libraries. What I would have liked was a simple pubsub with channels/rooms to subscribe and broadcast to and clients continuously connecting and disconnecting, rather than just broadcasting to all connections. That would get a little closer to real-world application complexity and performance. However, that substantially increases the amount of time and difficulty required to write a server implementation. So to be pragmatic, I chose to just to include decoding and encoding the messages in JSON.

No but seriously, you do realize JSON parsing kind of taints the whole benchmark? There are about 5 million different JSON parsers available for C++. Parsing JSON is an O(n) operation while parsing a WebSocket frame is an O(1) operation.

I don't think JSON parsing impacts benchmark results near as much as you might think. For the broadcast test the received message is decoded, the message to broadcast is encoded, and the broadcast complete message is encoded. That is 1 JSON decode and 2 JSON encodes per broadcast. For the exact test results published, the test conditions were 4 concurrent broadcasts with a max of 500ms per broadcast at the 95th percentile. Basically, that boils down to 8 broadcasts a second. So that means when the test was straining the websocket implementations the hardest, it is only doing 8 JSON decodes and 16 JSON encodes per second. Compared to the 200,000+ websocket messages being sent per second, I don't think that JSON performance makes a substantial difference is the benchmark results.

Obviously, it would make more of a difference in an echo benchmark, but we didn't publish those (initial results seemed more indicative of how the OS handled hundreds of thousands of open connections rather than differentiating the server implementations).

As far as upgrading to 0.10.x, I have a couple questions. Do you expect multi-threading to bring any performance advantages? I implemented it for the previous version (you may recall my question on thread safety https://github.com/uWebSockets/uWebSockets/issues/244), but only found a marginal performance difference. It spends almost all its time in kernel so it appears to be OS or network bound even with a single thread. Second, if yes, can you provide documentation or examples? (the previous multi-threaded example is gone -- https://github.com/uWebSockets/uWebSockets/blob/master/examples/multithreaded_echo.cpp)

There is no documentation written yet but a bright person like you will have no problem just reading the header file.

Sadly, I'm not as bright as you think... 😉

Finally, kudos on µWS performance. I haven't finished round 2 testing yet, but at the moment, even on the previous version it's faster with a single-thread than any of the other contestants running multi-threaded or multi-process.

@ghost
Copy link
Author

ghost commented Oct 14, 2016

Obviously, it would make more of a difference in an echo benchmark, but we didn't publish those (initial results seemed more indicative of how the OS handled hundreds of thousands of open connections rather than differentiating the server implementations).

This is kind of the core of my opinion about benchmarking websocket servers. If doing any kind of echo, one needs to make sure to send multiple websocket messages per TCP chunk in order to really stress the websocket server and not just the operating system's polling/event implementation. All websocket servers are TCP servers, but they are also parsers of the websocket protocol.

If you look at µWS, it doesn't really do anything special in terms of networking - it simply calls the regular old syscalls that the operating system exposes. The special part, and the part that differs between websocket implementations is by far the parser.

If you send multiple websocket messages per chunk, you will separate the weak servers from the strong ones. It also makes a lot more sense to send multiple websocket messages per TCP chunk since that is what is going to happen in any real world application anyways. If you look at games like Agar.io they send massive amounts of websocket messages all the time, there it makes a huge difference if you have a server that can handle the parsing of them, or if it simply breaks down.

If you then take this opinion into account, you can also see why I feel including JSON into the mix is a big deal. I know that if I would include JSON parsing into my benchmarks (which are much more centered around the websocket parser performance), it would completely mess with the numbers.

Anyways, as long as we both get roughly the same conclusion I guess I'm pleased. You should definitely see a very big difference in memory usage between µWS and the rest at least.

I wouldn't care about multithreading, it should probably only be used when you really have to / knows explicitly why it would make sense.

@ghost
Copy link
Author

ghost commented Oct 14, 2016

If I send one message per TCP chunk, I can get maybe 6x difference compared to ws. If I send 10 messages per TCP chunk I get about 30x difference. If I send 1000 messages I can get about 150x. Sending more messages than this does not improve performance (now we have fully stressed the parser, it has become the bottleneck). I'm not saying it makes sense to send 1000 messages per TCP chunk, but it certainly makes sense to send 5 or 10 - that way you tell a lot more about the server than what you tell about the operating system (which of course is the same for all servers).

@jackc
Copy link
Contributor

jackc commented Oct 14, 2016

Updated to 0.10.9 in 796e497.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant