-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throughput at 128 byte level in benchmarks low #46
Comments
Using TCP on a mac with server and client on the same machine, at the 128 byte level I get 4810 messages per second, and a throughput of .6 MB per second. This is significantly faster than the Amazon benchmark results as well as the results I've seen on linux, though much slower than the RDMA results above that we had also deemed "too slow". 128 bytes: 4800 msg/s, .6 MB/s |
I got the following output running the 1_1 benchmark in my laptop.
It seems that the |
I found this stack overflow post, the solution seems to suggest one potential issue we could be facing. I'll look into it more. |
As an update, I disabled Nagle's algorithm as recommended in the stack overflow post and got a significant speedup. The TCP setup is now seeing a throughput of 4000 messages per second at the 128 byte level, vs the previous 13. I'm now continuing to debug the segfault |
Current Speeds on linux: 128 bytes: 4200 msg/s, .54 MB/s |
4200 messages per second means ~238us per message. Seems reasonable for a TCP stack. The throughputs are terrible though. We should be getting 1 or 2 orders of magnitude higher than this. |
Right now the messages are being sent in two separate calls to |
I would start by computing the throughput of the second write (the one with the contents of the object being written). |
I used the instruments profiling tool on MacOS to profile the runtime of the throughput benchmark at the 128 byte level, and it seems that a significant amount of time is spent in the creation of the cirrus::PosixSemaphore s for the operation, and especially in the random number generator. I'll see what I can do to speed this up. Weight is the total time for a symbol and all its children, and self weight is the amount of time that a specific symbol spent running. Very little time was actually spent sending or receiving messages. |
I profiled the tcpserver as well, and it did not have any significant performance bottlenecks. The majority of its time was spent in calls to poll which makes sense as that is where it waits for io. It then spent ~161 ms sending and ~98 ms calling read. None of the message processing code, that is, copying it out of the flat buffer and inserting into the map, seemed to take a significant amount of time. |
How much data is being transferred? What is the throughput of the write()s and read()s? Does it match network bandwidth? |
I'll look into that |
How would you recommend going about finding this? The tool I used to profile execution time does not have any way to find total data sent or throughput unfortunately. I attempted to use logging statements around the send operations and receive operations in the TCPClient, but I'm not entirely sure that the results reflect the actual throughput of the operations as my understanding is that a Here are the results I found by counting the number of bytes sent per send / number of sends/receives per second: Receives: Twelve in a period of 3 ms for 4 full messages received per ms, or about 250 microseconds for each receive, which is in line with what you'd calculated previously. Each ack message is 68 bytes long. However, this is limited by the rate at which the server receives requests, which is limited by the slow nature of the code to generate random names for semaphores. Sends: Twelve full messages sent in a period of 3 ms for a total of 4 per ms, the same as receives. Each of these is 196 bytes. Serverside processing: In the span of one ms, the server also processes and responds to four requests. Due to the 1ms granularity |
As for the issue with the slow time to generate random names for semaphores, I've got a few ideas:
I think the first of these options sounds like the best to start with, but let me know if you have any suggestions. |
@jcarreira I've opened #90 Which increases throughput by ~4.5x with optimizations (2.75x without) , but this is still shy of the order of magnitude you'd wanted. |
As an update to RDMA speeds, I reran the benchmark on master and at the 128 byte level saw throughputs of only 9000 msg/s and 1.18 MB/s. Something seems off with this because this is half the msg/s of the latest TCP version, but has 10x the bytes/s. I'll look into it more. |
It turns out in the most recent versions of |
I've sent you an email about this. |
Currently, we are only seeing throughput of about 20 MB/s on 128 byte puts (before the introduction of the new interface). We should be seeing speeds of about 1 GB/S
Current speeds: (MB/s, messages/s)
128 bytes: 20.7 MB/s, 162072
4K bytes: 556.371 MB/s, 135833
50K bytes: 2445.7 MB/s, 47767.9
1M bytes: 4442e MB/s, 4236.22
10M bytes: 4369.74 MB/s, 416.731
The text was updated successfully, but these errors were encountered: