Skip to content
This repository has been archived by the owner on Feb 3, 2023. It is now read-only.

Scaling performance #2074

Open
1 of 3 tasks
freesig opened this issue Jan 21, 2020 · 11 comments
Open
1 of 3 tasks

Scaling performance #2074

freesig opened this issue Jan 21, 2020 · 11 comments
Assignees

Comments

@freesig
Copy link
Contributor

freesig commented Jan 21, 2020

I've made this issue to track the work on increasing the performance of nodes as we scale.

Hypothesis

The current thought is that there is a performance issue in sim2h that is preventing the stress test from getting past ~2000 nodes.

What I've found so far

I do not think there is actually a performance issue in the sim2h code.
This is how the threads are being used:
image
Here is the analysis:
image
This is the case for current develop and sim2h-futures4.
I think either sim2h is not under a large workload or it's busy waiting for locks.

Issues with the test metric

The metric we are using, essentially checking for Agent ID entries to be held by at least one other node appears to be giving me large variance when run multiple times in a row.

Note there was two test nodes that were down while I was running the tests today so I removed them from the list assuming this would not break the test. This could be a bad assumption.

Issues with perf and flamegraphs

These are great tools for finding hotspots within a program but they do not tell you much about cpu utilization. They can also average over large peaks of cpu usage. flamescope can help with the second issue but not the first.
It is possible to get thread stack traces from perf but I haven't found a nice way to do it. Instead I'm using a different profiler.

What I'm actually seeing

The amount of entries held drops by a few hundred the first run but then does not go down by much the second run through.
This is a recent run of 96 nodes with 20 instances. This is on the slower ubuntu computer. I have seen similar behaviour all day.

all not held missing: 1241, retrying after delay
all not held missing: 910, retrying after delay

What to try next

  • I want to test if sim2h_server is really under a large workload or not. I think we could look at the incoming websocket packets using tcpdump to see if that are building up.
  • We should also be checking the latency between the packets being sent from the nodes to sim2h and back again.
  • Figure out a way to rule out locks. This might be possible just by using this profiler.

I am using the USE method to try and find the issue. But so far I have not found high utilization or saturation in memory or cpu. I have not seen any errors.
If we can prove that the workload is making it to the connections we know the problem is in sim2h (probably as a logic or locking issue) otherwise the issue might be that the conductors are the slow point.

@freesig freesig self-assigned this Jan 21, 2020
@freesig
Copy link
Contributor Author

freesig commented Jan 21, 2020

There appears to be 17 threads running on an 8 core machine. I can see why there would be main thread plus 8 for the number of cpus but there's an additional 8. I might be reading the instrument wrong though. Perhaps it has been sampled twice. They do have different TID's though
I can't see anywhere that another thread is being spawned.

@freesig
Copy link
Contributor Author

freesig commented Jan 22, 2020

So here we can see how the threads are running.
It appears some task get's started on the main thread then runs on the other threads but there is a lot of transitions (the yellow lines) which shows where one thread holds a mutex or shared conditional object that is holding up the other thread.
The dark green is work being done and the light green is just waiting.
image

@freesig
Copy link
Contributor Author

freesig commented Jan 22, 2020

I just ran #2072 on the fast computer after two runs:

all not held missing: 948, retrying after delay
all not held missing: 948, retrying after delay

Half then nothing. This doesn't feel like a performance problem if it just stops.

@freesig
Copy link
Contributor Author

freesig commented Jan 22, 2020

performance stats

Performance counter stats for 'sim2h_server -p 9001 -s 20':

      1,046,895.25 msec task-clock                #    1.738 CPUs utilized          
        23,404,787      context-switches          # 22356.384 M/sec                 
            77,490      cpu-migrations            #   74.019 M/sec                  
            84,520      page-faults               #   80.734 M/sec                  
 2,676,310,126,213      cycles                    # 2556426.505 GHz                 
 3,378,981,425,711      instructions              #    1.26  insn per cycle         
   771,208,994,703      branches                  # 736663175.106 M/sec             
    15,198,925,820      branch-misses             #    1.97% of all branches        

     602.234688499 seconds time elapsed

     852.912081000 seconds user
     332.070658000 seconds sys

@freesig
Copy link
Contributor Author

freesig commented Jan 22, 2020

Here's a second run. Both runs panic at the end

Performance counter stats for 'sim2h_server -p 9001 -s 20':

      1,124,394.82 msec task-clock                #    1.816 CPUs utilized          
        24,123,768      context-switches          # 21454.906 M/sec                 
           148,794      cpu-migrations            #  132.333 M/sec                  
            73,785      page-faults               #   65.622 M/sec                  
 2,922,602,774,558      cycles                    # 2599269.273 GHz                 
 3,991,680,444,668      instructions              #    1.37  insn per cycle         
   914,011,837,874      branches                  # 812892845.278 M/sec             
    16,265,626,868      branch-misses             #    1.78% of all branches        

     619.030957231 seconds time elapsed

     932.561961000 seconds user
     324.481601000 seconds sys

Panic Backtrace
Panic happens at the end of the test after node admin reset configs/var/exam_1configs_aws-pool_100_us_develop_json.json is called.

@freesig
Copy link
Contributor Author

freesig commented Jan 22, 2020

Network tests

So the current branch seems to be stopping at exactly 948 not held (happened twice).
I thought it would be good to check out the network traffic during the run.

Data in and out

I logged this sar -n TCP 1 for the entire run of a test.

active/s
    The number of times TCP connections have made a direct transition to the SYN-SENT state from the CLOSED state per second [tcpActiveOpens].

passive/s
    The number of times TCP connections have made a direct transition to the SYN-RCVD state from the LISTEN state per second [tcpPassiveOpens].

iseg/s
    The total number of segments received per second, including those received in error [tcpInSegs]. This count includes segments received on currently established connections.

oseg/s
    The total number of segments sent per second, including those on current connections but excluding those containing only retransmitted octets [tcpOutSegs].

Here's the log.
The test starts at 02:29:37 and hits the first run at 02:34:49 then finishes at 02:40:51

Notes

  • High traffic in and out at the beginning of the test.
  • At around 02:32:47 it starts to drop a lot but there is occasional peaks
  • The peaks are almost exactly 30 seconds apart 🤔
    I should note however this is not just the sim2h port so it could be another tcp connection doing the every 30 seconds. However I dout that because there was not much other traffic on the server. Here is a log of all the non-ec2 traffic, there's nothing receiving with a large queue every 30 seconds.. I want to check this though, I'm struggling to find a nice way to see total traffic for a single port.

Tcp connections

I also logged all the tcp connections every second with netstat -tpc.
The full log (~50mb) is here.
Here's a couple of single connections:
57878
34488
Each line is a second apart.
You can generate these easily from the log with:
grep 34488 log_tcp.log where the number is a port you want to look at.

It looks like there's many seconds between messages on a port.
Although keep in mind the Recv-Q and Send-Q just show the size of the unread queue on that port. Not the total data that has passed through.

@freesig
Copy link
Contributor Author

freesig commented Jan 22, 2020

The changes to sim2h-futures4 are using the threads a lot better but there's still a long way to go.
I'll dig into this more tomorrow but I want to figure out why everything is just stopping at 948 held first.
image

@freesig
Copy link
Contributor Author

freesig commented Jan 28, 2020

This is from a direct message stress test:
image

@freesig
Copy link
Contributor Author

freesig commented Jan 30, 2020

Run 99 nodes 20 instances

Performance counter stats for process id '45757':

      8,305,189.12 msec task-clock                #    6.929 CPUs utilized          
       475,991,192      context-switches          # 57312.506 M/sec                 
        19,424,846      cpu-migrations            # 2338.881 M/sec                  
           730,687      page-faults               #   87.980 M/sec                  
18,835,997,308,304      cycles                    # 2267979.369 GHz                   (62.53%)
12,451,259,048,178      instructions              #    0.66  insn per cycle           (75.02%)
 2,773,115,227,216      branches                  # 333901519.546 M/sec               (74.98%)
    47,414,025,732      branch-misses             #    1.71% of all branches          (74.97%)
 3,519,841,544,680      L1-dcache-loads           # 423812335.238 M/sec               (74.99%)
   334,374,499,494      L1-dcache-load-misses     #    9.50% of all L1-dcache hits    (74.99%)
    51,509,984,841      LLC-loads                 # 6202144.809 M/sec                 (50.03%)
    14,351,669,025      LLC-load-misses           #   27.86% of all LL-cache hits     (50.05%)

    1198.583014941 seconds time elapsed

@freesig
Copy link
Contributor Author

freesig commented Jan 30, 2020

Run 99 nodes 30 instances

Performance counter stats for process id '9047':

      7,323,919.52 msec task-clock                #    7.475 CPUs utilized          
       427,504,936      context-switches          # 58371.063 M/sec                 
        18,495,039      cpu-migrations            # 2525.293 M/sec                  
           815,581      page-faults               #  111.359 M/sec                  
16,577,941,681,333      cycles                    # 2263534.275 GHz                   (62.56%)
10,891,486,085,643      instructions              #    0.66  insn per cycle           (75.03%)
 2,419,253,453,568      branches                  # 330322256.918 M/sec               (75.01%)
    40,595,556,677      branch-misses             #    1.68% of all branches          (74.98%)
 3,093,770,156,751      L1-dcache-loads           # 422420039.975 M/sec               (74.97%)
   299,600,755,595      L1-dcache-load-misses     #    9.68% of all L1-dcache hits    (74.97%)
    46,174,421,843      LLC-loads                 # 6304605.750 M/sec                 (50.02%)
    12,925,202,115      LLC-load-misses           #   27.99% of all LL-cache hits     (50.05%)

     979.799670134 seconds time elapsed

@freesig
Copy link
Contributor Author

freesig commented Jan 30, 2020

Direct messages are timing out at about 500 nodes
This is what sim2h outputs:

[2020-01-30T05:19:18Z ERROR sim2h] VERY SLOW metric - sim2h-state-build_handle_new_entry_data - 1082 ms
[2020-01-30T05:19:18Z ERROR sim2h] VERY SLOW metric - sim2h-state-build_handle_new_entry_data - 1082 ms
[2020-01-30T05:19:18Z ERROR sim2h] VERY SLOW metric - sim2h-state-build_handle_new_entry_data - 1158 ms
[2020-01-30T05:19:18Z ERROR sim2h] VERY SLOW metric - sim2h-state-build_handle_new_entry_data - 1265 ms
[2020-01-30T05:19:18Z ERROR sim2h] VERY SLOW metric - sim2h-state-build_handle_new_entry_data - 1265 ms
[2020-01-30T05:19:18Z ERROR sim2h] VERY SLOW metric - sim2h-state-build_handle_new_entry_data - 1327 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1466 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1466 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1466 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1466 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1466 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1466 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1466 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1466 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1466 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1459 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1459 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1459 ms
[2020-01-30T05:19:19Z ERROR sim2h] VERY SLOW metric - sim2h-handle_connection_msg - 1459 ms

That just continues like that for the whole run.
Here's a flame graph:

image
flame.zip

It doesn't look like the memory or the network are backing up.

I've captured some cassettes and connection / memory data. I'm going to try and profile the cassette locally

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant