Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many heap allocs/escapes in sipcallmon #9

Open
poandrei opened this issue Mar 20, 2021 · 7 comments
Open

too many heap allocs/escapes in sipcallmon #9

poandrei opened this issue Mar 20, 2021 · 7 comments
Assignees
Labels

Comments

@poandrei
Copy link
Contributor

From intuitive-issues created by poandrei: intuitivelabs/intuitive-issues#262

Some code in sipcallmon causes too many heap allocs/escapes.

Heap profile on evring_avoid_escapes branch, 30s allocs:
(go tool pprof -trim_path=/go/ -inuse_objects heap)

Showing top 10 nodes out of 18
      flat  flat%   sum%        cum   cum%
     16384 52.37% 52.37%      16131 51.56%  github.com/intuitivelabs/sipcallmon.processPackets
     -1024  3.27% 49.09%      -1024  3.27%  github.com/intuitivelabs/calltr.AllocCallEntry
       514  1.64% 50.74%        514  1.64%  github.com/intuitivelabs/sipcallmon.SIPStreamFactory.New
       257  0.82% 51.56%        257  0.82%  github.com/intuitivelabs/sipcallmon.udpSIPMsg
         0     0% 51.56%        514  1.64%  github.com/google/gopacket/tcpassembly.(*Assembler).AssembleWithTimestamp
         0     0% 51.56%      -1024  3.27%  github.com/google/gopacket/tcpassembly.(*Assembler).FlushWithOptions
         0     0% 51.56%      -1024  3.27%  github.com/google/gopacket/tcpassembly.(*Assembler).sendToConnection
         0     0% 51.56%      -1024  3.27%  github.com/google/gopacket/tcpassembly.(*Assembler).skipFlush
         0     0% 51.56%        514  1.64%  github.com/google/gopacket/tcpassembly.(*StreamPool).getConnection
         0     0% 51.56%      -1024  3.27%  github.com/intuitivelabs/calltr.ProcessMsg

profile_heap

@poandrei
Copy link
Contributor Author

Stats at the end of the run on timestamp branch, cmd line as above (10x speed, no debugging a.s.o.).
Note that there are slight differences between the end stats between the version, the fastest version ("timestamp") seems to generate a few more events, but there are some diffs in tcp connections too (no explanation for that yet)-

run time: 1m25.677687121s

Statistics:
  4940647 packets   4886763 ipv4     53884 ipv6      8416 other         0 inj.
     8416 ip4frags         0 defrag
  4465591 udp:   4414178 udp4     51413 upd6
   466640 tcp:    464169 tcp4      2471 tcp6
        0 tls         0 dtls         0 sctp 
tcp:     12999 streams    221616 reassembled segs  50656859 total bytes 
tcp:      1229 SYNs       830 FINs     12999 closed 
tcp:     28937 ignored      1264 ignored streams
tcp:        33 out-of-order     11774 missed         0 too big
tcp: 881594545 missed bytes
tcp:        51 stream timeouts     11666 reassembly timeouts
Parsed:   4465591 total    3902615 ok        1656 errors         0 probes
Parsed:   3822410 udp ok       392 errs     80205 tcp ok      1264 errs
Errors:      1656 parse          1 offset mismatch         1 body
Tracked:   3822410 udp     80203 tcp         0 err udp         2 err tcp
	"invalid character in header"  =      1468
	"bad header"                   =       128
	"incomplete/truncated data"    =        60
Requests: 1935221 
	REGISTER   =   1416757
	INVITE     =    105714
	ACK        =    106326
	BYE        =     46260
	PRACK      =         2
	CANCEL     =      9759
	OPTIONS    =    167150
	SUBSCRIBE  =     38696
	NOTIFY     =     42645
	UPDATE     =        96
	INFO       =        10
	PUBLISH    =      1798
	MESSAGE    =         4
	OTHER      =         4
Replies: 1967394 
	1XX =    171911
	2XX =   1244246
	3XX =         2
	4XX =    546770
	5XX =      2871
	6XX =      1594

calls.fail_new                :      0              
calls.fail_lim                :      0              
calls.active                  :  60647        M:  61551
calls.inv_1st                 :     45        M:     84
calls.inv_early_dlg           :    402        M:    446
calls.inv_neg                 :   2031        M:   2751
calls.inv_established         :   4362        M:   4379
calls.inv_bye_seen            :      3        M:     18
calls.inv_bye_repl            :   1363        M:   1409
calls.inv_canceled            :    454        M:    563
calls.non_inv_1st             :   1891        M:   1942
calls.non_inv_neg             :    786        M:    880
calls.non_inv_fin             :  49310        M:  50726

ev_rate.total_events          : 497643              
ev_rate.total_blst            :  47369              
ev_rate.blst_sent             :    197              
ev_rate.tracking_fail         :      0              
ev_rate.blst_recovered        :     34              

ev_rate_gc.gc_runs            :      8              
ev_rate_gc.gc_target_met      :      0              
ev_rate_gc.gc_walked          :  23420 m:  22951 M:  30318
ev_rate_gc.gc_timeout         :      0              
ev_rate_gc.gc_missed_ticks    :      0              
ev_rate_gc.hard_gc_runs       :      0              
ev_rate_gc.hard_gc_tgt_met    :      0              
ev_rate_gc.hard_gc_fail       :      0              
ev_rate_gc.hard_gc_timeout    :      0              
ev_rate_gc.hard_gc_walked     :      0 m:      0 M:      0
ev_rate_gc.light_gc_runs      :      0              
ev_rate_gc.light_gc_tgt_met   :      0              
ev_rate_gc.light_gc_fail      :      0              
ev_rate_gc.light_gc_timeout   :      0              
ev_rate_gc.light_gc_walked    :      0 m:      0 M:      0
ev_rate_gc.max_gc_retr_fail   :      0              
ev_rate_gc.alloc_fail         :      0              

ev_ring0.signals              :      0              
ev_ring0.skipped_sigs         :      0              
ev_ring0.skipped_sigs_max     :               M:      0
ev_ring0.skip_rdonly          :      0              
ev_ring0.skip_rdonly_max      :               M:      0
ev_ring0.fail                 :      0              
ev_ring0.rd_busy              :      0              
ev_ring0.wr_busy              :      0              
ev_ring0.blst                 : 806012              
ev_ring0.queued               : 288906              
ev_ring0.read_only            :      0        M:      0
ev_ring0.read_only2           :      0        M:      0
ev_ring0.get_old_idx          :      0              
ev_ring0.get_high_idx         :      0              
ev_ring0.get_last_idx         :      0              
ev_ring0.get_inv_ev           :      0              
ev_ring0.get_busy_ev          :      0              
ev_ring0.parallel             :      0        M:      4

ev_ring0.evtype.empty         :      0              
ev_ring0.evtype.call-start    :  10317              
ev_ring0.evtype.call-end      :  11102              
ev_ring0.evtype.call-attempt  :   5415              
ev_ring0.evtype.auth-failed   :   6855              
ev_ring0.evtype.action-log    :      0              
ev_ring0.evtype.reg-new       :  64495              
ev_ring0.evtype.reg-del       : 190639              
ev_ring0.evtype.reg-expired   :     83              
ev_ring0.evtype.sub-new       :      0              
ev_ring0.evtype.sub-del       :      0              
ev_ring0.evtype.other-failed  :   4486              
ev_ring0.evtype.other-timeout :   2021              
ev_ring0.evtype.other-ok      : 155060              
ev_ring0.evtype.parse-error   :   1656              
ev_ring0.evtype.msg-probe     : 642789              
ev_ring0.evtype.invalid       :      0              


regs.fail_new                 :      0              
regs.fail_lim                 :      0              
regs.active                   :  22948        M:  23284

@poandrei
Copy link
Contributor Author

after more work on the branch:

Showing top 10 nodes out of 23
      flat  flat%   sum%        cum   cum%
     -3278 53.92% 53.92%      -3278 53.92%  github.com/intuitivelabs/calltr.AllocEvRateEntry
      2170 35.70% 18.23%       2170 35.70%  github.com/intuitivelabs/calltr.AllocCallEntry
       357  5.87% 12.35%        357  5.87%  github.com/intuitivelabs/sipcallmon.SIPStreamFactory.New
      -257  4.23% 16.58%      -1365 22.45%  github.com/intuitivelabs/sipcallmon.udpSIPMsg
        16  0.26% 16.32%         16  0.26%  sync.(*poolChain).pushHead
         0     0% 16.32%        358  5.89%  github.com/google/gopacket/tcpassembly.(*Assembler).AssembleWithTimestamp
         0     0% 16.32%        357  5.87%  github.com/google/gopacket/tcpassembly.(*StreamPool).getConnection
         0     0% 16.32%         16  0.26%  github.com/intuitivelabs/calltr.(*EvRateEntry).Unref
         0     0% 16.32%         16  0.26%  github.com/intuitivelabs/calltr.(*EvRateHash).ForceEvict
         0     0% 16.32%      -3278 53.92%  github.com/intuitivelabs/calltr.(*EvRateHash).IncUpdate

Looks much better, the allocation (new allocs in the last 30s) happen mostly in expected places (Alloc*, tcp new connection).
profile_heap_2

@poandrei
Copy link
Contributor Author

Added a new branch (timestamp): intuitivelabs/calltr@c8629be

Measurements (not very precise due to frequency scaling):

  • sipcm running with 10x speed, big 14m pcap:
./sipcm -pcap /local/store/andrei/intuitivelabs/traffic.pcap    -http_port 8080 -evr_limits=23,103,203  -evr_intervals="[1s 1m 1h]" -replay_min_delay=0 -event_types_blst="msg-probe,other-failed,other-timeout,other-ok,parse-error"   -replay  -replay_scale=0.1 -run_forever -log_level=2 -parse_log_level=0 -debug_calltr=0
  • record:
pidstat -T TASK  -H -h -u -r -p `pgrep sipcm` 1 >/tmp/sipcm_timestamp.pidstat
sh plot_pidstat.sh /tmp/sipcm_timestamp.pidstat

Results:

master:     avg cpu: 178%, max cpu 204%,  max mem: 613Mb
noescape:   avg cpu: 172%, max cpu 200%,  max mem: 571Mb   => 3.3 %cpu  &   6.5% mem improvement
timestamp:  avg cpu:  170%, max cpu 194%, max mem: 526Mb  =>  4.5% cpu  &  14.2% mem improvement

Graphs:

  • master:
    sipcm_master pidstat2

  • noescape:
    sipcm_noesc pidstat

  • timestamp:
    sipcm_timestamp pidstat

@poandrei
Copy link
Contributor Author

Same test, this time with my new wheel timers, 1ms tick, version alpha (still some dbg, very lightly tested):

master:     avg cpu: 178%, max cpu 204%,  max mem: 613Mb
noescape:   avg cpu: 172%, max cpu 200%,  max mem: 571Mb   =>  3.3 %cpu &   6.5% mem improvement
timestamp:  avg cpu:  170%, max cpu 194%, max mem: 526Mb   =>  4.5% cpu &  14.2% mem improvement
wtimer:     avg cpu:   68%  max cpu  80%  max mem: 226Mb   => 61.8% cpu &  63% mem improvement compared 1week ago
                                                           => 60%   cpu  & 57% mem  compared to latest master
or 40%  cpu  (0.4!) usage and 42 % mem  from latest master on the same traffic   

Graph (same scale as before):
sipcm_wtimer3 pidstat

@poandrei
Copy link
Contributor Author

Latest profiling form mem:

      flat  flat%   sum%        cum   cum%
    109460 45.94% 45.94%     109460 45.94%  github.com/intuitivelabs/calltr.AllocCallEntry
     78859 33.10% 79.04%      78859 33.10%  github.com/intuitivelabs/calltr.AllocRegEntry
     21039  8.83% 87.87%      21039  8.83%  github.com/intuitivelabs/sipcallmon.SIPStreamFactory.New
     17647  7.41% 95.28%      17647  7.41%  github.com/intuitivelabs/calltr.AllocEvRateEntry
      8657  3.63% 98.91%       8657  3.63%  github.com/intuitivelabs/sipcallmon.(*EvRing).Init
      2565  1.08%   100%     204918 86.01%  github.com/intuitivelabs/sipcallmon.udpSIPMsg
         1 0.00042%   100%      21051  8.84%  github.com/google/gopacket/tcpassembly.(*StreamPool).getConnection
         0     0%   100%      22145  9.29%  github.com/google/gopacket/tcpassembly.(*Assembler).AssembleWithTimestamp
         0     0%   100%      17647  7.41%  github.com/intuitivelabs/calltr.(*EvRateHash).IncUpdate
         0     0%   100%     188319 79.04%  github.com/intuitivelabs/calltr.ProcessMsg

sipcm_wtimer_heap1
sipcm_wtimer_allocs

@poandrei
Copy link
Contributor Author

Unfortunately cpu frequency scaling interfered with most of the measurements so probably the latest numbers are much better (since the cpu did run only at 2.6-28Ghz from 3.3Ghz).

I've tried increasing the replay speed. Look like 33x (-replay_scale 0.03) is close to the maximum (cpu gets to max frequency and trying bigger speed-ups seems to run the pcap no faster then 33x, have to investigate if the problem is not from the replay code).

Results at 33x:

wtimer:     avg cpu:   131%  max cpu  148%  max mem: 806Mb 
Max active state tracked 163823, but 137096 non-inv and only  4378 established inv.
Events: 17111/s, 
call-start: 382/s
call-attempt: 113/s
reg-new: 2456/s
reg-del: 7122/s  --- something strange here
other-ok: 4383/s
msg-probe: 24357/s 

active regs: 23564

Graph:
sipcm_wtimer_33x pidstat

poandrei added a commit to intuitivelabs/calltr that referenced this issue Mar 23, 2021
Use high performance hierarchical timing wheel based timers.
Massive overall performance improvement, 2+x faster and less then
 half the memory usage.

See also: #9,  intuitivelabs/sipcallmon#9
@poandrei
Copy link
Contributor Author

Results at 33x. wtimer vs. wtimer + alloc_oneblock_v2

wtimer:     avg cpu:   131%  max cpu  148%  max mem: 806Mb 
oneblk:     avg cpu:   115% max cpu   123%  max mem: 774Mb

=> 12 % avg cpu improvement, 16% max,   3% mem

Graph:

sipcm_oneblock_33x pidstat

poandrei added a commit to intuitivelabs/calltr that referenced this issue Mar 29, 2021
Use different size pools to get the memory blocks. Each of the
common block size will get its own pool. The pool use sync.Pool
via bytespool.
Unfortunately since we allocate bytes blocks that we force cast to
our required data structure (CallEntry + buffer or RegEntry +
buffer), the go GC will never see the pointers in the structs inside
our blocks and will consider the free for taking at the next run.
To work around this we keep a pointer to each allocated block in a
separate list (see alloc_plist.go). This "list" is in fact a
simple linked list of pointer blocks of page size. Locking is used
only when a new page-size pointer block needs to be added or freed
 (otherwise only atomic operations are used).

It brings another 12% cpu and 3% memory improvement.
See /intuitivelabs/sipcallmon#9
@poandrei poandrei self-assigned this Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant