Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509

2k9laojia · 2019-12-10T03:15:33Z

Description
Describe the problem you encountered.
When pushing the stream to SRS, the SRS process crashed.

Environment

Operating System: CentOS
Encoder (Tool and Version): ...
Player (Tool and Version): ...
SRS Version: 2.0.248
SRS log is as follows:
client identified, type=fmle-publish, stream_name=62c1352ea3a64621999bfd946081f392, duration=-1.00
connected stream, tcUrl=rtmp://172.16.129.97/netposa, pageUrl=, swfUrl=, schema=rtmp, vhost=defaultVhost, port=1935, app=netposa, stream=62c1352ea3a64621999bfd946081f392, args=null
st_thread_create failed. ret=1017(Cannot allocate memory)

Reproduction
The steps to reproduce the bug are as follows:

Start SRS, run ...
Push the stream, run ...
The bug is reproduced, and the key information is as follows:

...

Expected Behavior
Describe what you expect to happen.

By examining the stack analysis, the reasons are as follows:
During the push stream process, the function SrsSource::initialize is called, which then calls hls->initialize. However, an error occurs in hls->initialize, specifically st_thread_create failed, resulting in an error code being returned. In the function SrsSource::fetch_or_create, srs_freep(source) is called, leading to the destructor of SrsSource being invoked. As a result, the member variable play_edge is released, triggering its destructor and subsequently calling the destructor of the SrsPlayEdge member variable ingester. The stop function is called, ultimately invoking _source->on_unpublish(). Since the call to hls->initialize failed and returned without calling play_edge->initialize(this, _req), the _source pointer in SrsEdgeIngester is null, causing a core dump. Therefore, adding a null check before using _source can resolve the issue.

I hope someone knowledgeable can help answer why st_thread_create failed and why it resulted in a failure.

TRANS_BY_GPT3

The text was updated successfully, but these errors were encountered:

winlinvip · 2019-12-10T09:20:36Z

The reason is clearly stated in the log: 'Cannot allocate memory'.
Can the scenario of running out of memory be reproduced?

TRANS_BY_GPT3

2k9laojia · 2019-12-11T03:16:38Z

My server had 60GB of remaining memory at that time, and the current SRS process was using 3.4GB of memory. Does running out of memory refer to the memory limit for the current process? This issue was occurring repeatedly while handling 500 streaming routes, closing and restarting them.

TRANS_BY_GPT3

2k9laojia · 2019-12-11T03:28:29Z

2k9laojia · 2019-12-13T01:39:37Z

After checking the srsSource objects in the pool, there are SrsSource::do_cycle_all with a pool size of [38320]. I don't know if it has any impact or not.

TRANS_BY_GPT3

winlinvip · 2019-12-13T01:45:08Z

You have also used a considerable amount of memory for other things, so the available memory should be less than 60GB.

How did you test it? Can you describe the process?

Encoder (tool and version): ...
Player (tool and version): ...

TRANS_BY_GPT3

2k9laojia · 2019-12-13T01:47:40Z

                total        used        free      shared  buff/cache   available

Mem: 109G 10G 61G 4.0G 38G 94G
Swap: 4.0G 250M 3.8G

2k9laojia · 2019-12-13T01:52:01Z

I push the stream to SRS using route 500, then play it on 500 clients, repeat for one hour, close all streams, and repeat the above steps. Each stream has a unique stream ID, so a new SRS source object is created every time a stream is pushed. SRSlibrtmp is used for pushing the stream, and JMeter is used for playing the stream, which is discarded after obtaining it. Now that the stress test is over, the CPU and memory usage of SRS remain high. I plan to add pool cleaning in the reload function. When this situation occurs, I will send a signal to SRS to execute a reload and release all SRS source objects in the pool, to see if the CPU and memory usage can be reduced.

TRANS_BY_GPT3

2k9laojia · 2019-12-13T01:52:59Z

TRANS_BY_GPT3

2k9laojia · 2019-12-13T01:53:54Z

'

Note: The translation provided maintains the markdown structure by italicizing the text.

TRANS_BY_GPT3

2k9laojia · 2019-12-13T06:49:26Z

An srsSource object will start 2 coroutines. At that time, there were 38,320 remaining srsSource objects in the pool, which is equivalent to over 70,000 coroutines. This scheduling is a significant CPU consumption. After releasing the srsSource, the CPU usage can be noticeably reduced.

TRANS_BY_GPT3

winlinvip · 2019-12-13T06:55:29Z

Okay, I understand what you mean. It seems like the performance issue is caused by too many sources. I will try to reproduce it first. Thank you.

TRANS_BY_GPT3

winlinvip · 2020-01-15T10:43:01Z

After I enabled HTTP FLV, I created a new source with curl and repeatedly used "killall curl" to reproduce this issue. The speed increases at around 1MB every 3 seconds.

for ((i=0;;i++)); do curl http://localhost:8080/live/livestream-$i.flv -o /dev/null ; sleep 0.1; done

for ((;;)); do killall curl; sleep 0.2; done

I discovered that many coroutines are actually SrsAsyncCallWorker, which means they are coroutines for asynchronous callbacks.

(gdb) p _st_iterate_threads_flag =1
(gdb) b _st_iterate_threads
(gdb) c
(gdb) bt
#1  0x000000000059ff55 in st_cond_timedwait (cvar=0x4953b40, 
    timeout=18446744073709551615) at sync.c:197
#2  0x00000000005a003a in st_cond_wait (cvar=0x4953b40) at sync.c:219
#3  0x00000000004aa3f6 in srs_cond_wait (cond=0x4953b40)
    at src/service/srs_service_st.cpp:334
#4  0x000000000058432c in SrsAsyncCallWorker::cycle (this=0x4953ae0)
    at src/app/srs_app_async_call.cpp:104
#5  0x00000000004f78da in SrsSTCoroutine::cycle (this=0x495a3c0)
    at src/app/srs_app_st.cpp:198
#6  0x00000000004f794f in SrsSTCoroutine::pfn (arg=0x495a3c0)
    at src/app/srs_app_st.cpp:213

I found that HLS and DVR are using this.

SrsDvrPlan::SrsDvrPlan() {
    async = new SrsAsyncCallWorker();

SrsHlsMuxer::SrsHlsMuxer() {
    async = new SrsAsyncCallWorker();

Running for about 16 minutes, with 4k streams, it occupies 125MB of memory and CPU usage is 5%.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
21719 root      20   0 1091240 126256   4976 S   5.0  6.2   0:30.52 ./objs/srs+ 

real	16m19.259s
user	0m23.620s
sys	0m6.990s

If this part is optimized, the number of coroutines may decrease, but it will not have a significant impact on the overall performance.

TRANS_BY_GPT3

winlinvip · 2020-01-15T11:48:37Z

Modify HLS and DVR to only start SrsAsyncCallWorker when publishing and stop it when unpublishing. This will prevent a Source from having two persistent coroutines.

When there are 4k streams, the situation is slightly better for the empty Source test compared to before. The memory has decreased to 80MB, while the CPU remains at 5%.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
21994 root      20   0  418252  83880   4992 S   5.0  4.1   0:27.98 ./objs/srs+

The number of coroutines has already been reduced to 8.

(gdb) p _st_active_count 
$2 = 8

We need to use a tool to check the reason.

TRANS_BY_GPT3

winlinvip · 2020-01-15T13:01:43Z

Enable SRS support for gperf and valgrind, refer to SRS Performance (CPU) and Memory Optimization Tool Usage:

./configure --with-gperf --with-gcp --with-gmc --with-gmp --with-valgrind

We found that CPU usage is high in the HTTP handler lookup. Since each stream source creates a new handler, the time taken to find the handler will also increase as the number of sources grows.

      34   1.5%  53.5%      320  14.1% SrsHttpServeMux::match

In GMP analysis, the main memory performance overhead is in:

   101.4  72.1%  72.1%    101.4  72.1% SrsFastVector::SrsFastVector

In Valgrind analysis, the memory leaks mainly occur in SrsFastVector.

==1426== 8,224,768 bytes in 1,004 blocks are still reachable in loss record 360 of 361
==1426==    at 0x4C2AB68: operator new[](unsigned long) (vg_replace_malloc.c:433)
==1426==    by 0x4CFA63: SrsFastVector::SrsFastVector() (srs_app_source.cpp:158)
==1426==    by 0x4CFE81: SrsMessageQueue::SrsMessageQueue(bool) (srs_app_source.cpp:241)
==1426==    by 0x563F74: SrsEdgeForwarder::SrsEdgeForwarder() (srs_app_edge.cpp:441)
==1426==    by 0x56575F: SrsPublishEdge::SrsPublishEdge() (srs_app_edge.cpp:721)
==1426==    by 0x4D6982: SrsSource::SrsSource() (srs_app_source.cpp:1769)
==1426==    by 0x4D63EE: SrsSource::fetch_or_create(SrsRequest*, ISrsSourceHandler*, SrsSource**) (srs_app_source.cpp:1656)
==1426==    by 0x4F2C8D: SrsHttpStreamServer::hijack(ISrsHttpMessage*, ISrsHttpHandler**) (srs_app_http_stream.cpp:1092)
==1426==    by 0x496626: SrsHttpServeMux::find_handler(ISrsHttpMessage*, ISrsHttpHandler**) (srs_http_stack.cpp:737)
==1426==    by 0x54CC13: SrsHttpServer::serve_http(ISrsHttpResponseWriter*, ISrsHttpMessage*) (srs_app_http_conn.cpp:278)
==1426==    by 0x497218: SrsHttpCorsMux::serve_http(ISrsHttpResponseWriter*, ISrsHttpMessage*) (srs_http_stack.cpp:859)
==1426==    by 0x54BE78: SrsHttpConn::process_request(ISrsHttpResponseWriter*, ISrsHttpMessage*) (srs_app_http_conn.cpp:161)
==1426== 
==1426== 8,224,768 bytes in 1,004 blocks are still reachable in loss record 361 of 361
==1426==    at 0x4C2AB68: operator new[](unsigned long) (vg_replace_malloc.c:433)
==1426==    by 0x4CFA63: SrsFastVector::SrsFastVector() (srs_app_source.cpp:158)
==1426==    by 0x4CFE81: SrsMessageQueue::SrsMessageQueue(bool) (srs_app_source.cpp:241)
==1426==    by 0x4ECCD1: SrsBufferCache::SrsBufferCache(SrsSource*, SrsRequest*) (srs_app_http_stream.cpp:63)
==1426==    by 0x4F19F3: SrsHttpStreamServer::http_mount(SrsSource*, SrsRequest*) (srs_app_http_stream.cpp:875)
==1426==    by 0x4F2D21: SrsHttpStreamServer::hijack(ISrsHttpMessage*, ISrsHttpHandler**) (srs_app_http_stream.cpp:1098)
==1426==    by 0x496626: SrsHttpServeMux::find_handler(ISrsHttpMessage*, ISrsHttpHandler**) (srs_http_stack.cpp:737)
==1426==    by 0x54CC13: SrsHttpServer::serve_http(ISrsHttpResponseWriter*, ISrsHttpMessage*) (srs_app_http_conn.cpp:278)
==1426==    by 0x497218: SrsHttpCorsMux::serve_http(ISrsHttpResponseWriter*, ISrsHttpMessage*) (srs_http_stack.cpp:859)
==1426==    by 0x54BE78: SrsHttpConn::process_request(ISrsHttpResponseWriter*, ISrsHttpMessage*) (srs_app_http_conn.cpp:161)
==1426==    by 0x54BB38: SrsHttpConn::do_cycle() (srs_app_http_conn.cpp:133)
==1426==    by 0x4C5617: SrsConnection::cycle() (srs_app_conn.cpp:171)

TRANS_BY_GPT3

winlinvip · 2020-01-15T13:23:46Z

When there are 4,000 streams, changing the default size of the fast vector from 8,000 to 8 bytes reduces the memory to 45MB, and the CPU usage is at 4.7%.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
 2661 root      20   0  330548  45548   5024 R   4.7  2.2   0:24.26 ./objs/srs+

TRANS_BY_GPT3

winlinvip · 2020-01-19T06:21:49Z

Will be fixed in #1579 to support hot upgrade or smooth upgrade. Upgrade smoothly.

Dup to #1509
Solution: #1579 (comment)

TRANS_BY_GPT3

l0g1n · 2020-12-27T01:20:30Z

A srsSource object will start 2 coroutines. At that time, there were 38,320 srsSource objects remaining in the pool, equivalent to over 70,000 coroutines. This scheduling is a significant CPU burden. After releasing the srsSource objects, the CPU usage can be noticeably reduced.

May I ask, how did you view the number of coroutines in the process with the command or tool you used to generate this graph?

TRANS_BY_GPT3

winlinvip · 2021-09-13T01:45:35Z

Dup to #413

winlinvip · 2024-06-14T23:58:39Z

Fixed by SmartPtr: Support shared ptr for live source. v6.0.129

winlinvip changed the title ~~srs中hls初始化错误时，导致srs崩掉~~ 大量推流时，Source泄漏导致OOM Jan 8, 2020

This was referenced Jan 9, 2020

clean up source and add publisher status #1566

Closed

fix: clean up source and add publisher status #1568

Closed

winlinvip added a commit that referenced this issue Jan 15, 2020

For #1509, release coroutine when source is idle. 3.0.98

857c783

winlinvip added a commit that referenced this issue Jan 15, 2020

For #1509, decrease the fast vector init size from 64KB to 64B. 3.0.99

7240fe3

winlinvip mentioned this issue Jan 19, 2020

Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning #1579

Closed

winlinvip closed this as completed Jan 19, 2020

winlinvip mentioned this issue Feb 17, 2020

docker-srs:3 obs starts pushing a new stream (session mode), the on_dvr callback function will not be triggered for the first time. #1601

Closed

This was referenced Mar 4, 2021

srs occasional crash caused by cleaning up unused SrsSource mechanism #713

Closed

SrsSource destructor coredump #986

Closed

winlinvip self-assigned this Sep 6, 2021

winlinvip added this to the 3.0 milestone Sep 6, 2021

winlinvip reopened this Sep 13, 2021

winlinvip added Bug It might be a bug. Enhancement Improvement or enhancement. labels Sep 13, 2021

winlinvip modified the milestones: 3.0, 5.0 Sep 13, 2021

winlinvip changed the title ~~大量推流时，Source泄漏导致OOM~~ Source清理：大量推流时，Source泄漏导致OOM Sep 13, 2021

winlinvip closed this as completed Sep 13, 2021

winlinvip added Duplicated Duplicated bug. and removed Bug It might be a bug. Enhancement Improvement or enhancement. labels Sep 13, 2021

winlinvip modified the milestones: 5.0, 4.0 Sep 13, 2021

winlinvip changed the title ~~Source清理：大量推流时，Source泄漏导致OOM~~ Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). Jul 26, 2023

winlinvip added the TransByAI Translated by AI/GPT. label Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509

Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509

2k9laojia commented Dec 10, 2019 •

edited by winlinvip

Loading

winlinvip commented Dec 10, 2019 •

edited

Loading

2k9laojia commented Dec 11, 2019 •

edited by winlinvip

Loading

2k9laojia commented Dec 11, 2019

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

winlinvip commented Dec 13, 2019 •

edited

Loading

2k9laojia commented Dec 13, 2019

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

winlinvip commented Dec 13, 2019 •

edited

Loading

winlinvip commented Jan 15, 2020 •

edited

Loading

winlinvip commented Jan 15, 2020 •

edited

Loading

winlinvip commented Jan 15, 2020 •

edited

Loading

winlinvip commented Jan 15, 2020 •

edited

Loading

winlinvip commented Jan 19, 2020 •

edited

Loading

l0g1n commented Dec 27, 2020 •

edited by winlinvip

Loading

winlinvip commented Sep 13, 2021 •

edited

Loading

winlinvip commented Jun 14, 2024

Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509

Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509

Comments

2k9laojia commented Dec 10, 2019 • edited by winlinvip Loading

winlinvip commented Dec 10, 2019 • edited Loading

2k9laojia commented Dec 11, 2019 • edited by winlinvip Loading

2k9laojia commented Dec 11, 2019

2k9laojia commented Dec 13, 2019 • edited by winlinvip Loading

winlinvip commented Dec 13, 2019 • edited Loading

2k9laojia commented Dec 13, 2019

2k9laojia commented Dec 13, 2019 • edited by winlinvip Loading

2k9laojia commented Dec 13, 2019 • edited by winlinvip Loading

2k9laojia commented Dec 13, 2019 • edited by winlinvip Loading

2k9laojia commented Dec 13, 2019 • edited by winlinvip Loading

winlinvip commented Dec 13, 2019 • edited Loading

winlinvip commented Jan 15, 2020 • edited Loading

winlinvip commented Jan 15, 2020 • edited Loading

winlinvip commented Jan 15, 2020 • edited Loading

winlinvip commented Jan 15, 2020 • edited Loading

winlinvip commented Jan 19, 2020 • edited Loading

l0g1n commented Dec 27, 2020 • edited by winlinvip Loading

winlinvip commented Sep 13, 2021 • edited Loading

winlinvip commented Jun 14, 2024

2k9laojia commented Dec 10, 2019 •

edited by winlinvip

Loading

winlinvip commented Dec 10, 2019 •

edited

Loading

2k9laojia commented Dec 11, 2019 •

edited by winlinvip

Loading

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

winlinvip commented Dec 13, 2019 •

edited

Loading

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

2k9laojia commented Dec 13, 2019 •

edited by winlinvip

Loading

winlinvip commented Dec 13, 2019 •

edited

Loading

winlinvip commented Jan 15, 2020 •

edited

Loading

winlinvip commented Jan 15, 2020 •

edited

Loading

winlinvip commented Jan 15, 2020 •

edited

Loading

winlinvip commented Jan 15, 2020 •

edited

Loading

winlinvip commented Jan 19, 2020 •

edited

Loading

l0g1n commented Dec 27, 2020 •

edited by winlinvip

Loading

winlinvip commented Sep 13, 2021 •

edited

Loading