Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many coroutines caused memory allocation failure, terminate called after throwing an instance of 'std::bad_alloc'. #509

Closed
han4235 opened this issue Oct 22, 2015 · 16 comments
Assignees
Labels
Bug It might be a bug. TransByAI Translated by AI/GPT.
Milestone

Comments

@han4235
Copy link

han4235 commented Oct 22, 2015

srs automatically exits when pulling the stream.
srs: src/app/srs_app_edge.cpp:766: virtual int SrsPlayEdge::on_ingest_play(): Assertion `state == SrsEdgeStatePlay' failed.

TRANS_BY_GPT3

@jarod
Copy link

jarod commented Oct 22, 2015

I also encountered version 2.0a2.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Oct 22, 2015

Please provide the configuration, logs, version, and steps to reproduce. Thank you~

TRANS_BY_GPT3

@winlinvip winlinvip added the Bug It might be a bug. label Oct 22, 2015
@winlinvip winlinvip added this to the srs 2.0 release milestone Oct 22, 2015
@jarod
Copy link

jarod commented Oct 23, 2015

My configuration is very simple, it consists of the default configuration files origin.conf and edge.conf. I changed the "origin" in edge.conf to my own server domain name. There is one origin and two edges. There are about 5 push streams and 50 pull streams. Both origin and edge have experienced downtime. The log for the edge is the same as the one mentioned above, and the log for the origin is as follows:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

If needed, I can provide the core dump files.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Oct 23, 2015

Please send me the core and its corresponding SRS. You can also put it on a file sharing platform. Is it CentOS?

TRANS_BY_GPT3

@jarod
Copy link

jarod commented Oct 23, 2015

centos 7 64bit, related files can be found at http://pan.baidu.com/s/1pJGLnyN

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Oct 23, 2015

Hmm, I'll find time to take a look.

TRANS_BY_GPT3

@winlinvip winlinvip changed the title srs 自动退出 srs crash,edge状态异常 Dec 22, 2015
@winlinvip
Copy link
Member

winlinvip commented Dec 22, 2015

[winlin@centos7 srs]$ ./objs/srs -v
2.0.195
[winlin@centos7 srs]$ ls -lh core.*
-rw-------. 1 winlin winlin 1.1G Oct 22 21:10 core.13964
-rw-------. 1 winlin winlin 2.1G Oct 22 22:16 core.31521


(gdb) f 2
#2  0x00000000004f884b in SrsEdgeIngester::cycle (this=0x455bb50) at src/app/srs_app_edge.cpp:138
warning: Source file is more recent than executable.
138     if ((ret = client->handshake()) != ERROR_SUCCESS) {
(gdb) p this[0]
$3 = {<ISrsReusableThread2Handler> = {_vptr.ISrsReusableThread2Handler = 0x898e50 <vtable for SrsEdgeIngester+16>}, stream_id = 1, _source = 
    0x2327650, _edge = 0x2193580, _req = 0x23982a0, pthread = 0x34d09c0, stfd = 0x16ee220, io = 0x3b63e20, kbps = 0x3b5e190, client = 0x34cd3a0, 
  origin_index = 0}

Visible that the edge object is not damaged.

(gdb) f 0
#0  0x00000000004500d0 in SrsComplexHandshake::handshake_with_server (this=0x7f5ff5a5cc00, hs_bytes=0x4341d00, io=0x3b63e20)
    at src/protocol/srs_rtmp_handshake.cpp:1341
1341        if ((ret = hs_bytes->read_s0s1s2(io)) != ERROR_SUCCESS) {


(gdb) p hs_bytes[0]
$7 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}

It is evident that the object has already been released, so using it again will definitely cause problems.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Dec 22, 2015

(gdb) bt
#0  0x00000000004500d0 in SrsComplexHandshake::handshake_with_server (this=0x7f5ff5a5cc00, hs_bytes=0x4341d00, io=0x3b63e20)
    at src/protocol/srs_rtmp_handshake.cpp:1341
#1  0x0000000000433889 in SrsRtmpClient::handshake (this=0x34cd3a0) at src/protocol/srs_rtmp_stack.cpp:1978
#2  0x00000000004f884b in SrsEdgeIngester::cycle (this=0x455bb50) at src/app/srs_app_edge.cpp:138
#3  0x00000000004a355d in SrsReusableThread2::cycle (this=0x34d09c0) at src/app/srs_app_thread.cpp:533
#4  0x00000000004a2557 in internal::SrsThread::thread_cycle (this=0x1b5b710) at src/app/srs_app_thread.cpp:203
#5  0x00000000004a2769 in internal::SrsThread::thread_fun (arg=0x1b5b710) at src/app/srs_app_thread.cpp:244
#6  0x000000000051643e in _st_thread_main () at sched.c:327
#7  0x0000000000516bae in st_thread_create (start=0x12f5105, arg=0xfbad8001, joinable=32608, stk_size=974285335) at sched.c:591
#8  0x0000000000000000 in ?? ()
(gdb) 

Stack.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Dec 22, 2015

(gdb) p hs_bytes[0]
$4 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}

Explanation: C0C1 has been completed, but S0S1S2 has not been received yet. This is an impossible execution path.





    // s0s1s2
    if ((ret = hs_bytes->read_s0s1s2(io)) != ERROR_SUCCESS) {
        return ret;
    }


    // plain text required.
    if (hs_bytes->s0s1s2[0] != 0x03) {
        ret = ERROR_RTMP_HANDSHAKE;
        srs_warn("handshake failed, plain text required. ret=%d", ret);
        return ret;
    }


int SrsHandshakeBytes::read_s0s1s2(ISrsProtocolReaderWriter* io)
{
    int ret = ERROR_SUCCESS;


    if (s0s1s2) {
        return ret;
    }


    ssize_t nsize;


    s0s1s2 = new char[3073];
    if ((ret = io->read_fully(s0s1s2, 3073, &nsize)) != ERROR_SUCCESS) {
        srs_warn("read s0s1s2 failed. ret=%d", ret);
        return ret;
    }
    srs_verbose("read s0s1s2 success.");


    return ret;
}

Explanation: When SrsHandshakeBytes::read_s0s1s2 returns, s0s1s2 is definitely non-NULL.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Dec 22, 2015

Observing hs_bytes again:

'
Make sure to maintain the markdown structure.

(gdb) p hs_bytes[0]
$5 = {_vptr.SrsHandshakeBytes = 0x191f860, c0c1 = 0x7f603a4727d8 <main_arena+120> "\310'G:`\177", s0s1s2 = 0x0, c2 = 0x0}
(gdb) x /12xb hs_bytes->c0c1
0x7f603a4727d8 <main_arena+120>:    0xc8    0x27    0x47    0x3a    0x60    0x7f    0x00    0x00
0x7f603a4727e0 <main_arena+128>:    0xc8    0x27    0x47    0x3a

Among them, c0 should be 0x03, but it is actually 0xc8.
And the pointer of c0c1 is 0x7f603a4727d8, which is definitely a stack pointer, but it should actually be a heap pointer.
From these two observations, hs_bytes is a wild pointer.

'
Make sure to maintain the markdown structure.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Dec 22, 2015

Looking at the stack:

(gdb) f 1
#1  0x0000000000433889 in SrsRtmpClient::handshake (this=0x34cd3a0) at src/protocol/srs_rtmp_stack.cpp:1978
1978        if ((ret = complex_hs.handshake_with_server(hs_bytes, io)) != ERROR_SUCCESS) {
(gdb) p hs_bytes[0]
$9 = {_vptr.SrsHandshakeBytes = 0x8917f0 <vtable for SrsHandshakeBytes+16>, c0c1 = 0x3ce7ab0 "\003V(\340D\200", 
  s0s1s2 = 0x4080ec0 "\003V(\340B\001", c2 = 0x0}

At this point, the observed hs_bytes are different from before, indicating a problem within the complex_hs.handshake_with_server. In the f1 section, c0c1 is a heap pointer, and the data starts with 03 without any corruption.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Dec 22, 2015

(gdb) p ((SrsStSocket*)io)[0]
$15 = {<ISrsProtocolReaderWriter> = {<ISrsProtocolReader> = {<ISrsBufferReader> = {
        _vptr.ISrsBufferReader = 0x895e20 <vtable for SrsStSocket+96>}, <ISrsProtocolStatistic> = {
        _vptr.ISrsProtocolStatistic = 0x895eb0 <vtable for SrsStSocket+240>}, <No data fields>}, <ISrsProtocolWriter> = {<ISrsBufferWriter> = {
        _vptr.ISrsBufferWriter = 0x895f18 <vtable for SrsStSocket+344>}, <No data fields>}, <No data fields>}, recv_timeout = 30000000, 
  send_timeout = 30000000, recv_bytes = 3073, send_bytes = 1537, stfd = 0x16ee220}

From the data of io, it can be seen that 3073 bytes (s0s1s2) were received and 1537 bytes (c0c1) were sent. There may have been a problem while processing s0s1s2.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Dec 22, 2015

This may be a problem caused by allocating objects on the stack. Change it to allocate on the heap.

TRANS_BY_GPT3

@winlinvip winlinvip changed the title srs crash,edge状态异常 srs crash,edge状态异常,terminate called after throwing an instance of 'std::bad_alloc' Oct 26, 2020
@winlinvip
Copy link
Member

winlinvip commented Oct 26, 2020

https://stackoverflow.com/a/2504601
bad_alloc is basically unable to allocate, judging from the size of the core, it is a long-running service.

If you are running on a typical embedded processor running Linux without virtual memory it is quite likely 
your process will be terminated by the operating system before new fails if you allocate too much memory.

If you are running your program on a machine with less physical memory than the maximum of virtual 
memory (2 GB on standard Windows) you will find that once you have allocated an amount of memory 
approximately equal to the available physical memory, further allocations will succeed but will cause 
paging to disk. This will bog your program down and you might not actually be able to get to the point 
of exhausting virtual memory. So you might not get an exception thrown.

If you have more physical memory than the virtual memory, and you simply keep allocating memory, 
you will get an exception when you have exhausted virtual memory to the point where you can not 
allocate the block size you are requesting.

If you have a long-running program that allocates and frees in many different block sizes, including 
small blocks, with a wide variety of lifetimes, the virtual memory may become fragmented to the point 
where new will be unable to find a large enough block to satisfy a request. Then new will throw an 
exception. If you happen to have a memory leak that leaks the occasional small block in a random 
location that will eventually fragment memory to the point where an arbitrarily small block allocation 
will fail, and an exception will be thrown.

If you have a program error that accidentally passes a huge array size to new[], new will fail and throw 
an exception. This can happen for example if the array size is actually some sort of random byte pattern, 
perhaps derived from uninitialized memory or a corrupted communication stream.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Oct 26, 2020

This article analyzes that bad_alloc is not always Out of Memory (OOM): http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1404r1.html

Wrote an example, as follows:

/*
ulimit -S -v 204800
g++ -g -O0 t.cpp -o t && ./t
*/
#include <stdio.h>
int main(){
    char* p1 = new char[193000 * 1024]; // huge allocation
    char* p0 = new char[100 * 1024]; // small allocation
    printf("OK\n");
}

Execution will crash.

[root@SRS tmp]# ulimit -S -v 204800
[root@SRS tmp]# g++ -g -O0 t.cpp -o t && ./t
terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

[root@SRS tmp]# ll core.21082 
-rw------- 1 root root 198045696 Oct 26 21:04 core.21082

Looking at the stack is not about allocating the majority, but about allocating the minority.

[root@SRS tmp]# gdb t -c core.21082 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffeae793000
Core was generated by `./t'.
Program terminated with signal 6, Aborted.
#0  0x00007fd17ff0e4f5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 libgcc-4.4.7-23.el6.x86_64 libstdc++-4.4.7-23.el6.x86_64
(gdb) bt
#0  0x00007fd17ff0e4f5 in raise () from /lib64/libc.so.6
#1  0x00007fd17ff0fcd5 in abort () from /lib64/libc.so.6
#2  0x00007fd1807c8a8d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x00007fd1807c6be6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007fd1807c6c13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x00007fd1807c6d32 in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00007fd1807c712d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6
#7  0x00007fd1807c71e9 in operator new[](unsigned long) () from /usr/lib64/libstdc++.so.6
#8  0x0000000000400624 in main () at t.cpp:8
(gdb) f 8
#8  0x0000000000400624 in main () at t.cpp:8
8	    char* p0 = new char[100 * 1024]; // small allocation
(gdb) 

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Oct 31, 2020

Added a gdb script, analyzed the number of coroutines in the core. Download the code srs.py first:

(gdb) source gdb/srs.py 
(gdb) nn_coroutines 
this coroutine(&_st_this_thread->tlink) is: 0x7f43ba761e78
next is 0x7f43b92d9e78, total 500
next is 0x7f43b5c37e78, total 1000
next is 0x7f43bfd71e78, total 31500
next is 0x7f43bdad9e78, total 32000
next is 0x7f43bd8f3e78, total 32500
total coroutines: 32717

By default, ST uses mmap to allocate the stack space for coroutines. Therefore, if the number exceeds a certain limit, it will fail. You can check this limit using the following:

[root@05ff04a933cd st]# sysctl vm.max_map_count
vm.max_map_count = 65530

Note: This limit does not apply in Docker, and you can open up to 650162 coroutines with a memory usage of around 40GB. Generally, this limit is enabled on production machines.

Then compile this code huge-threads.cpp and execute it.

g++ huge-threads.cpp ../../objs/st/libst.a -g -O0 -o huge-threads && 
./huge-threads 60000

Usually, it will hang around 30,000 coroutines here.

[root@05ff04a933cd st]# ./huge-threads 60000
pid=77682, create 60000 coroutines
create thread fail, i=32749

There are two solutions for this.

  1. It is necessary to check why there are so many coroutines when the Source is not cleaned up.
  2. MALLOC_STACK can be enabled during compilation.

TRANS_BY_GPT3

@winlinvip winlinvip self-assigned this Sep 25, 2021
@winlinvip winlinvip changed the title srs crash,edge状态异常,terminate called after throwing an instance of 'std::bad_alloc' coroutine太多导致开辟内存失败,terminate called after throwing an instance of 'std::bad_alloc' Jul 27, 2023
@winlinvip winlinvip changed the title coroutine太多导致开辟内存失败,terminate called after throwing an instance of 'std::bad_alloc' coroutine太多导致开辟内存失败,terminate called after throwing an instance of 'std::bad_alloc'' Translation: Too many coroutines caused memory allocation failure, terminate called after throwing an instance of 'std::bad_alloc'. Jul 27, 2023
@winlinvip winlinvip added the TransByAI Translated by AI/GPT. label Jul 27, 2023
@winlinvip winlinvip changed the title coroutine太多导致开辟内存失败,terminate called after throwing an instance of 'std::bad_alloc'' Translation: Too many coroutines caused memory allocation failure, terminate called after throwing an instance of 'std::bad_alloc'. Too many coroutines caused memory allocation failure, terminate called after throwing an instance of 'std::bad_alloc'. Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug It might be a bug. TransByAI Translated by AI/GPT.
Projects
None yet
Development

No branches or pull requests

3 participants