Description
Usage
SRS supports two signals:
- SIGTERM: Fast exit, quickly clean up actively disconnected connections, and then exit. K8s sends this signal during preStop, and then sends SIGKILL to forcefully kill the Pod after a timeout. We can configure
force_grace_quit
to consider SIGTERM as Gracefully QUIT as well. - SIGQUIT: Graceful exit, close listening and wait for all clients to disconnect before exiting. If there are still connections, SRS will not exit, but the longest exit waiting time configuration in K8s is
terminationGracePeriodSeconds
, and it will force exit after waiting for this long. If there are no connections, it will wait for grace_final_wait before exiting.
Note: SRS does not implement a maximum waiting time. It will wait for clients to disconnect indefinitely without forcing an exit. In conjunction with the
terminationGracePeriodSeconds
configuration in K8s for managing Pods, K8s will send SIGKILL to forcefully shut down SRS after a timeout.
Other
In order to simplify the handling process, SRS does not clean up memory objects when stopping the stream, as the stream may be re-pushed. If cleaning is required, it would result in complex and careful handling of Source objects, which is not conducive to problem simplification.
Not cleaning up Source objects will cause continuous memory growth. This may not be a noticeable issue in scenarios where there is less streaming and more playback. However, in scenarios with a lot of streaming, such as monitoring and conference scenarios, cleaning up the streams becomes necessary. Reference:
- PR for Source cleanup submitted by Nobody2 (fix: clean up source and add publisher status #1568) discussed various scenarios that require cleaning up. Of course, Nobody2 did a great job with the submitted PR, but the issue itself is too complex.
- Online reports (Source Cleanup: When there is a large amount of streaming, Source leakage causes OOM (Out of Memory). #1509, #1271, After stopping the stress test, the CPU and memory have remained consistently high. #1507) indicate memory leaks and OOM caused by not cleaning up Source objects.
Currently, partial optimizations have been implemented to alleviate this issue.
At the same time, we are also considering the most stable and easiest solution. There is another idea to make SRS support smooth exit and smooth upgrade, roughly as follows:
- Disable exclusive access to the PID file, allowing a new SRS to be started.
- Use REUSEPORT to open a new SRS, allowing both the old and new SRS to provide services using the same PID file.
- The old SRS will no longer accept new connections and the API port will be closed. After serving existing clients or after a certain period of time, such as 12 hours, the old SRS will exit.
This way, the old SRS can easily and safely release the created sources and potential other memory issues. Users can smoothly upgrade and exit SRS during off-peak periods according to their business needs, minimizing the impact on users.
The only issue is that when both the new and old SRS are providing services, the API is provided by the new SRS, which means that the system count is not accurate, and the number of users served by the old SRS may be missed.
Remark: If it is a source station cluster, the stream is on the old SRS, which may result in the inability to detect the stream. In this case, it is necessary to forcefully disconnect the stream. The client needs to support retries in order to smoothly support this. One solution is to place an Edge before the source station, so that retries can be supported through the Edge.
TRANS_BY_GPT3