Add option to issue full reconnect on a publication error. #1214

boks1971 · 2022-12-06T08:55:23Z

Leaving the publication error timeout at 30 seconds as there are some publications taking long. Also, there are cases where the peer connection fails after 30 seconds. The peer connection failure happens after publication error is detected. But, 30 seconds is a good amount of time for publication to establish.

* Split out mediatransportutil (livekit#1071) * Cleanup pass through logging (livekit#1073) * added filtering for noisy pion logs * demoted some logs to debug * using consistent trackID / participant / publisher / subscriber terminology * removed ice candidate log lines, deferring to combined log * Allow TCP fallback on multiple failures. (livekit#1077) * Update protocol (livekit#1075) * update protocol * auto egress ensure unique filename * fix tests * update pion/ice webrtc for udpmux fix (livekit#1081) * Cache RTPStats and seed on re-use (livekit#1080) * Cache RTPStats and seed on re-use When a cached down track is re-used, RTPStats was not cached. This caused sender reports getting out-of-sync with the remote side. Cache RTPStats and seed it on re-use. * staticcheck * Fix simulcast codec block track close (livekit#1082) * Update module github.com/urfave/cli/v2 to v2.19.2 (livekit#1076) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Update github.com/livekit/mediatransportutil digest to 7440725 (livekit#1072) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * v1.2.4 (livekit#1084) * Add default handler to 404 (livekit#1088) * Support track level stereo and red setting (livekit#1086) * Support track level stereo and red setting * fix test client * Prevent multiple transport fallback in same session. (livekit#1090) * Update module github.com/urfave/cli/v2 to v2.20.2 (livekit#1089) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Log Service API requests (livekit#1091) * Refine nat 1to1 mapping setting (livekit#1094) Now only set mapping when user_external_ip enabled or node_ip is explicitly set. If multiple local address resolved to same external ip, only the first one will be mapped to external, avoid candidate conflict between different clients. * Add server-sdk-kotlin to docs (livekit#1095) * remove record check on CreateRoom (livekit#1096) * update protocol (livekit#1097) * Misc clean up (livekit#1099) * Use pointers for packetMeta in sequencer to save memory. (livekit#1100) Padding packets do not need the full structure. They just need a placeholder in the sequencer array. So, use pointers (with padding slots filled by nil) to save some memory. Also, don't need padding for audio (yet). As padding packets are used only for probing and we do not probe using audio tracks (yet). * Warn on out-of-order RTCP RR. (livekit#1101) Have been seeing a few instances of "too many packets expected in delta" when trying to generate RTCP SR on down track. Actual sequence numbers indicate that start is after the end. As down track RTPStats are driven by receiver report, wondering if we are getting RTCP_RR out-of-order somehow causing this to happen. Cannot find any other reason for this. So, accepting RTCP_RR based update only if the sequence number is higher than existing and also logging a warning with sequence numbers if they look out-of-order. * Accept same highest sequence number as a puased track sends that (livekit#1102) * Log level should be `warn` instead of `warning` (livekit#1104) * Flag to bypass strict config enforcement (livekit#1105) Useful with the Helm charts, where other keys can be declared together. * v1.2.5 (livekit#1106) scrapped 1.2.4 release and will release 1.2.5 instead * Some misc changes (livekit#1107) - ticker.Stop always - clean up timer func (if they are added) on participant close - sequencer test enhancement to add a real packet after a pdding packet * Allocate packetMeta up front to reduce number of allocations. (livekit#1108) * Do not log duplicate packet error. (livekit#1116) Also, use bucket.Err* as it has been moved to mediatransportutil * Notify max layer taking into account overshoot. (livekit#1117) * Notify max layer taking into account overshoot. An attempt to handle case of FF stopping layer 0, but not layer 1. When max subscribed layer is layer 0, server allows overshoot to layer 1 to ensure continued streaming when the channel is not congested. But, dynacast could have reported maximum subscribed layer as layer 0. This is a very simple attempt to address that by taking overshoot into account. Needs testing if this works well or not. NOTE: When subsriber/down track is unmuted, it will report the max subscribed layer as the max required layer. In those cases, if the client does not start layer 0, there will still be an issue. IOW, server is not keeping track of client behaviour that the client has stopped layer 0 and publishing only layer 1. Server is just accommodating overshoot with this change. * Fix a couple of tests and change reflect.DeepEqual -> require.Equal as much as possible * Populate memory load in node stats. (livekit#1121) * Add fps calculator for VP8 and DependencyDescriptor (livekit#1110) * Add fps calculator for VP8 and DependencyDescriptor * clean code * unit test * clean code * solve comment * Consolidate getMemoryStats (livekit#1122) * Consolidate getMemoryStats * Avoid divide-by-0 * Remove named returns from room service. (livekit#1124) * Update module github.com/urfave/cli/v2 to v2.20.3 (livekit#1115) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Generated CLI Flags (livekit#1112) * web egress (livekit#1126) * baseFlags do not need to be set (livekit#1127) * Use ingress specific grants (livekit#1125) * Remove DD extension when AV1 not preferred (livekit#1129) * Seed snapshots (livekit#1128) * Seed snapshots - For one cycle after seeding, delta snap shot can get a huge gap because of snapshot iitializing from start if not present. Not a huge deal sa it should not affect functionality, but saving/restoring (at least with down track) snap shot is a big deal. So just do it. - Have been seeing a bunch of cases of delta stats getting a lot of packets due to out-of-order (what seems like) receiver report. So, save the receiver report and log it when out-of-order is detected to understand if they are closely spaced or something else could be happening. * Remove comment that does not apply anymore * log current time and RR * fix activeRecording (livekit#1132) * fix activeRecording * also check if p is nil * experiment fallback to tcp when udp unstable (livekit#1119) * fallback to tcp when udp unstable * Don't collect external address for ip filterd out (livekit#1135) * don't assume uname in /usr/bin/ (livekit#1138) uname exists in /bin on Ubuntu. As such, the installation fails. The script determines the OS without specifying the complete path to uname. Do the same in order to determine ARCH. * Update module golang.org/x/sync to v0.1.0 (livekit#1136) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Update module github.com/urfave/cli/v2 to v2.23.0 (livekit#1133) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Add red encoding for opus only publisher (livekit#1137) * Add red encodings for opus only publisher * Add test case for red receiver * Recover lost packet from red encoding for opus only client (livekit#1139) * Recover lost packet from red encoding for opus only client * Avoid divide-by-zero (livekit#1141) * revert canclose method of red receivers (livekit#1142) * Update module github.com/urfave/cli/v2 to v2.23.2 (livekit#1144) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Update module github.com/prometheus/client_golang to v1.13.1 (livekit#1140) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Update module github.com/urfave/cli/v2 to v2.23.4 (livekit#1145) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Make sure client get participant info before track fired (livekit#1147) * Set track level codec settings for all pending tracks (livekit#1148) * Use the redis.UniversalClient interface instead of *redis.Client when interacting with go-redis (livekit#1149) * Use the redis.UniversalClient interface instead of *redis.Client when interacting with go-redis * Update protocol to v1.2.1 * RedisStore: make UnlockRoom atomic (livekit#1044) Co-authored-by: David Zhao <dz@livekit.io> * Update module github.com/prometheus/client_golang to v1.14.0 (livekit#1151) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Version 1.3.0 (livekit#1152) Co-authored-by: David Zhao <dz@livekit.io> * Fix YAML failure when logging config is declared (livekit#1154) * Fix YAML failure when logging config is declared * version bump * CreateRoom API to actually create the room on an RTC node (livekit#1155) Previously, CreateRoom only created the room in storage, but did not hydrate it on an RTC node. This has caused strange behaviors such as emptyTimeout not working correctly (livekit#1109). Also reduced room reap worker to consistently reap rooms. Fixes livekit#241 * Confirm room creation prior to returning from CreateRoom (livekit#1157) * Demote some stable logs to Debugw (livekit#1158) * Demote some stable logs to Debugw * Add 'discard message from' to ignore list * A few more log tweaks (livekit#1159) * Update module github.com/urfave/cli/v2 to v2.23.5 (livekit#1153) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Check for ignored with fully formatted log message (livekit#1163) * Lmit failed send to specific error type (livekit#1164) * fix memory leak on long run room/participant (livekit#1169) * Fix memory leak * update pion * clean code * Do not append the stream key to the ingress URL for rtmp (livekit#1167) * command to show help with hidden generated flags (livekit#1171) * Force full reconnect when there is no previous answer during migration. (livekit#1168) * Fix potential ssrc collision between participants (livekit#1173) * Update module github.com/livekit/protocol to v1.2.3 (livekit#1176) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Prevent rtx buffer and forwarding path colliding (livekit#1174) * Prevent rtx buffer and forwarding path colliding Received packets are put into RTX buffer which is a circular buffer and the packet (sequence number) is queued for forwarding. If the RTX buffer fills up and cycles before forwarding happens, forwarding would pick the wrong packet (as it is holding a reference to a byte slice in the RTX buffer) to forward. Prevent it by moving reading from RTX buffer just before forwarding. Adds an extra copy from RTX buffer -> temp buffer for forwarding, but ensures that forwarding buffer is not used by another go routine. * Revert some changes from previous commit Details: - Do all forward processing as before. - One difference is not load raw packet into ExtPacket. - Load raw packet into provided buffer when module that reads using ReadExtended calls that function. If the packet is not there in the retransmission buffer, that packet will be dropped. This is the case we are trying to fix, i. e. the RTX buffer has cycled before ReadExtended could pull the packet. This makes a copy into the provided buffer so that the data does not change underneath. * Remove debug comment * Oops missed a function call * rename hidden-help to help-verbose (livekit#1180) * Set forceRelay to unset by default (livekit#1184) Setting it to `DISABLED` could cause clients to override user preference to use relay. * Update module github.com/pion/ice/v2 to v2.2.12 (livekit#1183) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Support redis cluster mode (livekit#1181) * use redisConfig of protocol instead of redisConfig and use redis of protocol to create redis client to support redis cluster mode too * Prevent subscription after participant close. (livekit#1182) * Add enable loopback candidate option (livekit#1185) * Force 'allowed; even on subscription.' (livekit#1187) Also not holding lock while resolving by sid. Not pretty. But, we should deprecate SID based permissions. Also, wire changes, possibly due to redis cluster PR? * set stereo max bitrates to 510000 (livekit#1188) * Promote dyancast logs for debugging (livekit#1190) * Check forwarder started when seeing. (livekit#1191) When switching from local -> remote or remote -> local, the forwarder state is cached and restored after the switch to ensure continuity in sequence number /time stamp. But, if the forwarder had not started before the switch, the sequence number always starts at 1 because of seeding. So, do not see unless forwarder was started before the switch. * Do not forward media till peer connection is connected. (livekit#1194) There were some failures with missing media. The only thing I could see between working and non-working case is when media forwarding starts. So, delay media forwarding till peer connection is connected. Also, add a subscribe op only if a subscribe/unsubscribe queuing is successful. There was a recent change to not queue a subscribe when the participant is closed/disconnected. This got the subscribe op counter out of whack. * Fix rtcp lost for downtrack used incorrect buffer factory (livekit#1195) * Fix rtcp lost for downtrack used incorrect buffer factory In buffer factory change(livekit#1173), every pariticipant has its own buffer factory, can't use publisher's bufferfactory to create DownTrack * clean code * Encoding primary packet only if red encoding don't have enough space (livekit#1196) * Encoding primary packet only if red encoding don't have enough space * clean code * node type prometheus metric labels (livekit#1197) * Add stats for data channel and signal (livekit#1198) * Add stats for data channel and signal * Solve comment * Suppress a few additional Pion logs (livekit#1199) * Suppress a few additional Pion logs * remove dupe * Update module go.uber.org/zap to v1.24.0 (livekit#1200) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Update module github.com/hashicorp/golang-lru to v0.6.0 (livekit#1165) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Update module github.com/twitchtv/twirp to v8.1.3+incompatible (livekit#1120) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Revert "Update module go.uber.org/zap to v1.24.0 (livekit#1200)" This reverts commit bf0bbc3. * Improve logging messages with RoomService (livekit#1203) * Configurable RoomService execution timeout (livekit#1206) * API execution timeout is now configurable In certain environments, it can take longer than the default 2s to fully execute API requests. Making execution timeout a configurable option. * do not expose api to YAML. internal for now. * Create response channel before sending StartSession (livekit#1208) * Fixed single-node routing breakage. (livekit#1209) * Fixed single-node routing breakage. Due to a regression of a previous change, Redis was always enabled even when no configuration was provided. * updated go modules * Update module github.com/urfave/cli/v2 to v2.23.6 (livekit#1207) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Update module github.com/pion/turn/v2 to v2.0.9 (livekit#1201) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Update module github.com/pion/transport to v0.14.1 (livekit#1202) Generated by renovateBot Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Added logging fields for Ingress & Egress services (livekit#1205) * Don't output error log if upstream closed while binding (livekit#1211) * Add option to issue full reconnect on a publication error. (livekit#1214) * Add option to issue full reconnect on a publication error. Leaving the publication error timeout at 30 seconds as there are some publications taking long. Also, there are cases where the peer connection fails after 30 seconds. The peer connection failure happens after publication error is detected. But, 30 seconds is a good amount of time for publication to establish. * prevent recursive lock * Close migration muted track which is not fired (livekit#1215) Co-authored-by: Raja Subramanian <raja.gobi@tutanota.com> Co-authored-by: David Zhao <dz@livekit.io> Co-authored-by: David Colburn <xero73@gmail.com> Co-authored-by: cnderrauber <zengjie9004@gmail.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Samuel Humeau <samuel.humeau@gmail.com> Co-authored-by: davidliu <davidliu@deviange.net> Co-authored-by: Mathew Kamkar <578302+matkam@users.noreply.github.com> Co-authored-by: Benjamin Pracht <benjamin@livekit.io> Co-authored-by: Mohit Agarwal <mohit@sdf.org> Co-authored-by: MaxnSter <maxnster20@gmail.com> Co-authored-by: Tom Xiong <tomxiongzh@gmail.com>

boks1971 requested review from davidzhao and cnderrauber and removed request for davidzhao December 6, 2022 08:55

prevent recursive lock

7448ab8

cnderrauber approved these changes Dec 6, 2022

View reviewed changes

boks1971 merged commit 6bd5504 into master Dec 6, 2022

boks1971 deleted the raja_reconnect_on_pub_failure branch December 6, 2022 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to issue full reconnect on a publication error. #1214

Add option to issue full reconnect on a publication error. #1214

boks1971 commented Dec 6, 2022

Add option to issue full reconnect on a publication error. #1214

Add option to issue full reconnect on a publication error. #1214

Conversation

boks1971 commented Dec 6, 2022