Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestPeerConnection_Renegotation_AddTrack failing sometimes #1039

Closed
lawl opened this issue Feb 18, 2020 · 11 comments
Closed

TestPeerConnection_Renegotation_AddTrack failing sometimes #1039

lawl opened this issue Feb 18, 2020 · 11 comments

Comments

@lawl
Copy link
Contributor

lawl commented Feb 18, 2020

Your environment.

What did you do?

When creating PR #1038 a the test-suite on travis failed and it had seemingly nothing to do with my change (add a method for users that's never used by pion itself).

After a force push with no changes, to trigger a rebuild, the test suite was green.

What did you expect?

Test suite being green.

What happened?

The TestPeerConnection_Renegotation_AddTrack test seems to sometimes fail. I have no idea how to reproduce it. See link to the travis build above.

@at-wat
Copy link
Member

at-wat commented Mar 10, 2020

It can be reproduced locally by running the test a lot of times.

On success,

--- PASS: TestPeerConnection_Renegotation_AddTrack (0.73s)
=== RUN   TestPeerConnection_Renegotation_AddTrack
pc INFO: 2020/03/10 22:44:47 signaling state changed to have-local-offer
pc INFO: 2020/03/10 22:44:47 signaling state changed to have-remote-offer
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Checking
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: checking
pc INFO: 2020/03/10 22:44:47 signaling state changed to stable
pc INFO: 2020/03/10 22:44:47 signaling state changed to stable
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Checking
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: checking
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Connected
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: connected
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Connected
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: connected
pc INFO: 2020/03/10 22:44:47 peer connection state changed: connected
pc INFO: 2020/03/10 22:44:47 peer connection state changed: connected
pc INFO: 2020/03/10 22:44:47 signaling state changed to have-local-offer
pc INFO: 2020/03/10 22:44:47 signaling state changed to have-remote-offer
pc INFO: 2020/03/10 22:44:47 signaling state changed to stable
pc INFO: 2020/03/10 22:44:47 signaling state changed to stable

** OnTrack called **

pc WARNING: 2020/03/10 22:44:47 Failed to accept RTP SessionSRTP has been closed
pc WARNING: 2020/03/10 22:44:47 Failed to accept RTCP SessionSRTCP has been closed
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Closed
pc INFO: 2020/03/10 22:44:47 peer connection state changed: closed
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: closed
pc WARNING: 2020/03/10 22:44:47 Failed to accept RTP SessionSRTP has been closed
pc WARNING: 2020/03/10 22:44:47 Failed to accept RTCP SessionSRTCP has been closed
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Closed
pc INFO: 2020/03/10 22:44:47 peer connection state changed: closed
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: closed

On failure,

--- PASS: TestPeerConnection_Renegotation_AddTrack (0.73s)
=== RUN   TestPeerConnection_Renegotation_AddTrack
pc INFO: 2020/03/10 22:44:47 signaling state changed to have-local-offer
pc INFO: 2020/03/10 22:44:47 signaling state changed to have-remote-offer
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Checking
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: checking
pc INFO: 2020/03/10 22:44:47 signaling state changed to stable
pc INFO: 2020/03/10 22:44:47 signaling state changed to stable
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Checking
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: checking
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Connected
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: connected
ice INFO: 2020/03/10 22:44:47 Setting new connection state: Connected
pc INFO: 2020/03/10 22:44:47 ICE connection state changed: connected
pc INFO: 2020/03/10 22:44:48 signaling state changed to have-local-offer
pc INFO: 2020/03/10 22:44:48 signaling state changed to have-remote-offer
pc WARNING: 2020/03/10 22:44:48 RTPReceiver Receive failed the DTLS transport has not started yet
pc INFO: 2020/03/10 22:44:48 signaling state changed to stable
pc INFO: 2020/03/10 22:44:48 signaling state changed to stable
pc WARNING: 2020/03/10 22:44:48 Failed to start Sender: the DTLS transport has not started yet
pc INFO: 2020/03/10 22:44:48 peer connection state changed: connected
pc INFO: 2020/03/10 22:44:48 peer connection state changed: connected

@at-wat
Copy link
Member

at-wat commented Mar 10, 2020

There is a race condition that startSRTP

webrtc/dtlstransport.go

Lines 145 to 153 in be9fde1

func (t *DTLSTransport) startSRTP() error {
t.lock.Lock()
defer t.lock.Unlock()
if t.srtpSession != nil && t.srtcpSession != nil {
return nil
} else if t.conn == nil {
return fmt.Errorf("the DTLS transport has not started yet")
}

is called before

webrtc/dtlstransport.go

Lines 286 to 295 in be9fde1

// Re-take the lock, nothing beyond here is blocking
t.lock.Lock()
defer t.lock.Unlock()
if err != nil {
t.onStateChange(DTLSTransportStateFailed)
return err
}
t.conn = dtlsConn

t.conn = dtlsConn.

@at-wat
Copy link
Member

at-wat commented Mar 10, 2020

pc.startRenegotation() is called before finishing pc.dtlsTransport.Start().

@at-wat
Copy link
Member

at-wat commented Mar 10, 2020

I think this block

webrtc/dtlstransport.go

Lines 238 to 289 in be9fde1

prepareTransport := func() (DTLSRole, *dtls.Config, error) {
t.lock.Lock()
defer t.lock.Unlock()
if err := t.ensureICEConn(); err != nil {
return DTLSRole(0), nil, err
}
if t.state != DTLSTransportStateNew {
return DTLSRole(0), nil, &rtcerr.InvalidStateError{Err: fmt.Errorf("attempted to start DTLSTransport that is not in new state: %s", t.state)}
}
t.srtpEndpoint = t.iceTransport.NewEndpoint(mux.MatchSRTP)
t.srtcpEndpoint = t.iceTransport.NewEndpoint(mux.MatchSRTCP)
t.remoteParameters = remoteParameters
// pion/webrtc#753
cert := t.certificates[0]
t.onStateChange(DTLSTransportStateConnecting)
return t.role(), &dtls.Config{
Certificates: []tls.Certificate{
{
Certificate: [][]byte{cert.x509Cert.Raw},
PrivateKey: cert.privateKey,
}},
SRTPProtectionProfiles: []dtls.SRTPProtectionProfile{dtls.SRTP_AES128_CM_HMAC_SHA1_80},
ClientAuth: dtls.RequireAnyClientCert,
LoggerFactory: t.api.settingEngine.LoggerFactory,
InsecureSkipVerify: true,
}, nil
}
var dtlsConn *dtls.Conn
dtlsEndpoint := t.iceTransport.NewEndpoint(mux.MatchDTLS)
role, dtlsConfig, err := prepareTransport()
if err != nil {
return err
}
// Connect as DTLS Client/Server, function is blocking and we
// must not hold the DTLSTransport lock
if role == DTLSRoleClient {
dtlsConn, err = dtls.Client(dtlsEndpoint, dtlsConfig)
} else {
dtlsConn, err = dtls.Server(dtlsEndpoint, dtlsConfig)
}
// Re-take the lock, nothing beyond here is blocking
t.lock.Lock()
defer t.lock.Unlock()

should be continuously locked.
And/or other APIs using DTLS/SRTP/SCTP should return error if pc.dtlsTransport.Start() is not completed.
(related: https://github.com/pion/webrtc/wiki/PlanningV3/_compare/e2fe45a9d1a6b224dd9317667c2c3a687675a893)

@Sean-Der Will you take this issue?

@at-wat at-wat self-assigned this Mar 11, 2020
@at-wat at-wat removed their assignment Mar 11, 2020
@at-wat
Copy link
Member

at-wat commented Mar 11, 2020

@Sean-Der feel free to take this issue if you want!

@Sean-Der
Copy link
Member

Sean-Der commented Mar 11, 2020

👍 really sorry for burning your time on this :( I will get this fixed tonight!

I don't ever want to be a blocker on your work, so if you ever feel I am slowing down the project please drive it any way you see fit. I can get you access to anything you need, and happy to put my time into any projects you think are important :)

@Sean-Der
Copy link
Member

Fixed by 825b5e4

The issue was that we would process RemoteDescriptions concurrently. We know how hold a lock and make sure they are processed sequentially.

@at-wat
Copy link
Member

at-wat commented Apr 12, 2020

This still happens.
https://travis-ci.org/github/pion/webrtc/jobs/673755936

=== RUN   TestPeerConnection_Renegotation_AddTrack
goroutine profile: total 15
2 @ 0x80758a4 0x804ca12 0x804c9ed 0x804c77c 0x83029bd 0x809e321
#	0x83029bc	github.com/pion/ice.NewAgent.func2+0x3c	/go/pkg/mod/github.com/pion/ice@v0.7.12/agent.go:431

2 @ 0x80758a4 0x8070524 0x806f9fb 0x80be807 0x80bf396 0x80bf37b 0x81bb84f 0x81cf69f 0x81ce467 0x82f86e0 0x809e321
#	0x806f9fa	internal/poll.runtime_pollWait+0x4a			/usr/local/go/src/runtime/netpoll.go:184
#	0x80be806	internal/poll.(*pollDesc).wait+0x36			/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x80bf395	internal/poll.(*pollDesc).waitRead+0x165		/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x80bf37a	internal/poll.(*FD).ReadFrom+0x14a			/usr/local/go/src/internal/poll/fd_unix.go:219
#	0x81bb84e	net.(*netFD).readFrom+0x3e				/usr/local/go/src/net/fd_unix.go:208
#	0x81cf69e	net.(*UDPConn).readFrom+0x3e				/usr/local/go/src/net/udpsock_posix.go:47
#	0x81ce466	net.(*UDPConn).ReadFrom+0x46				/usr/local/go/src/net/udpsock.go:121
#	0x82f86df	github.com/pion/ice.(*candidateBase).recvLoop+0xef	/go/pkg/mod/github.com/pion/ice@v0.7.12/candidate_base.go:92

2 @ 0x80758a4 0x8083e59 0x82723dd 0x82ffdc5 0x830c01f 0x809e321
#	0x82723dc	github.com/pion/transport/packetio.(*Buffer).Read+0xec		/go/pkg/mod/github.com/pion/transport@v0.10.0/packetio/buffer.go:146
#	0x82ffdc4	github.com/pion/ice.(*Conn).Read+0x74				/go/pkg/mod/github.com/pion/ice@v0.7.12/transport.go:78
#	0x830c01e	github.com/pion/webrtc/v2/internal/mux.(*Mux).readLoop+0x9e	/go/src/github.com/pion/webrtc/internal/mux/mux.go:106

2 @ 0x80758a4 0x8083e59 0x8302bff 0x809e321
#	0x8302bfe	github.com/pion/ice.(*Agent).startConnectivityChecks.func1.1+0xae	/go/pkg/mod/github.com/pion/ice@v0.7.12/agent.go:622

1 @ 0x80758a4 0x804ca12 0x804c9ed 0x804c75c 0x8110f5b 0x8113daa 0x8110c2a 0x8111fcd 0x811127b 0x845b754 0x80754d2 0x809e321
#	0x8110f5a	testing.(*T).Run+0x2ca		/usr/local/go/src/testing/testing.go:961
#	0x8113da9	testing.runTests.func1+0x59	/usr/local/go/src/testing/testing.go:1202
#	0x8110c29	testing.tRunner+0x99		/usr/local/go/src/testing/testing.go:909
#	0x8111fcc	testing.runTests+0x22c		/usr/local/go/src/testing/testing.go:1200
#	0x811127a	testing.(*M).Run+0x13a		/usr/local/go/src/testing/testing.go:1117
#	0x845b753	main.main+0x103			_testmain.go:330
#	0x80754d1	runtime.main+0x201		/usr/local/go/src/runtime/proc.go:203

1 @ 0x80758a4 0x8083e59 0x826b6b7 0x809e321
#	0x826b6b6	github.com/pion/dtls/v2/internal/net/connctx.(*connCtx).Read.func1+0xc6	/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/internal/net/connctx/connctx.go:62

1 @ 0x80758a4 0x8083e59 0x82723dd 0x830b7cc 0x826b118 0x829b791 0x82b69da 0x809e321
#	0x82723dc	github.com/pion/transport/packetio.(*Buffer).Read+0xec			/go/pkg/mod/github.com/pion/transport@v0.10.0/packetio/buffer.go:146
#	0x830b7cb	github.com/pion/webrtc/v2/internal/mux.(*Endpoint).Read+0x3b		/go/src/github.com/pion/webrtc/internal/mux/endpoint.go:36
#	0x826b117	github.com/pion/dtls/v2/internal/net/connctx.(*connCtx).Read+0x167	/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/internal/net/connctx/connctx.go:77
#	0x829b790	github.com/pion/dtls/v2.(*Conn).readAndBuffer+0xe0			/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/conn.go:551
#	0x82b69d9	github.com/pion/dtls/v2.(*Conn).handshake.func3+0xb9			/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/conn.go:825

1 @ 0x80758a4 0x8083e59 0x829dab4 0x829901c 0x829996d 0x8299749 0x83fe3eb 0x843f180 0x809e321
#	0x829dab3	github.com/pion/dtls/v2.(*Conn).handshake+0x313				/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/conn.go:867
#	0x829901b	github.com/pion/dtls/v2.createConn+0x7eb				/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/conn.go:193
#	0x829996c	github.com/pion/dtls/v2.ServerWithContext+0x5c				/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/conn.go:262
#	0x8299748	github.com/pion/dtls/v2.Server+0x98					/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/conn.go:229
#	0x83fe3ea	github.com/pion/webrtc/v2.(*DTLSTransport).Start+0x47a			/go/src/github.com/pion/webrtc/dtlstransport.go:315
#	0x843f17f	github.com/pion/webrtc/v2.(*PeerConnection).startTransports.func1+0x1df	/go/src/github.com/pion/webrtc/peerconnection.go:1578

1 @ 0x80758a4 0x8083e59 0x82b1b68 0x82b1399 0x82b6858 0x809e321
#	0x82b1b67	github.com/pion/dtls/v2.(*handshakeFSM).wait+0x157	/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/handshaker.go:244
#	0x82b1398	github.com/pion/dtls/v2.(*handshakeFSM).Run+0x2f8	/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/handshaker.go:157
#	0x82b6857	github.com/pion/dtls/v2.(*Conn).handshake.func2+0x87	/go/pkg/mod/github.com/pion/dtls/v2@v2.0.0-rc.10/conn.go:806

1 @ 0x80758a4 0x808496d 0x808495e 0x80846ba 0x80a7332 0x80a712c 0x840ade6 0x8406f86 0x842edbd 0x842cc36 0x8110c2a 0x809e321
#	0x80846b9	sync.runtime_SemacquireMutex+0x39						/usr/local/go/src/runtime/sema.go:71
#	0x80a7331	sync.(*Mutex).lockSlow+0x1f1							/usr/local/go/src/sync/mutex.go:138
#	0x80a712b	sync.(*Mutex).Lock+0x4b								/usr/local/go/src/sync/mutex.go:81
#	0x840ade5	github.com/pion/webrtc/v2.(*PeerConnection).startRenegotation+0x25		/go/src/github.com/pion/webrtc/peerconnection.go:1596
#	0x8406f85	github.com/pion/webrtc/v2.(*PeerConnection).SetRemoteDescription+0x585		/go/src/github.com/pion/webrtc/peerconnection.go:733
#	0x842edbc	github.com/pion/webrtc/v2.signalPair+0x22c					/go/src/github.com/pion/webrtc/peerconnection_test.go:66
#	0x842cc35	github.com/pion/webrtc/v2.TestPeerConnection_Renegotation_AddTrack+0x4d5	/go/src/github.com/pion/webrtc/peerconnection_renegotation_test.go:79
#	0x8110c29	testing.tRunner+0x99								/usr/local/go/src/testing/testing.go:909

1 @ 0x814626c 0x81460da 0x8143133 0x83f9847 0x809e321
#	0x814626b	runtime/pprof.writeRuntimeProfile+0x7b			/usr/local/go/src/runtime/pprof/pprof.go:708
#	0x81460d9	runtime/pprof.writeGoroutine+0x79			/usr/local/go/src/runtime/pprof/pprof.go:670
#	0x8143132	runtime/pprof.(*Profile).WriteTo+0x2c2			/usr/local/go/src/runtime/pprof/pprof.go:329
#	0x83f9846	github.com/pion/transport/test.TimeOut.func1+0x56	/go/pkg/mod/github.com/pion/transport@v0.10.0/test/util.go:18

panic: timeout

goroutine 2726 [running]:
github.com/pion/transport/test.TimeOut.func1()
	/go/pkg/mod/github.com/pion/transport@v0.10.0/test/util.go:21 +0xda
created by time.goFunc
	/usr/local/go/src/time/sleep.go:168 +0x37
FAIL	github.com/pion/webrtc/v2	70.460s

@at-wat at-wat reopened this Apr 12, 2020
@Sean-Der
Copy link
Member

This looks like a timeout in DTLS, possibly different?

Also interesting thing is I see server frames (but no client ones) maybe reading the trace wrong!

@at-wat
Copy link
Member

at-wat commented Apr 12, 2020

hmm, dtls handshaker routine must be always existing unless the connection is closed or handshake is errored, but client side handshaker seems not there. That's strange.

@Sean-Der
Copy link
Member

Sean-Der commented Jun 19, 2020

The last 19 builds in master have passed, so I don't think we have anything actionable.

@at-wat If you see it again re-open right away and happy to jump in on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants