Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xmpp: Session establishment can get stuck with certain malfunctioning servers despite context with timeout or deadline #124

Closed
horazont opened this issue Apr 1, 2021 · 11 comments

Comments

@horazont
Copy link

@horazont horazont commented Apr 1, 2021

Minimal reproducer:

package main

import (
	"context"
	"crypto/tls"
	"fmt"
	"time"
	"mellium.im/xmpp"
	"mellium.im/xmpp/dial"
	"mellium.im/xmpp/jid"
	"mellium.im/sasl"
)


func test(address string) {
	ctx, _ := context.WithTimeout(context.Background(), 2 * time.Second)

	fmt.Printf("%s: testing ...\n", address)

	addr, err := jid.Parse(address)
	if err != nil {
		fmt.Printf("%s: failed to parse: %s\n", address, err)
		return
	}

	dialer := dial.Dialer{
		NoTLS:     true,
		S2S: false,
	}
	dialer.Deadline, _ = ctx.Deadline()
	conn, err := dialer.Dial(ctx, "tcp", addr)
	if err != nil {
		fmt.Printf("%s: failed to dial: %s\n", address, err)
		return
	}
	defer conn.Close()

	tls_config := &tls.Config{}

	features := []xmpp.StreamFeature{
		xmpp.StartTLS(tls_config),
		xmpp.SASL(
			addr.Localpart(),
			"foobar2342",
			sasl.ScramSha256, sasl.ScramSha1, sasl.Plain,
		),
	}

	session, err := xmpp.NewSession(
		ctx,
		addr.Domain(),
		addr,
		conn,
		0,
		xmpp.NewNegotiator(
			xmpp.StreamConfig{
				Lang: "en",
				Features: func(s *xmpp.Session, _features ...xmpp.StreamFeature) []xmpp.StreamFeature {
					return features;
				},
			},
		),
	)

	if session != nil {
		defer session.Close()
	}

	if err != nil {
		fmt.Printf("%s: session failed: %s\n", address, err)
	} else {
		fmt.Printf("%s: session succedeed!\n", address)
	}
}


func main() {
	test("foo@blackhole.dreckshal.de")
	test("foo@reject.dreckshal.de")
	test("foo@http-blackhole.dreckshal.de")
}

The domains used are configured as follows:

  • blackhole.dreckshal.de: drops traffic to the XMPP port (courtesy badxmpp.eu). This is to demonstrate the expected behaviour.
  • reject.dreckshal.de: replies with connection refused on the XMPP port (also courtesy badxmpp.eu). This is to demonstrate a random test case.
  • http-blackhole.dreckshal.de (bad name, but whatever): accepts the connection and then … does nothing.

When running the above example, the first two tests terminate with an error (i/o error and connection refused respectively) and the third one gets stuck, despite the context timeout.


This was discovered because mellium does an implicit fallback to port 80 if port 5222 is not used and the domain in question had a strangely configured httpd which probably waited for the first newline or whatever. It did not reply at all to the stream header.

@horazont
Copy link
Author

@horazont horazont commented Apr 1, 2021

NB: If you ever get Connection refused in the third test, let me know. Then my makeshift "do not send any reply" server broke :).

@SamWhited
Copy link
Member

@SamWhited SamWhited commented Apr 2, 2021

So this appears to be getting stuck reading the reply stream header (which of course is never read if this is a black hole that returns nothing). This is expected behavior (more or less) because it's up to the user to set a deadline on the connection. We could also make it respect the ctx possibly, but I'm not sure if that makes sense or if it would be overriding the users intent if they already set a deadline on the conn. It would also require spawning a goroutine before any attempt at reading the network, which doesn't feel great. I'm really not sure what the expected Go-like thing to do is here. I'll think about it and make a decision, but in the mean time you can use conn.SetDeadline to work around this (and as a general rule of thumb you should always set deadlines on everything, I should add this to the examples).

@horazont
Copy link
Author

@horazont horazont commented Apr 2, 2021

Hi and thanks for the response.

Using conn.SetDeadline during login seems very appropriate! I wasn’t aware of this golang way of doing things.

Nevertheless, if a method like NewSession takes a context, I would expect it to adhere to that deadline on its own, without me having to set a deadline on the connection. If the conn is the only thing involving a deadline, then maybe not taking a context there is the way to go?

Maybe my expectation is not go-idiomatic in that regard though.

horazont added a commit to horazont/xmpp-blackbox-exporter that referenced this issue Apr 2, 2021
This protects against TCP blackholes (accept a TCP connection and
never reply) or tarpits (accept a TCP connection and return
individual bytes at a very slow rate).

See-Also: mellium/xmpp#124
@SamWhited
Copy link
Member

@SamWhited SamWhited commented Apr 2, 2021

Nevertheless, if a method like NewSession takes a context, I would expect it to adhere to that deadline on its own

I tend to agree, it feels odd that it can lock up even after the context is expired.

If the conn is the only thing involving a deadline, then maybe not taking a context there is the way to go?

The context is also used to break out of the negotiate loop, so we'd need a context either way.

Maybe my expectation is not go-idiomatic in that regard though.

This is something I hadn't thought about before and I'm really not sure what the idiomatic way to do this is. I generally think the connection deadline should be controlled by the user, not my library, but context of course is something I have more control over. It's sort of Go's idioms clashing with POSIX or C idioms (or wherever this socket behavior is defined, it's not strictly a Go thing).

@SamWhited
Copy link
Member

@SamWhited SamWhited commented Apr 3, 2021

I talked this out on chat earlier and came to a conclusion (though I am of course still open to other opinions).

I was forgetting that some of the convenience functions that wrap NewSession such as DialClientSession create the conn themselves and the user has no chance to set a deadline before passing one in. That makes this an actual bug because there's no way we want these to block forever with no way to cancel them.

If we make the context cancellation respected in those methods, it makes sense not to cancel it (this part is still debatable) in the methods that take a conn, partially because these actually take an io.ReadWriter so we'd have to do a type check which always feels jank and partially because this way these methods give the user maximum flexibility to set deadlines differently if they want.

So the current solution would be to add a goroutine that cancels connection reads/writes to:

  • DialSession,
  • DialClientSession, and
  • DialServerSession

and to leave the other methods alone (and maybe just add a reminder to the documentation that if you're using a conn you should always set a deadline and come up with a strategy for resetting it as necessary).

@horazont
Copy link
Author

@horazont horazont commented Apr 3, 2021

That sounds very reasonable!

@SamWhited
Copy link
Member

@SamWhited SamWhited commented Apr 3, 2021

@horazont any chance you'd be interested in implementing this? It should be pretty straight forward (one goroutine that gets copied into those three functions just after the conn is created and an update to the changelog). No pressure of course, I can get to it later this weekend if not.

@SamWhited
Copy link
Member

@SamWhited SamWhited commented Apr 3, 2021

And I guess we'd want to have a pre-declared time that's used for the actual cancellation, something like aLongTimeAgo from net:

// aLongTimeAgo is a convenient way to cancel dials.                             
var aLongTimeAgo = time.Unix(1, 0)
@horazont
Copy link
Author

@horazont horazont commented Apr 3, 2021

@horazont any chance you'd be interested in implementing this? It should be pretty straight forward (one goroutine that gets copied into those three functions just after the conn is created and an update to the changelog). No pressure of course, I can get to it later this weekend if not.

I’d rather not. My golang fu is not very solid yet and I don’t use either of those functions (I need to be on a lower level in my code), so I’d not be confident in that.

(Not to mention the huge pile of other stuff which needs to be done ;))

@SamWhited
Copy link
Member

@SamWhited SamWhited commented Apr 3, 2021

Not to worry, I'll get to it later. Thanks for pointing this DOS out!

MelliumBot pushed a commit that referenced this issue Apr 7, 2021
When dialing a new connection in the Dial* functions no deadline was
being set meaning that a misbehaving server that blackholes the
connection could cause these functions to block forever. This patch
makes these functions respect any deadline set on the context, giving
the user control over cancelation.

Unifying the context logic also results in a secondary issue being fixed
where cancel functions were not being respected (but deadlines were) on
writes.

Fixes #124

Signed-off-by: Sam Whited <sam@samwhited.com>
@SamWhited
Copy link
Member

@SamWhited SamWhited commented Apr 7, 2021

I just pushed up a branch with a fix for this (and for a related but that was discovered and fixed incidentally). It still feels wrong to me somehow, but I can't quite put my finger on why. Going to merge it for now (assuming tests pass) and we can always re-visit later if other problems arise.

@MelliumBot MelliumBot closed this in f51eb24 Apr 7, 2021
MelliumBot pushed a commit that referenced this issue Apr 17, 2021
When dialing a new connection in the Dial* functions no deadline was
being set meaning that a misbehaving server that blackholes the
connection could cause these functions to block forever. This patch
makes these functions respect any deadline set on the context, giving
the user control over cancelation.

Unifying the context logic also results in a secondary issue being fixed
where cancel functions were not being respected (but deadlines were) on
writes.

Fixes #124

Signed-off-by: Sam Whited <sam@samwhited.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants