Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running ESL connections unable to send commands #13

Closed
williamflynt opened this issue Nov 1, 2021 · 6 comments
Closed

Long running ESL connections unable to send commands #13

williamflynt opened this issue Nov 1, 2021 · 6 comments

Comments

@williamflynt
Copy link

williamflynt commented Nov 1, 2021

Problem

Bug Report

Connections made with esl.Dial cannot send commands using c.SendCommand after five seconds of life.

This is because we set a global deadline on the connection.

I had a much longer issue written, with some additional data, but my machine inexplicably rebooted...

Content
  1. Subjective Considerations
  2. Quick Workaround
  3. Reproduce

Subjective Considerations #

I am grateful for this package! These are small things that can (maybe) be improved iteratively, and they are only my opinions.

  1. Official context docs - Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it.
    • We should consider using a new context for each command, and allowing the "parent" caller to choose if a deadline is hit or a context needs canceled.
  2. Exporting some attributes and methods of Conn would be helpful for developers - specifically the newConnection constructor.
    • This package seems really perfect for a structure that enables 3rd party plugins/hooks. I wonder how we could enable that...
  3. Quietly stopping the receiveLoop on a context.Err() with only an internal logger.Warn can make it difficult to track down what's happening when things inevitably go wrong.
    • Should we consider panic? I'm not normally a fan but the entire system is down at this point, right?
    • At a minimum we should log a full error.
    • Is there a possibility we could reinitialize a failed connection? Should we?
    • Do these considerations apply when we remove the context from the Conn struct in favor of passing context around?

Quick Workaround #

Disable the deadline checker/setter here (remove or comment out). Replace with another timeout enforcement method, or just allow the caller to manage their own timeouts.

Reproduce #

/*
 * This code snippet shows a long-running ESL connection that
 * fails to deliver commands after five seconds of life.
 *
 * Note that we are limited by an artificial Go deadline, not
 * any FreeSWITCH limitation.
 */
package main

import (
	"context"
	"log"
	"time"

	"github.com/percipia/eslgo"
	"github.com/percipia/eslgo/command"
)


func main() {
	// Connect to FreeSWITCH
	conn, err := eslgo.Dial("127.0.0.1:8021", "ClueCon", func() {
		log.Println("Inbound Connection Disconnected")
	})
	if err != nil {
		log.Println("Error connecting", err)
		return
	}
	defer conn.ExitAndClose()

	// Create a basic context
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
	defer cancel()

	go func() {
		i := 0
		for {
			log.Printf("sending ping %v\n", i)
			if _, err := conn.SendCommand(ctx, command.API{
				Command:    "echo",
				Arguments:  log.Sprintf("ping %v", i),
				Background: false,
			}); err != nil {
				log.Println(err)
			}
			i++
			time.Sleep(time.Second * 1)
		}
	}()

	time.Sleep(30)
}
@winsock
Copy link
Contributor

winsock commented Nov 1, 2021

Hi @williamflynt, there are some really good comments in here, thanks for sharing them!

There shouldn't be any 5 second context timeout so this seems like a potential bug. The only 5 second timeout in the library is related to not receiving a response while authorizing and can be overridden.

Have you tried using an InboundOptions struct that potentially extends or disables that auth timeout? The eslgo.Dial method is a helper that just has some sane defaults that worked for Percipia's product.

In any case your suggestions are worth being looked into for improving things :)

@williamflynt
Copy link
Author

williamflynt commented Nov 1, 2021

I haven't tried that.

Here's where the deadline gets set on the first SendCommand:

_ = c.conn.SetWriteDeadline(deadline)

I haven't looked into where deadline on the context originally comes from.

@winsock
Copy link
Contributor

winsock commented Nov 1, 2021

The deadline gets set from the caller, in your code it would be the context passed in from here.

if _, err := conn.SendCommand(ctx, command.API{
				Command:    "echo",
				Arguments:  log.Sprintf("ping %v", i),
				Background: false,
			});

However, internally to the library SendCommand also gets called when the connection gets created initially to send the auth command with the password to FreeSWITCH. By default the library will send that command with a 5 second deadline for getting the response from FreeSWITCH. If it doesn't get a response from FS to the auth command it will assume that the connection is bad or unauthenticated and close it.

I am thinking that is what you are seeing here, the auth command not processing correctly. If raising the auth timeout to 10 seconds also keeps your commands running for 10 seconds we should be able to confirm if this theory is correct and proceed from there.

Are you able to share a bit on how everything is configured for you? Are you doing ACL authentication for inbound ESL or running your program on the same machine as FS? Any verbose logs from FS showing the ESL command flow should help too.

@winsock
Copy link
Contributor

winsock commented Nov 2, 2021

Some thoughts I have on the considerations outside of the 5 second disconnect issue.

  1. I'm not positive if the documentation on not storing contexts in structs covers the use of runningContext in this library. I believe most if not all of the public exported methods in this library use the context passed in by the caller. The stored context in the struct is effectively used as a synchronization primitive to handle disconnects or the user requesting the connection to be closed. It defaults to context.Background() and can be overridden in the options struct if desired. I suppose it could be rewritten to use channels but I don't know what is gained by doing so.
  2. This may be partially covered by the InboundOptions, OutboundOptions, and Options structs. Once created with the options desired you can call the Dial or ListenAndServe methods on the struct to create the connection. If there are more configuration parameters you think should be added in here I'm all ears!
  3. The onDisconnect callback if provided should be called in this case. I'm learly of adding auto reconnect to the library since it can be done by the user of the library via the callback function. That is how I've implemented this before for a few projects using this library. The project using this method is long running and stays up for months at a time and handles reconnects without issue. I don't want the library to dictate what the best method would be for the user when it comes to reconnecting after any cause of disconnect.
    • Automatic authentication was a bit of feature creep compromise compared to my general design of delegating to the user. The library seemed incomplete without it but I could've left that to the user as well.
    • Regarding logging, you can specify a custom logger that implements the Logger interface if you wish via the Options struct if you want tighter control visibility there.
    • About panic, I don't see that ever being added to the library. The library was designed to delegate most decision making to the user of the library. This allowed us to gracefully recover from error conditions without bringing down the entire process. If your use case requires this you should be able to add a panic call to your onDisconnect callback to get the same effect.

@winsock
Copy link
Contributor

winsock commented Nov 2, 2021

I do potentially see an overlooked path where the onDisconnect handler may not be called. It wouldn't explain your 5 second disconnect issue though.

@williamflynt
Copy link
Author

Well - not sure. Have you tried the code I left to reproduce the bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants