New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover connection on failure loops #2266

Open
nickbabcock opened this Issue Dec 17, 2018 · 18 comments

Comments

Projects
None yet
4 participants
@nickbabcock
Copy link

nickbabcock commented Dec 17, 2018

The issue

Npgql falls into failure loops, where even on connection close and reopen, Npgsql can't recover due to (what I presume) internal (static?) state that's been compromised.

To give an example, when writing a desktop app that reinitializes any state / connections on failure, I initially wrote this for npgsql

if (_failure)
{
    _conn.Close();
    _conn = new NpgsqlConnection(_connStr);
    await _conn.OpenAsync();
}

// execute prepared statements

When I upgraded the db one day, npgsql stopped working. Npgsql kept requesting to execute a prepared statement, but the server would throw a 26000 error "prepared statement "_p1" does not exist". Since Npgsql persists prepared statements, "closing and re-opening" the connection did not clear this error. A potential fix in npgsql for this could be to purge prepared statements that the server says doesn't exist

So my next attempt to reset Npgsql became:

if (_failure)
{
    _conn.Close();
    _conn = new NpgsqlConnection(_connStr);
    await _conn.OpenAsync();
    _conn.UnprepareAll();
}

However this new solution more often of not will fall into a failure loop when the client's desktop is woken from sleep. The first couple exceptions are about not being able to reach the database (that's fine, sometimes the network is slow to reestablish on wake), but then the same exception is thrown every time the new connection logic executes. The exception:

 Npgsql.NpgsqlException: Received backend message ReadyForQuery while expecting CommandCompleteMessage. Please file a bug.
   at Npgsql.NpgsqlConnector.<ExecuteInternalCommand>d__204.MoveNext()
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at Npgsql.NpgsqlConnector.ExecuteInternalCommand(String query)
   at Npgsql.NpgsqlConnector.UnprepareAll()
   at Npgsql.NpgsqlConnection.UnprepareAll()

The exact exception differs from time to time but the format remains the same:

 Npgsql.NpgsqlException: Received backend message ParameterDescription while expecting CommandCompleteMessage. Please file a bug.

Steps to reproduce

Putting a computer to sleep and then waking it feels weird to put into a bug report. You can take a look at the whole project or just the one file that deals with writing to postgres, but the application logic is fundamentally in the (contrived) style of:

while (true)
{
    try
    {
        Thread.Sleep(1000);
        if (_failure)
        {
            _conn.Close();
            _conn = new NpgsqlConnection(_connStr);
            await _conn.OpenAsync();
            _conn.UnprepareAll();
        }

        using (var cmd = new NpgsqlCommand("INSERT INTO abc(a) VALUES ('b')", _conn))
        {
            await cmd.PrepareAsync();
            await cmd.ExecuteNonQueryAsync();
        }

        _failure = false;
    }
    catch (Exception)
    {
        _failure = true;
        throw;
    }
}

Further technical details

Npgsql version: 4.0.3
PostgreSQL version: 10.5
Operating system: Windows

What's the recommended way to make the connection start back at square one?

@austindrenski

This comment has been minimized.

Copy link
Member

austindrenski commented Dec 17, 2018

@nickbabcock Thanks for linking to the code that triggered the issue for you, but would you be able to put together a minimal reproduction of this issue?

The linked project is interesting, but it isn't immediately clear how the TimescaleWriter is called, what it's service lifetime is (how resources are disposed), etc.

A more narrow reproduction (e.g. console app with a loop) would help us track this one down.

@roji roji added the bug label Dec 17, 2018

@roji roji added this to the 4.1 milestone Dec 17, 2018

@nickbabcock

This comment has been minimized.

Copy link

nickbabcock commented Dec 19, 2018

Let me see if I can narrow this down, I wasn't sure if there was something glaringly obvious. It seems to be more prevalent on higher spec'ed desktops. If I haven't responded in a week, feel free to bump the thread

@nickbabcock

This comment has been minimized.

Copy link

nickbabcock commented Dec 22, 2018

I'm back with a narrowed down use case (connection string is sent as first parameter)

class Program
{
    private static NpgsqlConnection _conn = new NpgsqlConnection();
    private static string _connStr;
    private static bool _failure = true;
    static void Main(string[] args)
    {
        _connStr = args[0];
        var timer = new System.Timers.Timer(1000) { AutoReset = true };
        timer.Elapsed += Timer_Elapsed;
        timer.Start();
        Console.ReadLine();
    }

    private static async void Timer_Elapsed(object sender, ElapsedEventArgs e)
    {
        try
        {
            if (_failure)
            {
                _conn.Close();
                _conn = new NpgsqlConnection(_connStr);
                Console.WriteLine("New connection");
                await _conn.OpenAsync();
                _conn.UnprepareAll();
            }

            using (var cmd = new NpgsqlCommand("INSERT INTO abc(textcol) VALUES(@texter)", _conn))
            {
                cmd.Parameters.Add("texter", NpgsqlTypes.NpgsqlDbType.Text);
                await cmd.PrepareAsync();
                cmd.Parameters["texter"].Value = "abc123";
                await cmd.ExecuteNonQueryAsync();
            }

            _failure = false;
            Console.WriteLine("Sent successfully");
        }
        catch (Exception ex)
        {
            _failure = true;
            Console.WriteLine(ex.ToString());
        }
    }
}

While sleeping / waking doesn't trigger the error 100% of the time, I do, more often than not, get the following exception loop:

System.InvalidOperationException: Reset() called on connector with state Executing
   at Npgsql.NpgsqlConnector.Reset() in C:\projects\npgsql\src\Npgsql\NpgsqlConnector.cs:line 1532
   at Npgsql.ConnectorPool.Release(NpgsqlConnector connector) in C:\projects\npgsql\src\Npgsql\ConnectorPool.cs:line 454
   at Npgsql.NpgsqlConnection.Close(Boolean wasBroken)

This is not the exception I was hoping to get (the one I originally reported), but it's a start

I wonder if this is because there is some race condition that is introduced thanks to the timer between the connection functions.

@roji

This comment has been minimized.

Copy link
Member

roji commented Dec 22, 2018

At least in theory, since you're using a class member it's possible for the timer method to be running twice at the same time. This can happen if for some reason a query takes more than a second (e.g. the database is overloaded) or similar.

Can you please try removing _conn and using a local variable inside the method instead? This should provide much better isolation between the timer runs.

@nickbabcock

This comment has been minimized.

Copy link

nickbabcock commented Dec 22, 2018

Can you please try removing _conn and using a local variable inside the method instead? This should provide much better isolation between the timer runs.

Very interesting, making this change in OhmGraphite will cause the "Npgsql.NpgsqlException: Received backend message ReadyForQuery while expecting CommandCompleteMessage" error almost immediately without needing to put the desktop to sleep.

Btw, this is against a regular postgres 10.5 instance in case there was any ambiguity.

But for a minimized test case I could only get it to trigger if the prepared statement was large.

Here's what I did using the following schema:

CREATE TABLE IF NOT EXISTS ohm_stats (
   time TIMESTAMPTZ NOT NULL,
   host TEXT,
   hardware TEXT,
   hardware_type TEXT,
   identifier TEXT,
   sensor TEXT,
   sensor_type TEXT,
   sensor_index INT,
   value REAL
);

And code:

class Program
{
    private static string _connStr;
    static void Main(string[] args)
    {
        _connStr = args[0];
        var timer = new Timer(1000) { AutoReset = true };
        timer.Elapsed += Timer_Elapsed;
        timer.Start();
        Console.ReadLine();
    }

    private static async void Timer_Elapsed(object sender, ElapsedEventArgs e)
    {
        try
        {
            using (var conn = new NpgsqlConnection(_connStr))
            {
                var watch = Stopwatch.StartNew();
                await conn.OpenAsync();
                conn.UnprepareAll();
                var dt = DateTime.Now;
                var columns = string.Join(", ", Enumerable.Range(0, 1000).Select((i) => $"(@time{i}, @host{i}, @hardware{i}, @hardware_type{i}, @identifier{i}, @sensor{i}, @sensor_type{i}, @sensor_index{i}, @value{i})"));
                var command = "INSERT INTO ohm_stats(time, host, hardware, hardware_type, identifier, sensor, sensor_type, sensor_index, value) VALUES " + columns;
                using (var cmd = new NpgsqlCommand(command, conn))
                {
                    for (int i = 0; i < 1000; i++)
                    {
                        cmd.Parameters.Add($"time{i}", NpgsqlDbType.TimestampTz);
                        cmd.Parameters.Add($"host{i}", NpgsqlDbType.Text);
                        cmd.Parameters.Add($"hardware{i}", NpgsqlDbType.Text);
                        cmd.Parameters.Add($"hardware_type{i}", NpgsqlDbType.Text);
                        cmd.Parameters.Add($"identifier{i}", NpgsqlDbType.Text);
                        cmd.Parameters.Add($"sensor{i}", NpgsqlDbType.Text);
                        cmd.Parameters.Add($"sensor_type{i}", NpgsqlDbType.Text);
                        cmd.Parameters.Add($"value{i}", NpgsqlDbType.Real);
                        cmd.Parameters.Add($"sensor_index{i}", NpgsqlDbType.Integer);
                    }
                    await cmd.PrepareAsync();

                    for (int i = 0; i < 1000; i++)
                    {
                        cmd.Parameters[$"time{i}"].Value = dt;
                        cmd.Parameters[$"host{i}"].Value = new string('a', 200);
                        cmd.Parameters[$"hardware{i}"].Value = new string('a', 200);
                        cmd.Parameters[$"hardware_type{i}"].Value = new string('a', 200);
                        cmd.Parameters[$"identifier{i}"].Value = new string('a', 200);
                        cmd.Parameters[$"sensor{i}"].Value = "a";
                        cmd.Parameters[$"sensor_type{i}"].Value = "a";
                        cmd.Parameters[$"value{i}"].Value = 1.0;
                        cmd.Parameters[$"sensor_index{i}"].Value = 1;
                    }
                    await cmd.ExecuteNonQueryAsync();
                }

                Console.WriteLine($"Sent successfully {watch.ElapsedMilliseconds}ms");
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.ToString());
        }
    }
}

will cause exceptions like (and to clarify this doesn't need to the desktop to go to sleep for this to occur):

Npgsql.NpgsqlException: Received backend message NoData while expecting CommandCompleteMessage. Please file a bug.

These errors occur even if conn.UnprepareAll(); is removed.

@austindrenski

This comment has been minimized.

Copy link
Member

austindrenski commented Dec 22, 2018

Does this depend on await cmd.PrepareAsync()?

@nickbabcock

This comment has been minimized.

Copy link

nickbabcock commented Dec 22, 2018

Does this depend on await cmd.PrepareAsync()?

Good spot. I can't make the error trigger without await cmd.PrepareAsync().

@austindrenski

This comment has been minimized.

Copy link
Member

austindrenski commented Dec 22, 2018

Interesting. Can you un-async the repo and still trigger the error?

@nickbabcock

This comment has been minimized.

Copy link

nickbabcock commented Dec 22, 2018

Yes, un-asyncing both the repo and project yielded flawless execution (no errors) -- very interesting 🤔

@roji

This comment has been minimized.

Copy link
Member

roji commented Dec 22, 2018

It seems likely that this is an actual bug inside Npgsql, @austindrenski (or @YohDeadfall) are you interested in taking a look?

@austindrenski

This comment has been minimized.

Copy link
Member

austindrenski commented Dec 22, 2018

@nickbabcock To clarify, this is reproducible in a synchronous version of the above sample, so long as cmd.Prepare() is called?

@roji I can take a closer look, though I'm unfamiliar with the particulars of our prepare/auto-prepare logic, so happy to step aside if someone else sees something obvious at work.

@nickbabcock

This comment has been minimized.

Copy link

nickbabcock commented Dec 22, 2018

@austindrenski I forgot to mention, I had added cmd.Prepare() in the flawless synchronous version, so cmd.Prepare() is not the culprit. await cmd.PrepareAsync() seems more likely (as including the line will cause the error, and commenting the line out will make the execution flawless)

@austindrenski

This comment has been minimized.

Copy link
Member

austindrenski commented Dec 23, 2018

Working on further narrowing the repro here: https://github.com/austindrenski/npgsql-2266.

As you mentioned above, the error message varies with each run, but consistently reproduces when there are at least 10,000 8,000 parameters.

@YohDeadfall

This comment has been minimized.

Copy link
Member

YohDeadfall commented Dec 23, 2018

@roji I'll be able to take a look on New Year holidays. Maybe will have time a bit early next week.

@austindrenski Ping me if you have some questions about the driver.

@roji

This comment has been minimized.

Copy link
Member

roji commented Dec 23, 2018

@austindrenski I forgot to mention, I had added cmd.Prepare() in the flawless synchronous version, so cmd.Prepare() is not the culprit. await cmd.PrepareAsync() seems more likely (as including the line will cause the error, and commenting the line out will make the execution flawless)

Prepare() and PrepareAsync() should be using the same code flow, so there isn't much chance something is actually wrong with PrepareAsync(). I'm guessing that it's more a timing issue which asynchronous I/O happens to trigger whereas synchronous I/O simply doesn't.

@austindrenski @YohDeadfall have fun :) Just don't waste time working on the same thing at the same time...

@nickbabcock

This comment has been minimized.

Copy link

nickbabcock commented Jan 10, 2019

Interestingly, I got into another but different failure loop with the same code, so I'll report it here, just in case it helps 😓

The error that repeatedly was thrown (even UnprepareAll on new connections didn't save me)

Unable to send metrics!System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: chunkLength
   at System.Text.StringBuilder.ToString()
   at Npgsql.SqlQueryParser.ParseRawQuery(String sql, Boolean standardConformantStrings, NpgsqlParameterCollection parameters, List`1 statements, Boolean deriveParameters)
   at Npgsql.NpgsqlCommand.ProcessRawQuery(Boolean deriveParameters)
   at Npgsql.NpgsqlCommand.Prepare(Boolean async)
   at Npgsql.NpgsqlCommand.Prepare()

The last error that occurred before the loop:

Npgsql.PostgresException (0x80004005): 42601: syntax error at or near "5"
   at Npgsql.NpgsqlConnector.<>c__DisplayClass161_0.<<ReadMessage>g__ReadMessageLong|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Npgsql.NpgsqlConnector.<>c__DisplayClass161_0.<<ReadMessage>g__ReadMessageLong|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Npgsql.NpgsqlCommand.<>c__DisplayClass74_0.<<Prepare>g__PrepareLong|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Npgsql.NpgsqlCommand.Prepare()

Looks like a invalid statement was sent to postgres, which is extremely bizarre. Maybe the computer slept / woke in the middle of a transaction, I'm not sure. Anyways no action needs to come of it, I just wanted to report it as I stumbled across it. Maybe I'll investigate ClearAllPools and see what effect that has.

@roji

This comment has been minimized.

Copy link
Member

roji commented Jan 11, 2019

@nickbabcock thanks for the error, can you please open this in a new separate issue?

The exception seems to be coming from this line in StringBuilder, which is quite interesting... We'll need to investigate to understand how this can be triggered and what it means.

@nickbabcock

This comment has been minimized.

Copy link

nickbabcock commented Jan 12, 2019

When I can consistently reproduce it, I'll open another issue. It triggers extremely rarely. I posted it here in case it connected some dots to a solution 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment