"Collection was modified; enumeration operation may not execute" on connection Dispose #1306

Closed
kae opened this Issue Sep 30, 2016 · 19 comments

Comments

Projects
None yet
4 participants
@kae

kae commented Sep 30, 2016

"Collection was modified; enumeration operation may not execute."

at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
at System.Collections.Generic.Dictionary`2.KeyCollection.Enumerator.MoveNext()
at System.Data.Common.DbConnectionStringBuilder.get_ConnectionString()
at Npgsql.NpgsqlConnectionStringBuilder.Equals(Object obj)
at System.Collections.Generic.ObjectEqualityComparer`1.Equals(T x, T y)
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValueInternal(TKey key, Int32 hashcode, TValue& value)
at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
at Npgsql.NpgsqlConnection.ReallyClose(Boolean wasBroken)
at Npgsql.NpgsqlConnection.Dispose(Boolean disposing)
at System.Data.Common.DbConnection.Dispose()
at Service.Commands.MessageHandlerCommand.d__2.MoveNext()

We have this issue in production under some unclear circumstances (2-3 times per day).

Sadly I still can't reproduce this behavior with a gist of code, but it really looks like some corner case of modification of reused connectionstringbuilder with working connection pool.

On a side note, while seeking for possible source of problem I found

if (waitingOpenAttempt.IsAsync)
{
    // If the waiting open attempt is asynchronous (i.e. OpenAsync()), we can't simply
    // call SetResult on its TaskCompletionSource, since it would execute the open's
    // continuation in our thread (the closing thread). Instead we schedule the completion
    // to run in the TP
    Task.Run(() => tcs.SetResult(connector));
}

It looks really suspicious. Were you considered usage of TaskCreationOptions.RunContinuationsAsynchronously as argument for constructing TaskCompletionSource instead of manually starting new thread for tcs.SetResult?

< update >
Found that "Note that the RunContinuationsAsynchronously member is available in the TaskCreationOptions enumeration starting with the .NET Framework 4.6"
Dropping 4.5.x support considered too harmful?
< /update >

Also

lock (this)
{
    if (!tcs.Task.IsCompleted)
    {
        tcs.SetCanceled();
        throw;
    }
}

maybe can be replaced with

tcs.TrySetCanceled();
throw;
@cimpress-clusher

This comment has been minimized.

Show comment
Hide comment
@cimpress-clusher

cimpress-clusher Sep 30, 2016

My issue may be exactly what you've experienced, or a different issue, but I just solved it yesterday. In upgrading from 3.0.8 to 3.1.8 I was constantly running into "Collection was modified; enumeration operation may not execute.". Our application services hundreds of requests per second so it made sense that something is likely not thread safe. I believe I've tracked this issue down.

When upgrading we picked up the default PersistSecurityInfo=false setting, which I believe is the source of the problem. Attached is a picture of 7 threads all accessing ConnectionString in parallel, some of which are pulling passwd correctly, some of which are getting null, this looks to be directly related to the code on lines 294 and 296 which are flipping the value of Settings.Password.

upload

Setting PersistSecurityInfo=true (to bypass this code entirely) my application proceeds succesfully without error.

cimpress-clusher commented Sep 30, 2016

My issue may be exactly what you've experienced, or a different issue, but I just solved it yesterday. In upgrading from 3.0.8 to 3.1.8 I was constantly running into "Collection was modified; enumeration operation may not execute.". Our application services hundreds of requests per second so it made sense that something is likely not thread safe. I believe I've tracked this issue down.

When upgrading we picked up the default PersistSecurityInfo=false setting, which I believe is the source of the problem. Attached is a picture of 7 threads all accessing ConnectionString in parallel, some of which are pulling passwd correctly, some of which are getting null, this looks to be directly related to the code on lines 294 and 296 which are flipping the value of Settings.Password.

upload

Setting PersistSecurityInfo=true (to bypass this code entirely) my application proceeds succesfully without error.

@kae

This comment has been minimized.

Show comment
Hide comment
@kae

kae Sep 30, 2016

We doesn't use PersistSecurityInfo, but maybe using it will rise probability of issues

kae commented Sep 30, 2016

We doesn't use PersistSecurityInfo, but maybe using it will rise probability of issues

@cimpress-clusher

This comment has been minimized.

Show comment
Hide comment
@cimpress-clusher

cimpress-clusher Sep 30, 2016

PersistSecurityInfo has a default value of false even if unspecified in the connection string (which is what happened to us). If you have any amount of concurrency it might be part of the problem.

PersistSecurityInfo has a default value of false even if unspecified in the connection string (which is what happened to us). If you have any amount of concurrency it might be part of the problem.

@kae

This comment has been minimized.

Show comment
Hide comment
@kae

kae Sep 30, 2016

Should 'Setting PersistSecurityInfo=false (to bypass this code entirely)' in your response be readed as 'Setting PersistSecurityInfo=true (to bypass this code entirely)' then?

kae commented Sep 30, 2016

Should 'Setting PersistSecurityInfo=false (to bypass this code entirely)' in your response be readed as 'Setting PersistSecurityInfo=true (to bypass this code entirely)' then?

@cimpress-clusher

This comment has been minimized.

Show comment
Hide comment
@cimpress-clusher

cimpress-clusher Sep 30, 2016

yes, sorry. I've edited my initial post to fix the typo. PersistSecurityInfo=true will bypass the code in question.

yes, sorry. I've edited my initial post to fix the typo. PersistSecurityInfo=true will bypass the code in question.

@kae

This comment has been minimized.

Show comment
Hide comment
@kae

kae Sep 30, 2016

repro

using System;
using System.Linq;
using System.Threading.Tasks;
using Dapper;

public class Program {
    private static readonly string connStr =
        "Server=pg;Port=5432;Database=temp;User id=admin;Password=admin;Pooling=true";

    public static async Task<int> Test() {
        try {
            using (var conn = new Npgsql.NpgsqlConnection(connStr)) {
                conn.Open();
                var rs = await conn.QueryAsync<int>("select 1");
                return rs.FirstOrDefault();
            }
        } catch (Exception e) {
            System.Console.WriteLine(e);
            return 42;
        }
    }

    public static void Main() {
/*      Npgsql.Logging.NpgsqlLogManager.Provider = new Npgsql.Logging.ConsoleLoggingProvider(
            minLevel: Npgsql.Logging.NpgsqlLogLevel.Trace
        );*/
        var res = Task.WhenAll(Enumerable.Range(0, 100).Select(_ => Test())).Result;
    }
}

I had quite stable exceptions running this code.
It took so long because i forgot to set Pooling=true

And with PersistSecurityInfo=true exception disappears

kae commented Sep 30, 2016

repro

using System;
using System.Linq;
using System.Threading.Tasks;
using Dapper;

public class Program {
    private static readonly string connStr =
        "Server=pg;Port=5432;Database=temp;User id=admin;Password=admin;Pooling=true";

    public static async Task<int> Test() {
        try {
            using (var conn = new Npgsql.NpgsqlConnection(connStr)) {
                conn.Open();
                var rs = await conn.QueryAsync<int>("select 1");
                return rs.FirstOrDefault();
            }
        } catch (Exception e) {
            System.Console.WriteLine(e);
            return 42;
        }
    }

    public static void Main() {
/*      Npgsql.Logging.NpgsqlLogManager.Provider = new Npgsql.Logging.ConsoleLoggingProvider(
            minLevel: Npgsql.Logging.NpgsqlLogLevel.Trace
        );*/
        var res = Task.WhenAll(Enumerable.Range(0, 100).Select(_ => Test())).Result;
    }
}

I had quite stable exceptions running this code.
It took so long because i forgot to set Pooling=true

And with PersistSecurityInfo=true exception disappears

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Sep 30, 2016

Member

I'll take a look at this whrn I'm back, am away for the next 10 days or so.

Member

roji commented Sep 30, 2016

I'll take a look at this whrn I'm back, am away for the next 10 days or so.

@rwasef1830

This comment has been minimized.

Show comment
Hide comment
@rwasef1830

rwasef1830 Sep 30, 2016

Contributor

This bug appeared due to using a global single NpgsqlConnectionStringBuilder instance for each connection string across all connections. The solution is to not flip the value of Settings.Password to return the passwordless connection string, or to use a lock, but then a lock needs to used every single place Settings.Password is accessed. I'll try to make a pull request with a better solution (precompute the passwordless connection string)

Contributor

rwasef1830 commented Sep 30, 2016

This bug appeared due to using a global single NpgsqlConnectionStringBuilder instance for each connection string across all connections. The solution is to not flip the value of Settings.Password to return the passwordless connection string, or to use a lock, but then a lock needs to used every single place Settings.Password is accessed. I'll try to make a pull request with a better solution (precompute the passwordless connection string)

rwasef1830 added a commit to rwasef1830/npgsql that referenced this issue Oct 2, 2016

@rwasef1830

This comment has been minimized.

Show comment
Hide comment
@rwasef1830

rwasef1830 Oct 2, 2016

Contributor

@cimpress-clusher @kae
Could you try with a build that includes the pull request I created (#1308) ? I was not able to get reliable unit test to include with the pull request, so I would appreciate feedback on it.

(Indeed reliably unit testing thread-safety issues is quite hard).

Contributor

rwasef1830 commented Oct 2, 2016

@cimpress-clusher @kae
Could you try with a build that includes the pull request I created (#1308) ? I was not able to get reliable unit test to include with the pull request, so I would appreciate feedback on it.

(Indeed reliably unit testing thread-safety issues is quite hard).

@kae

This comment has been minimized.

Show comment
Hide comment
@kae

kae Oct 2, 2016

@rwasef1830 did you try it with a repro I included?

Added comment to commit

kae commented Oct 2, 2016

@rwasef1830 did you try it with a repro I included?

Added comment to commit

@rwasef1830

This comment has been minimized.

Show comment
Hide comment
@rwasef1830

rwasef1830 Oct 2, 2016

Contributor

@kae I tried on your repro (converted it to a unit test) but the test was always passing for me. (Probably need a larger number of threads and possible remote Postgresql instance or other timing differences to work).

Does my patch solve it on your machine with your repro ?

Contributor

rwasef1830 commented Oct 2, 2016

@kae I tried on your repro (converted it to a unit test) but the test was always passing for me. (Probably need a larger number of threads and possible remote Postgresql instance or other timing differences to work).

Does my patch solve it on your machine with your repro ?

@kae

This comment has been minimized.

Show comment
Hide comment
@kae

kae Oct 2, 2016

Didn't try.
Too much hassle with manual builds :)

kae commented Oct 2, 2016

Didn't try.
Too much hassle with manual builds :)

@rwasef1830

This comment has been minimized.

Show comment
Hide comment
@rwasef1830

rwasef1830 Oct 2, 2016

Contributor

@kae Could you try this one then ? Manual build of Npgsql commit 51db9f0.

Npgsql1306.zip

Contributor

rwasef1830 commented Oct 2, 2016

@kae Could you try this one then ? Manual build of Npgsql commit 51db9f0.

Npgsql1306.zip

@kae

This comment has been minimized.

Show comment
Hide comment
@kae

kae Oct 3, 2016

For manual build I had no reproduction with any value of PersistSecurityInfo

For 3.1.8 with missing PersistSecurityInfo or with PersistSecurityInfo=false I has stable reproduction, but no errors with PersistSecurityInfo=true

kae commented Oct 3, 2016

For manual build I had no reproduction with any value of PersistSecurityInfo

For 3.1.8 with missing PersistSecurityInfo or with PersistSecurityInfo=false I has stable reproduction, but no errors with PersistSecurityInfo=true

@rwasef1830

This comment has been minimized.

Show comment
Hide comment
@rwasef1830

rwasef1830 Oct 3, 2016

Contributor

@kae thanks for your feedback, hopefully @roji will review and merge #1308 this soon.
@cimpress-clusher Thanks for your analysis and identification of the thread-safety issue!

Contributor

rwasef1830 commented Oct 3, 2016

@kae thanks for your feedback, hopefully @roji will review and merge #1308 this soon.
@cimpress-clusher Thanks for your analysis and identification of the thread-safety issue!

@cimpress-clusher

This comment has been minimized.

Show comment
Hide comment
@cimpress-clusher

cimpress-clusher Oct 3, 2016

@rwasef1830 I tried your update with PersistSecurityInfo=false and encountered 0 failures, and similar performance as I was seeing with PersistSecurityInfo=true, using 3.1.8.

@rwasef1830 I tried your update with PersistSecurityInfo=false and encountered 0 failures, and similar performance as I was seeing with PersistSecurityInfo=true, using 3.1.8.

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Oct 17, 2016

Member

Sorry for not looking at this earlier, I was away and then busy... Will review this ASAP.

Meanwhile, @kae a quick response to your comment:

On a side note, while seeking for possible source of problem I found

if (waitingOpenAttempt.IsAsync)
{
    // If the waiting open attempt is asynchronous (i.e. OpenAsync()), we can't simply
    // call SetResult on its TaskCompletionSource, since it would execute the open's
    // continuation in our thread (the closing thread). Instead we schedule the completion
    // to run in the TP
    Task.Run(() => tcs.SetResult(connector));
}

It looks really suspicious. Were you considered usage of TaskCreationOptions.RunContinuationsAsynchronously as argument for constructing TaskCompletionSource instead of manually starting new thread for tcs.SetResult?

As you note below, RunContinuationsAsynchronously was only introduced in .NET 4.6. I wouldn't want to drop support for .NET 4.5 unless there's a good reason to. Why do you say the current code looks suspicious? There's no manual starting of a new thread, just handing off the SetResult to be executed in the threadpool rather than by the thread currently closing the connection. This pattern is mentioned as just what RunContinuationsAsynchronously is meant to replace.

Again, RunContinuationsAsynchronously would be a better solution but unless the current code triggers any sort of issue I don't see any reason to drop .NET 4.5 support.

Regarding your other suggestion to use tcs.TrySetCanceled() - that looks good, if you want to submit a PR against dev I'll merge it (otherwise I can do it myself).

As for the actual issue I'll look into it ASAP.

Member

roji commented Oct 17, 2016

Sorry for not looking at this earlier, I was away and then busy... Will review this ASAP.

Meanwhile, @kae a quick response to your comment:

On a side note, while seeking for possible source of problem I found

if (waitingOpenAttempt.IsAsync)
{
    // If the waiting open attempt is asynchronous (i.e. OpenAsync()), we can't simply
    // call SetResult on its TaskCompletionSource, since it would execute the open's
    // continuation in our thread (the closing thread). Instead we schedule the completion
    // to run in the TP
    Task.Run(() => tcs.SetResult(connector));
}

It looks really suspicious. Were you considered usage of TaskCreationOptions.RunContinuationsAsynchronously as argument for constructing TaskCompletionSource instead of manually starting new thread for tcs.SetResult?

As you note below, RunContinuationsAsynchronously was only introduced in .NET 4.6. I wouldn't want to drop support for .NET 4.5 unless there's a good reason to. Why do you say the current code looks suspicious? There's no manual starting of a new thread, just handing off the SetResult to be executed in the threadpool rather than by the thread currently closing the connection. This pattern is mentioned as just what RunContinuationsAsynchronously is meant to replace.

Again, RunContinuationsAsynchronously would be a better solution but unless the current code triggers any sort of issue I don't see any reason to drop .NET 4.5 support.

Regarding your other suggestion to use tcs.TrySetCanceled() - that looks good, if you want to submit a PR against dev I'll merge it (otherwise I can do it myself).

As for the actual issue I'll look into it ASAP.

@kae

This comment has been minimized.

Show comment
Hide comment
@kae

kae Oct 18, 2016

@roji I think you right.
It's just mine overly aggressive feelings to manual Task.Run (caused by some patterns I fight at work like await Task.Run(() => { SomeCodeHere(); }) or even Task.Run(() => { SomeCodeHere(); }).Wait())

kae commented Oct 18, 2016

@roji I think you right.
It's just mine overly aggressive feelings to manual Task.Run (caused by some patterns I fight at work like await Task.Run(() => { SomeCodeHere(); }) or even Task.Run(() => { SomeCodeHere(); }).Wait())

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Oct 18, 2016

Member

@kae I get where you're coming from, Task.Run is frequently abused (your second example is especially bad).

Member

roji commented Oct 18, 2016

@kae I get where you're coming from, Task.Run is frequently abused (your second example is especially bad).

@roji roji closed this in #1308 Oct 18, 2016

roji added a commit that referenced this issue Oct 18, 2016

roji added a commit that referenced this issue Oct 18, 2016

@roji roji changed the title from Npgsql 3.1.8: "Collection was modified; enumeration operation may not execute" on connection Dispose to "Collection was modified; enumeration operation may not execute" on connection Dispose Oct 18, 2016

@roji roji added the 🐞 bug label Oct 18, 2016

@roji roji added this to the 3.1.9 milestone Oct 18, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment