New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isValid() implementation running into stmt.cancel() thread-safety issue #412

Closed
brettwooldridge opened this Issue Oct 31, 2015 · 47 comments

Comments

Projects
None yet
7 participants
@brettwooldridge
Contributor

brettwooldridge commented Oct 31, 2015

@vlsi @davecramer We have a user (@Viniciusmsss) of HikariCP and pgjdbc that is running into what appears to be an old issue.

HikariCP will, if supported by the driver, use the Connection.isValid() method. The isValid() implementation in pgjdbc uses a Statement and the setQueryTimeout() mechanism. Which in turn uses a TimerTask (AbstractJdbc2Statement.startTimer()) whose run() method invokes Statement.cancel().

From the nabble discussion and the discussion in #368 it seems that running cancel() on a separate thread, which sends the cancel command on a separate connection, can lead to a race. I am surmising that this may be the source of the issue observed by the user.

Thoughts?

@vlsi

This comment has been minimized.

Member

vlsi commented Oct 31, 2015

Do you mean something like the following?

  1. isValid setups timer
  2. test query is sent
  3. test OK
  4. user query is sent
  5. "timer" wakes up and issues cancel
  6. user-provided query is killed
@vlsi

This comment has been minimized.

Member

vlsi commented Oct 31, 2015

Ah, here Tom explains that PostgreSQL's cancel is not synchronous: http://postgresql.nabble.com/ERROR-canceling-query-due-to-user-request-tp2077761p2077764.html.

@brettwooldridge, that means a workaround is to use 0 for validationTimeout so query.cancel is not used.

@vlsi

This comment has been minimized.

Member

vlsi commented Oct 31, 2015

I was able to replicate the issue by changing pgjdbc source so it treats timeout value as milliseconds, not seconds:

    public void testValidation() throws SQLException
    {
        long deadLine = System.currentTimeMillis() + 10000;
        Statement stmt = con.createStatement();
        while(System.currentTimeMillis() < deadLine) {
            if (con.isValid(1)) { // by default it would be 1 second, but with 1ms timeout the issue is easier to reproduce
                ResultSet rs = stmt.executeQuery("select 1");
                rs.close();
            } else {
                System.out.println("con.isValid == false, assumig timeout happend during validate");
            }
        }
    }

@brettwooldridge , do you have some concurrency tests in the HikariCP suite? /just wondering if it can be reused/

vlsi added a commit to Gordiychuk/pgjdbc that referenced this issue Oct 31, 2015

fix: statement.cancel should be thread-safe
Statement.cancel was not thread-safe, thus query timeout from one statement might cancel subsequent statement.
The change introduces state tracking of a statement (ide, in-query, cancelling, cancelled), so out-of-order cancels can be detected and ignored

fixes pgjdbc#412
@brettwooldridge

This comment has been minimized.

Contributor

brettwooldridge commented Oct 31, 2015

@vlsi Very quick fix! We have a few concurrency tests, but I don't know of any that would trip over this issue.

TestConnections a few multi-threaded tests.

TestConnectionTimeoutRetry a few multi-threaded tests.

ConnectionRaceConditionTest one test.

Probably a few more...

vlsi added a commit to Gordiychuk/pgjdbc that referenced this issue Oct 31, 2015

fix: statement.cancel and statement.setQueryTimeout should be thread-…
…safe

Statement.cancel was not thread-safe, thus query timeout from one statement might cancel subsequent statement.
The change introduces state tracking of a statement (idle, in-query, cancelling, cancelled), so out-of-order cancels can be detected and ignored

fixes pgjdbc#412

vlsi added a commit to Gordiychuk/pgjdbc that referenced this issue Nov 1, 2015

fix: statement.cancel and statement.setQueryTimeout should be thread-…
…safe

Statement.cancel was not thread-safe, thus query timeout from one statement might cancel subsequent statement.
The change introduces state tracking of a statement (idle, in-query, cancelling, cancelled), so out-of-order cancels can be detected and ignored

fixes pgjdbc#412

vlsi added a commit to Gordiychuk/pgjdbc that referenced this issue Nov 2, 2015

fix: statement.cancel and statement.setQueryTimeout should be thread-…
…safe

Statement.cancel was not thread-safe, thus query timeout from one statement might cancel subsequent statement.
The change introduces state tracking of a statement (idle, in-query, cancelling, cancelled), so out-of-order cancels can be detected and ignored

fixes pgjdbc#412

@davecramer davecramer closed this in bc3d848 Nov 3, 2015

@vlsi

This comment has been minimized.

Member

vlsi commented Nov 3, 2015

@Viniciusmsss, could you please retest with pgjdbc 1205?

@Viniciusmsss

This comment has been minimized.

Viniciusmsss commented Nov 4, 2015

Hi @vlsi , we were never able to reproduce the problem, so we can't test it. The problem only occurs in production, at the machines of our software users. We will launch a new version till the end of this month and it will have the updated library of pgjdbc (1205). After that I'll come back here to post the result. Thank you very much for such a fast solution.

@vlsi

This comment has been minimized.

Member

vlsi commented Nov 4, 2015

we were never able to reproduce the problem, so we can't test it.

I'm not pushing you. That is just a reminder that 1205 is out.

@Viniciusmsss

This comment has been minimized.

Viniciusmsss commented Nov 4, 2015

Ok thanks. Before 1205 had been out, I was thinking about compiling the merged version and use it, but it is so much better to use a prebuilt version. Thank you very much.

@Viniciusmsss

This comment has been minimized.

Viniciusmsss commented Dec 23, 2015

Hello, sorry for delay. I would like to inform that pgjdbc 1205 fixed the problem for us. We are no longer having the exception "ERROR: canceling statement due to user request".
Thank you very much!

@vlsi

This comment has been minimized.

Member

vlsi commented Dec 23, 2015

Note: it is better to upgrade to 1206 or 1207.
There's a getBigDecimal issue in 1205: #424

@Viniciusmsss

This comment has been minimized.

Viniciusmsss commented Dec 23, 2015

Oh! thanks!

@Viniciusmsss

This comment has been minimized.

Viniciusmsss commented Dec 23, 2015

Do you know if the getBigDecimal issue in 1205 also exists in prepared statements? I am asking because we didn't have any problem with 1205. We do use getBigDecimal, but just with prepared statements.

@vlsi

This comment has been minimized.

Member

vlsi commented Dec 23, 2015

It does happen with prepared statements. However, that depends on the datatype that database choses to transfer the data.
The issue happens only in case you use getBigDecimal on a int2, int4, or int8 column.
I believe the issue does not appear for numeric datatype.

I am asking because we didn't have any problem with 1205

Once again: it is not "exception at ..." kind of issue. It is "wrong results from getBigDecimal", so it can be unnoticed.

@Viniciusmsss

This comment has been minimized.

Viniciusmsss commented Dec 23, 2015

Ok, thanks. We use getBigDecimal just with columns of type Decimal (from Postgres). That is why we are not having the issue. The results of our software are correct. Thanks again!

@rdp

This comment has been minimized.

rdp commented May 16, 2016

Interestingly, we use isValid and run into this error message "frequently" (a few times a day in production). Unfortunately after bumping the version to 1208 instead of getting it "7" times a day now we only get it "2" times a day, can anyone else confirm or deny if they still get this problem? anyway fix for now is to avoid isValid and setQueryTimeout I will let you know if I can come up with a more easily reproducible test case. ref: http://stackoverflow.com/questions/13114101/postgresql-error-canceling-statement-due-to-user-request

@davecramer

This comment has been minimized.

Member

davecramer commented May 16, 2016

@rdp what are you using for a timeout ?

@vlsi

This comment has been minimized.

Member

vlsi commented May 17, 2016

@rdp , can you please clarify which exact error message are you running into?

@Viniciusmsss

This comment has been minimized.

Viniciusmsss commented May 17, 2016

Hello, I thought that we were no long getting the exception after we have updated pgjdbc because we've stayed some days without getting the exception, whereas we were getting it several times per day before.

But I looked at the logs now and I found out that we are having around 6 exceptions/month. Before the updated pgjdbc version we were having around 500 exceptions/month.

Note: We've upgraded from pgjdbc 1102 to pgjdbc 1205 on DEC-2015, then we've upgraded to pgjdbc 1207 on MAR-2016. We are still with pgjdbc 1207.

The message that we are running into is:

ERROR: canceling statement due to user request
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)  <-
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:645)  <-
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:495)  <-
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:380)  <-
com.zaxxer.hikari.proxy.PreparedStatementProxy.executeQuery(PreparedStatementProxy.java:52)  <-
com.zaxxer.hikari.proxy.HikariPreparedStatementProxy.executeQuery(HikariPreparedStatementProxy.java)  <-
@vlsi

This comment has been minimized.

Member

vlsi commented May 17, 2016

Okay, what if we add "timestamp vs timeout" related information to the exception?

Like ERROR: canceling statement due to user request, execution started at=...ms, timeout=...ms

Here's the test that checks if setQueryTimeout can affect another schema:

/**
* Test executes two queries one after another. The first one has timeout of 1ms, and the second
* one does not. The timeout of the first query should not impact the second one.
*/
public void testShortQueryTimeout() throws SQLException {
long deadLine = System.currentTimeMillis() + 10000;
Statement stmt = con.createStatement();
((PgStatement) stmt).setQueryTimeoutMs(1);
Statement stmt2 = con.createStatement();
while (System.currentTimeMillis() < deadLine) {
try {
stmt.execute("select 1");
} catch (SQLException e) {
// ignore "statement cancelled"
}
stmt2.executeQuery("select 1");
}
}

At this point, I'm inclined to think that somebody is using query timeout, so the exceptions happen for a good reason.

@vlsi

This comment has been minimized.

Member

vlsi commented May 17, 2016

However, PgStatement.timeout should be long, not int as the latter is subject to integer overflow:

@Viniciusmsss

This comment has been minimized.

Viniciusmsss commented May 17, 2016

I think it's a good point to add "timestamp vs timeout" related information to the exception. Also, you could add the exception timestamp. Like:
ERROR: canceling statement due to user request, execution started at=...ms, exception occurred at=..ms, timeout=...ms

@davecramer

This comment has been minimized.

Member

davecramer commented May 17, 2016

Yes to adding more information in the timeout exception
I can't imagine a timeout overflowing an integer ?

@vlsi

This comment has been minimized.

Member

vlsi commented May 17, 2016

There's * 1000, thus it can overflow if someone uses setQueryTimeout(Integer.MAX_VALUE):

setQueryTimeoutMs(seconds * 1000);

@davecramer

This comment has been minimized.

Member

davecramer commented May 17, 2016

What about limiting the input value ... 2147483 (Integer.MAX_VALUE/1000) is a lot of seconds ?

@vlsi

This comment has been minimized.

Member

vlsi commented May 17, 2016

That is fine. I'm just trying to imagine cases how can it break.

@davecramer

This comment has been minimized.

Member

davecramer commented May 17, 2016

I'm thinking on high workloads if the timeout is short enough the isValid
query could be delayed and the timeout kicks in.

Dave Cramer

On 17 May 2016 at 07:15, Vladimir Sitnikov notifications@github.com wrote:

That is fine. I'm just trying to imagine cases how it can break.


You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
#412 (comment)

@rdp

This comment has been minimized.

rdp commented May 17, 2016

It appears we use, in my case, isValid(1)
The error message is canceling statement due to user request (there seem to be no explicit use of setQueryTimeout). Thanks!

@vlsi

This comment has been minimized.

Member

vlsi commented May 17, 2016

@rdp , do you have a stacktrace associated with canceling statement... error?

@rdp

This comment has been minimized.

rdp commented May 17, 2016

This is the trace: canceling statement due to user request org.postgresql.util.PSQLException: ERROR: canceling statement due to user request at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2284) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2003) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:200) at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:424) at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:161) at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:114) ...

@rdp

This comment has been minimized.

rdp commented May 17, 2016

I could run an experiment where we remote the isValid and see if the problem "goes away" if desired...

@davecramer

This comment has been minimized.

Member

davecramer commented May 17, 2016

Pretty sure the message you get is cancelling statement due to user request
when the timer fires. I'd try setting the timeout longer than 1S as I
mentioned it would be easy for queries to get backed up and take longer
than 1S

Dave Cramer

On 17 May 2016 at 12:41, Roger Pack notifications@github.com wrote:

I could run an experiment where we remote the isValid and see if the
problem "goes away" if desired...


You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
#412 (comment)

@vlsi

This comment has been minimized.

Member

vlsi commented May 18, 2016

Technically speaking:

  1. connection.cancel() ultimately sends a unix SIGINT to the backend in question
  2. signal delivery is asynchronous

Thus, the following can happen (T1 is app thread, T2 is timer thread):

  1. T1: isValid(1) called
  2. T1: something caused the app to freeze (e.g. gc pause) for more than 1 sec
  3. T1: app unfreezes
  4. T2: timeout timer fires, and calls connection.cancel()
  5. backend sends SIGINT to the relevant pid, however the signal gets stuck somewhere
  6. T1: "isValid" response is already ready, so the call returns successfully
  7. T1: issues regular query
  8. OS delivers SIGINT causing the query to fail with canceling statement due to user request ...

^^^ the following looks plausible, however I cannot reproduce the issue as described above

That leaves the following solutions:
S1) Increase timeout for isValid checks -> it should reduce the number of con.cancel requests, thus reducing the number of failures
S2) At the driver side, send additional query in case of "timeout has fired and no canceling statement error observed". Technically speaking, SIGINT delivery can survive multiple SQL executions,
S3) Ask "improved cancel API" from backend, so it can handle (e.g. ignore) delayed SIGINTs.

@simeonmalchev

This comment has been minimized.

simeonmalchev commented Feb 7, 2017

@vlsi, this problem happens occasionally for us with pgjdbc 9.4.1212 running against PostgreSQL 9.5.5, java version is 1.8. This is the stack trace we got:

org.postgresql.util.PSQLException: ERROR: canceling statement due to user request
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2455)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2155)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:288)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:430)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:356)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:168)
at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:116)

Let me know if I can provide any more useful info.

timorantalaiho added a commit to Opetushallitus/valinta-tulos-service that referenced this issue May 27, 2017

@rogue2yjg

This comment has been minimized.

rogue2yjg commented Feb 18, 2018

I had updated my jdbc driver to postgresql-42.1.4.jre7, db server was PostgreSQL 9.4.1, but this issue still happened.

1、We used HikariCP-java7-2.4.13 as our connection poll, and its validationTimeout config is 3000ms;
2、I made a test that I used a thread pool with 5 threads to execute 10 tasks, every task run a same query separatly which costed 2s+;
3、The statement_timeout config in our db server is 0;
4、All 10 tasks occurred the issue;

The error log of pool-1-thread-4 is below:

2018-02-18 21:30:51 [D] [pool-1-thread-4] c.y.s.a.d.D.getMonthAvgFaultRecoveryCost - ==> Preparing: xxx
2018-02-18 21:30:51 [D] [pool-1-thread-4] c.y.s.a.d.D.getMonthAvgFaultRecoveryCost - ==> Parameters: 2018-01-31 00:00:00.0(Timestamp), 2018-01-31 00:00:00.0(Timestamp), 43(String)
..........
2018-02-18 21:31:01 [E] [pool-1-thread-4] c.y.s.api.scheulejobs.ReportGenTask
Caused by: org.postgresql.util.PSQLException: ERROR: canceling statement due to user request
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2477)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2190)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:300)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:169)
at org.postgresql.jdbc.PgPreparedStatement.execute(PgPreparedStatement.java:158)
at com.zaxxer.hikari.pool.ProxyPreparedStatement.execute(ProxyPreparedStatement.java:44)
at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.execute(HikariProxyPreparedStatement.java)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.ibatis.logging.jdbc.PreparedStatementLogger.invoke(PreparedStatementLogger.java:59)
at com.sun.proxy.$Proxy134.execute(Unknown Source)
at org.apache.ibatis.executor.statement.PreparedStatementHandler.query(PreparedStatementHandler.java:62)
at org.apache.ibatis.executor.statement.RoutingStatementHandler.query(RoutingStatementHandler.java:78)
at org.apache.ibatis.executor.SimpleExecutor.doQuery(SimpleExecutor.java:62)
at org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:303)
at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:154)
at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:102)
at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.ibatis.plugin.Invocation.proceed(Invocation.java:49)
at com.github.pagehelper.SqlUtil._processPage(SqlUtil.java:351)
at com.github.pagehelper.SqlUtil.processPage(SqlUtil.java:329)
at com.github.pagehelper.PageHelper.intercept(PageHelper.java:150)
at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61)
at com.sun.proxy.$Proxy132.query(Unknown Source)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:120
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:113)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectOne(DefaultSqlSession.java:73)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:358)
... 10 common frames omitted

@vlsi

This comment has been minimized.

Member

vlsi commented Feb 18, 2018

@simeonmalchev , @rogue2yjg , would you provide more details?
E.g. a reproducer or loggerLevel=TRACE&loggerFile=pgjdbc-debug.log log or something like that.

@rogue2yjg

This comment has been minimized.

rogue2yjg commented Feb 18, 2018

@vlsi
I had update my report, and I want to emphasize a point, no matter how big I set the HikariCP's validationTimeout config, the issue always occured 10s after the query executed in multi thread scene.

@rogue2yjg

This comment has been minimized.

rogue2yjg commented Feb 18, 2018

@vlsi
Is there any detail I can provide?

@brettwooldridge

This comment has been minimized.

Contributor

brettwooldridge commented Feb 18, 2018

@rogue2yjg If you made a test case that can reproduce the issue reliably, can you either make a github repo with it, or zip it up and attach it here?

@rogue2yjg

This comment has been minimized.

rogue2yjg commented Feb 19, 2018

@brettwooldridge
ok, I’ll upload my test case later.

@simeonmalchev

This comment has been minimized.

simeonmalchev commented Feb 19, 2018

@vlsi, sorry I can't reproduce it at the moment. This issue happened for us about a year ago in a staging environment that doesn't exist anymore. If we get this same exception from somewhere else, I'll update here.

@simeonmalchev

This comment has been minimized.

simeonmalchev commented Feb 19, 2018

@vlsi, the issue actually keeps on happening but in production env where I can't set loggerLevel=TRACE&loggerFile=pgjdbc-debug.log or similar. Doesn't longer happens in the staging env. I'll see whether I can find and provide some additional useful info for this issue.

@rogue2yjg

This comment has been minimized.

rogue2yjg commented Feb 20, 2018

@vlsi @brettwooldridge
I had uploaded my test case, but not including the database table structure and data.
You could use any query sql which needed to cost 2s+ to replace my sql.

smart-fire.zip

@vlsi

This comment has been minimized.

Member

vlsi commented Feb 20, 2018

@rogue2yjg , smart-fire-api-server/src/main/resources/mybatis-config.xml reads <setting name="defaultStatementTimeout" value="10"/>

That does sound like a 10-second timeout. Have you tried to increase that value?

@simeonmalchev

This comment has been minimized.

simeonmalchev commented Feb 20, 2018

@vlsi, your last comment prompted me to check, our current exceptions are "canceling statement due to statement timeout" which is due to the configured on PSQL server side statement_timeout being hit, they are not "canceling statement due to user request". I got mixed up with the text of the exceptions. My last two comments from 2 days ago are invalid.

@rogue2yjg

This comment has been minimized.

rogue2yjg commented Feb 22, 2018

@vlsi
I had increased the defaultStatementTimeout value in mybatis-config.xml to 20, and the test case didn't throw the "canceling statement due to user request" exception. But I had a doubt with this solution.
As long as I set a config value(e.g. 10s) something like statement-timeout in anywhere our project, and the query cost 10s+. Should the "canceling statement due to user request" exception occur? Was it a correct scene?

@vlsi

This comment has been minimized.

Member

vlsi commented Feb 22, 2018

@rogue2yjg , I'm afraid I do not understand your message.

defaultStatementTimeout sets a timeout for mybatis queries, and if you set defaultStatementTimeout=10, then any query that exceeds 10s might end up with canceling statement due to user request error. That is expected behavior.

@rogue2yjg

This comment has been minimized.

rogue2yjg commented Feb 23, 2018

@vlsi
That's it! That's a expected behavior in my test case. That's what I want to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment