New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Parse command complete message via regex #962

Merged
merged 1 commit into from Sep 28, 2017

Conversation

Projects
None yet
5 participants
@sehrope
Contributor

sehrope commented Sep 26, 2017

Regex based approach for #960. This should handle all the current cases as well as a couple other ones (ex: "COPY 123").

The new code only throws an exception if the response is out of range (ex: count of rows is greater than 2^31-1). I think there's a separate PR/Issue to address changing the count to a long so I haven't changed any of that behavior here.

If the pattern doesn't match then no parsing error is thrown, it's silently skipped. I think that's a saner default as any new feature added to the server with a possibly incompatible command complete message wouldn't break the driver. Plus any new server responses that matches the generic COMMAND COUNT or COMMAND OID COUNT format would be handled with no code additions.

I purposely left the commented out String command = ... on L2519 as it helps explain the parsing format. If we ever change the internals to store the parsed representation then we could use it.

@jorsol

This comment has been minimized.

Contributor

jorsol commented Sep 26, 2017

A benchmark with both methods would be nice.

@jorsol

This comment has been minimized.

Contributor

jorsol commented Sep 26, 2017

While I like a catch-all approach, a regex will always be slower. In the context of CommandComplete it's possible that it's not a big deal.

Small benchmark:

Benchmark                           Mode  Cnt      Score     Error   Units
TestCommandTag.commandOldSELECT     avgt   10     52.385 ±   1.891   ns/op
TestCommandTag.commandRegexSELECT   avgt   10    297.995 ±  17.714   ns/op
TestCommandTag.commandOldINSERT    thrpt   10  12370.687 ± 242.040  ops/ms
TestCommandTag.commandRegexINSERT  thrpt   10   3175.088 ± 123.602  ops/ms

But making the assumption that the command status will always be COMMAND (OID) COUNT is not really safe.

@codecov-io

This comment has been minimized.

codecov-io commented Sep 26, 2017

Codecov Report

Merging #962 into master will increase coverage by <.01%.
The diff coverage is 80%.

@@             Coverage Diff              @@
##             master     #962      +/-   ##
============================================
+ Coverage     65.92%   65.92%   +<.01%     
+ Complexity     3561     3560       -1     
============================================
  Files           166      166              
  Lines         15238    15244       +6     
  Branches       2464     2465       +1     
============================================
+ Hits          10045    10050       +5     
- Misses         4022     4023       +1     
  Partials       1171     1171
@sehrope

This comment has been minimized.

Contributor

sehrope commented Sep 26, 2017

It doesn't assume it's always in that format. It just doesn't restrict the handling to an explicit list of command response types.

Anything that is in that format will be parsed by the regex with the trailing number being handled as the count. Anything that doesn't match the format will be ignored.

@vlsi

This comment has been minimized.

Member

vlsi commented Sep 27, 2017

@jorsol , would you please share TestCommandTag benchmark code?

@jorsol

This comment has been minimized.

Contributor

jorsol commented Sep 27, 2017

@vlsi , sure:

Benchmark                               (value)  Mode  Cnt    Score   Error  Units
TestCommandTag.regexBench         SELECT 123456  avgt   10  239.066 ± 2.756  ns/op
TestCommandTag.regexBench       INSERT 0 123456  avgt   10  292.946 ± 4.829  ns/op
TestCommandTag.startsWithBench    SELECT 123456  avgt   10   46.163 ± 3.004  ns/op
TestCommandTag.startsWithBench  INSERT 0 123456  avgt   10   70.691 ± 2.091  ns/op
/*
 * Copyright (c) 2017, PostgreSQL Global Development Group
 * See the LICENSE file in the project root for more information.
 */

package org.postgresql.benchmark.statement;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.sql.SQLException;
import java.sql.Statement;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.infra.Blackhole;

@Fork(value = 1, jvmArgsPrepend = "-Xmx128m")
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class TestCommandTag {

  private static final Pattern COMMAND_COMPLETE_PATTERN = Pattern.compile("^([A-Za-z]+)(?: (\\d+))?(?: (\\d+))?$");

  @Param({"SELECT 123456", "INSERT 0 123456"})
  private String value;

  @Benchmark
  public void regexBench(Blackhole b) throws SQLException {
    interpretCommandStatusNEW(value, b);
  }

  @Benchmark
  public void startsWithBench(Blackhole b) throws SQLException {
    interpretCommandStatusOLD(value, b);
  }

  private void interpretCommandStatusOLD(String status, Blackhole b) {
    int update_count = 0;
    long insert_oid = 0;

    if (status.startsWith("INSERT ") || status.startsWith("UPDATE ") || status.startsWith("DELETE ")
        || status.startsWith("SELECT ") || status.startsWith("COPY ")
        || status.startsWith("MOVE ") || status.startsWith("FETCH ")) {
      try {
        long updates = Long.parseLong(status.substring(1 + status.lastIndexOf(' ')));

        // deal with situations where the update modifies more than 2^32 rows
        if (updates > Integer.MAX_VALUE) {
          update_count = Statement.SUCCESS_NO_INFO;
        } else {
          update_count = (int) updates;
        }

        if (status.startsWith("INSERT")) {
          insert_oid
              = Long.parseLong(status.substring(1 + status.indexOf(' '), status.lastIndexOf(' ')));
        }
      } catch (NumberFormatException nfe) {
        System.err.println(nfe.getMessage());
      }
    }
    b.consume(insert_oid);
    b.consume(update_count);
  }

  private void interpretCommandStatusNEW(String status, Blackhole b) {
    long oid = 0;
    int count = 0;
    Matcher matcher = COMMAND_COMPLETE_PATTERN.matcher(status);
    if (matcher.matches()) {
      // String command = matcher.group(1);
      try {
        if (matcher.group(3) != null) {
          // COMMAND OID ROWS
          oid = Long.parseLong(matcher.group(2));
          count = Integer.parseInt(matcher.group(3));
        } else if (matcher.group(2) != null) {
          // COMMAND ROWS
          count = Integer.parseInt(matcher.group(2));
        }
      } catch (NumberFormatException e) {
        System.err.println(e.getMessage());
      }
    }
    b.consume(oid);
    b.consume(count);
  }

  public static void main(String[] args) throws RunnerException {
    Options opt = new OptionsBuilder()
        .include(TestCommandTag.class.getSimpleName())
        .detectJvmArgs()
        .build();

    new Runner(opt).run();
  }

}
@davecramer

This comment has been minimized.

Member

davecramer commented Sep 27, 2017

@jorsol

This comment has been minimized.

Contributor

jorsol commented Sep 27, 2017

@vlsi , I updated the comment... same result.

For what it's worth, this is just a small fraction of the whole processing of results from database, so the processing might be negligible in real world.

@sehrope

This comment has been minimized.

Contributor

sehrope commented Sep 27, 2017

FYI, I'm not doubting the .startsWith(...) approach is faster than the regex approach. I just don't think it matters. We're talking about nanoseconds per command here (not per row) so even if the other approach took zero time nobody would ever notice.

I'm advocating this regex based approach because I think it's both more clear and more future proof. If anything is added down the line on the server side with a similar format (which is more likely than something being added with a different format) it'd be implicitly handled. I haven't confirmed it in the wild but I'd imagine that this would work with any Postgres derivatives that provide custom command completion statuses as well (vs. a hard code list that wouldn't include them).

If there's interest in this patch I'll update it to handle the count being out of int range (by parsing it as a long and then checking it against the max value ... thinking back not really sure why I didn't include that from the beginning). Otherwise I'm sure the other PR by @jorsol will be fine too for current use cases.

@davecramer

This comment has been minimized.

Member

davecramer commented Sep 27, 2017

I'd have to agree. Seems to me the regex approach is simpler and clearer.

@jorsol

This comment has been minimized.

Contributor

jorsol commented Sep 27, 2017

I'm not against this approach, in the long run the final goal is the same:

This can save an entire round-trip to the client, allowing result counts and pagination to be calculated without an additional COUNT query.

I'm not sure about if it's "clearer", I don't immediately know what means "^([A-Za-z]+)(?: (\\d+))?(?: (\\d+))?$" or what tries to do. But I don't really care either, it belongs to the protocol and it's something that is not touched often, so I'm fine with this approach, the final goal it's the same.

/**
* QueryExecutor implementation for the V3 protocol.
*/
public class QueryExecutorImpl extends QueryExecutorBase {
private static final Logger LOGGER = Logger.getLogger(QueryExecutorImpl.class.getName());
private static final Pattern COMMAND_COMPLETE_PATTERN = Pattern.compile("^([A-Za-z]+)(?: (\\d+))?(?: (\\d+))?$");

This comment has been minimized.

@vlsi

vlsi Sep 27, 2017

Member

I suggest to use "([A-Za-z]++)(?: (\\d++))?+(?: (\\d++))?+"
It is explained in https://www.regular-expressions.info/catastrophic.html

This comment has been minimized.

@sehrope

sehrope Sep 27, 2017

Contributor

For normal/valid responses from the backend server I don't think there's any backtracking as it'd be either COMMAND, COMMAND COUNT, or COMMAND OID COUNT so the internal FSA would terminate. But if the regex can be improved to deal with a malicious response without too much complexity then sure why not. I'll take a look.

This comment has been minimized.

@vlsi

vlsi Sep 27, 2017

Member

Let's put it in another way: ++ forbids backtracking, and it makes the code DOS-safe.

if (status.startsWith("INSERT")) {
insert_oid =
Long.parseLong(status.substring(1 + status.indexOf(' '), status.lastIndexOf(' ')));
if (matcher.group(3) != null) {

This comment has been minimized.

@vlsi

vlsi Sep 27, 2017

Member

Would you extract a variable for matcher.group(...)?

This comment has been minimized.

@sehrope

sehrope Sep 27, 2017

Contributor

Sure that'd probably save an internal substring on the Matcher.

|| status.startsWith("MOVE")) {
long oid = 0;
int count = 0;
Matcher matcher = COMMAND_COMPLETE_PATTERN.matcher(status);

This comment has been minimized.

@vlsi

vlsi Sep 27, 2017

Member

What do you think of reusing Matcher instance?
Matcher could be cached in a QueryExecutorImpl.

This comment has been minimized.

@sehrope

sehrope Sep 27, 2017

Contributor

Matcher isn't thread safe[1] so it's cleaner to leave it internal to the the method. The main work of prepping the regex is done by the Pattern.compile(...) anyway so I doubt a cached Matcher would make much difference. The standard approach of regex parsing in Java is to have a static Pattern and a local variable for the matcher as it handles both the matching and groups for the results.

[1]: ... and yes neither is the Connection or QueryExecutor but you know what I mean...

This comment has been minimized.

@vlsi

vlsi Sep 27, 2017

Member

create table update_bench_test(int_field int4, str_field varchar(200))
connection.prepareStatement("update update_bench_test set int_field = int_field+1") (the table is empty)

No matcher reuse: the code allocates 624 bytes. With matcher reuse -- 376 bytes.

This comment has been minimized.

@sehrope

sehrope Sep 27, 2017

Contributor

Sure but the matcher would be young gen allocation and be easily reclaimed.

I think pulling it outside the method comes at the expense of readability for negligible gain. If you think it's worth it I'd suggest doing it in a follow up PR.

@sehrope

This comment has been minimized.

Contributor

sehrope commented Sep 27, 2017

I pushed some changes to this PR. The regex hasn't been updated (yet) as would need to study it a bit more. Rest of the changes have been applied though. Specifically:

  • Save the matching groups to variables group2/group3 so they can be reused.
  • Handling counts greater than Integer.MAX_VALUE
  • Minor rewording of the message for the exception thrown if the count or oid value is out range
feat: Parse command complete message via regex
Replaces command status whitelist based parsing with a regex approach to
handle generic COMMAND OID COUNT or COMMAND COUNT responses. If the
response does not match the regex then parsing is skipped. This should allow
for automatically supporting new server responses of that same form as well as
skipping any that cannot be parsed.
@sehrope

This comment has been minimized.

Contributor

sehrope commented Sep 28, 2017

Pushed one more change with an update to the regex to use possessive quantifiers per @vlsi.

Only difference from the one in his comment is the inclusion of the start (^) and end of line anchors ($). The regex handling in Java includes them by default to match complete lines so they're not really needed, but I think it's a bit clearer (to me at least...) to have them explicitly included.

@vlsi

vlsi approved these changes Sep 28, 2017

@davecramer

This comment has been minimized.

Member

davecramer commented Sep 28, 2017

Is there any reason not to push this then ?

@vlsi vlsi added this to the 42.1.5 milestone Sep 28, 2017

@vlsi vlsi merged commit 097db5e into pgjdbc:master Sep 28, 2017

2 checks passed

codecov/project 65.92% (+<.01%) compared to fcb28c7
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@sehrope

This comment has been minimized.

Contributor

sehrope commented Sep 28, 2017

Nice. Thanks @vlsi for the review and @jorsol for the original PR and test case.

@vlsi

This comment has been minimized.

Member

vlsi commented Jun 15, 2018

Just for the reference, I've measured Matcher#reset case.
The thing is we can easily have per QueryExecutorImpl Matcher instance. That shaves 200b worth of allocations.

matcher = matcher.reset(status)

Benchmark                                         (value)  Mode  Cnt    Score    Error   Units
TestCommandTag.regexBench                        SELECT 1  avgt   10  151.432 ±  9.181   ns/op
TestCommandTag.regexBench:·gc.alloc.rate.norm    SELECT 1  avgt   10   48.000 ±  0.001    B/op
TestCommandTag.regexBench                      INSERT 0 1  avgt   10  198.042 ±  3.547   ns/op
TestCommandTag.regexBench:·gc.alloc.rate.norm  INSERT 0 1  avgt   10   96.000 ±  0.001    B/op
matcher = pattern.matcher(status)

Benchmark                                         (value)  Mode  Cnt    Score    Error   Units
TestCommandTag.regexBench                        SELECT 1  avgt   10  174.909 ±  5.512   ns/op
TestCommandTag.regexBench:·gc.alloc.rate.norm    SELECT 1  avgt   10  248.000 ±  0.001    B/op
TestCommandTag.regexBench                      INSERT 0 1  avgt   10  232.960 ±  5.624   ns/op
TestCommandTag.regexBench:·gc.alloc.rate.norm  INSERT 0 1  avgt   10  296.000 ±  0.001    B/op

Here are indexOf results:

Benchmark                                         (value)  Mode  Cnt     Score     Error   Units
TestCommandTag.indexOf                           SELECT 1  avgt   10    34.769 ±   5.241   ns/op
TestCommandTag.indexOf:·gc.alloc.rate.norm       SELECT 1  avgt   10    48.000 ±   0.001    B/op
TestCommandTag.indexOf                         INSERT 0 1  avgt   10    53.060 ±   2.362   ns/op
TestCommandTag.indexOf:·gc.alloc.rate.norm     INSERT 0 1  avgt   10    96.000 ±   0.001    B/op

rhavermans added a commit to bolcom/pgjdbc that referenced this pull request Jul 13, 2018

feat: parse command complete message via regex (pgjdbc#962)
Replaces command status whitelist based parsing with a regex approach to
handle generic COMMAND OID COUNT or COMMAND COUNT responses. If the
response does not match the regex then parsing is skipped. This should allow
for automatically supporting new server responses of that same form as well as
skipping any that cannot be parsed.

Fixes pgjdbc#958

rhavermans added a commit to bolcom/pgjdbc that referenced this pull request Jul 13, 2018

feat: parse command complete message via regex (pgjdbc#962)
Replaces command status whitelist based parsing with a regex approach to
handle generic COMMAND OID COUNT or COMMAND COUNT responses. If the
response does not match the regex then parsing is skipped. This should allow
for automatically supporting new server responses of that same form as well as
skipping any that cannot be parsed.

Fixes pgjdbc#958
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment