New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for PreparedStatement.setCharacterStream(int, Reader) #671

Closed
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@bd-infor
Contributor

bd-infor commented Oct 20, 2016

First cut in order to get feedback.

@codecov-io

This comment has been minimized.

codecov-io commented Oct 20, 2016

Current coverage is 62.08% (diff: 55.55%)

Merging #671 into master will increase coverage by 0.06%

@@             master       #671   diff @@
==========================================
  Files           150        151     +1   
  Lines         15180      15288   +108   
  Methods           0          0          
  Messages          0          0          
  Branches       3054       3085    +31   
==========================================
+ Hits           9414       9491    +77   
- Misses         4519       4530    +11   
- Partials       1247       1267    +20   

Powered by Codecov. Last update 6d2a53e...b478933

private static final int DEFAULT_CHAR_BUFFER_SIZE = 8 * 1024;
//#if mvn.project.property.postgresql.jdbc.spec < "JDBC4.1"
private static final Charset UTF_8 = Charset.availableCharsets().get("UTF-8");

This comment has been minimized.

@jorsol

jorsol Oct 20, 2016

Contributor

This should be changed to private static final Charset UTF_8 = Charset.forName("UTF-8"); to avoid the use of the preprocesor and have backwards compatibility with Java 6.

This comment has been minimized.

@bd-infor

bd-infor Oct 21, 2016

Contributor
  1. Charset.forName("UTF-8") throws an exception. Charset.availableCharsets().get("UTF-8") does not. Given that "UTF-8" is a mandatory charset, it doesn't make a huge difference, but some static analyzers complain if exceptions can be thrown during class initialization.
  2. I couldn't find any discussion of how long pgjdbc intends to maintain Java 6 compatibility. Given that Java 6 has been out of support for years, I would consider dropping Java 6 support to be a better option. As a compromise, I coded it to use StandardCharset when available and fall back to a static final class member for Java 6.
@vlsi

The change requires a unit test

import java.nio.charset.CoderResult;
import java.nio.charset.MalformedInputException;
public class ReaderInputStream extends InputStream {

This comment has been minimized.

@vlsi

vlsi Oct 21, 2016

Member

Can you please clarify how is this different from java.io.InputStreamReader?

This comment has been minimized.

@bd-infor

bd-infor Oct 21, 2016

Contributor

java.io.InputStreamReader takes a byte stream and a Charset as input and outputs a UTF-16 char stream. ReaderInputStream takes a UTF-16 char stream as input and outputs a UTF-8 byte stream. I hardcoded UTF-8 as the target charset because pgjdbc always uses UTF-8 for front end<->back end communication.

<toolchain>
<type>jdk</type>
<provides>
<version>1.9</version>

This comment has been minimized.

@vlsi

vlsi Oct 21, 2016

Member

Why did you kill 1.9 profile?

This comment has been minimized.

@bd-infor

bd-infor Oct 21, 2016

Contributor

Sorry, that was not intended. This is the first time that I have used git and I made a mistake.

setString(parameterIndex, castToString(in), getStringType());
break;
case Types.LONGVARCHAR:
if (in instanceof InputStream) {
preparedParameters.setText(parameterIndex, (InputStream)in);

This comment has been minimized.

@vlsi

vlsi Oct 21, 2016

Member

Can you please add a unit test that uses the added feature?

This comment has been minimized.

@bd-infor

bd-infor Oct 21, 2016

Contributor

In my post to the mailing list, I asked 2 questions regarding the unit tests. Those questions have not been answered yet.

  1. Can I rely on having a UTF-8 database for the test code? The most interesting test cases use UTF-16 surrogate pairs.
  2. For the ReaderInputStreamTest unit test, should I create a pgjdbc/src/test/java/org/postgresql/util directory or should I just put it into the jdbc41 test directory?

This comment has been minimized.

@davecramer

davecramer Oct 24, 2016

Member

Pretty sure we can guarantee that there will be a UTF server. But that can't be guaranteed in the wild. We just set the encoding of the connection. Is that not possible ?

This comment has been minimized.

@bd-infor

bd-infor Oct 24, 2016

Contributor

Thanks. Because the test database may not be UTF-8, I will restrict the test cases in the unit test to printable ASCII. The test cases in ReaderInputStreamTest will cover the surrogate pairs.

This comment has been minimized.

@vlsi

vlsi Oct 24, 2016

Member

Test database is created here: https://github.com/pgjdbc/pgjdbc/blob/master/.travis.yml#L10

We can even use different encodings if that is required for tests.

If a particular test requires the db to be UTF-8, it should be handled as Assume.assumeTrue(con....) kind of validation in the test setup like this:

Assume.assumeTrue("jsonb requires PostgreSQL 9.4+", TestUtil.haveMinimumServerVersion(con, "9.4"));

When the "assumption" fails, the test is just ignored. The test should be a junit 4 test (e.g. extends BaseTest4)

This comment has been minimized.

@bd-infor

bd-infor Oct 24, 2016

Contributor

Fixing JRE 1.6 dependency for leadingSurrogate/trailingSurrogate.

This comment has been minimized.

@bd-infor

bd-infor Oct 24, 2016

Contributor

Fixed all the pgjdbc style warnings/errors.
Now, does anyone have a suggestions as to why the test would fail on PG_VERSION=9.4 QUERY_MODE=simple COVERAGE=Y ANORM_SBT=Y rest? Or suggestions as to how to duplicate that locally I'm particularly confused by the message "Sent and received data are not the same expected:<[... really long string ...]> but was:<[<stream of 10240 bytes>]>. The actual value is retrieved with String getString(int) and the expected value is passed in as a String, so how can either come out as a byte stream?
Thanks.

This comment has been minimized.

@bd-infor

bd-infor Oct 25, 2016

Contributor

For future reference, I duplicated the problem in this particular travis-ci instance by

  1. changing to Postgres 9.4 as the test server
  2. setting the environment variables (on Windows - set QUERY_MODE=simple ...)
  3. running "mvn clean package -B -V -DprefereQueryMode=simple -P release,coverage"
  4. I was already using Oracle JDK 1.8 and did not need to change it.
  5. Operating system doesn't seem to matter as the test fails on Windows as well as Linux.
    To find the values for 2 and 3, click the details link next to the Travis CI failure. Click on the failed build job link. The environment variables for the failed job are near line 150 of the debug log. The mvn command is near line 250.

This comment has been minimized.

@bd-infor

bd-infor Oct 29, 2016

Contributor

@vlsi Are these test cases acceptable?

This comment has been minimized.

@davecramer

davecramer Oct 29, 2016

Member

He is reviewing the code, keep in mind we do this in our spare time

@bd-infor bd-infor changed the title from add support for PreparedStatement.setCharacterStream(int, Reader) to feat: add support for PreparedStatement.setCharacterStream(int, Reader) Oct 26, 2016

feat: add support for PreparedStatement.setCharacterStream(int, Reader)
In pgjdbc 9.4 (1211) and earlier, the PreparedStatement method
setCharacterStream(int, Reader) throws a not implemented
exception. The setCharacterStream(int, Reader, int) method does not
throw an exception, but it materializes the Reader as a String and
then uses the same code path as setString(int, String). Materializing
the readers can cause memory usage issues (OutOfMemoryError) when
working with large text values.
This patch implements the setCharacterStream(int, Reader) method
using in memory buffers for small streams and temp files for large
streams. To do so, it uses code added in pull request #220 that was
used to implement setBinaryStream(int, InputStream). This patch
adds a utility class to convert the Reader's UTF-16 character stream
into a UTF-8 binary stream.
The setCharacterStream(int, Reader) method implemented in this patch
supports both simple and extended query modes. However, in simple
mode, it materializes the reader as a String just like
setCharacterStream(int, Reader, int).

@vlsi vlsi force-pushed the pgjdbc:master branch 2 times, most recently from c8cebf6 to f52bf7f Oct 29, 2016

@vlsi vlsi added this to the 9.4.1213 milestone Nov 9, 2016

@vlsi

Did you consider unifying existing org.postgresql.jdbc.PgPreparedStatement#setCharacterStream(int, java.io.Reader, int), and other setCharacterStream methods?

throw new PSQLException(GT.tr("Provided Reader failed."), PSQLState.UNEXPECTED_ERROR, ioe);
}
}
public void setCharacterStream(int parameterIndex, Reader value) throws SQLException {

This comment has been minimized.

@vlsi

vlsi Nov 12, 2016

Member

Sorry for the late review.

Technically speaking:

  1. PostgreSQL protocol does not support streaming data of unknown length. That means, for "unknown size", the only way out is to buffer the value somehow.
  2. "buffering the stream into java.lang.String" does not differ much from org.postgresql.jdbc.PgPreparedStatement#setCharacterStream(int, java.io.Reader, int)

Did you consider unifying existing org.postgresql.jdbc.PgPreparedStatement#setCharacterStream(int, java.io.Reader, int), and other setCharacterStream methods?

It looks like your implementation brings yet another "save Reader into String" implementation with no code (and bug :) ) sharing with existing code around similar setCharacterStream methods.

PS. StreamWrapper does buffer the contents into byte[]. I think it is possible to keep stream all the way down and stream data to the network socket, however it is not implemented yet.

This comment has been minimized.

@bd-infor

bd-infor Nov 14, 2016

Contributor

PS. StreamWrapper does buffer the contents into byte[]. ...

When I read this, I think you are saying that the entire content of the InputStream is buffered into a single large byte[] and that is incorrect. If you meant that the content is buffered into a byte[] of at most StreamWrapper.MAX_MEMORY_BUFFER_BYTES in the StreamWrapper(InputStream) constructor and then in 8192 byte chunks when writing to the socket, then I agree.

... however it is not implemented yet.

I agree 100% that this patch is pointless if the stream is not kept all the way down to the network socket. However, I think that behavior is implemented. Please let me know if I am reading this code path incorrectly.

In the StreamWrapper(InputStream) constructor, values of up to MAX_MEMORY_BUFFER_BYTES are buffered entirely in memory (byte[] value is in StreamWrapper.rawData and StreamWrapper.stream is null). Larger values are streamed to a temporary file (StreamWrapper.rawData is null and StreamWrapper.stream is a FileInputStream over the temp file).

SimpleParameterList.streamBytea(PGStream, StreamWrapper) sends the byte[] if rawData is not null. If rawData is null, it calls pgStream.sendStream(wrapper.getStream(), wrapper.getLength()).

PGStream.sendStream(InputStream, int) copies the file contents directly to the connection socket 8192 bytes at a time.

  1. ... the only way out is to buffer the value ...

Agreed.

  1. ... does not differ much ...

Agreed. However, the "buffer to String" code path is only taken when (connection.getPreferQueryMode() == PreferQueryMode.SIMPLE) and I only added that code path as a work around for the QUERY_MODE=simple test failure.

In extended query mode (the normal code path), setCharacterStream(int, Reader) uses StreamWrapper to buffer to disk. Managing/limiting JVM heap use by buffering large values to disk rather than memory is the primary goal of the patch.

Making the disk buffering also work for simple query mode would be a nice future enhancement, but is beyond the scope of this patch. If you think it would be helpful, I could add a "// TODO - implement Reader to socket streaming for simple query mode" comment to the simple query mode code path to make that explicit.

Did you consider unifying ...

Yes. I think that unifying the existing setCharacterStream(int, Reader, int) with my setCharacterStream(int, Reader) is a bad idea. Making them both use the existing implementation (buffer to String) leaves us with the memory use problems that I'm trying to fix. Making them both use the buffer to file (or byte[]) implementation is a significant change for existing users. Anyone who is having memory use problems with setCharacterStream(int, Reader, int) can easily switch to setCharacterStream(int, Reader). If the current users are not having memory use problems, there is no benefit and the possible creation of temp files where there were none seems like a significant backwards compatability risk to me.

As for setCharacterStream(int, Reader, long), I think that allowing a "long" length would lead people to believe that they actually could pass values larger than Integer.MAX_VALUE. Given the 1GB limit on bytea columns and the 2GB length limit in the protocol, that seems like a bad idea.

The JDBC documentation implies that having different performance characteristics for the different setCharacterStream() APIs is possible. From the JavaDoc for setCharacterStream(int, Reader), "... Note: Consult your JDBC driver documentation to determine if it might be more efficient to use a version of setCharacterStream which takes a length parameter. ...".

It is my intent to submit a documentation change if this pull request is accepted but it seems premature to do so at this time.

This comment has been minimized.

@vlsi

vlsi Nov 15, 2016

Member

Larger values are streamed to a temporary file

I stay corrected.

Given the 1GB limit on bytea columns and the 2GB length limit in the protocol, that seems like a bad idea

We can throw an appropriate "invalid argument" exception

This comment has been minimized.

@bd-infor

bd-infor Nov 15, 2016

Contributor

We can throw an appropriate "invalid argument" exception

Agreed. Given that the suggested unification is not directly related to the new functionality in this pull request and given that this pull request does not currently make any changes to either setCharacterStream(int, Reader, int) or setCharacterStream(int, Reader, long), I will submit a separate "refactor: ..." pull request to unify setCharacterStream(int, Reader, long) with setCharacterStream(int, Reader, int).

@vlsi vlsi self-assigned this Nov 24, 2016

@vlsi vlsi closed this in ee4c426 Nov 24, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment