feat: add support for PreparedStatement.setCharacterStream(int, Reader) #671
Conversation
Current coverage is 62.08% (diff: 55.55%)@@ master #671 diff @@
==========================================
Files 150 151 +1
Lines 15180 15288 +108
Methods 0 0
Messages 0 0
Branches 3054 3085 +31
==========================================
+ Hits 9414 9491 +77
- Misses 4519 4530 +11
- Partials 1247 1267 +20
|
private static final int DEFAULT_CHAR_BUFFER_SIZE = 8 * 1024; | ||
|
||
//#if mvn.project.property.postgresql.jdbc.spec < "JDBC4.1" | ||
private static final Charset UTF_8 = Charset.availableCharsets().get("UTF-8"); |
jorsol
Oct 20, 2016
Member
This should be changed to private static final Charset UTF_8 = Charset.forName("UTF-8");
to avoid the use of the preprocesor and have backwards compatibility with Java 6.
This should be changed to private static final Charset UTF_8 = Charset.forName("UTF-8");
to avoid the use of the preprocesor and have backwards compatibility with Java 6.
bd-infor
Oct 21, 2016
Author
Contributor
- Charset.forName("UTF-8") throws an exception. Charset.availableCharsets().get("UTF-8") does not. Given that "UTF-8" is a mandatory charset, it doesn't make a huge difference, but some static analyzers complain if exceptions can be thrown during class initialization.
- I couldn't find any discussion of how long pgjdbc intends to maintain Java 6 compatibility. Given that Java 6 has been out of support for years, I would consider dropping Java 6 support to be a better option. As a compromise, I coded it to use StandardCharset when available and fall back to a static final class member for Java 6.
- Charset.forName("UTF-8") throws an exception. Charset.availableCharsets().get("UTF-8") does not. Given that "UTF-8" is a mandatory charset, it doesn't make a huge difference, but some static analyzers complain if exceptions can be thrown during class initialization.
- I couldn't find any discussion of how long pgjdbc intends to maintain Java 6 compatibility. Given that Java 6 has been out of support for years, I would consider dropping Java 6 support to be a better option. As a compromise, I coded it to use StandardCharset when available and fall back to a static final class member for Java 6.
The change requires a unit test |
import java.nio.charset.CoderResult; | ||
import java.nio.charset.MalformedInputException; | ||
|
||
public class ReaderInputStream extends InputStream { |
vlsi
Oct 21, 2016
Member
Can you please clarify how is this different from java.io.InputStreamReader
?
Can you please clarify how is this different from java.io.InputStreamReader
?
bd-infor
Oct 21, 2016
Author
Contributor
java.io.InputStreamReader takes a byte stream and a Charset as input and outputs a UTF-16 char stream. ReaderInputStream takes a UTF-16 char stream as input and outputs a UTF-8 byte stream. I hardcoded UTF-8 as the target charset because pgjdbc always uses UTF-8 for front end<->back end communication.
java.io.InputStreamReader takes a byte stream and a Charset as input and outputs a UTF-16 char stream. ReaderInputStream takes a UTF-16 char stream as input and outputs a UTF-8 byte stream. I hardcoded UTF-8 as the target charset because pgjdbc always uses UTF-8 for front end<->back end communication.
<toolchain> | ||
<type>jdk</type> | ||
<provides> | ||
<version>1.9</version> |
vlsi
Oct 21, 2016
Member
Why did you kill 1.9 profile?
Why did you kill 1.9 profile?
bd-infor
Oct 21, 2016
Author
Contributor
Sorry, that was not intended. This is the first time that I have used git and I made a mistake.
Sorry, that was not intended. This is the first time that I have used git and I made a mistake.
setString(parameterIndex, castToString(in), getStringType()); | ||
break; | ||
case Types.LONGVARCHAR: | ||
if (in instanceof InputStream) { | ||
preparedParameters.setText(parameterIndex, (InputStream)in); |
vlsi
Oct 21, 2016
Member
Can you please add a unit test that uses the added feature?
Can you please add a unit test that uses the added feature?
bd-infor
Oct 21, 2016
Author
Contributor
In my post to the mailing list, I asked 2 questions regarding the unit tests. Those questions have not been answered yet.
- Can I rely on having a UTF-8 database for the test code? The most interesting test cases use UTF-16 surrogate pairs.
- For the ReaderInputStreamTest unit test, should I create a pgjdbc/src/test/java/org/postgresql/util directory or should I just put it into the jdbc41 test directory?
In my post to the mailing list, I asked 2 questions regarding the unit tests. Those questions have not been answered yet.
- Can I rely on having a UTF-8 database for the test code? The most interesting test cases use UTF-16 surrogate pairs.
- For the ReaderInputStreamTest unit test, should I create a pgjdbc/src/test/java/org/postgresql/util directory or should I just put it into the jdbc41 test directory?
davecramer
Oct 24, 2016
Member
Pretty sure we can guarantee that there will be a UTF server. But that can't be guaranteed in the wild. We just set the encoding of the connection. Is that not possible ?
Pretty sure we can guarantee that there will be a UTF server. But that can't be guaranteed in the wild. We just set the encoding of the connection. Is that not possible ?
bd-infor
Oct 24, 2016
Author
Contributor
Thanks. Because the test database may not be UTF-8, I will restrict the test cases in the unit test to printable ASCII. The test cases in ReaderInputStreamTest will cover the surrogate pairs.
Thanks. Because the test database may not be UTF-8, I will restrict the test cases in the unit test to printable ASCII. The test cases in ReaderInputStreamTest will cover the surrogate pairs.
vlsi
Oct 24, 2016
Member
Test database is created here: https://github.com/pgjdbc/pgjdbc/blob/master/.travis.yml#L10
We can even use different encodings if that is required for tests.
If a particular test requires the db to be UTF-8, it should be handled as Assume.assumeTrue(con....)
kind of validation in the test setup like this:
When the "assumption" fails, the test is just ignored. The test should be a junit 4 test (e.g. extends BaseTest4
)
Test database is created here: https://github.com/pgjdbc/pgjdbc/blob/master/.travis.yml#L10
We can even use different encodings if that is required for tests.
If a particular test requires the db to be UTF-8, it should be handled as Assume.assumeTrue(con....)
kind of validation in the test setup like this:
When the "assumption" fails, the test is just ignored. The test should be a junit 4 test (e.g. extends BaseTest4
)
bd-infor
Oct 24, 2016
Author
Contributor
Fixing JRE 1.6 dependency for leadingSurrogate/trailingSurrogate.
Fixing JRE 1.6 dependency for leadingSurrogate/trailingSurrogate.
bd-infor
Oct 24, 2016
Author
Contributor
Fixed all the pgjdbc style warnings/errors.
Now, does anyone have a suggestions as to why the test would fail on PG_VERSION=9.4 QUERY_MODE=simple COVERAGE=Y ANORM_SBT=Y rest? Or suggestions as to how to duplicate that locally I'm particularly confused by the message "Sent and received data are not the same expected:<[... really long string ...]> but was:<[<stream of 10240 bytes>]>. The actual value is retrieved with String getString(int) and the expected value is passed in as a String, so how can either come out as a byte stream?
Thanks.
Fixed all the pgjdbc style warnings/errors.
Now, does anyone have a suggestions as to why the test would fail on PG_VERSION=9.4 QUERY_MODE=simple COVERAGE=Y ANORM_SBT=Y rest? Or suggestions as to how to duplicate that locally I'm particularly confused by the message "Sent and received data are not the same expected:<[... really long string ...]> but was:<[<stream of 10240 bytes>]>. The actual value is retrieved with String getString(int) and the expected value is passed in as a String, so how can either come out as a byte stream?
Thanks.
bd-infor
Oct 25, 2016
Author
Contributor
For future reference, I duplicated the problem in this particular travis-ci instance by
- changing to Postgres 9.4 as the test server
- setting the environment variables (on Windows - set QUERY_MODE=simple ...)
- running "mvn clean package -B -V -DprefereQueryMode=simple -P release,coverage"
- I was already using Oracle JDK 1.8 and did not need to change it.
- Operating system doesn't seem to matter as the test fails on Windows as well as Linux.
To find the values for 2 and 3, click the details link next to the Travis CI failure. Click on the failed build job link. The environment variables for the failed job are near line 150 of the debug log. The mvn command is near line 250.
For future reference, I duplicated the problem in this particular travis-ci instance by
- changing to Postgres 9.4 as the test server
- setting the environment variables (on Windows - set QUERY_MODE=simple ...)
- running "mvn clean package -B -V -DprefereQueryMode=simple -P release,coverage"
- I was already using Oracle JDK 1.8 and did not need to change it.
- Operating system doesn't seem to matter as the test fails on Windows as well as Linux.
To find the values for 2 and 3, click the details link next to the Travis CI failure. Click on the failed build job link. The environment variables for the failed job are near line 150 of the debug log. The mvn command is near line 250.
davecramer
Oct 29, 2016
Member
He is reviewing the code, keep in mind we do this in our spare time
He is reviewing the code, keep in mind we do this in our spare time
In pgjdbc 9.4 (1211) and earlier, the PreparedStatement method setCharacterStream(int, Reader) throws a not implemented exception. The setCharacterStream(int, Reader, int) method does not throw an exception, but it materializes the Reader as a String and then uses the same code path as setString(int, String). Materializing the readers can cause memory usage issues (OutOfMemoryError) when working with large text values. This patch implements the setCharacterStream(int, Reader) method using in memory buffers for small streams and temp files for large streams. To do so, it uses code added in pull request #220 that was used to implement setBinaryStream(int, InputStream). This patch adds a utility class to convert the Reader's UTF-16 character stream into a UTF-8 binary stream. The setCharacterStream(int, Reader) method implemented in this patch supports both simple and extended query modes. However, in simple mode, it materializes the reader as a String just like setCharacterStream(int, Reader, int).
c8cebf6
to
f52bf7f
Did you consider unifying existing org.postgresql.jdbc.PgPreparedStatement#setCharacterStream(int, java.io.Reader, int), and other setCharacterStream methods? |
throw new PSQLException(GT.tr("Provided Reader failed."), PSQLState.UNEXPECTED_ERROR, ioe); | ||
} | ||
} | ||
|
||
public void setCharacterStream(int parameterIndex, Reader value) throws SQLException { |
vlsi
Nov 12, 2016
Member
Sorry for the late review.
Technically speaking:
- PostgreSQL protocol does not support streaming data of unknown length. That means, for "unknown size", the only way out is to buffer the value somehow.
- "buffering the stream into
java.lang.String
" does not differ much from org.postgresql.jdbc.PgPreparedStatement#setCharacterStream(int, java.io.Reader, int)
Did you consider unifying existing org.postgresql.jdbc.PgPreparedStatement#setCharacterStream(int, java.io.Reader, int)
, and other setCharacterStream
methods?
It looks like your implementation brings yet another "save Reader into String" implementation with no code (and bug :) ) sharing with existing code around similar setCharacterStream
methods.
PS. StreamWrapper
does buffer the contents into byte[]
. I think it is possible to keep stream all the way down and stream data to the network socket, however it is not implemented yet.
Sorry for the late review.
Technically speaking:
- PostgreSQL protocol does not support streaming data of unknown length. That means, for "unknown size", the only way out is to buffer the value somehow.
- "buffering the stream into
java.lang.String
" does not differ much fromorg.postgresql.jdbc.PgPreparedStatement#setCharacterStream(int, java.io.Reader, int)
Did you consider unifying existing org.postgresql.jdbc.PgPreparedStatement#setCharacterStream(int, java.io.Reader, int)
, and other setCharacterStream
methods?
It looks like your implementation brings yet another "save Reader into String" implementation with no code (and bug :) ) sharing with existing code around similar setCharacterStream
methods.
PS. StreamWrapper
does buffer the contents into byte[]
. I think it is possible to keep stream all the way down and stream data to the network socket, however it is not implemented yet.
bd-infor
Nov 14, 2016
Author
Contributor
PS. StreamWrapper does buffer the contents into byte[]. ...
When I read this, I think you are saying that the entire content of the InputStream is buffered into a single large byte[] and that is incorrect. If you meant that the content is buffered into a byte[] of at most StreamWrapper.MAX_MEMORY_BUFFER_BYTES in the StreamWrapper(InputStream) constructor and then in 8192 byte chunks when writing to the socket, then I agree.
... however it is not implemented yet.
I agree 100% that this patch is pointless if the stream is not kept all the way down to the network socket. However, I think that behavior is implemented. Please let me know if I am reading this code path incorrectly.
In the StreamWrapper(InputStream) constructor, values of up to MAX_MEMORY_BUFFER_BYTES are buffered entirely in memory (byte[] value is in StreamWrapper.rawData and StreamWrapper.stream is null). Larger values are streamed to a temporary file (StreamWrapper.rawData is null and StreamWrapper.stream is a FileInputStream over the temp file).
SimpleParameterList.streamBytea(PGStream, StreamWrapper) sends the byte[] if rawData is not null. If rawData is null, it calls pgStream.sendStream(wrapper.getStream(), wrapper.getLength()).
PGStream.sendStream(InputStream, int) copies the file contents directly to the connection socket 8192 bytes at a time.
- ... the only way out is to buffer the value ...
Agreed.
- ... does not differ much ...
Agreed. However, the "buffer to String" code path is only taken when (connection.getPreferQueryMode() == PreferQueryMode.SIMPLE) and I only added that code path as a work around for the QUERY_MODE=simple test failure.
In extended query mode (the normal code path), setCharacterStream(int, Reader) uses StreamWrapper to buffer to disk. Managing/limiting JVM heap use by buffering large values to disk rather than memory is the primary goal of the patch.
Making the disk buffering also work for simple query mode would be a nice future enhancement, but is beyond the scope of this patch. If you think it would be helpful, I could add a "// TODO - implement Reader to socket streaming for simple query mode" comment to the simple query mode code path to make that explicit.
Did you consider unifying ...
Yes. I think that unifying the existing setCharacterStream(int, Reader, int) with my setCharacterStream(int, Reader) is a bad idea. Making them both use the existing implementation (buffer to String) leaves us with the memory use problems that I'm trying to fix. Making them both use the buffer to file (or byte[]) implementation is a significant change for existing users. Anyone who is having memory use problems with setCharacterStream(int, Reader, int) can easily switch to setCharacterStream(int, Reader). If the current users are not having memory use problems, there is no benefit and the possible creation of temp files where there were none seems like a significant backwards compatability risk to me.
As for setCharacterStream(int, Reader, long), I think that allowing a "long" length would lead people to believe that they actually could pass values larger than Integer.MAX_VALUE. Given the 1GB limit on bytea columns and the 2GB length limit in the protocol, that seems like a bad idea.
The JDBC documentation implies that having different performance characteristics for the different setCharacterStream() APIs is possible. From the JavaDoc for setCharacterStream(int, Reader), "... Note: Consult your JDBC driver documentation to determine if it might be more efficient to use a version of setCharacterStream which takes a length parameter. ...".
It is my intent to submit a documentation change if this pull request is accepted but it seems premature to do so at this time.
PS. StreamWrapper does buffer the contents into byte[]. ...
When I read this, I think you are saying that the entire content of the InputStream is buffered into a single large byte[] and that is incorrect. If you meant that the content is buffered into a byte[] of at most StreamWrapper.MAX_MEMORY_BUFFER_BYTES in the StreamWrapper(InputStream) constructor and then in 8192 byte chunks when writing to the socket, then I agree.
... however it is not implemented yet.
I agree 100% that this patch is pointless if the stream is not kept all the way down to the network socket. However, I think that behavior is implemented. Please let me know if I am reading this code path incorrectly.
In the StreamWrapper(InputStream) constructor, values of up to MAX_MEMORY_BUFFER_BYTES are buffered entirely in memory (byte[] value is in StreamWrapper.rawData and StreamWrapper.stream is null). Larger values are streamed to a temporary file (StreamWrapper.rawData is null and StreamWrapper.stream is a FileInputStream over the temp file).
SimpleParameterList.streamBytea(PGStream, StreamWrapper) sends the byte[] if rawData is not null. If rawData is null, it calls pgStream.sendStream(wrapper.getStream(), wrapper.getLength()).
PGStream.sendStream(InputStream, int) copies the file contents directly to the connection socket 8192 bytes at a time.
- ... the only way out is to buffer the value ...
Agreed.
- ... does not differ much ...
Agreed. However, the "buffer to String" code path is only taken when (connection.getPreferQueryMode() == PreferQueryMode.SIMPLE) and I only added that code path as a work around for the QUERY_MODE=simple test failure.
In extended query mode (the normal code path), setCharacterStream(int, Reader) uses StreamWrapper to buffer to disk. Managing/limiting JVM heap use by buffering large values to disk rather than memory is the primary goal of the patch.
Making the disk buffering also work for simple query mode would be a nice future enhancement, but is beyond the scope of this patch. If you think it would be helpful, I could add a "// TODO - implement Reader to socket streaming for simple query mode" comment to the simple query mode code path to make that explicit.
Did you consider unifying ...
Yes. I think that unifying the existing setCharacterStream(int, Reader, int) with my setCharacterStream(int, Reader) is a bad idea. Making them both use the existing implementation (buffer to String) leaves us with the memory use problems that I'm trying to fix. Making them both use the buffer to file (or byte[]) implementation is a significant change for existing users. Anyone who is having memory use problems with setCharacterStream(int, Reader, int) can easily switch to setCharacterStream(int, Reader). If the current users are not having memory use problems, there is no benefit and the possible creation of temp files where there were none seems like a significant backwards compatability risk to me.
As for setCharacterStream(int, Reader, long), I think that allowing a "long" length would lead people to believe that they actually could pass values larger than Integer.MAX_VALUE. Given the 1GB limit on bytea columns and the 2GB length limit in the protocol, that seems like a bad idea.
The JDBC documentation implies that having different performance characteristics for the different setCharacterStream() APIs is possible. From the JavaDoc for setCharacterStream(int, Reader), "... Note: Consult your JDBC driver documentation to determine if it might be more efficient to use a version of setCharacterStream which takes a length parameter. ...".
It is my intent to submit a documentation change if this pull request is accepted but it seems premature to do so at this time.
vlsi
Nov 15, 2016
Member
Larger values are streamed to a temporary file
I stay corrected.
Given the 1GB limit on bytea columns and the 2GB length limit in the protocol, that seems like a bad idea
We can throw an appropriate "invalid argument" exception
Larger values are streamed to a temporary file
I stay corrected.
Given the 1GB limit on bytea columns and the 2GB length limit in the protocol, that seems like a bad idea
We can throw an appropriate "invalid argument" exception
bd-infor
Nov 15, 2016
•
Author
Contributor
We can throw an appropriate "invalid argument" exception
Agreed. Given that the suggested unification is not directly related to the new functionality in this pull request and given that this pull request does not currently make any changes to either setCharacterStream(int, Reader, int) or setCharacterStream(int, Reader, long), I will submit a separate "refactor: ..." pull request to unify setCharacterStream(int, Reader, long) with setCharacterStream(int, Reader, int).
We can throw an appropriate "invalid argument" exception
Agreed. Given that the suggested unification is not directly related to the new functionality in this pull request and given that this pull request does not currently make any changes to either setCharacterStream(int, Reader, int) or setCharacterStream(int, Reader, long), I will submit a separate "refactor: ..." pull request to unify setCharacterStream(int, Reader, long) with setCharacterStream(int, Reader, int).
First cut in order to get feedback.