New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use UTF-8 instead of US-ASCII for initial encoding #398

Merged
merged 1 commit into from Oct 14, 2015

Conversation

Projects
None yet
3 participants
@davecramer
Member

davecramer commented Oct 13, 2015

If lc_messages is set to a language which uses UTF-8 then connections which fail garble messages

@@ -99,7 +99,7 @@ public Struct createStruct(String typeName, Object[] attributes) throws SQLExcep
public Array createArrayOf(String typeName, Object[] elements) throws SQLException
{
checkClosed();
int oid = getTypeInfo().getPGArrayType(typeName);
int oid = getTypeInfo().getPGArrayType(typeName.toLowerCase());

This comment has been minimized.

@vlsi

vlsi Oct 14, 2015

Member

Why is this a part of "use UTF-8 instead of US-ASCII for initial encoding" PR?

This comment has been minimized.

@davecramer

davecramer Oct 14, 2015

Member

It should not be..... my mistake. How do you create two different PR's without applying the first PR ?

This comment has been minimized.

@vlsi

vlsi Oct 14, 2015

Member

You should use different branches for different PRs.
It is not recommended to create PRs out of a master branch.
In other words, you create a branch for "default encoding" and a branch for "name case".

In present situation it would be easier to just git reset --hard 8dae0ae5df6ffaa484d0d6ff69aaa33f5d1d713b && git push -f origin master. This will update PR with to have just a single commit, then you could merge it in.

This comment has been minimized.

@davecramer

davecramer Oct 14, 2015

Member

wouldn't I push to my repo instead of master ?

Dave Cramer

On 14 October 2015 at 06:53, Vladimir Sitnikov notifications@github.com
wrote:

In org/postgresql/jdbc4/AbstractJdbc4Connection.java
#398 (comment):

@@ -99,7 +99,7 @@ public Struct createStruct(String typeName, Object[] attributes) throws SQLExcep
public Array createArrayOf(String typeName, Object[] elements) throws SQLException
{
checkClosed();

  •    int oid = getTypeInfo().getPGArrayType(typeName);
    
  •    int oid = getTypeInfo().getPGArrayType(typeName.toLowerCase());
    

You should use different branches for different PRs.
It is not recommended to create PRs out of a master branch.
In other words, you create a branch for "default encoding" and a branch
for "name case".

In present situation it would be easier to just git reset --hard
8dae0ae && git push -f origin master.
This will update PR with to have just a single commit, then you could merge
it in.


Reply to this email directly or view it on GitHub
https://github.com/pgjdbc/pgjdbc/pull/398/files#r41979375.

This comment has been minimized.

@vlsi

vlsi Oct 14, 2015

Member

I did not notice you create PRs from your own repository.
Then you should push to your repo.

davecramer added a commit that referenced this pull request Oct 14, 2015

Merge pull request #398 from davecramer/master
fix: use UTF-8 instead of US-ASCII for initial encoding

@davecramer davecramer merged commit 78fe602 into pgjdbc:master Oct 14, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@pdradx

This comment has been minimized.

pdradx commented Feb 22, 2016

Hi!

I have something to say about this commit.
If lc_messages is set to a language, which uses non UTF-8 encoding then the connection will throw an exception if server send to client some message (absence of db for example), but that exception have not correct SQLState (broken encoding instead of Class 3D — Invalid Catalog Name). That is broke the compatibility of old software with current drivers.
For example we can't test the presence of DB in cluster because we fail to read server answer because it's sent in encoding which UTF-8 can not decode.
So decision wether to broke behaviour or garble message. You have patched code to satisfy your environment settings, but broke it for other non UTF-8 environments which have UTF-8 incompatible server messages.
There is an example of broken behaviour exception when we connect to nonexistent database:

org.postgresql.util.PSQLException: ошибка при попытке подсоединения.
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:257)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:65)
at org.postgresql.jdbc.PgConnection.(PgConnection.java:159)
at org.postgresql.Driver.makeConnection(Driver.java:415)
at org.postgresql.Driver.connect(Driver.java:283)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:215)
...
Caused by: java.io.IOException: Illegal UTF-8 sequence: byte 2 of 2 byte sequence is not 10xxxxxx: -64
at org.postgresql.core.UTF8Encoding.checkByte(UTF8Encoding.java:28)
at org.postgresql.core.UTF8Encoding.decode(UTF8Encoding.java:103)
at org.postgresql.core.PGStream.ReceiveString(PGStream.java:329)
at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:686)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:207)

The problem not solved even if we explicitly set the charset for connection parameter.
IMHO broken message if much lesser evil than broken behaviour.

The environments settings was Russian_CP1251.
The same problem may be recovered for many other settings I think.

Do not take as rudeness! English is not my native language.

@davecramer

This comment has been minimized.

Member

davecramer commented Feb 22, 2016

Sounds like we have a catch-22 if we use US-ASCII then we will break UTF-8 and vice-versa ?

@pdradx

This comment has been minimized.

pdradx commented Feb 23, 2016

Something like that.
But if default is ANSI we just read message wrong for multibyte symbols. But if default is UTF-8 then we catch encoding exception because not all ANSI-string as byte stream can be correctly read as UTF-8.
If we trying to set encoding to ANSI-1251 the error is more sever (even not an SQL exception at al).

@vlsi

This comment has been minimized.

Member

vlsi commented Feb 23, 2016

I wonder if we can implement the following:

  1. read all "startup messages" as bytes
  2. find "encoding" value
  3. decode messages via the encoding
@pdradx

This comment has been minimized.

pdradx commented Feb 24, 2016

Can we read encoding settings from cluster (lc_messages) as initial operations and then use it to decode messages from server?

@pdradx

This comment has been minimized.

pdradx commented Feb 24, 2016

It's tough and error-prone operations to find encoding (more statistical predictions than plain rules). For example ANSI and OEM ecnodings will accept any sequence of bytes without worying about gobbledigook in result string. UTF-8 will accept message from any server settings if only english letters appears, but will be unable to read if sequence of bytes appeared to be broken multibyte sequence for symbol.

@pdradx

This comment has been minimized.

pdradx commented Feb 24, 2016

Also I see error that driver are not using charset option from connection parameters when establishing first initial connection to DB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment