Support database encodings other than UTF8 #101

pramsey · 2016-05-16T14:21:05Z

Some people do run PostgreSQL databases with non-default encodings (though it's pretty rare in production, I hope, so this is a low priority issue). For them, there's really no solution except to add transcoding in the FDW extension. Fortunately, GDAL has an internal UTF model, so it's possible to count on that as a fixed reference point, otherwise it would just be too ugly.

bowguy · 2020-09-25T18:46:48Z

Is this supported?
ogr_fdw.c shows:

/* Check that the database encoding is UTF8, to match OGR internals */
if (GetDatabaseEncoding() != PG_UTF8)
{

   elog(ERROR, "OGR FDW only works with UTF-8 databases");

```
   PG_RETURN_VOID();
```
}

and trying to connect to a a non-UTF8 database shows the above error.

pramsey · 2020-09-25T18:55:06Z

No, I don't think it is. The commit actually is transcoding only on write, there's nothing for read that I see, and as you note there's still a block right at the start checking for a utf backend.

bowguy · 2020-09-25T19:00:27Z

The thing I am running into is using ogr_fdw to connect (and import) from MSSQL using LATIN1 encoding. It works fine connecting and importing and the extended characters get inserted into the DB with no errors. However running a query with results that include that text throws an error.
My workaround is ogr2ogr to csv and then psql via \copy. The ogr_fdw is a really elegant solution for me except for this.

pramsey · 2020-09-25T19:03:35Z

It's a little fiddly because OGR cannot provide a hard guarantee that its data is UTF8. Some drivers won't provide that... dunno of the MSSQL driver does. Mostly it seems like drivers just take in whatever encoding the source has, and don't do any transcoding. I'm not sure why the commit thought it fixed this issue, since it was only one step on the way there.

So your PgSQL database is LATIN1 and your MSSQL data is... something? And OGR is in the middle. Basically, even if I fixed this, if it turns out OGR isn't providing UTF in the middle it will still not work for you.

bowguy · 2020-09-25T20:53:42Z

I have a MSSQL database that I have full connection to via ODBC. I created a PS database with the default UTF8 encoding, installed ogr_fdw, and used ogr_fdw_info to get the connection parameters. Everything is fine so far. To speed up queries, I create a local copy via:
CREATE TABLE foo as (SELECT * from fdw_foo);
This works fine. However, when doing queries on a text column i run into errors saying:
'utf-8' codec can't decode byte 0xbe in position 4: invalid start byte
At this point I realized my MSSQL database had extended characters and will error on any query with results using those characters. But the reason I am asking this is ogr_fdw ACTUALLY DID insert the extended characters into my PS database. I was querying against the local table foo from above, not the foreign table fdw_foo. To me it seems everything is working OK except for the encoding issue.

I just created a new database with WIN1252 encoding (I found out later MSSQL was LATIN1) and tried again but I could not create the extension because of the utf check early on the code.

So I ended up using ogr2ogr from MSSQL -> .csv (the extended characters are clearly there) and then copied the csv into the WIN1252 encoded PS database. This is doing what I want but the ogr_fdw is a much cleaner solution. To grab a fresh snapshot just:

truncate table foo; insert into foo (select * from fdw_foo);

without the intermediate step of creating the .csv files.

pramsey · 2020-09-25T20:56:40Z

So probably OGR does not convert the strings it gets from MSSQL into UTF, so they are in OGR in LATIN1.
Then the ogr_fdw code asks OGR "do you have UTF strings?" and OGR says "nope" and ogr_fdw goes "well, not knowing any better I'll just copy them in 'as is'.
Fundamentally, there is just not enough metadata travelling through the OGR pipes to make this work automatically. The only way to fix it would be to add a user parameter to the FDW server object that states what its encoding in. Then ogr_fdw could transcode automatically to the local server encoding.
It would be a handy knob, if only for the very few people smart enough to figure out when to turn it. Encodings are hard, and best handled automatically if at all possible.
Another (good, much better, really, cc @rouault) thing to push on would be having the OGR ODBC driver correctly transcode the data it gets into UTF while reading from sources. That way OGR could advertise that the layer is in UTF and other drivers could do the "right thing" vis a vis their own systems.

bowguy · 2020-09-25T21:09:29Z

Thanks. I like the "handy knob" idea since text output from the two databases would be the same. Otherwise the output might look like
Temp: 65 °
vs
Temp: 65 0xb0

The ODBC driver is a Microsoft product so the OGR might not be able to correctly know what it is pushing.

rouault · 2020-09-25T21:13:53Z

@szekerest could probably comment better than me what is the status of the MSSQL driver regarding UTF-8

szekerest · 2020-09-26T09:27:34Z

@rouault According to CPLODBCStatement::Fetch as soon as the text data is stored as WCHAR (nvarchar, ntext) the returned data is converted to UTF8. The single byte string representations (CHAR) are not being converted at the moment.

rouault · 2020-09-26T10:22:55Z

@szekerest So the driver is probably close to be able to advertize OLCStringsAsUTF8 ? (although @bowguy seems to get LATIN1 strings . Perhaps UTF8 conversion is something added recently. @bowguy Which GDAL version do you use?)

szekerest · 2020-09-26T11:03:14Z

In my understanting we have single byte string columns in the database with specific encoding. That is not converted to UTF8 automatically by the driver. That would require to query the database global encoding setting to be able to apply that conversion.

bowguy · 2020-09-26T21:03:28Z

OSGEO4W installed GDAL 3.0.3 ogrfdw_version.txt shows OGR_FDW: 1.0.9 PostgreSQL: 12 w64 Built: 20200105 GIT_REPO: https://github.com/pramsey/pgsql-ogr-fdw GIT_BRANCH: GIT_REVISION: GDAL_VER: 2.4.4 postgresql 12, postgis 3.0 Path is pointing to C:\Program Files\PostgreSQL\12\bin

…

On Sat, Sep 26, 2020, 4:03 AM Tamas Szekeres ***@***.***> wrote: In my understanting we have single byte string columns in the database with specific encoding. That is not converted to UTF8 automatically by the driver. That would require to query the database global encoding setting to be able to apply that conversion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#101 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4YOTS2INFDJAO3NWZGKKDSHXC75ANCNFSM4CD5OQTA> .

bowguy · 2020-09-29T16:39:26Z

What if it does not do any conversion (which seems to happen in my case)? The extension is loaded but the check is in the server to see if the source matches the database. Then instead of "OGR FDW only works with UTF-8 databases" the CREATE SERVER throws "Error: Source is encoded in LATIN1 but database is UTF-8" Not a perfect fix - ideal is as you say convert anything from source into destination but how often will that happen? I think I am kind of an edge case and I have a workaround.

pramsey · 2020-09-29T16:43:13Z

Yes, just stripping out the hard check for a UTF database might work...

…

On Tue, Sep 29, 2020 at 9:39 AM bowguy ***@***.***> wrote: What if it does not do any conversion (which seems to happen in my case)? The extension is loaded but the check is in the server to see if the source matches the database. Then instead of "OGR FDW only works with UTF-8 databases" the CREATE SERVER throws "Error: Source is encoded in LATIN1 but database is UTF-8" Not a perfect fix - ideal is as you say convert anything from source into destination but how often will that happen? I think I am kind of an edge case and I have a workaround. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#101 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA3ZW3PPEEJAKCUEGEOTGLSIIEU3ANCNFSM4CD5OQTA> .

bowguy · 2020-09-29T16:46:25Z

I'll see if I can set up a build environment for it. (Argh, windows.)

bowguy · 2020-09-30T23:08:39Z

Made the changes but running into issues with GDAL versions. It looks like the windows version of postgis 3.0 is using an old gdal version (2.4.4 - see above). @robe2 is there a build of postgis 3.0 with GDAL 3.0 ?

bowguy · 2020-10-01T23:46:08Z

Success. I had to compile gdal 2.4.4 from source in MSYS2 because the default PostGIS installation (PG12,PG3) uses gdal 2.4.4 while everything else on windows is gdal 3.0 Then I could compile pgsql_ogr_fdw in MSYS2 and copy the files over.
Anyway, the extension now loads in any database encoding, no errors and the extended characters come in just fine. The only change to ogr_fdw.c is to comment out those 5 lines mentioned above.

robe2 · 2020-10-04T02:19:09Z

Made the changes but running into issues with GDAL versions. It looks like the windows version of postgis 3.0 is using an old gdal version (2.4.4 - see above). @robe2 is there a build of postgis 3.0 with GDAL 3.0 ?

I'm planning to ship postgis 3.0.3 Windows bundle with GDAL 3.0. I think I've worked out the kinks already i was having with EDB library conflicts. If I haven't already I'll flip winnie to start building ogr_fdw with GDAL 3.0 and PostGIS 3.0 branch to build with 3.0.

Aside from that sounds like no changes needed in ogr_fdw to support encoding? Can we close this out then @bowguy @pramsey @rouault

bowguy · 2020-10-04T19:11:03Z

@robe2 if you can ship 3.0.3 postgis with GDAL 3.0 I will test it ASAP. Maybe EDB 13?
What if using GDAL 3.0 shows the same behavior? If the database is UTF8, and there are extended characters in a csv file, /copy throws an error but ogr_fdw writes them in anyway, is this a bug or a feature?

bowguy · 2020-10-05T23:35:18Z

Correction to previous comment.
I used ogr2ogr to create a CSV export of the original data (with extended characters). I then created a standard UTF8 database as before.
I then used psql -c "\copy foo from foo.csv with delimiter ',' CSV HEADER;" - This produced no errors and the data was imported cleanly. Part of a query output was "4X4 ¾TON P/U" - note the extended character.
Then I created a ogr_fdw server to the CSV file, created a foreign table to the same file and did a import to a local table:
CREATE TABLE local_foo as (select * from fdw_foo);
this had no errors but doing a query on the local table gives:
'utf-8' codec can't decode byte 0xbe in position 4: invalid start byte.
If I use a different encoding (WWIN1252), ogr_fdw works perfectly.
I will test as soon as I can get a PostGIS/GDAL 3.0 version installed.

bowguy · 2020-10-06T15:42:37Z

How about removing the five lines and mark this fixed? For me it is working for encodings other than UTF8 just fine. I think there is something else going on with the encoding for me.

robe2 · 2020-10-17T17:57:40Z

@robe2 if you can ship 3.0.3 postgis with GDAL 3.0 I will test it ASAP. Maybe EDB 13?
What if using GDAL 3.0 shows the same behavior? If the database is UTF8, and there are extended characters in a csv file, /copy throws an error but ogr_fdw writes them in anyway, is this a bug or a feature?

Just for anyone looking at this. As discussed on PostGIS mailing lists, I have PG 13, PostGIS 3.0.2 (includes latest released ogrfdw) up on stackbuilder and in http://download.osgeo.org/postgis/windows/pg13/

but as I discovered my compile has no dependecy on libiconv - https://lists.osgeo.org/pipermail/postgis-devel/2020-October/028655.html which @bowguy pointed out might be the issue.

I'll report back on this after I have recompiled GDAL (and gotten libiconv back in there)

bowguy · 2020-10-19T21:55:34Z

Progress on my issue. I have LATIN1 encoding in my ODBC database. Some of the characters in a string are wchar which must be converted to multi byte characters for UTF8. This can be done with
CPLRecodeFromWChar()
The offending string is "4X4 �TON P/U" and should be converted to "4X4 ¾TON P/U"
The first string has one wchar and the second has one multi byte.
I called:
cstr_decoded = CPLRecodeFromWChar(cstr_in, CPL_ENC_ISO8859_1, CPL_ENC_UTF8);
and is correctly translated the wchar to multi byte. However, the function converted every two characters into one so the string was truncated into "44¾O /"
I am not sure how to pick out the wchar's buried in the string even if I went through the characters one by one. Any suggestions would be appreciated, and I will keep plugging on.

Note this no longer is part of the original issue, 'Support database encodings other than UTF8' (which is fixed and closed) so it might be better to move to another issue.

bowguy · 2020-10-19T23:43:28Z

Success! CPLRecode(cstr_in, CPL_ENC_ISO8859_1, CPL_ENC_UTF8) works perfectly for me. Now I need to test for the source encoding and destination encoding.
From the documentation:

The only guaranteed supported encodings are CPL_ENC_UTF8, CPL_ENC_ASCII and CPL_ENC_ISO8859_1. Currently, the following conversions are supported :
CPL_ENC_ASCII -> CPL_ENC_UTF8 or CPL_ENC_ISO8859_1 (no conversion in fact)
CPL_ENC_ISO8859_1 -> CPL_ENC_UTF8
CPL_ENC_UTF8 -> CPL_ENC_ISO8859_1

bowguy · 2020-10-20T17:50:59Z

If I add
open_options 'ENCODING=LATIN1'
to the options section like in the documentation, how do I find that value in ogr_fdw.c? Could this be the 'handy knob' @pramsey mentioned earlier? The value could be anything iconv supports.

pramsey self-assigned this May 16, 2016

pramsey added the enhancement label May 16, 2016

pramsey mentioned this issue Sep 30, 2017

Add support for transcoding to non-UTF databases #132

Merged

pramsey closed this as completed Oct 2, 2017

pramsey reopened this Sep 25, 2020

pramsey closed this as completed in 3111e80 Oct 6, 2020

bowguy mentioned this issue Oct 19, 2020

Encode strings to destination encoding #193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support database encodings other than UTF8 #101

Support database encodings other than UTF8 #101

pramsey commented May 16, 2016 •

edited

Loading

bowguy commented Sep 25, 2020

pramsey commented Sep 25, 2020

bowguy commented Sep 25, 2020

pramsey commented Sep 25, 2020

bowguy commented Sep 25, 2020

pramsey commented Sep 25, 2020

bowguy commented Sep 25, 2020

rouault commented Sep 25, 2020

szekerest commented Sep 26, 2020

rouault commented Sep 26, 2020

szekerest commented Sep 26, 2020

bowguy commented Sep 26, 2020 via email

bowguy commented Sep 29, 2020

pramsey commented Sep 29, 2020 via email

bowguy commented Sep 29, 2020

bowguy commented Sep 30, 2020

bowguy commented Oct 1, 2020

robe2 commented Oct 4, 2020

bowguy commented Oct 4, 2020

bowguy commented Oct 5, 2020

bowguy commented Oct 6, 2020

robe2 commented Oct 17, 2020

bowguy commented Oct 19, 2020

bowguy commented Oct 19, 2020

bowguy commented Oct 20, 2020

Support database encodings other than UTF8 #101

Support database encodings other than UTF8 #101

Comments

pramsey commented May 16, 2016 • edited Loading

bowguy commented Sep 25, 2020

pramsey commented Sep 25, 2020

bowguy commented Sep 25, 2020

pramsey commented Sep 25, 2020

bowguy commented Sep 25, 2020

pramsey commented Sep 25, 2020

bowguy commented Sep 25, 2020

rouault commented Sep 25, 2020

szekerest commented Sep 26, 2020

rouault commented Sep 26, 2020

szekerest commented Sep 26, 2020

bowguy commented Sep 26, 2020 via email

bowguy commented Sep 29, 2020

pramsey commented Sep 29, 2020 via email

bowguy commented Sep 29, 2020

bowguy commented Sep 30, 2020

bowguy commented Oct 1, 2020

robe2 commented Oct 4, 2020

bowguy commented Oct 4, 2020

bowguy commented Oct 5, 2020

bowguy commented Oct 6, 2020

robe2 commented Oct 17, 2020

bowguy commented Oct 19, 2020

bowguy commented Oct 19, 2020

bowguy commented Oct 20, 2020

pramsey commented May 16, 2016 •

edited

Loading