Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support database encodings other than UTF8 #101

Closed
pramsey opened this issue May 16, 2016 · 25 comments
Closed

Support database encodings other than UTF8 #101

pramsey opened this issue May 16, 2016 · 25 comments
Assignees

Comments

@pramsey
Copy link
Owner

pramsey commented May 16, 2016

Some people do run PostgreSQL databases with non-default encodings (though it's pretty rare in production, I hope, so this is a low priority issue). For them, there's really no solution except to add transcoding in the FDW extension. Fortunately, GDAL has an internal UTF model, so it's possible to count on that as a fixed reference point, otherwise it would just be too ugly.

@bowguy
Copy link

bowguy commented Sep 25, 2020

Is this supported?
ogr_fdw.c shows:

  1. /* Check that the database encoding is UTF8, to match OGR internals */
  2. if (GetDatabaseEncoding() != PG_UTF8)
  3. {
  4.    elog(ERROR, "OGR FDW only works with UTF-8 databases");
    
  5.    PG_RETURN_VOID();
    
  6. }

and trying to connect to a a non-UTF8 database shows the above error.

@pramsey pramsey reopened this Sep 25, 2020
@pramsey
Copy link
Owner Author

pramsey commented Sep 25, 2020

No, I don't think it is. The commit actually is transcoding only on write, there's nothing for read that I see, and as you note there's still a block right at the start checking for a utf backend.

@bowguy
Copy link

bowguy commented Sep 25, 2020

The thing I am running into is using ogr_fdw to connect (and import) from MSSQL using LATIN1 encoding. It works fine connecting and importing and the extended characters get inserted into the DB with no errors. However running a query with results that include that text throws an error.
My workaround is ogr2ogr to csv and then psql via \copy. The ogr_fdw is a really elegant solution for me except for this.

@pramsey
Copy link
Owner Author

pramsey commented Sep 25, 2020

It's a little fiddly because OGR cannot provide a hard guarantee that its data is UTF8. Some drivers won't provide that... dunno of the MSSQL driver does. Mostly it seems like drivers just take in whatever encoding the source has, and don't do any transcoding. I'm not sure why the commit thought it fixed this issue, since it was only one step on the way there.

So your PgSQL database is LATIN1 and your MSSQL data is... something? And OGR is in the middle. Basically, even if I fixed this, if it turns out OGR isn't providing UTF in the middle it will still not work for you.

@bowguy
Copy link

bowguy commented Sep 25, 2020

I have a MSSQL database that I have full connection to via ODBC. I created a PS database with the default UTF8 encoding, installed ogr_fdw, and used ogr_fdw_info to get the connection parameters. Everything is fine so far. To speed up queries, I create a local copy via:
CREATE TABLE foo as (SELECT * from fdw_foo);
This works fine. However, when doing queries on a text column i run into errors saying:
'utf-8' codec can't decode byte 0xbe in position 4: invalid start byte
At this point I realized my MSSQL database had extended characters and will error on any query with results using those characters. But the reason I am asking this is ogr_fdw ACTUALLY DID insert the extended characters into my PS database. I was querying against the local table foo from above, not the foreign table fdw_foo. To me it seems everything is working OK except for the encoding issue.

I just created a new database with WIN1252 encoding (I found out later MSSQL was LATIN1) and tried again but I could not create the extension because of the utf check early on the code.

So I ended up using ogr2ogr from MSSQL -> .csv (the extended characters are clearly there) and then copied the csv into the WIN1252 encoded PS database. This is doing what I want but the ogr_fdw is a much cleaner solution. To grab a fresh snapshot just:

truncate table foo; insert into foo (select * from fdw_foo);

without the intermediate step of creating the .csv files.

@pramsey
Copy link
Owner Author

pramsey commented Sep 25, 2020

So probably OGR does not convert the strings it gets from MSSQL into UTF, so they are in OGR in LATIN1.
Then the ogr_fdw code asks OGR "do you have UTF strings?" and OGR says "nope" and ogr_fdw goes "well, not knowing any better I'll just copy them in 'as is'.
Fundamentally, there is just not enough metadata travelling through the OGR pipes to make this work automatically. The only way to fix it would be to add a user parameter to the FDW server object that states what its encoding in. Then ogr_fdw could transcode automatically to the local server encoding.
It would be a handy knob, if only for the very few people smart enough to figure out when to turn it. Encodings are hard, and best handled automatically if at all possible.
Another (good, much better, really, cc @rouault) thing to push on would be having the OGR ODBC driver correctly transcode the data it gets into UTF while reading from sources. That way OGR could advertise that the layer is in UTF and other drivers could do the "right thing" vis a vis their own systems.

@bowguy
Copy link

bowguy commented Sep 25, 2020

Thanks. I like the "handy knob" idea since text output from the two databases would be the same. Otherwise the output might look like
Temp: 65 °
vs
Temp: 65 0xb0

The ODBC driver is a Microsoft product so the OGR might not be able to correctly know what it is pushing.

@rouault
Copy link
Contributor

rouault commented Sep 25, 2020

@szekerest could probably comment better than me what is the status of the MSSQL driver regarding UTF-8

@szekerest
Copy link

@rouault According to CPLODBCStatement::Fetch as soon as the text data is stored as WCHAR (nvarchar, ntext) the returned data is converted to UTF8. The single byte string representations (CHAR) are not being converted at the moment.

@rouault
Copy link
Contributor

rouault commented Sep 26, 2020

@szekerest So the driver is probably close to be able to advertize OLCStringsAsUTF8 ? (although @bowguy seems to get LATIN1 strings . Perhaps UTF8 conversion is something added recently. @bowguy Which GDAL version do you use?)

@szekerest
Copy link

In my understanting we have single byte string columns in the database with specific encoding. That is not converted to UTF8 automatically by the driver. That would require to query the database global encoding setting to be able to apply that conversion.

@bowguy
Copy link

bowguy commented Sep 26, 2020 via email

@bowguy
Copy link

bowguy commented Sep 29, 2020

What if it does not do any conversion (which seems to happen in my case)? The extension is loaded but the check is in the server to see if the source matches the database. Then instead of "OGR FDW only works with UTF-8 databases" the CREATE SERVER throws "Error: Source is encoded in LATIN1 but database is UTF-8" Not a perfect fix - ideal is as you say convert anything from source into destination but how often will that happen? I think I am kind of an edge case and I have a workaround.

@pramsey
Copy link
Owner Author

pramsey commented Sep 29, 2020 via email

@bowguy
Copy link

bowguy commented Sep 29, 2020

I'll see if I can set up a build environment for it. (Argh, windows.)

@bowguy
Copy link

bowguy commented Sep 30, 2020

Made the changes but running into issues with GDAL versions. It looks like the windows version of postgis 3.0 is using an old gdal version (2.4.4 - see above). @robe2 is there a build of postgis 3.0 with GDAL 3.0 ?

@bowguy
Copy link

bowguy commented Oct 1, 2020

Success. I had to compile gdal 2.4.4 from source in MSYS2 because the default PostGIS installation (PG12,PG3) uses gdal 2.4.4 while everything else on windows is gdal 3.0 Then I could compile pgsql_ogr_fdw in MSYS2 and copy the files over.
Anyway, the extension now loads in any database encoding, no errors and the extended characters come in just fine. The only change to ogr_fdw.c is to comment out those 5 lines mentioned above.

@robe2
Copy link
Contributor

robe2 commented Oct 4, 2020

Made the changes but running into issues with GDAL versions. It looks like the windows version of postgis 3.0 is using an old gdal version (2.4.4 - see above). @robe2 is there a build of postgis 3.0 with GDAL 3.0 ?

I'm planning to ship postgis 3.0.3 Windows bundle with GDAL 3.0. I think I've worked out the kinks already i was having with EDB library conflicts. If I haven't already I'll flip winnie to start building ogr_fdw with GDAL 3.0 and PostGIS 3.0 branch to build with 3.0.

Aside from that sounds like no changes needed in ogr_fdw to support encoding? Can we close this out then @bowguy @pramsey @rouault

@bowguy
Copy link

bowguy commented Oct 4, 2020

@robe2 if you can ship 3.0.3 postgis with GDAL 3.0 I will test it ASAP. Maybe EDB 13?
What if using GDAL 3.0 shows the same behavior? If the database is UTF8, and there are extended characters in a csv file, /copy throws an error but ogr_fdw writes them in anyway, is this a bug or a feature?

@bowguy
Copy link

bowguy commented Oct 5, 2020

Correction to previous comment.
I used ogr2ogr to create a CSV export of the original data (with extended characters). I then created a standard UTF8 database as before.
I then used psql -c "\copy foo from foo.csv with delimiter ',' CSV HEADER;" - This produced no errors and the data was imported cleanly. Part of a query output was "4X4 ¾TON P/U" - note the extended character.
Then I created a ogr_fdw server to the CSV file, created a foreign table to the same file and did a import to a local table:
CREATE TABLE local_foo as (select * from fdw_foo);
this had no errors but doing a query on the local table gives:
'utf-8' codec can't decode byte 0xbe in position 4: invalid start byte.
If I use a different encoding (WWIN1252), ogr_fdw works perfectly.
I will test as soon as I can get a PostGIS/GDAL 3.0 version installed.

@bowguy
Copy link

bowguy commented Oct 6, 2020

How about removing the five lines and mark this fixed? For me it is working for encodings other than UTF8 just fine. I think there is something else going on with the encoding for me.

@pramsey pramsey closed this as completed in 3111e80 Oct 6, 2020
@robe2
Copy link
Contributor

robe2 commented Oct 17, 2020

@robe2 if you can ship 3.0.3 postgis with GDAL 3.0 I will test it ASAP. Maybe EDB 13?
What if using GDAL 3.0 shows the same behavior? If the database is UTF8, and there are extended characters in a csv file, /copy throws an error but ogr_fdw writes them in anyway, is this a bug or a feature?

Just for anyone looking at this. As discussed on PostGIS mailing lists, I have PG 13, PostGIS 3.0.2 (includes latest released ogrfdw) up on stackbuilder and in http://download.osgeo.org/postgis/windows/pg13/

but as I discovered my compile has no dependecy on libiconv - https://lists.osgeo.org/pipermail/postgis-devel/2020-October/028655.html which @bowguy pointed out might be the issue.

I'll report back on this after I have recompiled GDAL (and gotten libiconv back in there)

@bowguy
Copy link

bowguy commented Oct 19, 2020

Progress on my issue. I have LATIN1 encoding in my ODBC database. Some of the characters in a string are wchar which must be converted to multi byte characters for UTF8. This can be done with
CPLRecodeFromWChar()
The offending string is "4X4 �TON P/U" and should be converted to "4X4 ¾TON P/U"
The first string has one wchar and the second has one multi byte.
I called:
cstr_decoded = CPLRecodeFromWChar(cstr_in, CPL_ENC_ISO8859_1, CPL_ENC_UTF8);
and is correctly translated the wchar to multi byte. However, the function converted every two characters into one so the string was truncated into "44¾O /"
I am not sure how to pick out the wchar's buried in the string even if I went through the characters one by one. Any suggestions would be appreciated, and I will keep plugging on.

Note this no longer is part of the original issue, 'Support database encodings other than UTF8' (which is fixed and closed) so it might be better to move to another issue.

@bowguy
Copy link

bowguy commented Oct 19, 2020

Success! CPLRecode(cstr_in, CPL_ENC_ISO8859_1, CPL_ENC_UTF8) works perfectly for me. Now I need to test for the source encoding and destination encoding.
From the documentation:

The only guaranteed supported encodings are CPL_ENC_UTF8, CPL_ENC_ASCII and CPL_ENC_ISO8859_1. Currently, the following conversions are supported :
CPL_ENC_ASCII -> CPL_ENC_UTF8 or CPL_ENC_ISO8859_1 (no conversion in fact)
CPL_ENC_ISO8859_1 -> CPL_ENC_UTF8
CPL_ENC_UTF8 -> CPL_ENC_ISO8859_1

@bowguy
Copy link

bowguy commented Oct 20, 2020

If I add
open_options 'ENCODING=LATIN1'
to the options section like in the documentation, how do I find that value in ogr_fdw.c? Could this be the 'handy knob' @pramsey mentioned earlier? The value could be anything iconv supports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants