Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for data type timezone conversion to UTC #249

Closed
leo-schick opened this issue Sep 3, 2022 · 11 comments
Closed

Support for data type timezone conversion to UTC #249

leo-schick opened this issue Sep 3, 2022 · 11 comments

Comments

@leo-schick
Copy link

In TSQL/SQL Server, the datetimeoffset - which is practially datetime with timezone offset - is currently converted into a parquet string. I don't like that because the data stored there is quite large and might be bad to sort.
Postgres has a similar data type called timestamp with time zone

Unfortunately, PARQUET does not support a logical data type which supports time zone, but there are serveral considerations how to deal with different time zones, see here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp One suggestion is to convert all to UTC.

I would like to have the option to tell odbc2parquet that all datetimeoffset/timestamp with time zone data types shall be converted into a UTC TIMESTAMP in the PARQUET export instead of into a PARQUET string. I suggest to add this option with an additional parameter.

The benefit of this is that filtering or sorting on a PARQUET timestamp is much faster than on a PARQUET string.

@leo-schick leo-schick changed the title Support for TSQL data type DateTimeOffset Support for data type timezone conversion to UTC Sep 3, 2022
@pacman82
Copy link
Owner

pacman82 commented Sep 4, 2022

Hi, thanks for the great user story! I already try to map timestamps as best as I can. I fully empathise with your usecase. You wouldn't happen to know to which ODBC datatypes datetimeoffset or timestapm with time zone map?

I need to test / investigate, but I partly fear that these mit be types which are not part of the ODBC standard, but driver specific extensions. If they are part of the standard, it is most likely just an oversight from me not handling them accordingly. Drivery specific types are more tricky. Their meaning and interpretation is not agreed on cross drivers (well, I guess that is what makes them specific). I would be more hesitant to implement logic for that. Not saying no just now, but it is something to carefully think about.

My bandwith is a bit low at the moment, may be a while before I get to investigate this. Don't know about your ODBC or Rust skills, but if you want I can give you pointers how to drive this story a bit.

Otherwise, just querying the table you likely have with a verbose logging enabled -vvv and sharing the output with me here, would be nice. Especially everything that describes the column in questios.

If you want to go the extra mile use odbcsv to yield me the column description.

Cheers, Markus

@leo-schick
Copy link
Author

For SQL Server it seems to be a vendor/driver specific implementation with a type called SQL_SS_TIMESTAMPOFFSET. See Data Type Support for ODBC Date and Time Improvements
. It passes an extra struct which could be used to cast it:

typedef struct tagSS_TIMESTAMPOFFSET_STRUCT {  
   SQLSMALLINT year;  
   SQLUSMALLINT month;  
   SQLUSMALLINT day;  
   SQLUSMALLINT hour;  
   SQLUSMALLINT minute;  
   SQLUSMALLINT second;  
   SQLUINTEGER fraction;  
   SQLSMALLINT timezone_hour;  
   SQLSMALLINT timezone_minute;  
} SQL_SS_TIMESTAMPOFFSET_STRUCT;  

I couldn't find an offical documentation for Postgres right away.

I will discuss in the team what the best option is to tackle this issue.

@pacman82
Copy link
Owner

pacman82 commented Sep 4, 2022

I see, this is why the timestamp logic of odbc2parquet does not trigger. It is just considered an "other" type, and fetched as string. One way of tackling this could be to detect the database system before analyzing the schema. From the odbc-api test suite we know Microsoft SQL Server will identify itself as such: https://github.com/pacman82/odbc-api/blob/4721fd050b5813021538591810869e05f3039383/odbc-api/tests/integration.rs#L2841

We could then introduce database specify logic to handle "other" types. Just a thought. I'll sleep over it.

@leo-schick
Copy link
Author

👍

@pacman82
Copy link
Owner

pacman82 commented Sep 4, 2022

If you could find out what the struct for PostgreSQL would look like, we may be able to support that, too.

@leo-schick
Copy link
Author

I could not find any specific documentation about that. I guess because timestamp with time zone is quite new in PostgreSQL (compared to the ODBC implementation). Before this type was implemented, the default type takes the local time zone and converts it to UTC when it is stored in the DB and it is translated back when the data is read. The closest I could find is this odbc documentation which notes this logic but does not give explecite information about its logic (unfortunately I couldn't find a deep ODBC documentation on https://www.postgresql.org/docs/)

I guess the best way would be to try this out. I currently don't have the time to build a test setup for PostgreSQL with ODBC ...

@pacman82
Copy link
Owner

Hello @leo-schick , odbc2parquet 0.13.0 has been released. It uses instant semantics, if the database is "Microsoft SQL Server" and the relational type is DATETIMEOFFSET. I'll leave this issue open to check if supporting PostgreSQL is an easy win. MSSQL is already supported with the current release.

Cheers, Markus

@pacman82
Copy link
Owner

Tests with Postgres have shown, that to an ODBC client both TIMESTAMP and TIMESTAMPTZ are just presented as ODBC type TIMESTAMP. However, Postgres seems to return the timestamp as UTC for the latter. I'll explain it better with more time, but in short there is nothing I can or should implement for Postgres.

@leo-schick
Copy link
Author

@pacman82 There is a way to tell postgres to export the data in a specific timezone, e.g. UTC. The psql command interprets the environment variable PGTZ to set the time zone for the client. Maybe something comparable is available for ODBC e.g. as ODBC connection parameter? just an idea....

@pacman82
Copy link
Owner

In my tests it seems that Postgres does convert the timestamp to UTC. It is just that from the client side I have no way of telling wether this is instant semantics or not. So the parquet logical type even is timestamp, but it will not indicate that it is adjusted to UTC. Overall the postgres ODBC drivers behaviour is nicer for clients who do not have any special treatment for Postgres. The MSSQL chooses to use a custom type, as such the default behivour of clients is unlikely to be satisfactory, but now that I've special code for it, it is overall nicer.

Any ideas for configuring ODBC are better placed with the maintainers of the PostgreSQL ODBC driver. Yet, as written before. At least to me it seems to already do what you want.

@pacman82
Copy link
Owner

@leo-schick Almost forgot to ask. Does the tool work for you now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants