Support MSSQL data type TIME #285

leo-schick · 2022-10-28T11:03:06Z

Currencly the data type time from MSSQL is exported as BYTE_ARRAY, UTF8, String:

Column description from parquet-tools:

############ Column(OrderTime) ############
name: OrderTime
path: OrderTime
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: 62%)

I would have expected a parquet TIME type in logical_type/converted_type.

The text was updated successfully, but these errors were encountered:

pacman82 · 2022-10-30T18:47:10Z

Support for time is still missing. I had a branch two years ago (master...time), yet never merged it, because I could not find a column type which would identify itself as 92 SQL_DATA_TIME. The time data type of an MSSQL tables seems to be another custom type and identifies as -154. So this seems to require custom code for MSSQL. This time though I can not fathom why the default SQL Type would not do.

Valid feature request might take a while till I get to it, though.

leo-schick · 2022-10-31T05:28:58Z

You are right, MSSQL uses a custom type as described here: https://learn.microsoft.com/en-us/sql/relational-databases/native-client-odbc-date-time/data-type-support-for-odbc-date-and-time-improvements?view=sql-server-ver16

The type is called SQL_SS_TIME2 and has the following structure:

typedef struct tagSS_TIME2_STRUCT {  
   SQLUSMALLINT hour;  
   SQLUSMALLINT minute;  
   SQLUSMALLINT second;  
   SQLUINTEGER fraction;  
} SQL_SS_TIME2_STRUCT;

pacman82 · 2022-11-05T10:10:39Z

odbc2parquet 0.14.0 is released which maps TIME to Time Nano. Mapping it to Micro or Milli depeding on precision seem to triggers upstream not impelmented errors.

pacman82 · 2022-11-08T07:11:54Z

@leo-schick Does odbc2parquet 0.14.0 resolve your issue?

pacman82 · 2022-11-14T07:00:17Z

Closing this for now

leo-schick · 2022-12-08T16:27:42Z

Hi @pacman82 ,

sorry for the late response. Was quite busy with some other tasks.

Unfortunately, this seems to not work as it should. I am not 100% sure how this should be solved tho...

Here my validation results:

I have a table with a SQL time

After upgrading, the parquet-tools shows now that logical_type is time:

############ Column(OrderTime) ############
name: OrderTime
path: OrderTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Time(isAdjustedToUTC=false, timeUnit=nanoseconds)
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 11%)

Reading with Apache Spark

When I use Apache Spark 3.3.0 to read it as a SQL TIMESTAMP type (data type TIME is not supported inside Apache Spark SQL) I get this for the same record:

Just wonder how this gets messed up.

Reading with Microsoft Synapse

When I use Microsoft Synapse, I get the following error message:

Column 'OrderTime' of type 'TIME' is not compatible with external data type 'Parquet physical type: INT64', please try with 'BIGINT'. File/External table name: '<table_name>'.

Probably because the converted_type (legacy) is missing.

I think we should find another way to solve this. I mean, Apache Spark is quite famous and it should at least work there correctly.

leo-schick · 2022-12-08T16:40:58Z

p.s. I just noted that SQL data type time is by default time(7) which is 100 ns precision. Using Time(.., timeUnit=nanoseconds) is then correct and mapping converted_type (legacy) to None is correct as well, see here.

Then the remaining question is then why we read it with Apache Spark....

pacman82 · 2022-12-08T17:32:29Z

Hi @leo-schick , thanks for the response.

Then the remaining question is then why we read it with Apache Spark

Do you mean 'why' or 'how'?

pacman82 · 2022-12-09T06:47:15Z

I wonder if odbc2parquet should offer a flag to choose microseconds precision, or even if it should do so by default. What do you think @leo-schick ?

leo-schick · 2022-12-09T06:57:14Z

Do you mean 'why' or 'how'?
I meant how. Or better how do we get “time” the best way into the parquet file so that we can read it correctly in Apache Spark.

I am not yet sure about the flag for microseconds precision. I would propose a flag which tries to always convert to a converted_type of possible. I think I can build a way around it for me as long as it works in Apache Spark. IMO converting of data should not be part of the odbc2parquet tool - maybe as an option when you like to implement it but I will not use it I think.

pacman82 added the enhancement New feature or request label Oct 30, 2022

pacman82 closed this as completed Nov 14, 2022

This was referenced Dec 18, 2022

Flag to support legacy converted types #315

Open

Support Time Columns in Parquet Record API apache/arrow-rs#3362

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MSSQL data type TIME #285

Support MSSQL data type TIME #285

leo-schick commented Oct 28, 2022

pacman82 commented Oct 30, 2022

leo-schick commented Oct 31, 2022

pacman82 commented Nov 5, 2022

pacman82 commented Nov 8, 2022

pacman82 commented Nov 14, 2022

leo-schick commented Dec 8, 2022

leo-schick commented Dec 8, 2022

pacman82 commented Dec 8, 2022

pacman82 commented Dec 9, 2022

leo-schick commented Dec 9, 2022

Support MSSQL data type TIME #285

Support MSSQL data type TIME #285

Comments

leo-schick commented Oct 28, 2022

pacman82 commented Oct 30, 2022

leo-schick commented Oct 31, 2022

pacman82 commented Nov 5, 2022

pacman82 commented Nov 8, 2022

pacman82 commented Nov 14, 2022

leo-schick commented Dec 8, 2022

Here my validation results:

Reading with Apache Spark

Reading with Microsoft Synapse

leo-schick commented Dec 8, 2022

pacman82 commented Dec 8, 2022

pacman82 commented Dec 9, 2022

leo-schick commented Dec 9, 2022