Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MSSQL data type TIME #285

Closed
leo-schick opened this issue Oct 28, 2022 · 10 comments
Closed

Support MSSQL data type TIME #285

leo-schick opened this issue Oct 28, 2022 · 10 comments
Labels
enhancement New feature or request

Comments

@leo-schick
Copy link

Currencly the data type time from MSSQL is exported as BYTE_ARRAY, UTF8, String:

Column description from parquet-tools:

############ Column(OrderTime) ############
name: OrderTime
path: OrderTime
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: 62%)

I would have expected a parquet TIME type in logical_type/converted_type.

@pacman82 pacman82 added the enhancement New feature or request label Oct 30, 2022
@pacman82
Copy link
Owner

Support for time is still missing. I had a branch two years ago (master...time), yet never merged it, because I could not find a column type which would identify itself as 92 SQL_DATA_TIME. The time data type of an MSSQL tables seems to be another custom type and identifies as -154. So this seems to require custom code for MSSQL. This time though I can not fathom why the default SQL Type would not do.

Valid feature request might take a while till I get to it, though.

@leo-schick
Copy link
Author

You are right, MSSQL uses a custom type as described here: https://learn.microsoft.com/en-us/sql/relational-databases/native-client-odbc-date-time/data-type-support-for-odbc-date-and-time-improvements?view=sql-server-ver16

The type is called SQL_SS_TIME2 and has the following structure:

typedef struct tagSS_TIME2_STRUCT {  
   SQLUSMALLINT hour;  
   SQLUSMALLINT minute;  
   SQLUSMALLINT second;  
   SQLUINTEGER fraction;  
} SQL_SS_TIME2_STRUCT;

@pacman82
Copy link
Owner

pacman82 commented Nov 5, 2022

odbc2parquet 0.14.0 is released which maps TIME to Time Nano. Mapping it to Micro or Milli depeding on precision seem to triggers upstream not impelmented errors.

@pacman82
Copy link
Owner

pacman82 commented Nov 8, 2022

@leo-schick Does odbc2parquet 0.14.0 resolve your issue?

@pacman82
Copy link
Owner

Closing this for now

@leo-schick
Copy link
Author

Hi @pacman82 ,

sorry for the late response. Was quite busy with some other tasks.

Unfortunately, this seems to not work as it should. I am not 100% sure how this should be solved tho...

Here my validation results:

I have a table with a SQL time
image

After upgrading, the parquet-tools shows now that logical_type is time:

############ Column(OrderTime) ############
name: OrderTime
path: OrderTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Time(isAdjustedToUTC=false, timeUnit=nanoseconds)
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 11%)

Reading with Apache Spark

When I use Apache Spark 3.3.0 to read it as a SQL TIMESTAMP type (data type TIME is not supported inside Apache Spark SQL) I get this for the same record:
image

Just wonder how this gets messed up.

Reading with Microsoft Synapse

When I use Microsoft Synapse, I get the following error message:

Column 'OrderTime' of type 'TIME' is not compatible with external data type 'Parquet physical type: INT64', please try with 'BIGINT'. File/External table name: '<table_name>'.

Probably because the converted_type (legacy) is missing.

I think we should find another way to solve this. I mean, Apache Spark is quite famous and it should at least work there correctly.

@leo-schick
Copy link
Author

p.s. I just noted that SQL data type time is by default time(7) which is 100 ns precision. Using Time(.., timeUnit=nanoseconds) is then correct and mapping converted_type (legacy) to None is correct as well, see here.

Then the remaining question is then why we read it with Apache Spark....

@pacman82
Copy link
Owner

pacman82 commented Dec 8, 2022

Hi @leo-schick , thanks for the response.

Then the remaining question is then why we read it with Apache Spark

Do you mean 'why' or 'how'?

@pacman82
Copy link
Owner

pacman82 commented Dec 9, 2022

I wonder if odbc2parquet should offer a flag to choose microseconds precision, or even if it should do so by default. What do you think @leo-schick ?

@leo-schick
Copy link
Author

Do you mean 'why' or 'how'?
I meant how. Or better how do we get “time” the best way into the parquet file so that we can read it correctly in Apache Spark.

I am not yet sure about the flag for microseconds precision. I would propose a flag which tries to always convert to a converted_type of possible. I think I can build a way around it for me as long as it works in Apache Spark. IMO converting of data should not be part of the odbc2parquet tool - maybe as an option when you like to implement it but I will not use it I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants