Problem with varchar(MAX) #59

AdrianSkierniewski · 2021-05-10T17:51:30Z

Hi Markus,

I'm still using this nice tool and so far everything worked as expected. But then recently I encountered some edge case scenario when I'm trying to dump a column from MS SQL created using varchar(MAX).

This leads to a warning (only visible when using -vvv) WARN - State: 01004, Native error: 0, Message: [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation. The parquet file generated for this table contains an empty string instead of a proper value.
During the debugging, I found that the ODBC buffer description has max_str_len set to 0 which is probably the root cause of this behavior.

2021-05-10T17:34:23+00:00 - DEBUG - ODBC column description for column 1: ColumnDescription { name: [116, 111, 107, 101, 110], data_type: Varchar { length: 0 }, nullability: Nullable }
2021-05-10T17:34:23+00:00 - DEBUG - ODBC buffer description for column 1: BufferDescription { nullable: true, kind: WText { max_str_len: 0 } }

I think that there are two problems here. One is that this truncation happens on the varchar(MAX) (I don't know if there is anything that you could do with it). The second one is that this is a silent warning which corrupts the parquet files (shouldn't we have some flag to be more strict about dumping data 1 to 1 without any transformation?).

Cheers,
Adrian

The text was updated successfully, but these errors were encountered:

pacman82 · 2021-05-10T21:56:54Z

Hello Adrian,

There are a whole bunch of things which can improve here. I use this incident as a trigger to write them all down. This scenario is not so much an edge case as something which is likely happen over an over again. There has already been a similar issue, with a column reporting size zero ( #17 ) although in that case it was a column containin geometry. MSSQL returns 0 in general if it can not give an reliable upper bound for the size of a value of the type. There is also the related subject of database drivers giving reliable upper bounds, but them being ridiculisly large (#43).

As a very minimal standard of correctness and usability I would expect the tool to explicity ignore a column if it is not able to handle it sensibly. The tool should have given you the warning defined here:

odbc2parquet/src/query.rs

Line 516 in 3309f5d

"Ignoring column '{}' with index {}. Driver reported a display length of 0. \

Yet it did not do that. This is a bug and so I labled this issue accordingly. Most likely you work on windows, and therfore the tool uses UTF16 by default to avoid depending on the local system encoding, sadly the if condition in the above code only checks for text but not wide text.

The second issue here is a UX improvement. Not showing warnings by default is probably a bad idea. Therfore I created #60.

Now ignoring columns and reporting this to the user would not be buggy, but also hardly make users happy. Of course we want the data to appear in the parquet file. To get to this point however will be a bit of a stony road however. Here is why:

The way fetching data in bulks works in ODBC is that the application (i.e. this tool) has to provide buffers large enough to hold the desired number of rows. Each value in these column buffers obtains the same space, so the size which needs to be allocated for a single column is max_value_size * batch_size. That just does not work very well if the maximum size is either very large or of unknown size.

To mitigate this three strategies come to my mind:

Some drivers allow refetching the last batch. So the tool could guess a reasonable max value size (let's say 4kb) and increase and refetch it if there are actual values exceeding that limit. That strategy requires some ustream changes to odbc-api first though and would not work with all drivers. Yet ODBC can be asked whether a given connection supports this behaviour.
In the presence of potentially large columns, we could switch to fetching row by row. In that case it would be desirabel to at least bind all columns of known size with preallocated buffers largo enough to hold one value, to at least reduce the number of roundtrips per row. Some minor adjustmets to upstream odbc-api are also needed for this strategy.
A user could provide domain knowledge about the values in a column which would otherwise be ignored column and manually specifying a maximum size.

All these strategies are not mutually exclusive, and the ultimate version of these tool would eventually offer all these posibilities, but as of now my plan is to implement 2 first, as this would shoud make the tool work out of the box in almost any situation. If you need something urgently though I could probably pull 3 off the fastest. A workaround you might also try is to cast the column explicitly into a VARCHAR with a fixed size.

Cheers, Markus

pacman82 · 2021-05-10T22:01:30Z

Damn, I already wrote a test for this scenario.

odbc2parquet/tests/lib.rs

Line 356 in 3309f5d

fn query_varchar_max() {

Sadly it is currently only ever executed on Linux so I did not catch the UTF-16 encoding issue. Oh, well I guess one can not prevent them all ...

pacman82 · 2021-05-15T11:16:31Z

Version 0.6.4 is releasing with improved warnings and enables them by default. This should have at least saved you some time debugging the issue.

AdrianSkierniewski · 2021-05-25T06:42:53Z

Thank you @pacman82 for the detailed explanation of what is going on there. For now, I was able to workaround this by not using varchar(MAX). I could do that because I owned the table. I agree that the second option sounds good and it should solve the main problem which is exporting the data into parquet. I assume that we'll still see some warning in that case, right?

pacman82 · 2021-05-25T07:48:49Z

Hi @AdrianSkierniewski, yes I am unsure about the log level (info or warn? Suggestions?), but I want to make it visible which mode the tool is in, since the row by row version is expected to be much(!) slower. The information should hint to the columns which force the mode switch, so one can decide whether they could be changed.

The row by row will likely be much (!) slower with most drivers and setups. Depending on how large your table is I am not sure if switching the column type is the solution and the row by row mode is the workaround. Still I want this mode to be the next larger feature in the tool. Yet, this will be a completely independent rewrite of the query code, so I'll need a long weekend with lots of free time to get there.

Cheers, Markus

AdrianSkierniewski · 2021-05-25T08:10:51Z

I think that since it will be much slower and it's just a workaround there should be a warning message about it. If this message won't be clearly visible people may start reporting issues related to the performance without even knowing about the underlying issue.
I was able to solve my issue and since there is always a way to cast a column into fixed-length varchar, plus we'll have the warning message visible. I don't think that this will block anyone, so please take your time to work on it.

pacman82 · 2021-05-25T14:27:30Z

Thanks @AdrianSkierniewski I'll take your advice and will log it as a warning.

Cheers, Markus

leo-schick · 2023-03-09T11:51:44Z

Hi @pacman82 is there any update on this?

I run now into the same issue. For MSSQL I would suggest to map varchar(max) to varchar(8000) and nvarchar(max) to nvarchar(4000) to at least make it work. This doesn't need to be the perfect solution codewise IMO, but would be great when it would work.

As @AdrianSkierniewski, I own the table and therefore can make it work but it is just cumbersome to remind yourself that column type varchar(max) and nvarchar(max) is not supported :-/

pacman82 · 2023-03-09T17:41:38Z

Hi @leo-schick ,

you can make it work, by setting the --column-length-limit 8000 option.

Should have mentioned this option in the past before 😅 . I also did not find a fancier solution which is feasable, if you can not adapt your table scheme, IMHO the limit should at least be explicitly set by the user of odbc2parquet so the user is aware of the limitation.

Best, Markus

pacman82 · 2023-03-13T12:07:50Z

@leo-schick Does the flag work for you?

leo-schick · 2023-03-13T12:12:44Z

@pacman82 I changed now the type in the database table to make it work. But I think it would be great when odbc2parquet would be able to handle (n)varchar(max) without this extra flag...

pacman82 · 2023-04-12T05:49:35Z

ODBC 4.0 is going to specify length exception behaviour (https://github.com/microsoft/ODBC-Specification/blob/master/ODBC%204.0.md#611-usage. I do not know when it will release, and when support will be wide spread. Until then it seems to me that letting the user defining an upper bound is the best possible solution to this.

pacman82 added the bug Something isn't working label May 10, 2021

pacman82 mentioned this issue May 10, 2021

Default log level should be warn #60

Closed

This was referenced May 10, 2021

ignore zero sized columns, then utf16 mode is enabled #61

Merged

log warnings by default #64

Merged

bms-dwh mentioned this issue Dec 21, 2021

Problem with Varchar(max) pacman82/arrow-odbc-py#11

Closed

pacman82 closed this as completed Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with varchar(MAX) #59

Problem with varchar(MAX) #59

AdrianSkierniewski commented May 10, 2021

pacman82 commented May 10, 2021

pacman82 commented May 10, 2021

pacman82 commented May 15, 2021

AdrianSkierniewski commented May 25, 2021

pacman82 commented May 25, 2021

AdrianSkierniewski commented May 25, 2021

pacman82 commented May 25, 2021

leo-schick commented Mar 9, 2023

pacman82 commented Mar 9, 2023

pacman82 commented Mar 13, 2023

leo-schick commented Mar 13, 2023

pacman82 commented Apr 12, 2023

Problem with varchar(MAX) #59

Problem with varchar(MAX) #59

Comments

AdrianSkierniewski commented May 10, 2021

pacman82 commented May 10, 2021

pacman82 commented May 10, 2021

pacman82 commented May 15, 2021

AdrianSkierniewski commented May 25, 2021

pacman82 commented May 25, 2021

AdrianSkierniewski commented May 25, 2021

pacman82 commented May 25, 2021

leo-schick commented Mar 9, 2023

pacman82 commented Mar 9, 2023

pacman82 commented Mar 13, 2023

leo-schick commented Mar 13, 2023

pacman82 commented Apr 12, 2023