Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with varchar(MAX) #59

Closed
AdrianSkierniewski opened this issue May 10, 2021 · 12 comments
Closed

Problem with varchar(MAX) #59

AdrianSkierniewski opened this issue May 10, 2021 · 12 comments
Labels
bug Something isn't working

Comments

@AdrianSkierniewski
Copy link

Hi Markus,

I'm still using this nice tool and so far everything worked as expected. But then recently I encountered some edge case scenario when I'm trying to dump a column from MS SQL created using varchar(MAX).

This leads to a warning (only visible when using -vvv) WARN - State: 01004, Native error: 0, Message: [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation. The parquet file generated for this table contains an empty string instead of a proper value.
During the debugging, I found that the ODBC buffer description has max_str_len set to 0 which is probably the root cause of this behavior.

2021-05-10T17:34:23+00:00 - DEBUG - ODBC column description for column 1: ColumnDescription { name: [116, 111, 107, 101, 110], data_type: Varchar { length: 0 }, nullability: Nullable }
2021-05-10T17:34:23+00:00 - DEBUG - ODBC buffer description for column 1: BufferDescription { nullable: true, kind: WText { max_str_len: 0 } }

I think that there are two problems here. One is that this truncation happens on the varchar(MAX) (I don't know if there is anything that you could do with it). The second one is that this is a silent warning which corrupts the parquet files (shouldn't we have some flag to be more strict about dumping data 1 to 1 without any transformation?).

Cheers,
Adrian

@pacman82 pacman82 added the bug Something isn't working label May 10, 2021
@pacman82
Copy link
Owner

Hello Adrian,

There are a whole bunch of things which can improve here. I use this incident as a trigger to write them all down. This scenario is not so much an edge case as something which is likely happen over an over again. There has already been a similar issue, with a column reporting size zero ( #17 ) although in that case it was a column containin geometry. MSSQL returns 0 in general if it can not give an reliable upper bound for the size of a value of the type. There is also the related subject of database drivers giving reliable upper bounds, but them being ridiculisly large (#43).

As a very minimal standard of correctness and usability I would expect the tool to explicity ignore a column if it is not able to handle it sensibly. The tool should have given you the warning defined here:

"Ignoring column '{}' with index {}. Driver reported a display length of 0. \

Yet it did not do that. This is a bug and so I labled this issue accordingly. Most likely you work on windows, and therfore the tool uses UTF16 by default to avoid depending on the local system encoding, sadly the if condition in the above code only checks for text but not wide text.

The second issue here is a UX improvement. Not showing warnings by default is probably a bad idea. Therfore I created #60.

Now ignoring columns and reporting this to the user would not be buggy, but also hardly make users happy. Of course we want the data to appear in the parquet file. To get to this point however will be a bit of a stony road however. Here is why:

The way fetching data in bulks works in ODBC is that the application (i.e. this tool) has to provide buffers large enough to hold the desired number of rows. Each value in these column buffers obtains the same space, so the size which needs to be allocated for a single column is max_value_size * batch_size. That just does not work very well if the maximum size is either very large or of unknown size.

To mitigate this three strategies come to my mind:

  1. Some drivers allow refetching the last batch. So the tool could guess a reasonable max value size (let's say 4kb) and increase and refetch it if there are actual values exceeding that limit. That strategy requires some ustream changes to odbc-api first though and would not work with all drivers. Yet ODBC can be asked whether a given connection supports this behaviour.

  2. In the presence of potentially large columns, we could switch to fetching row by row. In that case it would be desirabel to at least bind all columns of known size with preallocated buffers largo enough to hold one value, to at least reduce the number of roundtrips per row. Some minor adjustmets to upstream odbc-api are also needed for this strategy.

  3. A user could provide domain knowledge about the values in a column which would otherwise be ignored column and manually specifying a maximum size.

All these strategies are not mutually exclusive, and the ultimate version of these tool would eventually offer all these posibilities, but as of now my plan is to implement 2 first, as this would shoud make the tool work out of the box in almost any situation. If you need something urgently though I could probably pull 3 off the fastest. A workaround you might also try is to cast the column explicitly into a VARCHAR with a fixed size.

Cheers, Markus

@pacman82
Copy link
Owner

Damn, I already wrote a test for this scenario.

fn query_varchar_max() {

Sadly it is currently only ever executed on Linux so I did not catch the UTF-16 encoding issue. Oh, well I guess one can not prevent them all ...

@pacman82
Copy link
Owner

Version 0.6.4 is releasing with improved warnings and enables them by default. This should have at least saved you some time debugging the issue.

@AdrianSkierniewski
Copy link
Author

Thank you @pacman82 for the detailed explanation of what is going on there. For now, I was able to workaround this by not using varchar(MAX). I could do that because I owned the table. I agree that the second option sounds good and it should solve the main problem which is exporting the data into parquet. I assume that we'll still see some warning in that case, right?

@pacman82
Copy link
Owner

Hi @AdrianSkierniewski, yes I am unsure about the log level (info or warn? Suggestions?), but I want to make it visible which mode the tool is in, since the row by row version is expected to be much(!) slower. The information should hint to the columns which force the mode switch, so one can decide whether they could be changed.

The row by row will likely be much (!) slower with most drivers and setups. Depending on how large your table is I am not sure if switching the column type is the solution and the row by row mode is the workaround. Still I want this mode to be the next larger feature in the tool. Yet, this will be a completely independent rewrite of the query code, so I'll need a long weekend with lots of free time to get there.

Cheers, Markus

@AdrianSkierniewski
Copy link
Author

I think that since it will be much slower and it's just a workaround there should be a warning message about it. If this message won't be clearly visible people may start reporting issues related to the performance without even knowing about the underlying issue.
I was able to solve my issue and since there is always a way to cast a column into fixed-length varchar, plus we'll have the warning message visible. I don't think that this will block anyone, so please take your time to work on it.

@pacman82
Copy link
Owner

Thanks @AdrianSkierniewski I'll take your advice and will log it as a warning.

Cheers, Markus

@leo-schick
Copy link

Hi @pacman82 is there any update on this?

I run now into the same issue. For MSSQL I would suggest to map varchar(max) to varchar(8000) and nvarchar(max) to nvarchar(4000) to at least make it work. This doesn't need to be the perfect solution codewise IMO, but would be great when it would work.

As @AdrianSkierniewski, I own the table and therefore can make it work but it is just cumbersome to remind yourself that column type varchar(max) and nvarchar(max) is not supported :-/

@pacman82
Copy link
Owner

pacman82 commented Mar 9, 2023

Hi @leo-schick ,

you can make it work, by setting the --column-length-limit 8000 option.

Should have mentioned this option in the past before 😅 . I also did not find a fancier solution which is feasable, if you can not adapt your table scheme, IMHO the limit should at least be explicitly set by the user of odbc2parquet so the user is aware of the limitation.

Best, Markus

@pacman82
Copy link
Owner

@leo-schick Does the flag work for you?

@leo-schick
Copy link

@pacman82 I changed now the type in the database table to make it work. But I think it would be great when odbc2parquet would be able to handle (n)varchar(max) without this extra flag...

@pacman82
Copy link
Owner

ODBC 4.0 is going to specify length exception behaviour (https://github.com/microsoft/ODBC-Specification/blob/master/ODBC%204.0.md#611-usage. I do not know when it will release, and when support will be wide spread. Until then it seems to me that letting the user defining an upper bound is the best possible solution to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants