-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with varchar(MAX) #59
Comments
Hello Adrian, There are a whole bunch of things which can improve here. I use this incident as a trigger to write them all down. This scenario is not so much an edge case as something which is likely happen over an over again. There has already been a similar issue, with a column reporting size zero ( #17 ) although in that case it was a column containin geometry. MSSQL returns 0 in general if it can not give an reliable upper bound for the size of a value of the type. There is also the related subject of database drivers giving reliable upper bounds, but them being ridiculisly large (#43). As a very minimal standard of correctness and usability I would expect the tool to explicity ignore a column if it is not able to handle it sensibly. The tool should have given you the warning defined here: Line 516 in 3309f5d
Yet it did not do that. This is a bug and so I labled this issue accordingly. Most likely you work on windows, and therfore the tool uses UTF16 by default to avoid depending on the local system encoding, sadly the The second issue here is a UX improvement. Not showing warnings by default is probably a bad idea. Therfore I created #60. Now ignoring columns and reporting this to the user would not be buggy, but also hardly make users happy. Of course we want the data to appear in the parquet file. To get to this point however will be a bit of a stony road however. Here is why: The way fetching data in bulks works in ODBC is that the application (i.e. this tool) has to provide buffers large enough to hold the desired number of rows. Each value in these column buffers obtains the same space, so the size which needs to be allocated for a single column is To mitigate this three strategies come to my mind:
All these strategies are not mutually exclusive, and the ultimate version of these tool would eventually offer all these posibilities, but as of now my plan is to implement Cheers, Markus |
Damn, I already wrote a test for this scenario. Line 356 in 3309f5d
Sadly it is currently only ever executed on Linux so I did not catch the UTF-16 encoding issue. Oh, well I guess one can not prevent them all ... |
Version |
Thank you @pacman82 for the detailed explanation of what is going on there. For now, I was able to workaround this by not using |
Hi @AdrianSkierniewski, yes I am unsure about the log level (info or warn? Suggestions?), but I want to make it visible which mode the tool is in, since the row by row version is expected to be much(!) slower. The information should hint to the columns which force the mode switch, so one can decide whether they could be changed. The row by row will likely be much (!) slower with most drivers and setups. Depending on how large your table is I am not sure if switching the column type is the solution and the row by row mode is the workaround. Still I want this mode to be the next larger feature in the tool. Yet, this will be a completely independent rewrite of the query code, so I'll need a long weekend with lots of free time to get there. Cheers, Markus |
I think that since it will be much slower and it's just a workaround there should be a warning message about it. If this message won't be clearly visible people may start reporting issues related to the performance without even knowing about the underlying issue. |
Thanks @AdrianSkierniewski I'll take your advice and will log it as a warning. Cheers, Markus |
Hi @pacman82 is there any update on this? I run now into the same issue. For MSSQL I would suggest to map As @AdrianSkierniewski, I own the table and therefore can make it work but it is just cumbersome to remind yourself that column type |
Hi @leo-schick , you can make it work, by setting the Should have mentioned this option in the past before 😅 . I also did not find a fancier solution which is feasable, if you can not adapt your table scheme, IMHO the limit should at least be explicitly set by the user of Best, Markus |
@leo-schick Does the flag work for you? |
@pacman82 I changed now the type in the database table to make it work. But I think it would be great when |
ODBC 4.0 is going to specify length exception behaviour (https://github.com/microsoft/ODBC-Specification/blob/master/ODBC%204.0.md#611-usage. I do not know when it will release, and when support will be wide spread. Until then it seems to me that letting the user defining an upper bound is the best possible solution to this. |
Hi Markus,
I'm still using this nice tool and so far everything worked as expected. But then recently I encountered some edge case scenario when I'm trying to dump a column from MS SQL created using
varchar(MAX)
.This leads to a warning (only visible when using -vvv)
WARN - State: 01004, Native error: 0, Message: [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation
. The parquet file generated for this table contains an empty string instead of a proper value.During the debugging, I found that the ODBC buffer description has
max_str_len
set to0
which is probably the root cause of this behavior.I think that there are two problems here. One is that this truncation happens on the varchar(MAX) (I don't know if there is anything that you could do with it). The second one is that this is a silent warning which corrupts the parquet files (shouldn't we have some flag to be more strict about dumping data 1 to 1 without any transformation?).
Cheers,
Adrian
The text was updated successfully, but these errors were encountered: