-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Nested Types to the Appender #150
Conversation
# Conflicts: # appender.go # appender_test.go
appender_test.go
Outdated
@@ -28,7 +27,8 @@ const ( | |||
float REAL, | |||
double DOUBLE, | |||
string VARCHAR, | |||
bool BOOLEAN | |||
bool BOOLEAN, | |||
-- blob BLOB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There previously was no test for BLOB
types, currently it is causing a segmentation fault in
appender.go::443
:
state = C.duckdb_append_data_chunk(*a.appender, chunk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcboeker I've looked into this and I don't think BLOB
types have ever fully been supported. Running this test on the current main causes a segmentation fault. To my understanding BLOB
's at their core look like this: []uint8
in golang, they overlap with slices of uint8
, which, if I'm not missing something, means we have to choose which one to support.
I've opened an issue -> #152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @maiadegraaf, great PR! I added quite a lot of comments (again 😅), but I think that now we are almost there!
- can you double-check that we
panic
only on internal errors? In all other cases, we return an error to the user. Let's also cover all user errors with tests. - can you double-check that we call
Close
properly on all test results (db, conn, rows, etc.) to avoid any memory leaks (I am missing destructors)? I think you already do almost everywhere, but just to be sure.
@marcboeker, what motivates removing significant test coverage to have the tests run faster? Especially since these cover the new nested functionality? |
Please review the Although benchmarks can be helpful, they are not present for the rest of go-duckdb. Therefore, I have removed them. My aim is to create test cases that cover a significant portion of the code, are comprehensible, and can serve as a form of documentation. There are still missing test cases for the appender, such as UUID. |
Thanks for the explanation.
The reason for having more than ten rows is that DuckDB's standard vector size is 2048. I see that the primitive types now cover 3k rows, which tests the creation of multiple chunks. I missed that change. Child vectors can exceed the standard vector size for nested types, so testing this might still be interesting.
It is possible to use a regex to run only specific tests, and it is also possible to group slow tests into a common expression to exclude (e.g., This is mostly me trying to understand the project better and better understand the structure for future contributions. Maybe it is worth setting up a section in the contributing guidelines about this? |
I've done some more refactoring and added tests for the UUID type, time.Time and []byte. Unfortunately, appending a blob does not work at the moment, hence the @maiadegraaf Do you know what type a DUCKDB_TYPE_BLOB is internally? Is it a primitive type or a list (DUCKDB_TYPE_LIST) of uint8 (which i doubt)? The current implementation triggers a segmentation violation in the following line:
The logical type we set is:
One weird thing is that the appender is a bit flaky under Ubuntu. Locally on macOS and in the CI under macOS/FreeBSD, everything works reliably. It fails at the same operation as the blob appending:
|
I had a look at the failing test. The reason for the segmentation fault is that we do not set the validity mask correctly for the child vectors of the I guess this worked on macos but not on Ubuntu because Ubuntu probably default-initialised that bit to one (valid). In contrast, MacOS probably default-initialised that bit to zero (false)... Or some other bit-level shenanigans. I have a fix ready, and I'm also working on some other changes. I'll open a PR to @maiadegraaf once I'm done. I'll also get back to you about the |
@taniabogatsch Thanks, sounds great. I'm curious to see, how you have fixed it. |
Validity mask fix and changes towards idiomatic Go
I used the logical type to switch in the |
W.r.t. the To the best of my knowledge, Go does not distinguish between cc @marcboeker, what do you think? :) |
More Appender changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvements. I really like the generics approach. I'm going to merge this and we can fix the small changes later.
fn SetColValue | ||
|
||
// The type of the column. | ||
ddbType C.duckdb_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call it duckdbType
or dbType
as it is more clear than ddbType
. Reads like a typo.
return nil, fmt.Errorf("can't create appender") | ||
} | ||
var appender C.duckdb_appender | ||
state := C.duckdb_appender_create(*dbConn.con, cSchema, cTable, &appender) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should shorten dbConn
simply to conn
as the corresponding struct field in the Appender
is also called c
.
return csPtr, slice | ||
} | ||
|
||
func initPrimitive[T any](ddbType C.duckdb_type) (colInfo, C.duckdb_logical_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome use of generics. Makes the code a lot cleaner 🎉
Maybe we should rename ddbType
to duckdbType
here to.
Avoid writing to parquet if we can (appends only) ~Using upstream driver has some shortcomings from the get go: The interface decides on the schema based on passed types, so having untyped `nil`s don't help. But typed nils are also not handled. The [fork/PR](marcboeker/go-duckdb#150) fixes these issues but I still get 0 rows on the test db (or motherduck)~ The PR is now merged upstream and tagged as 1.6.1. ~I had to use my [fork](https://github.com/disq/go-duckdb/tree/feat/allow-nulls-in-appender-first-row) to get around nils, another way to this would be if the `appender.initColTypes` was exported, then we could just pass on concrete types for init and then use nullable data as normal.~ ~Another issue: We create JSON types as `json` but appender doesn't know about it. Sending `string` instead (which is what the json alias points to internally in duckdb, varchar) doesn't work. Might need the newer appender implementation for this to go forward.~ Example config: ```yaml kind: source spec: name: test path: cloudquery/test registry: cloudquery version: v4.0.8 # tables: ["*"] tables: [ "test_some_table" ] destinations: ["duckdb"] spec: num_clients: 1 num_rows: 4 num_sub_rows: 1 # num_clients: 10 # num_rows: 10000 num_sub_rows: 0 # num_sub_rows: 10 --- kind: destination spec: name: duckdb version: v0.0.0 registry: grpc path: localhost:7777 # write_mode: "overwrite" write_mode: "append" spec: connection_string: tmp_duck.db # connection_string: md:my_db # Optional parameters batch_size: 10000 # batch_size_bytes: 14194304 # 14 MiB # debug: false ```
I'm currently working on adding nested types to the appended. I haven't finished it yet but I'm opening up a draft for better coordination.
The Big Changes:
STRUCT
's andLIST
's are now both supported! Including nested calls:tripleNestedListInt INT[][][]
topWrapper STRUCT(Wrapper STRUCT(Base STRUCT(I INT, V VARCHAR)))
mixList STRUCT(A STRUCT(L VARCHAR[]), B STRUCT(L INT[])[])[]
InitializeColumnTypesAndInfos
, which reduces the number of switches to one, which only happens once.@taniabogatsch and I are still ironing out some final changes here