Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Nested Types to the Appender #150

Merged
merged 57 commits into from
Feb 16, 2024
Merged

Conversation

maiadegraaf
Copy link
Contributor

I'm currently working on adding nested types to the appended. I haven't finished it yet but I'm opening up a draft for better coordination.

The Big Changes:

  • STRUCT's and LIST's are now both supported! Including nested calls:
    • tripleNestedListInt INT[][][]
    • topWrapper STRUCT(Wrapper STRUCT(Base STRUCT(I INT, V VARCHAR)))
    • mixList STRUCT(A STRUCT(L VARCHAR[]), B STRUCT(L INT[])[])[]
  • The values are now added through recursive callback functions, initialized in InitializeColumnTypesAndInfos, which reduces the number of switches to one, which only happens once.

@taniabogatsch and I are still ironing out some final changes here

@maiadegraaf maiadegraaf marked this pull request as draft January 19, 2024 09:40
appender_test.go Outdated
@@ -28,7 +27,8 @@ const (
float REAL,
double DOUBLE,
string VARCHAR,
bool BOOLEAN
bool BOOLEAN,
-- blob BLOB
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There previously was no test for BLOB types, currently it is causing a segmentation fault in
appender.go::443:

state = C.duckdb_append_data_chunk(*a.appender, chunk)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcboeker I've looked into this and I don't think BLOB types have ever fully been supported. Running this test on the current main causes a segmentation fault. To my understanding BLOB's at their core look like this: []uint8 in golang, they overlap with slices of uint8, which, if I'm not missing something, means we have to choose which one to support.

I've opened an issue -> #152

Copy link
Collaborator

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maiadegraaf, great PR! I added quite a lot of comments (again 😅), but I think that now we are almost there!

  • can you double-check that we panic only on internal errors? In all other cases, we return an error to the user. Let's also cover all user errors with tests.
  • can you double-check that we call Close properly on all test results (db, conn, rows, etc.) to avoid any memory leaks (I am missing destructors)? I think you already do almost everywhere, but just to be sure.

appender.go Outdated Show resolved Hide resolved
appender_nested_test.go Outdated Show resolved Hide resolved
appender_test.go Outdated Show resolved Hide resolved
appender.go Show resolved Hide resolved
appender.go Outdated Show resolved Hide resolved
appender.go Outdated Show resolved Hide resolved
appender_naive_test.go Outdated Show resolved Hide resolved
appender_naive_test.go Outdated Show resolved Hide resolved
appender_naive_test.go Outdated Show resolved Hide resolved
appender_naive_test.go Outdated Show resolved Hide resolved
appender.go Outdated Show resolved Hide resolved
@taniabogatsch
Copy link
Collaborator

@marcboeker, what motivates removing significant test coverage to have the tests run faster? Especially since these cover the new nested functionality?

@marcboeker
Copy link
Owner

@marcboeker, what motivates removing significant test coverage to have the tests run faster? Especially since these cover the new nested functionality?

Please review the appender_test.go file, where nested and base cases have been consolidated and refactored (though not yet complete). The only test removed is the nested large test, as I believe the base nested test is sufficient for testing with e.g. 10 rows instead of 10000, without sacrificing the speed of the test suite. Rapid tests are essential for working with them.

Although benchmarks can be helpful, they are not present for the rest of go-duckdb. Therefore, I have removed them. My aim is to create test cases that cover a significant portion of the code, are comprehensible, and can serve as a form of documentation.

There are still missing test cases for the appender, such as UUID.

@taniabogatsch
Copy link
Collaborator

Thanks for the explanation.

The only test removed is the nested large test, as I believe the base nested test is sufficient for testing with e.g. 10 rows instead of 10000

The reason for having more than ten rows is that DuckDB's standard vector size is 2048. I see that the primitive types now cover 3k rows, which tests the creation of multiple chunks. I missed that change. Child vectors can exceed the standard vector size for nested types, so testing this might still be interesting.

without sacrificing the speed of the test suite. Rapid tests are essential for working with them.

It is possible to use a regex to run only specific tests, and it is also possible to group slow tests into a common expression to exclude (e.g., _slow_test.go). That way, it is possible to include tests closer to real-usage scenarios while keeping fast tests for rapid development.

This is mostly me trying to understand the project better and better understand the structure for future contributions. Maybe it is worth setting up a section in the contributing guidelines about this?

@marcboeker
Copy link
Owner

marcboeker commented Feb 9, 2024

I've done some more refactoring and added tests for the UUID type, time.Time and []byte.

Unfortunately, appending a blob does not work at the moment, hence the TestAppenderBlob is skipped.

@maiadegraaf Do you know what type a DUCKDB_TYPE_BLOB is internally? Is it a primitive type or a list (DUCKDB_TYPE_LIST) of uint8 (which i doubt)?

The current implementation triggers a segmentation violation in the following line:

state = C.duckdb_append_data_chunk(*a.appender, chunk)

The logical type we set is: C.duckdb_create_logical_type(C.DUCKDB_TYPE_BLOB) and the value for the vector row index is set via:

setPrimitive[[]byte](colInfo, rowIdx, val.([]byte))

One weird thing is that the appender is a bit flaky under Ubuntu. Locally on macOS and in the CI under macOS/FreeBSD, everything works reliably. It fails at the same operation as the blob appending:

state = C.duckdb_append_data_chunk(*a.appender, chunk)

@taniabogatsch
Copy link
Collaborator

I had a look at the failing test. The reason for the segmentation fault is that we do not set the validity mask correctly for the child vectors of the STRUCT vector. When appending the data, we assume the values are set and try to append a string with an undefined length and an invalid data pointer.

I guess this worked on macos but not on Ubuntu because Ubuntu probably default-initialised that bit to one (valid). In contrast, MacOS probably default-initialised that bit to zero (false)... Or some other bit-level shenanigans.

I have a fix ready, and I'm also working on some other changes. I'll open a PR to @maiadegraaf once I'm done.

I'll also get back to you about the BLOBs.

@marcboeker
Copy link
Owner

@taniabogatsch Thanks, sounds great. I'm curious to see, how you have fixed it.

Validity mask fix and changes towards idiomatic Go
@taniabogatsch
Copy link
Collaborator

I used the logical type to switch in the SetNull, but I believe that only works for top-level recursion depths. We either have to destroy all child logical types when initializing the types or when closing the Appender. So these types might not be accessible in deeper levels. A cleaner solution here is to revert to the original implementation of this PR, where we return the logical type and end up with exactly one logical type for the top-level chunk initialization. That way, we also don't keep an unused field in the colInfo struct, and we can instead set the DUCKDB_TYPE. I'll push a fix. 🤔

@taniabogatsch
Copy link
Collaborator

W.r.t. the BLOB type. DuckDB distinguishes between BLOB and UTINYINT[]. In DuckDB, a BLOB has the same representation as a VARCHAR. Both are a more elaborate char pointer (inlined vs. string heap). It is possible to store Go-string and Go-[]byte in C.CString, and pass the resulting C string to duckdb_vector_assign_string_element. I updated the BLOB implementation accordingly, and I also expanded on the test.

To the best of my knowledge, Go does not distinguish between []byte and []uint8. DuckDB distinguishes between BLOB and UTINYINT[]. @maiadegraaf raised this problematic here. In the current solution, we opt to support BLOB in the Appender, but not UTINYINT[]. I added the respective tests.

cc @marcboeker, what do you think? :)

Copy link
Owner

@marcboeker marcboeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements. I really like the generics approach. I'm going to merge this and we can fix the small changes later.

fn SetColValue

// The type of the column.
ddbType C.duckdb_type
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it duckdbType or dbType as it is more clear than ddbType. Reads like a typo.

return nil, fmt.Errorf("can't create appender")
}
var appender C.duckdb_appender
state := C.duckdb_appender_create(*dbConn.con, cSchema, cTable, &appender)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should shorten dbConn simply to conn as the corresponding struct field in the Appender is also called c.

return csPtr, slice
}

func initPrimitive[T any](ddbType C.duckdb_type) (colInfo, C.duckdb_logical_type) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome use of generics. Makes the code a lot cleaner 🎉
Maybe we should rename ddbType to duckdbType here to.

@marcboeker marcboeker merged commit e26a426 into marcboeker:main Feb 16, 2024
3 checks passed
kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this pull request Feb 21, 2024
Avoid writing to parquet if we can (appends only)


~Using upstream driver has some shortcomings from the get go: The interface decides on the schema based on passed types, so having untyped `nil`s don't help. But typed nils are also not handled. The [fork/PR](marcboeker/go-duckdb#150) fixes these issues but I still get 0 rows on the test db (or motherduck)~ The PR is now merged upstream and tagged as 1.6.1.

~I had to use my [fork](https://github.com/disq/go-duckdb/tree/feat/allow-nulls-in-appender-first-row) to get around nils, another way to this would be if the `appender.initColTypes` was exported, then we could just pass on concrete types for init and then use nullable data as normal.~

~Another issue: We create JSON types as `json` but appender doesn't know about it. Sending `string` instead (which is what the json alias points to internally in duckdb, varchar) doesn't work. Might need the newer appender implementation for this to go forward.~


Example config:
```yaml
kind: source
spec:
  name: test
  path: cloudquery/test
  registry: cloudquery
  version: v4.0.8
#  tables: ["*"]
  tables: [ "test_some_table" ]
  destinations: ["duckdb"]
  spec:
    num_clients: 1
    num_rows: 4
    num_sub_rows: 1
#    num_clients: 10
#    num_rows: 10000
    num_sub_rows: 0
#    num_sub_rows: 10
---
kind: destination
spec:
  name: duckdb
  version: v0.0.0
  registry: grpc
  path: localhost:7777
#  write_mode: "overwrite"
  write_mode: "append"
  spec:
    connection_string: tmp_duck.db
#    connection_string: md:my_db
    # Optional parameters
    batch_size: 10000
#    batch_size_bytes: 14194304 # 14 MiB
    # debug: false
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants