DuckDB backend#780
Conversation
|
@kmicklas feel free to push to this branch! I'm not going to have time to get to it for at least one week |
|
Wow this is amazing! So glad to see beam grow like this |
|
Although I was chatting so much at Amerihac that I ended up doing very little... It'll get done nights and weekends |
cd9b838 to
53addec
Compare
53addec to
f5d4196
Compare
|
At this stage, the backend is functional with the SQL92 standard implemented. What remains to be built before I consider merging this PR:
I'm leaving most DuckDB extensions as future work, as well as more recent SQL standards such as SQL99 and SQL2003. |
5e4e749 to
064979c
Compare
064979c to
0d116b6
Compare
f80a596 to
8a47d38
Compare
2805007 to
a8f7131
Compare
|
I'll take a look at this later today! |
kmicklas
left a comment
There was a problem hiding this comment.
Is there a reason for having separate entity types for each data source? I can't think of any extra power this would provide. On the other hand, with a single entity type you could in theory do more dynamic behavior like choose at runtime whether to read from CSV or parquet with the same schema.
| deriving (Generic, Database DuckDB) | ||
| ``` | ||
|
|
||
| Contrary to `parquet` and `icebergTable`, we'll need to tweak the default CSV options by changing |
There was a problem hiding this comment.
I don't see anything obviously wrong, but starting with this paragraph GitHub started highlighting the markdown as if it were Haskell, which makes me suspicious that there is something wrong with the code block above. (Evil Unicode?)
However the rendered file looks fine, so maybe just a GitHub bug.
There was a problem hiding this comment.
Very strange. I tried removing the csv block, which I wasn't sure if it was supported notation, but it changed nothing. My guess is that's a Github Markdown syntax highlighting quirk
| snd | ||
| ) | ||
| where | ||
| quotePath path = mconcat [emitChar '\'', emit (Text.pack path), emitChar '\''] |
There was a problem hiding this comment.
Is it possible to use placeholders for paths? This seems like a possible injection mechanism.
There was a problem hiding this comment.
That is a great question. It looks like read_csv and functions like it don't support placeholders. As far as I understand, this means that the filepath or glob cannot be used for injection
1c6ff45 to
165c895
Compare
The reason I went with separate entity types is twofold:
The alternative here is that we could have an entity like: data Parquet
data CSV
data Iceberg
data SourceEntity (fmt :: Type) (table :: (Type -> Type) -> Type)
class IsDuckDBSource fmt where
type Input fmt -- e.g. `FilePath` or `NonEmpty FilePath`
type Options fmt -- e.g. `CSVOptions` or `IcebergTableOptions`I found the mechanism above to be a little heavy to save on boilerplate. Did you have some other mechanism in mind? |
|
Well, there's a simpler design actually that removes a lot of duplicated code. I suspect this is what Ken was suggesting: data DataSourceEntity (table :: (Type -> Type) -> Type)
data DataSource
= CSV (NonEmpty FilePath) CSVOptions
| Parquet (NonEmptyFilePath) -- no parquet options
| Iceberg FilePath IcebergOptions
dataSource :: DataSource
-> EntityModification (DatabaseEntity DuckDB db) DuckDB (DataSource table) |
165c895 to
d08ff95
Compare
Yup! This is exactly what I meant. My only remaining concern is that I think we should document more clearly the injection possibility with untrusted paths. |
d08ff95 to
d048b54
Compare
Done. Thanks for the review! |
Fixes #779