Skip to content

DuckDB backend#780

Merged
LaurentRDC merged 8 commits intomasterfrom
duckdb-backend
Feb 24, 2026
Merged

DuckDB backend#780
LaurentRDC merged 8 commits intomasterfrom
duckdb-backend

Conversation

@LaurentRDC
Copy link
Copy Markdown
Member

@LaurentRDC LaurentRDC commented Feb 8, 2026

Fixes #779

@LaurentRDC
Copy link
Copy Markdown
Member Author

@kmicklas feel free to push to this branch! I'm not going to have time to get to it for at least one week

@tathougies
Copy link
Copy Markdown
Collaborator

Wow this is amazing! So glad to see beam grow like this

@LaurentRDC
Copy link
Copy Markdown
Member Author

Although I was chatting so much at Amerihac that I ended up doing very little... It'll get done nights and weekends

@LaurentRDC LaurentRDC force-pushed the duckdb-backend branch 8 times, most recently from cd9b838 to 53addec Compare February 18, 2026 13:45
@LaurentRDC
Copy link
Copy Markdown
Member Author

LaurentRDC commented Feb 18, 2026

At this stage, the backend is functional with the SQL92 standard implemented.

What remains to be built before I consider merging this PR:

  • read_parquet;
  • iceberg_scan;
  • read_csv;

I'm leaving most DuckDB extensions as future work, as well as more recent SQL standards such as SQL99 and SQL2003.

@LaurentRDC LaurentRDC marked this pull request as ready for review February 18, 2026 13:55
@LaurentRDC LaurentRDC force-pushed the duckdb-backend branch 2 times, most recently from 5e4e749 to 064979c Compare February 20, 2026 02:37
@kmicklas
Copy link
Copy Markdown
Member

I'll take a look at this later today!

Copy link
Copy Markdown
Member

@kmicklas kmicklas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for having separate entity types for each data source? I can't think of any extra power this would provide. On the other hand, with a single entity type you could in theory do more dynamic behavior like choose at runtime whether to read from CSV or parquet with the same schema.

Comment thread .github/workflows/build.yaml
Comment thread docs/user-guide/backends/beam-duckdb.md Outdated
deriving (Generic, Database DuckDB)
```

Contrary to `parquet` and `icebergTable`, we'll need to tweak the default CSV options by changing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything obviously wrong, but starting with this paragraph GitHub started highlighting the markdown as if it were Haskell, which makes me suspicious that there is something wrong with the code block above. (Evil Unicode?)

However the rendered file looks fine, so maybe just a GitHub bug.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very strange. I tried removing the csv block, which I wasn't sure if it was supported notation, but it changed nothing. My guess is that's a Github Markdown syntax highlighting quirk

snd
)
where
quotePath path = mconcat [emitChar '\'', emit (Text.pack path), emitChar '\'']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use placeholders for paths? This seems like a possible injection mechanism.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great question. It looks like read_csv and functions like it don't support placeholders. As far as I understand, this means that the filepath or glob cannot be used for injection

@LaurentRDC
Copy link
Copy Markdown
Member Author

LaurentRDC commented Feb 22, 2026

@kmicklas

Is there a reason for having separate entity types for each data source? I can't think of any extra power this would provide. On the other hand, with a single entity type you could in theory do more dynamic behavior like choose at runtime whether to read from CSV or parquet with the same schema.

The reason I went with separate entity types is twofold:

  • Different entity types support different input sources, e.g. Parquet and CSV support multiple files, while Iceberg tables support a single FilePath as input;
  • Different entity types have different set of options, e.g. CSVOptions and IcebergTableOptions.

The alternative here is that we could have an entity like:

data Parquet
data CSV
data Iceberg

data SourceEntity (fmt :: Type) (table :: (Type -> Type) -> Type)

class IsDuckDBSource fmt where
    type Input fmt -- e.g. `FilePath` or `NonEmpty FilePath`
    type Options fmt -- e.g. `CSVOptions` or `IcebergTableOptions`

I found the mechanism above to be a little heavy to save on boilerplate. Did you have some other mechanism in mind?

@LaurentRDC LaurentRDC requested a review from kmicklas February 22, 2026 21:49
@LaurentRDC
Copy link
Copy Markdown
Member Author

LaurentRDC commented Feb 22, 2026

Well, there's a simpler design actually that removes a lot of duplicated code. I suspect this is what Ken was suggesting:

data DataSourceEntity (table :: (Type -> Type) -> Type)

data DataSource
    = CSV (NonEmpty FilePath) CSVOptions
    | Parquet (NonEmptyFilePath) -- no parquet options
    | Iceberg FilePath IcebergOptions

dataSource :: DataSource 
           -> EntityModification (DatabaseEntity DuckDB db) DuckDB (DataSource table)

@kmicklas
Copy link
Copy Markdown
Member

Well, there's a simpler design actually that removes a lot of duplicated code. I suspect this is what Ken was suggesting:

data DataSourceEntity (table :: (Type -> Type) -> Type)

data DataSource
    = CSV (NonEmpty FilePath) CSVOptions
    | Parquet (NonEmptyFilePath) -- no parquet options
    | Iceberg FilePath IcebergOptions

dataSource :: DataSource 
           -> EntityModification (DatabaseEntity DuckDB db) DuckDB (DataSource table)

Yup! This is exactly what I meant.

My only remaining concern is that I think we should document more clearly the injection possibility with untrusted paths.

@LaurentRDC
Copy link
Copy Markdown
Member Author

My only remaining concern is that I think we should document more clearly the injection possibility with untrusted paths.

Done. Thanks for the review!

@LaurentRDC LaurentRDC removed the request for review from tathougies February 24, 2026 17:57
@LaurentRDC LaurentRDC merged commit 71477f6 into master Feb 24, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DuckDB backend

3 participants