Database Feature Matrix

Features

🚿Stream — a mode when Bulker inserts data to a destination on per record basis. Usually, databases don't like when a large amount of data is streamed. Don't use at production scale (more than 10-100 records per minute, depending on database).
🛢️Batch — a mode when Bulker inserts data to a destination in batches. Preferred mode for large amounts of data.
🔑Primary Key - a primary key is the column or columns that contain values that uniquely identify each row in a table. Enabled via stream options. Required for 'deduplication' option.
🐫Deduplication — a mode that avoid duplication of data rows with the equal values of key columns (primary key). It means that if Bulker receives a record with the same primary key values, the old one will be replaced. Bulker maintains uniqueness of rows based on primary key columns even for warehouses that doesn't enforce uniqueness natively. Enabled via stream options. Require primary key option. May comes with performance tradeoffs.
🗓️Timestamp Column - timestamp column option helps Bulker to create tables optimized for range queries and sorting by time, e.g. event creation time.

	Redshift	BigQuery	ClickHouse	Snowflake	Postgres	MySQL
Stream	✅ Supported ⚠️ Slow	❌ Not supported	✅ Supported	✅ Supported	✅ Supported	✅ Supported
Batch	✅ Supported	✅ Supported	✅ Supported	✅ Supported	✅ Supported	✅ Supported
Primary key	✅ Supported	ℹ️ Emulated	✅️ Supported	✅️ Supported	✅️ Supported	✅️ Supported
Deduplication	✅ Supported	✅ Supported	✅ Supported ⚠️ Eventual dedup	✅ Supported	✅ Supported	✅ Supported
Timestamp Column	✅ Supported	✅ Supported	✅ Supported	✅ Supported	✅ Supported	✅ Supported

Advanced features

Those features are not exposed as HTTP API and supported only on Go-lib API level.

Replace Table - a special version of batch mode that assumes that a single batch contains all data for a table. Depending on database implementation bulker tries to atomically replace old table with a new one.
Replace Partition - a special version of batch mode that replaces a part of target table. Part of table to replace is defined by 'partition' stream option. Each batch loads data for virtual partition identified by 'partition' option value. If table already contains data for provided 'partition', this data will be deleted and replaced with new data from current batch. Enabled via stream options.

	Redshift	BigQuery	ClickHouse	Snowflake	Postgres	MySQL	S3 (coming soon)
Replace Table	✅ Supported	✅ Supported	✅ Supported	✅ Supported	✅ Supported	✅ Supported
Replace Partition	✅ Supported	✅ Supported ⚠️ Not atomic	✅ Supported	✅ Supported	✅ Supported	✅ Supported

Redshift

Redshift Stream

Supported as plain insert statements. Don't use at production scale (more than 10 records per minute)

Redshift Batch

Algorithm:

Write to tmp file
Load tmp file to s3
BEGIN TRANSACTION
COPY from s3 to tmp_table
INSERT into target_table select from tmp_table
COMMIT

Redshift Deduplication

For batch mode the following algorithm is used:

Write to tmp file
Deduplicate rows in tmp file
Load tmp file to s3
BEGIN TRANSACTION
COPY from s3 to tmp_table
DELETE from target_table using tmp_table where primary key matches
INSERT into target_table select from tmp_table
COMMIT

For stream mode:

SELECT by primary key. Then either INSERT or UPDATE depending on result. Don't use at production scale (more than 10 records per minute)

Redshift Primary Key

In Redshift primary keys doesn’t enforce uniqueness. Bulker performs deduplication itself when deduplication option is enabled and primary key is specified.

If primary key consists of a single column, that column will also be selected as the DIST KEY.

Redshift Timestamp Column

Selected timestamp column will be used as sort key for target table.

Redshift Replace Table

Algorithm:

Write to tmp file
Load tmp file to s3
BEGIN TRANSACTION
COPY from s3 to tmp_table
RENAME target_table to deprecated_target_table_20060101_150405
RENAME tmp_table to target_table
DROP TABLE deprecated_target_table_20060101_150405
COMMIT

Redshift Replace Partition

Algorithm:

Write to tmp file
Load tmp file to s3
BEGIN TRANSACTION
DELETE from target_table where partition_id=partiton option value
COPY from s3 to target_table
COMMIT

BigQuery

BigQuery Stream

It's possible to implement, but without deduplication, so we decided not to.

BigQuery Batch

Write to tmp file
Use Loader API to load to tmp_table from tmp file
Use Copier API to copy from tmp_table to target_table

BigQuery Deduplication

Algorithm for batch mode:

Write to tmp file
Dedup tmp file
Use Loader API to load to tmp_table from tmp file
MERGE into target_table on tmp_table when matched then UPDATE when not matched them INSERT

BigQuery Primary Key

Emulated - bulker fully handles uniqueness. Primary keys columns meta information stored in table labels.

BigQuery Timestamp Column

Bulker creates time-unit column-partitioned table with specified timestamp column and monthly partitioning.

BigQuery Replace Table

Algorithm:

Write to tmp file
Use Loader API to load to tmp_table from tmp file
Use Copier API to copy from tmp_table to target_table with WriteTruncate mode
Drop tmp_table

BigQuery Replace Partition

Algorithm:

DELETE from target_table where partition_id= partition option value
Write to tmp file
Use Loader API to load to target_table from tmp file

ClickHouse

ClickHouse Stream

For single node instance:

INSERT INTO target_table (...) VALUES (..)

For cluster bulker insert into distributed table so data evenly distributed across cluster nodes:

INSERT INTO dist_target_table (...) VALUES (...)

ClickHouse Batch

Algorithm:

Write to tmp file
INSERT INTO tmp_table (...) VALUES (...) - bulk load data from tmp file into tmp_table using bulk insert
INSERT INTO target_table(...) SELECT ... FROM tmp_table

ClickHouse Deduplication

Bulker relies on underlying table engine. Unless table created prior to Bulker, ReplacingMergeTree will be used.

Primary key column configured for destination will be used both as PRIMARY KEY and ORDER BY.

ReplacingMergeTree engine performs deduplication in background during some time after insertion So it's still possible to get rows with duplicated primary key columns using ordinary SELECT.

To make sure that no duplicates are present in query results use FINAL modifier, e.g:

SELECT * FROM target_table FINAL.

ClickHouse Primary Key

Primary keys columns also used as sorting key for ReplacingMergeTree engine.

ClickHouse Timestamp Column

Bulker creates tables partitioned by specified timestamp column and monthly partitioning, e.g. PARTITION BY toYYYYMM(_timestamp)

ClickHouse Replace Table

Algorithm:

Write to tmp file
INSERT INTO tmp_table (...) VALUES (...) - bulk load data from tmp file into tmp_table using bulk insert
EXCHANGE TABLES target_table tmp_table

ClickHouse Replace Partition

Algorithm:

Write to tmp file
INSERT INTO tmp_table(...) VALUES (...) - bulk load data from tmp file into tmp_table using bulk insert
INSERT INTO target_table(...) SELECT ... FROM tmp_table

Snowflake

Snowflake Stream

INSERT INTO target_table (...) VALUES (..)

Snowflake Batch

Algorithm:

Write to tmp file
Load tmp file to stage
BEGIN TRANSACTION
COPY from stage to tmp_table
INSERT into target_table select from tmp_table
COMMIT

Snowflake Deduplication

For batch mode the following algorithm is used:

Write to tmp file
Deduplicate rows in tmp file
Load tmp file to s3
BEGIN TRANSACTION
COPY from stage to tmp_table
MERGE into target_table using (select from tmp_table) ...
COMMIT

For stream mode:

SELECT by primary key. Then either INSERT or UPDATE depending on result.

Snowflake Primary Key

In Snowflake primary keys doesn’t enforce uniqueness. Bulker performs deduplication itself when deduplication option is enabled and primary key is specified.

Snowflake Timestamp Column

Bulker sets clustering key to the month part of specified timestamp column values, e.g. CLUSTER BY (DATE_TRUNC('MONTH', _timestamp))

Snowflake Replace Table

Algorithm:

Write to tmp file
Load tmp file to stage
BEGIN TRANSACTION
COPY from stage to tmp_table
RENAME target_table to deprecated_target_table_20060101_150405
RENAME tmp_table to target_table
DROP TABLE deprecated_target_table_20060101_150405
COMMIT

Snowflake Replace Partition

Algorithm:

Write to tmp file
Load tmp file to stage
BEGIN TRANSACTION
DELETE from target_table where partition_id=partiton option value
COPY from stage to target_table
COMMIT

Postgres

Postgres Stream

INSERT INTO target_table (...) VALUES (..)

Postgres Batch

Algorithm:

Write to tmp file
BEGIN TRANSACTION
COPY from STDIN to tmp_table - load tmp file into tmp_table
INSERT into target_table select from tmp_table
COMMIT

Postgres Deduplication

For batch mode the following algorithm is used:

Write to tmp file
Deduplicate rows in tmp file
BEGIN TRANSACTION
COPY from STDIN to tmp_table - load tmp file into tmp_table
INSERT into target_table select from tmp_table ON CONFLICT UPDATE ...
COMMIT

For stream mode:

INSERT INTO target_table (...) VALUES (..) ON CONFLICT UPDATE ...

Postgres Primary Key

Postgres Timestamp Column

Regular index is created on specified timestamp column.

Postgres Replace Table

Algorithm:

Write to tmp file
BEGIN TRANSACTION
COPY from STDIN to tmp_table - load tmp file into tmp_table
RENAME target_table to deprecated_target_table_20060101_150405
RENAME tmp_table to target_table
DROP TABLE deprecated_target_table_20060101_150405
COMMIT

Postgres Replace Partition

Algorithm:

Write to tmp file
BEGIN TRANSACTION
DELETE from target_table where partition_id=partiton option value
COPY from STDIN to target_table - load tmp file into tmp_table
COMMIT

MySQL

MySQL Stream

INSERT INTO target_table (...) VALUES (..)

MySQL Batch

Algorithm:

BEGIN TRANSACTION
INSERT into tmp_table
INSERT into target_table select from tmp_table
COMMIT

MySQL Deduplication

For batch mode the following algorithm is used:

BEGIN TRANSACTION
INSERT into tmp_table ... ON DUPLICATE KEY UPDATE ...
INSERT into target_table select from tmp_table ... ON DUPLICATE KEY UPDATE ...
COMMIT

For stream mode:

INSERT INTO target_table ... ON DUPLICATE KEY UPDATE ...

MySQL Primary Key

MySQL Timestamp Column

Regular index is created on specified timestamp column.

MySQL Replace Table

Algorithm:

BEGIN TRANSACTION
INSERT into tmp_table
RENAME target_table to deprecated_target_table_20060101_150405
RENAME tmp_table to target_table
DROP TABLE deprecated_target_table_20060101_150405
COMMIT

MySQL Replace Partition

Algorithm:

BEGIN TRANSACTION
DELETE from target_table where partition_id=partiton option value
INSERT into target_table ... ON DUPLICATE KEY UPDATE ...
COMMIT

Files

db-feature-matrix.md

Latest commit

History

db-feature-matrix.md

File metadata and controls

Database Feature Matrix

Features

Advanced features

Redshift

Redshift Stream

Redshift Batch

Redshift Deduplication

Redshift Primary Key

Redshift Timestamp Column

Redshift Replace Table

Redshift Replace Partition

BigQuery

BigQuery Stream

BigQuery Batch

BigQuery Deduplication

BigQuery Primary Key

BigQuery Timestamp Column

BigQuery Replace Table

BigQuery Replace Partition

ClickHouse

ClickHouse Stream

ClickHouse Batch

ClickHouse Deduplication

ClickHouse Primary Key

ClickHouse Timestamp Column

ClickHouse Replace Table

ClickHouse Replace Partition

Snowflake

Snowflake Stream

Snowflake Batch

Snowflake Deduplication

Snowflake Primary Key

Snowflake Timestamp Column

Snowflake Replace Table

Snowflake Replace Partition

Postgres

Postgres Stream

Postgres Batch

Postgres Deduplication

Postgres Primary Key

Postgres Timestamp Column

Postgres Replace Table

Postgres Replace Partition

MySQL

MySQL Stream

MySQL Batch

MySQL Deduplication

MySQL Primary Key

MySQL Timestamp Column

MySQL Replace Table

MySQL Replace Partition