Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New schema #20

Merged
merged 3 commits into from
Jan 14, 2022
Merged

New schema #20

merged 3 commits into from
Jan 14, 2022

Conversation

UnamedRus
Copy link
Contributor

Important and doesn't require to recreate table (can be done on existing table):

String -> LowCardinality(String)
CODEC(ZSTD(1)) -> no codec or CODEC(LZ4)

Not really important, but makes schema a bit cleaner:

-toUnixTimestamp(timestamp) -> timestamp

Probably, it can make sense to put host column before pod_name in ORDER BY, but it's needs to be tested.

BTW, it's not recommended to use . in non-array column.
Because dot reserved for Nested data type.

Using map instead of key-value arrays

CREATE TABLE IF NOT EXISTS logs.logs_local ON CLUSTER `{cluster}`
(
    `timestamp` DateTime64(3) CODEC(Delta, LZ4),
    `cluster` LowCardinality(String),
    `namespace` LowCardinality(String),
    `app` LowCardinality(String),
    `pod_name` LowCardinality(String),
    `container_name` LowCardinality(String),
    `host` LowCardinality(String),
    `fields_string` Map(LowCardinality(String), String),
    `fields_number` Map(LowCardinality(String), Float64),
    `fields_string.key` Array(LowCardinality(String)) ALIAS mapKeys(fields_string),
    `fields_string.value` Array(String) ALIAS mapValues(fields_string),
    `fields_number.key` Array(LowCardinality(String)) ALIAS mapKeys(fields_number),
    `fields_number.value` Array(Float64) ALIAS mapValues(fields_number),
    `log` String CODEC(ZSTD(1))
)
ENGINE = ReplicatedMergeTree
PARTITION BY toDate(timestamp)
ORDER BY (cluster, namespace, app, pod_name, container_name, host, timestamp)
TTL toDateTime(timestamp) + INTERVAL 30 DAY;

It will make queries cleanier and a bit faster, eg:

SELECT fields_string.value[indexOf(fields_string.key, 'content.level')] FROM logs.logs;

SELECT fields_string['content.level'] FROM logs.logs;

But returning maps is not supported yet in clickhouse-go afaik.
ClickHouse/clickhouse-go#380

Projection optimization:

Starting from ClickHouse 21.8, we can add projections to speedup some queries:

ALTER TABLE logs.logs_local ON CLUSTER '{cluster}' ADD PROJECTION buckets (SELECT cluster, namespace, app, pod_name, container_name, toStartOfInterval(timestamp, INTERVAL 30 second) AS interval_data, count() GROUP BY cluster, namespace, app, pod_name, container_name);

ALTER TABLE logs.logs_local ON CLUSTER '{cluster}' MATERIALIZE PROJECTION buckets;

Enable setting for using projections in queries:

allow_experimental_projection_optimization = 1

But you need to write queries in following way (use interval_data instead of timestamp):

SELECT
    toStartOfInterval(timestamp, toIntervalSecond(30)) AS interval_data,
    count(*) AS count_data
FROM logs.logs
WHERE (interval_data >= FROM_UNIXTIME(1641923841)) AND (interval_data <= FROM_UNIXTIME(1641924741)) AND (namespace = 'kobs') AND (app = 'kobs') AND (container_name = 'kobs')
GROUP BY interval_data
ORDER BY interval_data ASC WITH FILL FROM toStartOfInterval(FROM_UNIXTIME(1641923841), toIntervalSecond(30)) TO toStartOfInterval(FROM_UNIXTIME(1641924741), toIntervalSecond(30)) STEP 30

Constraint optimization:

Starting from ClickHouse 21.12, it's possible to teach ClickHouse rewrite fields_string.value[indexOf(fields_string.key, 'content.level')] to content_level column by adding constraints.

ALTER TABLE logs.logs_local ON CLUSTER '{cluster}' ADD CONSTRAINT content_level_cnst ASSUME content_level = fields_string.value[indexOf(fields_string.key, 'content.level')];
ALTER TABLE logs.logs_local ON CLUSTER '{cluster}' ADD CONSTRAINT content_response_code_cnst ASSUME content_response_code = fields_number.value[indexOf(fields_number.key, 'content.response_code')];

In order to allow it in queries, you need to set those settings for user:

optimize_using_constraints = 1
optimize_substitute_columns = 1
convert_query_to_cnf = 1
optimize_append_index = 1

But there is one issue about them:
ClickHouse/ClickHouse#33544

SELECT fields_string.value[indexOf(fields_string.key, 'content.level')] FROM logs.logs; -- will not be optimized

SELECT fields_string.value[indexOf(fields_string.key, 'content.level')] FROM logs.logs WHERE fields_string.value[indexOf(fields_string.key, 'content.level')] = 'value'; -- will be optimized

But what if we don't want to actually filter by our column?
There is way to hack ClickHouse into using constraints by doing this:

SELECT fields_string.value[indexOf(fields_string.key, 'content.level')] FROM logs.logs WHERE NOT ignore(fields_string.value[indexOf(fields_string.key, 'content.level')]); -- will be optimized

Copy link
Member

@ricoberger ricoberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @UnamedRus that's awesome, thank you very much 🙂

It looks like map support was implemented yesterday in the v2 branch for the Go driver, so maybe we can switch to maps soon.

`container_name` LowCardinality(String),
`host` LowCardinality(String),
`fields_string.key` Array(LowCardinality(String)),
`fields_string.value` Array(String) CODEC(ZSTD(1)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wrote in the PR

CODEC(ZSTD(1)) -> no codec or CODEC(LZ4)

should this also be changed for the fields_string.value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also be changed for the fields_string.value

It's needs to be tested :) (Most likely by you, as you have good dataset)
Now, i can't say for sure.

It's fairly easy to do such test, here is article about it:
https://kb.altinity.com/altinity-kb-schema-design/codecs/altinity-kb-how-to-test-different-compression-codecs/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link 🙂. I will merge your PR and try to compared based on our data.

@ricoberger
Copy link
Member

Thanks again for all your adjustments and recommendations 🙂

@ricoberger ricoberger merged commit e21bbbe into kobsio:main Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants