Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibity with Clickhouse 24.2 #3855

Closed
2 tasks done
luizkowalski opened this issue Mar 2, 2024 · 8 comments
Closed
2 tasks done

Compatibity with Clickhouse 24.2 #3855

luizkowalski opened this issue Mar 2, 2024 · 8 comments
Assignees

Comments

@luizkowalski
Copy link

luizkowalski commented Mar 2, 2024

Past Issues Searched

  • I have searched open and closed issues to make sure that the bug has not yet been reported

Issue is a Bug Report

  • This is a bug report and not a feature request, nor asking for self-hosted support

Using official Plausible Cloud hosting or self-hosting?

Self-hosting

Describe the bug

hey 👋🏻

I tried to upgrade my self-hosted instance of Clickhouse, and it looks like Plausible is not yet ready for it

This is the exception I'm seeing on the console:

2024-03-02T14:13:30.904571761Z 14:13:30.901 request_id=F7j4ERc_wsg38ZcAAAfC [info] Sent 500 in 26ms
2024-03-02T14:13:30.904574245Z 14:13:30.901 [error] #PID<0.3764.0> running PlausibleWeb.Endpoint (connection #PID<0.3762.0>, stream id 2) terminated
2024-03-02T14:13:30.904577021Z Server: plausible.sumiu.link:80 (http)
2024-03-02T14:13:30.904579085Z Request: GET /api/stats/luizkowalski.net/countries?period=day&date=2024-03-02&filters=%7B%7D&with_imported=true&limit=300
2024-03-02T14:13:30.904581489Z ** (exit) an exception was raised:
2024-03-02T14:13:30.904583634Z     ** (Ch.Error) Code: 10. DB::Exception: Not found column _sample_factor:  in block start DateTime UInt32(size = 0), country_code LowCardinality(FixedString(2)) ColumnLowCardinality(size = 0, UInt8(size = 0), ColumnUnique(size = 1, FixedString(size = 1))), site_id UInt64 UInt64(size = 0), user_id UInt64 UInt64(size = 0). (NOT_FOUND_COLUMN_IN_BLOCK) (version 24.2.1.2248 (official build))
2024-03-02T14:13:30.904586258Z
2024-03-02T14:13:30.904588212Z         (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:913: Ecto.Adapters.SQL.raise_sql_call_error/1
2024-03-02T14:13:30.904596768Z         (ecto_ch 0.1.9) lib/ecto/adapters/clickhouse.ex:269: Ecto.Adapters.ClickHouse.execute/5
2024-03-02T14:13:30.904599112Z         (ecto 3.9.5) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
2024-03-02T14:13:30.904601216Z         (ecto 3.9.5) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
2024-03-02T14:13:30.904603290Z         (plausible 0.0.1) lib/plausible/stats/breakdown.ex:195: Plausible.Stats.Breakdown.breakdown_sessions/5
2024-03-02T14:13:30.904605484Z         (plausible 0.0.1) lib/plausible_web/controllers/api/stats_controller.ex:810: PlausibleWeb.Api.StatsController.countries/2
2024-03-02T14:13:30.904608129Z         (plausible 0.0.1) lib/plausible_web/controllers/api/stats_controller.ex:1: PlausibleWeb.Api.StatsController.action/2
2024-03-02T14:13:30.904610343Z         (plausible 0.0.1) lib/plausible_web/controllers/api/stats_controller.ex:1: PlausibleWeb.Api.StatsController.phoenix_controller_pipeline/2
2024-03-02T14:13:30.904612648Z 14:13:30.903 [debug] QUERY ERROR source="events_v2" db=21.5ms queue=0.1ms idle=1749.0ms
2024-03-02T14:13:30.904614852Z SELECT toStartOfHour(toTimeZone(e0."timestamp", {$0:String})),toUInt64(round(uniq(e0."user_id") * any(_sample_factor))) FROM "events_v2" AS e0 sample 20000000 WHERE (e0."site_id" = {$1:Int64}) AND ((e0."timestamp" >= {$2:DateTime}) AND (e0."timestamp" < {$3:DateTime})) GROUP BY toStartOfHour(toTimeZone(e0."timestamp", {$4:String})) ORDER BY toStartOfHour(toTimeZone(e0."timestamp", {$5:String})) ["Europe/Berlin", 2, ~U[2024-03-01 23:00:00Z], ~U[2024-03-02 23:00:00Z], "Europe/Berlin", "Europe/Berlin"]

Expected behavior

Compatible with CH changes

Screenshots

No response

Environment

- OS: Ubuntu 22.04 (Docker)
- Browser: - 
- Browser Version: -

Deployment configuration (I'm using Kamal)

plausible:
    image: plausible/analytics:v2.0.0
    roles:
      - web
    env:
      clear:
        BASE_URL: https://plausible.sumiu.link/
        SECRET_KEY_BASE: <%= ENV["PLAUSIBLE_SECRE_KEY_BASE"] %>
        DATABASE_URL: <%= ENV["PLAUSIBLE_DATABASE_URL"] %>
        CLICKHOUSE_DATABASE_URL: http://10.0.0.3:8123/plausible_events
        DISABLE_REGISTRATION: invite_only
        LOG_LEVEL: debug
        MAILER_ADAPTER: Bamboo.PostmarkAdapter
        MAILER_EMAIL: xxx
        MAILER_NAME: Luiz from Sumiu
      secret:
        - POSTMARK_API_KEY
    cmd: sh -c "sleep 5 && /entrypoint.sh db createdb && /entrypoint.sh db migrate && /entrypoint.sh run"
    labels:
      traefik.enable: true

  clickhouse:
    hosts:
      - accessories
    image: clickhouse/clickhouse-server:24.2
    port: 8123
    directories:
      - event-data:/var/lib/clickhouse
@ruslandoga
Copy link
Contributor

👋 @luizkowalski

Can you please run show create on events_v2 and sessions_v2 tables? I'll try out ClickHouse 24.2 on my instance later today.

@luizkowalski
Copy link
Author

luizkowalski commented Mar 2, 2024

sure, but how do I do that? Plausible's app console? I'm not very familiar with all this, it is my first time playing around with self-hosting

nevermind, got it

ec50295d4672 :) show create plausible_events.events_v2;

SHOW CREATE TABLE plausible_events.events_v2

Query id: 9a41190e-bdeb-4d87-9d59-67ba4d8af518

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE plausible_events.events_v2
(
    `timestamp` DateTime CODEC(Delta(4), LZ4),
    `name` LowCardinality(String),
    `site_id` UInt64,
    `user_id` UInt64,
    `session_id` UInt64,
    `hostname` String,
    `pathname` String CODEC(ZSTD(3)),
    `referrer` String,
    `referrer_source` String,
    `country_code` FixedString(2),
    `screen_size` LowCardinality(String),
    `operating_system` LowCardinality(String),
    `browser` LowCardinality(String),
    `utm_medium` String,
    `utm_source` String,
    `utm_campaign` String,
    `meta.key` Array(String),
    `meta.value` Array(String),
    `browser_version` LowCardinality(String),
    `operating_system_version` LowCardinality(String),
    `subdivision1_code` LowCardinality(String),
    `subdivision2_code` LowCardinality(String),
    `city_geoname_id` UInt32,
    `utm_content` String,
    `utm_term` String,
    `transferred_from` String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
PRIMARY KEY (site_id, toDate(timestamp), name, user_id)
ORDER BY (site_id, toDate(timestamp), name, user_id, timestamp)
SAMPLE BY user_id
SETTINGS index_granularity = 8192 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.
ec50295d4672 :) show create plausible_events.sessions_v2;

SHOW CREATE TABLE plausible_events.sessions_v2

Query id: 0b562490-b1eb-4575-896e-cd6f578a96ac

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE plausible_events.sessions_v2
(
    `session_id` UInt64,
    `sign` Int8,
    `site_id` UInt64,
    `user_id` UInt64,
    `hostname` String,
    `timestamp` DateTime CODEC(DoubleDelta, LZ4),
    `start` DateTime CODEC(DoubleDelta, LZ4),
    `is_bounce` UInt8,
    `entry_page` String,
    `exit_page` String,
    `pageviews` Int32,
    `events` Int32,
    `duration` UInt32,
    `referrer` String,
    `referrer_source` String,
    `country_code` LowCardinality(FixedString(2)),
    `screen_size` LowCardinality(String),
    `operating_system` LowCardinality(String),
    `browser` LowCardinality(String),
    `utm_medium` String,
    `utm_source` String,
    `utm_campaign` String,
    `browser_version` LowCardinality(String),
    `operating_system_version` LowCardinality(String),
    `subdivision1_code` LowCardinality(String),
    `subdivision2_code` LowCardinality(String),
    `city_geoname_id` UInt32,
    `utm_content` String,
    `utm_term` String,
    `transferred_from` String,
    `entry_meta.key` Array(String),
    `entry_meta.value` Array(String)
)
ENGINE = CollapsingMergeTree(sign)
PARTITION BY toYYYYMM(start)
PRIMARY KEY (site_id, toDate(start), user_id, session_id)
ORDER BY (site_id, toDate(start), user_id, session_id)
SAMPLE BY user_id
SETTINGS index_granularity = 8192 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.003 sec.

@ruslandoga ruslandoga self-assigned this Mar 2, 2024
@ruslandoga
Copy link
Contributor

ruslandoga commented Mar 2, 2024

Thank you! I've been able to reproduce it but I'm not sure what's the cause yet. _sample_factor is still available in ClickHouse 24.2:

7eeb32dddd7a :) select sum(s.pageviews * s._sample_factor) from sessions_v2 s sample 100;

-- ┌─sum(multiply(pageviews, _sample_factor))─┐
-- │                       390594.55999999976 │
-- └──────────────────────────────────────────┘

-- 1 row in set. Elapsed: 0.006 sec. Processed 20.73 thousand rows, 248.77 KB (3.22 million rows/s., 38.60 MB/s.)
-- Peak memory usage: 0.00 B.

7eeb32dddd7a :) select version();

-- ┌─version()───┐
-- │ 24.2.1.2248 │
-- └─────────────┘

If the fix would involve changing the queries, it'd be unlikely to be merged until Plausible Cloud switches to the version requiring it.

@ruslandoga
Copy link
Contributor

Hm, this query seems to work in clickhouse client

SELECT
    toStartOfHour(toTimeZone(e0.timestamp, 'Europe/Berlin')),
    toUInt64(round(uniq(e0.user_id) * any(_sample_factor)))
FROM events_v2 AS e0
SAMPLE 20000000
GROUP BY toStartOfHour(toTimeZone(e0.timestamp, 'Europe/Berlin'))
ORDER BY toStartOfHour(toTimeZone(e0.timestamp, 'Europe/Berlin')) ASC

-- ┌─toStartOfHour(toTimeZone(timestamp, 'Europe/Berlin'))─┬─toUInt64(round(multiply(uniq(user_id), any(_sample_factor))))─┐
-- │                                   2022-08-10 17:00:00 │                                                             1 │
-- │                                   2022-08-10 18:00:00 │                                                             1 │
-- │                                   2022-08-11 11:00:00 │                                                             1 │
-- │                                   2022-08-11 12:00:00 │                                                             1 │
-- etc.

-- 981 rows in set. Elapsed: 0.039 sec. Processed 599.36 thousand rows, 7.19 MB (15.34 million rows/s., 184.13 MB/s.)

but adding a WHERE clause makes it fail

SELECT
    toStartOfHour(toTimeZone(e0.timestamp, 'Europe/Berlin')),
    toUInt64(round(uniq(e0.user_id) * any(_sample_factor)))
FROM events_v2 AS e0
SAMPLE 20000000
WHERE (e0.site_id = 2) AND ((e0.timestamp >= '2024-03-01 23:00:00') AND (e0.timestamp < '2024-03-02 23:00:00'))
GROUP BY toStartOfHour(toTimeZone(e0.timestamp, 'Europe/Berlin'))
ORDER BY toStartOfHour(toTimeZone(e0.timestamp, 'Europe/Berlin')) ASC

-- Received exception from server (version 24.2.1):
-- Code: 10. DB::Exception: Received from localhost:9000. DB::Exception: Not found column _sample_factor:  in block timestamp DateTime UInt32(size = 0), site_id UInt64 UInt64(size = 0), user_id UInt64 UInt64(size = 0). (NOT_FOUND_COLUMN_IN_BLOCK)

@ruslandoga
Copy link
Contributor

Seems to be more of a ClickHouse bug, I'll try opening an issue.

@ruslandoga
Copy link
Contributor

ruslandoga commented Mar 2, 2024

ClickHouse/ClickHouse#60685

I'll re-open this issue if it turns out to not be a bug.

@luizkowalski
Copy link
Author

Thanks a lot, @ruslandoga! I will subscribed to that issue

@dustinmoris
Copy link

FYI, I tried the workaround in the linked ClickHouse issue and it worked for me by setting query_plan_optimize_prewhere to 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants