Fix user_id column type in ClickHouse benchmark table schema by kolodkin · Pull Request #3 · kolodkin/samples

kolodkin · 2026-03-20T21:13:03Z

Summary

Fixed a bug in the ClickHouse Snowflake ID benchmark where the user_id column was incorrectly using the id type definition instead of the user_id type definition.

Changes

Changed user_id column definition from {defs['id']} to {defs['user_id']} in the benchmark table schema
This ensures the user_id column uses the correct type as specified in the type definitions dictionary

Details

The benchmark table creation was referencing the wrong key from the type definitions, which could cause the user_id column to have an incorrect data type. This fix ensures proper schema definition for accurate benchmarking.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Tables now have just (id, value) where id uses DEFAULT expressions (generateSnowflakeID, generateUUIDv7, etc.) and value is rand64(). https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

- Install ClickHouse from packages.clickhouse.com deb repo - Use `clickhouse server` for startup (works as root) - Print version on startup - Add access/ to .gitignore (ClickHouse runtime files) https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Split two long paragraph lines into 2-3 shorter lines each. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Keep exact numbers in the table but use general language in the text. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Sample IDs at fixed offsets (1000, 250k, 500k, 750k, 999k) instead of ORDER BY id LIMIT 5, so range scans and lookups hit equivalent data across all ID types. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Add a seq column (0..N-1) to all tables so sample IDs are picked at the same logical insertion positions, replacing the OFFSET-based approach. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Revert to simple ORDER BY id LIMIT 5 sampling — the seq column approach was unnecessarily complex and results were consistent either way. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Replace fabricated table with real numbers. Acknowledge UUID's edge on range scans and GROUP BY while noting UInt64 wins on writes/storage. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Remove specific numbers, summarize findings in plain text instead. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Create 100k-row lookup tables per ID type for a fair JOIN test. UInt64 and UUID are comparable; String is ~1.5x slower. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Explain why higher cardinality shows bigger compression wins: plain String compression struggles with more unique values while LC stays compact via dictionary encoding. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

The ORDER BY with OFFSET+LIMIT pattern (pagination) was missing from the benchmark. This matters because String IDs are wider and sorting with an offset can amplify the cost difference vs UInt64/UUID. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

LIMIT 5 grabbed the first (smallest) IDs, which always hit the first granule and made point lookups trivially fast for all types. Sampling from OFFSET NUM_ROWS/2 gives a fairer comparison. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Show order-of-magnitude differences rather than false precision from run-to-run variance. Added explanation in script docstrings. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Use concrete ~xN speedups instead of vague language. Fix incorrect claim that UUID edges out UInt64 on range scans (it's ~x1). Highlight String JOIN penalty (~x2) and add OFFSET LIMIT to the list. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Use qualitative language instead of ~xN ratios — the benchmark scripts produce the exact numbers, the post should just convey the takeaways. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

…training) - Drop the per-class threshold-tuning stage (improvement #1). - Improvement #3: NUM_EPOCHS 3 -> 5, LR 5e-5 -> 2e-5 (canonical BERT recipe; best val-macro-F1 checkpoint is kept, so extra epochs are low-risk). - Add a truncation check in the tokenize cell reporting plot token-length percentiles and the share of plots exceeding MAX_LENGTH.

claude added 25 commits March 20, 2026 21:11

Fix user_id column to reference defs['user_id'] in CREATE TABLE

f107d1b

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Simplify benchmark to 2-column tables with DEFAULT id generation

8f7f7d3

Tables now have just (id, value) where id uses DEFAULT expressions (generateSnowflakeID, generateUUIDv7, etc.) and value is rand64(). https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Print ClickHouse version in benchmark output

ac8835f

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Break long lines in reddit.txt for readability

2fce1ed

Split two long paragraph lines into 2-3 shorter lines each. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Remove specific multipliers from prose in reddit.txt

14dad54

Keep exact numbers in the table but use general language in the text. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Fix benchmark to use same logical row positions for fair comparison

bcf72b2

Sample IDs at fixed offsets (1000, 250k, 500k, 750k, 999k) instead of ORDER BY id LIMIT 5, so range scans and lookups hit equivalent data across all ID types. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Use shared seq column for fair ID sampling across tables

2afc17e

Add a seq column (0..N-1) to all tables so sample IDs are picked at the same logical insertion positions, replacing the OFFSET-based approach. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Restore original straightforward ID sampling

c578818

Revert to simple ORDER BY id LIMIT 5 sampling — the seq column approach was unnecessarily complex and results were consistent either way. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Update reddit post with actual benchmark results

3b46e62

Replace fabricated table with real numbers. Acknowledge UUID's edge on range scans and GROUP BY while noting UInt64 wins on writes/storage. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Replace benchmark table with prose summary in reddit post

baba678

Remove specific numbers, summarize findings in plain text instead. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Add JOIN benchmark query and update reddit post

e9ab63e

Create 100k-row lookup tables per ID type for a fair JOIN test. UInt64 and UUID are comparable; String is ~1.5x slower. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Clarify LowCardinality compression benefit in reddit post

89dcfb9

Explain why higher cardinality shows bigger compression wins: plain String compression struggles with more unique values while LC stays compact via dictionary encoding. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Round speedup ratios to 1 decimal place in both benchmarks

2b8e6c2

Show order-of-magnitude differences rather than false precision from run-to-run variance. Added explanation in script docstrings. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Use 0 decimal places for speedup ratios

9a4c620

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Format speedup ratios as ~x1, ~x2, etc.

065473e

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Remove specific numbers from reddit post

effdb7f

Use qualitative language instead of ~xN ratios — the benchmark scripts produce the exact numbers, the post should just convey the takeaways. https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Use 'the benchmark' instead of 'our benchmark' / 'we benchmarked'

854e151

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Add UUID docs link alongside UInt64 docs link

ffa73f9

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Only run benchmark CI on changes in post_0001/

baab946

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Rename workflow to post_0001

d9a21f8

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

Rename benchmark.yml to post_0001.yml

1b00813

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

kolodkin merged commit 8c54817 into main Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix user_id column type in ClickHouse benchmark table schema#3

Fix user_id column type in ClickHouse benchmark table schema#3
kolodkin merged 25 commits into
mainfrom
claude/snowflake-vs-uuid-tpYK2

kolodkin commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kolodkin commented Mar 20, 2026

Summary

Changes

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants