Skip to content

Fix user_id column type in ClickHouse benchmark table schema#3

Merged
kolodkin merged 25 commits into
mainfrom
claude/snowflake-vs-uuid-tpYK2
Mar 21, 2026
Merged

Fix user_id column type in ClickHouse benchmark table schema#3
kolodkin merged 25 commits into
mainfrom
claude/snowflake-vs-uuid-tpYK2

Conversation

@kolodkin

Copy link
Copy Markdown
Owner

Summary

Fixed a bug in the ClickHouse Snowflake ID benchmark where the user_id column was incorrectly using the id type definition instead of the user_id type definition.

Changes

  • Changed user_id column definition from {defs['id']} to {defs['user_id']} in the benchmark table schema
  • This ensures the user_id column uses the correct type as specified in the type definitions dictionary

Details

The benchmark table creation was referencing the wrong key from the type definitions, which could cause the user_id column to have an incorrect data type. This fix ensures proper schema definition for accurate benchmarking.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa

claude added 25 commits March 20, 2026 21:11
Tables now have just (id, value) where id uses DEFAULT expressions
(generateSnowflakeID, generateUUIDv7, etc.) and value is rand64().

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
- Install ClickHouse from packages.clickhouse.com deb repo
- Use `clickhouse server` for startup (works as root)
- Print version on startup
- Add access/ to .gitignore (ClickHouse runtime files)

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Keep exact numbers in the table but use general language in the text.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Sample IDs at fixed offsets (1000, 250k, 500k, 750k, 999k) instead of
ORDER BY id LIMIT 5, so range scans and lookups hit equivalent data
across all ID types.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Add a seq column (0..N-1) to all tables so sample IDs are picked at
the same logical insertion positions, replacing the OFFSET-based approach.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Revert to simple ORDER BY id LIMIT 5 sampling — the seq column approach
was unnecessarily complex and results were consistent either way.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Replace fabricated table with real numbers. Acknowledge UUID's edge
on range scans and GROUP BY while noting UInt64 wins on writes/storage.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Create 100k-row lookup tables per ID type for a fair JOIN test.
UInt64 and UUID are comparable; String is ~1.5x slower.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Explain why higher cardinality shows bigger compression wins: plain
String compression struggles with more unique values while LC stays
compact via dictionary encoding.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
The ORDER BY with OFFSET+LIMIT pattern (pagination) was missing from
the benchmark. This matters because String IDs are wider and sorting
with an offset can amplify the cost difference vs UInt64/UUID.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
LIMIT 5 grabbed the first (smallest) IDs, which always hit the first
granule and made point lookups trivially fast for all types. Sampling
from OFFSET NUM_ROWS/2 gives a fairer comparison.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Show order-of-magnitude differences rather than false precision
from run-to-run variance. Added explanation in script docstrings.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Use concrete ~xN speedups instead of vague language. Fix incorrect
claim that UUID edges out UInt64 on range scans (it's ~x1). Highlight
String JOIN penalty (~x2) and add OFFSET LIMIT to the list.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
Use qualitative language instead of ~xN ratios — the benchmark scripts
produce the exact numbers, the post should just convey the takeaways.

https://claude.ai/code/session_013iKn3cYDCMudtWU69aVXLa
@kolodkin kolodkin merged commit 8c54817 into main Mar 21, 2026
kolodkin pushed a commit that referenced this pull request Jun 21, 2026
…training)

- Drop the per-class threshold-tuning stage (improvement #1).
- Improvement #3: NUM_EPOCHS 3 -> 5, LR 5e-5 -> 2e-5 (canonical BERT recipe;
  best val-macro-F1 checkpoint is kept, so extra epochs are low-risk).
- Add a truncation check in the tokenize cell reporting plot token-length
  percentiles and the share of plots exceeding MAX_LENGTH.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants