Climbmix (vs Fineweb) annotated with Topics/Formats and with precomputed embeddings. #696

ddudek · 2026-04-11T12:09:35Z

ddudek
Apr 11, 2026

Climbmix annoted (hf repo)

If someone would like to play with the dataset - I've uploaded 200shards of the current Climbmix dataset with precomputed embeddings (jina v5 nano) and with classified with topics and formats:
https://huggingface.co/datasets/ddudek/nanochat-climbmix-annotated

Even though the original climbmix dataset comes with topic labels - first: it was hard to get the original labels, second: I didn't like the topics split.

Parquet files keep the nanochat compatible format (row groups, 'text' column), so this can be used as a drop-in replacement of the Karpathy's mix, and the additional metadata can be used in the code for any experiments (e.g. train a small model only on "Fashion & Beauty" topics)

Embeddings: Generated using jinaai/jina-embeddings-v5-text-nano with task="clustering" (768 dimensions, float16)
Topic Classification: WebOrganizer/TopicClassifier-NoURL
Format Classification: WebOrganizer/FormatClassifier-NoURL

Additionally I have a small sample of the previous Fineweb-edu (it's a small sample, so not uploaded to hf), and here's the comparison:

Climbmix vs Fineweb — Dataset Comparison

Comparison based on 50,000 rows sampled from each dataset.

Topic Distribution

Topic	Climbmix	%	Fineweb	%	Diff (pp)
Science & Tech.	12026	24.1%	11752	23.5%	+0.6
Health	7329	14.7%	9791	19.6%	-4.9
Home & Hobbies	5807	11.6%	1683	3.4%	+8.2
Education & Jobs	3169	6.3%	5535	11.1%	-4.8
Food & Dining	2325	4.6%	613	1.2%	+3.4
Transportation	2244	4.5%	696	1.4%	+3.1
Industrial	1977	4.0%	1442	2.9%	+1.1
Sports & Fitness	1861	3.7%	536	1.1%	+2.6
Hardware	1584	3.2%	483	1.0%	+2.2
Art & Design	1577	3.2%	1063	2.1%	+1.1
Software Dev.	1355	2.7%	1013	2.0%	+0.7
Software	1066	2.1%	805	1.6%	+0.5
Finance & Business	1029	2.1%	1129	2.3%	-0.2
History	971	1.9%	5172	10.3%	-8.4
Fashion & Beauty	959	1.9%	122	0.2%	+1.7
Games	809	1.6%	178	0.4%	+1.2
Entertainment	666	1.3%	406	0.8%	+0.5
Politics	640	1.3%	2259	4.5%	-3.2
Religion	607	1.2%	1777	3.6%	-2.4
Social Life	559	1.1%	434	0.9%	+0.2
Literature	513	1.0%	1977	4.0%	-3.0
Crime & Law	443	0.9%	765	1.5%	-0.6
Travel	440	0.9%	365	0.7%	+0.2
Adult	44	0.1%	4	0.0%	+0.1

Format Distribution

Format	Climbmix	%	Fineweb	%	Diff (pp)
Knowledge Article	8665	17.3%	13667	27.3%	-10.0
Tutorial	7421	14.8%	5259	10.5%	+4.3
Product Page	4599	9.2%	3317	6.6%	+2.6
Q&A Forum	4057	8.1%	567	1.1%	+7.0
News Article	3069	6.1%	5106	10.2%	-4.1
Personal Blog	2923	5.8%	2746	5.5%	+0.3
Comment Section	2688	5.4%	179	0.4%	+5.0
FAQ	2574	5.1%	361	0.7%	+4.4
Academic Writing	2482	5.0%	3424	6.8%	-1.8
Nonfiction Writing	2061	4.1%	6549	13.1%	-9.0
Content Listing	1656	3.3%	1052	2.1%	+1.2
Listicle	1515	3.0%	1405	2.8%	+0.2
News (Org.)	1074	2.1%	1671	3.3%	-1.2
Customer Support	790	1.6%	372	0.7%	+0.9
Truncated	743	1.5%	572	1.1%	+0.4
Audio Transcript	733	1.5%	386	0.8%	+0.7
Structured Data	647	1.3%	1148	2.3%	-1.0
About (Org.)	567	1.1%	1003	2.0%	-0.9
Documentation	538	1.1%	596	1.2%	-0.1
Spam / Ads	428	0.9%	214	0.4%	+0.5
User Review	299	0.6%	92	0.2%	+0.4
Creative Writing	270	0.5%	222	0.4%	+0.1
About (Pers.)	166	0.3%	37	0.1%	+0.2
Legal Notices	35	0.1%	55	0.1%	+0.0

Key Takeaways

Topics — largest differences

Science & Tech. is the top topic in both (~24%)
Climbmix overrepresents: Home & Hobbies (+8.2pp), Food & Dining (+3.4), Transportation (+3.1), Sports & Fitness (+2.6)
Fineweb overrepresents: History (+8.4pp), Health (+4.9), Education & Jobs (+4.8), Politics (+3.2), Literature (+3.0)

Formats — largest differences

Climbmix overrepresents: Q&A Forum (+7.0pp)*, Comment Section (+5.0), FAQ (+4.4), Tutorial (+4.3)
Fineweb overrepresents: Knowledge Article (+10.0pp), Nonfiction Writing (+9.0), News Article (+4.1)

Character

Climbmix skews toward practical, interactive, user-generated content (Q&A*, tutorials, FAQs, product pages, home & hobbies)
Fineweb skews toward traditional long-form written content (nonfiction, knowledge articles, news, history, literature, education)

* My comment - in Climbmix some docs contain QA pairs appended to the document itself, and classifier might have been skewed because of this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Climbmix (vs Fineweb) annotated with Topics/Formats and with precomputed embeddings. #696

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Climbmix (vs Fineweb) annotated with Topics/Formats and with precomputed embeddings. #696

Uh oh!

Uh oh!

ddudek Apr 11, 2026

Climbmix annoted (hf repo)

Climbmix vs Fineweb — Dataset Comparison

Topic Distribution

Format Distribution

Key Takeaways

Topics — largest differences

Formats — largest differences

Character

Replies: 0 comments

ddudek
Apr 11, 2026