Climbmix (vs Fineweb) annotated with Topics/Formats and with precomputed embeddings. #696
ddudek
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Climbmix annoted (hf repo)
If someone would like to play with the dataset - I've uploaded 200shards of the current Climbmix dataset with precomputed embeddings (jina v5 nano) and with classified with topics and formats:
https://huggingface.co/datasets/ddudek/nanochat-climbmix-annotated
Even though the original climbmix dataset comes with topic labels - first: it was hard to get the original labels, second: I didn't like the topics split.
Parquet files keep the nanochat compatible format (row groups, 'text' column), so this can be used as a drop-in replacement of the Karpathy's mix, and the additional metadata can be used in the code for any experiments (e.g. train a small model only on "Fashion & Beauty" topics)
Embeddings: Generated using
jinaai/jina-embeddings-v5-text-nanowithtask="clustering"(768 dimensions, float16)Topic Classification:
WebOrganizer/TopicClassifier-NoURLFormat Classification:
WebOrganizer/FormatClassifier-NoURLAdditionally I have a small sample of the previous Fineweb-edu (it's a small sample, so not uploaded to hf), and here's the comparison:
Climbmix vs Fineweb — Dataset Comparison
Comparison based on 50,000 rows sampled from each dataset.
Topic Distribution
Format Distribution
Key Takeaways
Topics — largest differences
Formats — largest differences
Character
* My comment - in Climbmix some docs contain QA pairs appended to the document itself, and classifier might have been skewed because of this.
Beta Was this translation helpful? Give feedback.
All reactions