diff --git a/_blog.yml b/_blog.yml index c4ac503a3f..e4b24fc580 100644 --- a/_blog.yml +++ b/_blog.yml @@ -6888,7 +6888,7 @@ - hub - local: streaming-datasets - title: "Streaming datasets at scale" + title: "Streaming datasets: 100x More Efficient" author: andito thumbnail: /blog/assets/streaming_datasets/streaming_datasets.png date: Oct 27, 2025 diff --git a/streaming-datasets.md b/streaming-datasets.md index fe9253056e..4dcd2c3576 100644 --- a/streaming-datasets.md +++ b/streaming-datasets.md @@ -1,5 +1,5 @@ --- -title: Streaming datasets: 100x More Efficient +title: "Streaming datasets: 100x More Efficient" thumbnail: /blog/assets/streaming_datasets/streaming_datasets.png authors: - user: andito @@ -9,6 +9,7 @@ authors: - user: merve --- +# Streaming datasets: 100x More Efficient ## TLDR @@ -20,7 +21,6 @@ authors: Visualization of a dataset being streamed -## Streaming datasets: 100x More Efficient Loading data, especially at the terabyte scale, is a major pain in any machine learning workflow. We suffered this while training [SmolLM3](https://huggingface.co/blog/smollm3), at one point we had to wait 3 hours before each run to download enough data.