From 3cbaef3f402565044fbd14013aa0bc21e952b887 Mon Sep 17 00:00:00 2001 From: Andres Marafioti Date: Mon, 27 Oct 2025 16:15:13 +0100 Subject: [PATCH 1/2] fixing title --- _blog.yml | 2 +- streaming-datasets.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/_blog.yml b/_blog.yml index c4ac503a3f..e4b24fc580 100644 --- a/_blog.yml +++ b/_blog.yml @@ -6888,7 +6888,7 @@ - hub - local: streaming-datasets - title: "Streaming datasets at scale" + title: "Streaming datasets: 100x More Efficient" author: andito thumbnail: /blog/assets/streaming_datasets/streaming_datasets.png date: Oct 27, 2025 diff --git a/streaming-datasets.md b/streaming-datasets.md index fe9253056e..5ef900e0ae 100644 --- a/streaming-datasets.md +++ b/streaming-datasets.md @@ -1,5 +1,5 @@ --- -title: Streaming datasets: 100x More Efficient +title: "Streaming datasets: 100x More Efficient" thumbnail: /blog/assets/streaming_datasets/streaming_datasets.png authors: - user: andito @@ -20,7 +20,7 @@ authors: Visualization of a dataset being streamed -## Streaming datasets: 100x More Efficient +# Streaming datasets: 100x More Efficient Loading data, especially at the terabyte scale, is a major pain in any machine learning workflow. We suffered this while training [SmolLM3](https://huggingface.co/blog/smollm3), at one point we had to wait 3 hours before each run to download enough data. From 72b4bfa4b2d7526e96923b11d40be36fe13eb937 Mon Sep 17 00:00:00 2001 From: Andres Marafioti Date: Mon, 27 Oct 2025 16:15:39 +0100 Subject: [PATCH 2/2] title first --- streaming-datasets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/streaming-datasets.md b/streaming-datasets.md index 5ef900e0ae..4dcd2c3576 100644 --- a/streaming-datasets.md +++ b/streaming-datasets.md @@ -9,6 +9,7 @@ authors: - user: merve --- +# Streaming datasets: 100x More Efficient ## TLDR @@ -20,7 +21,6 @@ authors: Visualization of a dataset being streamed -# Streaming datasets: 100x More Efficient Loading data, especially at the terabyte scale, is a major pain in any machine learning workflow. We suffered this while training [SmolLM3](https://huggingface.co/blog/smollm3), at one point we had to wait 3 hours before each run to download enough data.