From 18cf6f26596706c3a72a18f8de1ed72013f007a3 Mon Sep 17 00:00:00 2001 From: Pasquale Minervini Date: Tue, 14 Oct 2025 12:27:06 +0100 Subject: [PATCH 1/2] Update smollm3.md The citation for intra-document masking is missing, fixed it --- smollm3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/smollm3.md b/smollm3.md index c995150c29..1638aaff8c 100644 --- a/smollm3.md +++ b/smollm3.md @@ -73,7 +73,7 @@ SmolLM3 follows a transformer decoder architecture with tied embedding similar t **NoPE:** We implemented NoPE from "[RoPE to NoRoPE and Back Again: A New Hybrid Attention Strategy](https://huggingface.co/papers/2501.18795)" (Yang et al., 2025), selectively removing rotary position embeddings from every 4th layer. This approach improves long context performance without affecting short context capabilities, as confirmed by our ablations. -**Intra-Document Masking:** During training, we use attention masking to ensure tokens from different documents in the same training sequence don't attend to each other. Similar to Llama 3, this helps with faster and more stable long context training while maintaining short context performance. +**Intra-Document Masking:** Following "[Analysing The Impact of Sequence Composition on Language Model Pre-Training](https://arxiv.org/abs/2402.13991)", during training, we use attention masking to ensure tokens from different documents in the same training sequence don't attend to each other. Similar to Llama 3, this helps with faster and more stable long context training while maintaining short context performance. **Training Stability:** Following OLMo 2, we remove weight decay from embedding layers to improve training stability. This modification contributed to more stable training dynamics, with embedding norms naturally stabilizing at healthier values during training without impacting overall performance in our ablations. From d3c327e975781bbd5cdbd2b842131c8a2f837a54 Mon Sep 17 00:00:00 2001 From: Pasquale Minervini Date: Wed, 15 Oct 2025 11:17:44 +0100 Subject: [PATCH 2/2] Update smollm3.md using https://huggingface.co/papers/2402.13991 instead of arxiv --- smollm3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/smollm3.md b/smollm3.md index 1638aaff8c..06d9b41ee5 100644 --- a/smollm3.md +++ b/smollm3.md @@ -73,7 +73,7 @@ SmolLM3 follows a transformer decoder architecture with tied embedding similar t **NoPE:** We implemented NoPE from "[RoPE to NoRoPE and Back Again: A New Hybrid Attention Strategy](https://huggingface.co/papers/2501.18795)" (Yang et al., 2025), selectively removing rotary position embeddings from every 4th layer. This approach improves long context performance without affecting short context capabilities, as confirmed by our ablations. -**Intra-Document Masking:** Following "[Analysing The Impact of Sequence Composition on Language Model Pre-Training](https://arxiv.org/abs/2402.13991)", during training, we use attention masking to ensure tokens from different documents in the same training sequence don't attend to each other. Similar to Llama 3, this helps with faster and more stable long context training while maintaining short context performance. +**Intra-Document Masking:** Following "[Analysing The Impact of Sequence Composition on Language Model Pre-Training](https://huggingface.co/papers/2402.13991)", during training, we use attention masking to ensure tokens from different documents in the same training sequence don't attend to each other. Similar to Llama 3, this helps with faster and more stable long context training while maintaining short context performance. **Training Stability:** Following OLMo 2, we remove weight decay from embedding layers to improve training stability. This modification contributed to more stable training dynamics, with embedding norms naturally stabilizing at healthier values during training without impacting overall performance in our ablations.