In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install datasets transformers

Looking in indexes: https://download.pytorch.org/whl/cpu
INFO: pip is looking at multiple versions of torch to determine which version is compatible with other requirements. This could take a while.
Collecting torch
  Downloading https://download.pytorch.org/whl/cpu/torch-2.6.0%2Bcpu-cp311-cp311-linux_x86_64.whl.metadata (26 kB)
Downloading https://download.pytorch.org/whl/cpu/torch-2.6.0%2Bcpu-cp311-cp311-linux_x86_64.whl (178.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.7/178.7 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 2.6.0+cu124
    Uninstalling torch-2.6.0+cu124:
      Successfully uninstalled torch-2.6.0+cu124
Successfully installed torch-2.6.0+cpu
Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collec

# Train Our Own SmolLM (MiniLM) from Scratch

In this project, we will build our own SmolLM from scratch, covering key steps like pretraining, fine-tuning, reinforcement learning, and deployment. With limited computational resources, we will experiment with training models on consumer GPUs, focusing on educational purposes and extending them with Chinese language understanding.

## What is SmolLM?

Hugging Face’s **SmolLM** series introduces compact yet powerful language models optimized for efficiency. Initially released in three sizes—**135M**, **360M**, and **1.7B** parameters—these models are trained on high-quality datasets such as **Cosmopedia v2**, **FineWeb-Edu**, and **Python-Edu**, enabling strong reasoning and world knowledge capabilities.

Building on this, **SmolLM2** further improves instruction-following, knowledge, reasoning, and mathematics skills, trained on **11 trillion tokens** from diverse datasets like **FineWeb-Edu**, **DCLM**, **The Stack**, and new mathematics and coding datasets.

By leveraging the insights from SmolLM, our project aims to push the boundaries of small-scale yet high-performance models.

## Exploring the SmolLM 135M Model

Before diving into the training process, let’s first examine the architecture of the **SmolLM 135M** model and understand its key components.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

checkpoint = "HuggingFaceTB/SmolLM2-135M-Instruct"

device = torch.device("cuda" if torch.cuda.is_available() else "mps" if hasattr(torch, 'mps') and torch.mps.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

messages = [{"role": "user", "content": "What is gravity?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0]))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/861 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
What is gravity?<|im_end|>

<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
What is gravity?<|im_end|>
<|im_start|>assistant
Gravity is a fundamental force of nature that attracts objects with mass towards each other. It is a result of the interaction between mass, energy, and space itself. According to Einstein's theory of general relativity, gravity is not a


SmolLM models are built upon transformer architectures, similar to models like LlamaForCausalLM. Both utilize embedding layers, attention mechanisms, and feed-forward networks to process and generate text.

SmolLM Model Structure:

- Embedding Layer: Converts input tokens into dense vector representations.
- Transformer Layers: Consist of:
- Self-Attention Mechanism: Allows the model to focus on different parts of the input sequence when making predictions.
- Feed-Forward Neural Networks: Process the outputs from the attention mechanism to capture complex patterns.
- Normalization Layers: Stabilize and accelerate training by normalizing inputs to each layer.



In [None]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576, padding_idx=2)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb)

# MiniLM Model Architecture Design

Small-scale large language models (LLMs) are gaining traction for their efficiency and versatility. In the development of MobileLLM, researchers found that for small-scale language models, increasing the number of layers (depth) is more beneficial than expanding the size of each layer (width). Here we compare Qwen2.5 0.5B, MobileLLM 125M, SmolLM 135M, and SmolLM2 135M across three key aspects: training data, model architecture, and training details.

## 1. Training Data
| Model          | Token Size       | Sources                                              | Key Features |
|---------------|-----------------|------------------------------------------------------|--------------|
| **Qwen2.5 0.5B** | 18T             | General text, code, math, multilingual data        | Broad knowledge, long-context support |
| **MobileLLM 125M** | 1T             | Publicly available online data                     | Efficient, optimized for mobile |
| **SmolLM 135M** | 600B            | Cosmo-Corpus (educational & synthetic datasets)    | Strong in academic and coding domains |
| **SmolLM2 135M** | 2T | FineMath, Stack-Edu, InfiMM-WebMath, Cosmopedia   | Domain-specific enhancements (math, coding) |

## 2. Model Architecture
| Model          | Type                 | Parameters | Layers | Attention Heads | KV Heads | Hidden Size | Intermediate Size | Vocab Size | Key Innovations |
|---------------|----------------------|------------|--------|----------------|---------|-------------|-----------------|------------|-----------------|
| **Qwen2.5 0.5B** | Decoder-only Transformer | 500M       | 24     | 14             | 2       | 896         | 4864            | 151,936    | RoPE, SwiGLU, long-context (32K tokens) |
| **MobileLLM 125M** | Decoder-only Transformer | 125M       | 30     | 9              | 3       | 576         | 1536            | 32,000     | SwiGLU, embedding sharing, GQA |
| **SmolLM 135M** | Decoder-only Transformer | 135M       | 30     | 9              | 3       | 576         | 1536            | 49,152     | Llama-like, tied embeddings |
| **SmolLM2 135M** | Decoder-only Transformer | 135M       | ~30    | ~9             | ~3      | ~576        | ~1536           | 49,152     | GQA for improved efficiency |

## 3. Training Details
| Model          | Hardware Used      | Training Time | Cost   | Notes |
|---------------|------------------|--------------|-------|-------|
| **Qwen2.5 0.5B** | Not disclosed      | Not disclosed | Not disclosed | Likely high computational requirement due to 18T tokens |
| **MobileLLM 125M** | 32 A100 80G GPUs  | ~3 days       | Not specified | Optimized for mobile deployment |
| **SmolLM 135M** | 64 H100 GPUs       | Not specified | Not specified | Uses high-end GPUs for better efficiency |
| **SmolLM2 135M** | Not specified      | Not specified | Not specified | Expected to have similar efficiency to MobileLLM |

## Conclusion
Each of these models has unique strengths:
- **Qwen2.5 0.5B**: Best for long-context applications and diverse knowledge coverage.
- **MobileLLM 125M**: Most optimized for mobile and edge deployments.
- **SmolLM 135M**: Focused on educational and coding applications with efficient architecture.
- **SmolLM2 135M**: Likely an improved version of SmolLM with better efficiency via GQA.


| **Aspect**                | **Qwen2.5 0.5B**                     | **MobileLLM 125M**                    | **SmolLM 135M**                      | **SmolLM2 135M**                     |
|---------------------------|--------------------------------------|--------------------------------------|--------------------------------------|--------------------------------------|
| **Training Data Size**    | 18T tokens                           | 1T tokens                            | 600B tokens                          | 2T   |
| **Data Sources**          | Web, code, math, multilingual        | Publicly available online data       | Cosmo-Corpus (Cosmopedia v2, Python-Edu, FineWeb-Edu) | FineMath, Stack-Edu, InfiMM-WebMath, Cosmopedia |
| **Model Architecture**    | Transformer (decoder-only)           | Transformer (decoder-only)           | Transformer (decoder-only)           | Transformer (decoder-only with GQA)  |
| **Total Parameters**      | 0.5B (500M)                          | 125M                                 | 135M                                 | 135M                                 |
| **Layers**                | 24                                   | 30                                   | 30                                   | ~30 (assumed)                        |
| **Attention Heads**       | 14 query, 2 KV heads                 | 9 attention heads, 3 KV heads        | 9 attention heads, 3 KV heads        | ~9 attention heads (assumed)         |
| **Hidden Size**           | 896                                  | 576                                  | 576                                  | ~576 (assumed)                       |
| **Intermediate Size**     | 4864                                 | 1536                                 | 1536                                 | ~1536 (assumed)                      |
| **Vocab Size**            | 151,936                              | 32,000                               | 49,152                               | 49,152                               |
| **Key Innovations**       | RoPE, SwiGLU, tied embeddings        | SwiGLU, deep and thin architecture, embedding sharing, GQA | Tied embeddings, Llama-like architecture | Grouped Query Attention (GQA)        |
| **Training Hardware**     | Not disclosed                        | 32 A100 80G GPUs                     | 64 H100 GPUs                         | Not specified                        |
| **Training Time**         | Not disclosed                        | ~3 days                              | Not specified                        | Not specified                        |
| **Training Cost**         | Not disclosed                        | Not specified                        | Not specified                        | Not specified                        |



__We will modify use smolLM model architecture for our mini-LM with ~20k vocabulary so the total model parameters is ~0.1B.__

# Model Parameter Calculation

The total parameter count for a transformer-based language model like LLaMA (decoder-only) can be approximated using the following formula, considering the embedding layer, attention mechanism with Grouped Query Attention (GQA), feed-forward network (FFN), and layer normalization:

$$
\text{Total Parameters} = \text{Embeddings} + (\text{Attention} + \text{FFN} + \text{LayerNorms}) \times L + \text{FinalLayerNorm}
$$

### **Breakdown of Components**

#### **Embeddings**
$$
\text{Embeddings} = V \times H
$$
Where:
- \( V \) is the vocabulary size
- \( H \) is the hidden size

#### **Attention Per Layer (with GQA)**
Each layer consists of Query, Key, Value, and Output projections:
$$
\text{Query} = H \times (N_q \times D_h)
$$
$$
\text{Key} = H \times (N_{kv} \times D_h)
$$
$$
\text{Value} = H \times (N_{kv} \times D_h)
$$
$$
\text{Output} = H \times H
$$
Total attention parameters per layer:
$$
H \times (N_q \times D_h + 2 \times N_{kv} \times D_h + H)
$$

Where:
- \( N_q \) = number of attention heads
- \( N_{kv} \) = number of key-value heads
- \( D_h = H / N_q \) (head dimension)

#### **FFN Per Layer**
The feed-forward network consists of two projection layers:
$$
\text{FFN} = H \times I + I \times H
$$
Where \( I \) is the intermediate size.

#### **Layer Normalizations Per Layer**
Each layer has two LayerNorms:
$$
\text{LayerNorms} = 2 \times 2 \times H
$$

#### **Final Layer Normalization**
$$
\text{FinalLayerNorm} = 2 \times H
$$

### **Applying to SmolLM-Like Configuration**
Given:
- \( V = 20,000 \)
- \( H = 576 \)
- \( I = 1536 \)
- \( L = 30 \)
- \( N_q = 9 \)
- \( N_{kv} = 3 \)
- \( D_h = H / N_q = 576 / 9 = 64 \)

#### **Embeddings Calculation**
$$
20,000 \times 576 = 11,520,000
$$

#### **Attention Per Layer**
$$
\text{Query} = 576 \times (9 \times 64) = 576 \times 576 = 331,776
$$
$$
\text{Key} = 576 \times (3 \times 64) = 576 \times 192 = 110,592
$$
$$
\text{Value} = 110,592
$$
$$
\text{Output} = 576 \times 576 = 331,776
$$
$$
\text{Total Attention} = 331,776 + 110,592 + 110,592 + 331,776 = 884,736
$$

#### **FFN Per Layer**
$$
\text{Up Projection} = 576 \times 1536 = 884,736
$$
$$
\text{Down Projection} = 1536 \times 576 = 884,736
$$
$$
\text{Total FFN} = 884,736 + 884,736 = 1,769,472
$$

#### **LayerNorm Per Layer**
$$
2 \times (576 + 576) = 2 \times 1152 = 2,304
$$

#### **Total Per Layer**
$$
884,736 + 1,769,472 + 2,304 = 2,656,512
$$

#### **Total for All Layers**
$$
30 \times 2,656,512 = 79,695,360
$$

#### **Final LayerNorm**
$$
2 \times 576 = 1,152
$$

#### **Final Total Parameter Count**
$$
11,520,000 + 79,695,360 + 1,152 = 91,216,512
$$`



# Choose Datasets


## Pretrain Data
SmolLM2's 135M parameter model was trained using a meticulously curated dataset and a strategic training approach to maximize performance within its compact architecture. The training process involved extensive filtering and dataset selection, utilizing models to predict optimal datasets and experimenting with various datasets to enhance performance. The model was trained on approximately 600 billion tokens sourced from the SmolLM-Corpus, a high-quality dataset developed by Hugging Face.

To ensure the model possesses proficiency in __both English and Chinese__, we can use following datasets for example
- [BAAI/IndustryCorpus2](https://huggingface.co/datasets/BAAI/IndustryCorpus2)
- [deepctrl-sft-data](https://modelscope.cn/datasets/deepctrl/deepctrl-sft-data)

More Chinese datasets can be found in this [Awesome-Chinese-LLM](https://github.com/HqWu-HITCS/Awesome-Chinese-LLM) repo.

---

Below is analysis (from Grok deep research and have ChatGPT online/reasoning verified 😄) for pretraining a small language model (LLM) similar to SmolLM, with a 20k vocabulary and approximately 0.1 billion parameters, using both Chinese and English data. The focus is on achieving bilingual ability with general world knowledge, given the user’s limited compute resources (two 3080 GPUs, 24 hours of training time) and the datasets mentioned: **BAAI/IndustryCorpus2**

### BAAI/IndustryCorpus2
- **Overview:**  
  - Hosted on Hugging Face, this dataset comprises a large corpus categorized into 31 industry categories with both Chinese and English data.  
  - It includes high-quality sources (e.g., the Pile, bigcode, open-web-math) and is organized into three quality levels (high, middle, low).

- **Size & Token Estimate:**  
  - Total processed disk size: **3276 GB** (approximately 1 TB for Chinese and 2.2 TB for English).  
  - Estimated tokens (using a rough ratio from similar datasets like the Pile at ~2.75 bytes per token):  
    - **~1161.6 billion tokens total**  
    - ~363 billion tokens for Chinese  
    - ~798.6 billion tokens for English

- **Key Industry Categories (Data Size):**
  - **News:** 51.0 GB
  - **Literature and emotions:** 105.5 GB
  - **Subject education:** 340.9 GB
  - **Sports:** 262.5 GB
  - **Games:** 37.6 GB
  - **Programming:** 11.0 GB (minimize or exclude based on user preference)
  - **Biomedicine:** 61.7 GB

- **Conclusion:**  
  With its extensive coverage and focus on general world knowledge, **BAAI/IndustryCorpus2** is the ideal candidate for pretraining this bilingual model. ✅

---

### Compute Constraints and Data Volume 💻

- **Hardware:**  
  - Two 3080 GPUs (10-12 GB of memory each)

- **Training Speed Estimates:**  
  - **Baseline:**  
    - For a small LLM (e.g., GPT-2 small, 124M parameters), a single 3080 can process around 16,384 tokens per second (batch size 32, sequence length 512).  
    - With two GPUs: ~32,768 tokens/sec, totaling roughly **2.8B tokens** in 24 hours.
  - **Optimized Settings:**  
    - With improved batch size (e.g., 64) and shorter sequence length (256), each GPU might achieve ~46,811 tokens per second.  
    - Two GPUs combined could reach ~93,622 tokens/sec, totaling around **8.1B tokens** in 24 hours.

- **Feasibility:**  
  - Processing between **8-10 billion tokens** in 24 hours is achievable, aligning well with the target of processing about **10B tokens**.  
  - Given the model’s small size, even one pass over this data may suffice for initial learning. 👍

---

### Sampling Strategy 🔍

- **Language Balance:**  
  - To ensure a balanced bilingual model, aim to sample a total of **10B tokens** (5B tokens for Chinese and 5B tokens for English).  
  - Note: This results in oversampling Chinese data (1.38%) compared to English (0.63%), which is acceptable to meet bilingual objectives.

- **Category Focus:**  
  - Prioritize general categories like news, literature, education, sports, and games.  
  - Exclude or minimize categories such as programming, biomedicine, and mathematics-statistics.

- **Sampling Method:**  
  - **Stratified Sampling:**  
    - Sample proportionally from each general category within the Chinese and English datasets.  
    - Adjust sampling rates to achieve the target of 5B tokens per language.
  - **Implementation Tools:**  
    - Tools like Hugging Face’s dataset library can be used to facilitate this stratified sampling process.

- **Conclusion:**  
  A stratified sampling strategy that ensures both diversity and language balance will best serve the goal of training a well-rounded bilingual model. 🌟

Note: we can use huggingface stream mode to only download the sampled data which will be faster. See details in `prepare_datasets.py`

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset

# Load dataset in streaming mode
dataset = load_dataset("BAAI/IndustryCorpus2", split="train", streaming=True)

# Convert to a list to inspect the first sample
first_sample = next(iter(dataset))
print(first_sample)
print(first_sample.keys())

Resolving data files:   0%|          | 0/3203 [00:00<?, ?it/s]

{'text': '马亮:如何破解外卖骑手的"生死劫"\n在消费至上的今天,企业不应道德绑架消费者,让消费者为企业的伪善埋单.在就业压力加大和外卖骑手供过于求的情况下,外卖企业涸泽而渔地压榨骑手,而由此导致的成本和责任,也不应由消费者来承担.\n当人们通过手机一键叫外卖,方便享受外卖送餐服务时,可能并不会想到外卖骑手在配送热气腾腾的餐食过程中所经历的一次次"生死劫".当人们抱怨外卖骑手一骑绝尘,横穿马路,闯红灯的时候,可能无法想象这实际上是外卖平台企业通过科学算法精准预测而来的结果.\n"人物"杂志在题为"外卖骑手,困在系统里"一文中,历数外卖骑手在平台算法的"压榨"下,不得不冒着生命危险超速,逆行,闯红灯.与此同时,外卖骑手在用更少的速度和更低的成本在配送餐食的时候,他们的收入却并没有随之增长,工作的安全性,稳定性和价值也在受到侵蚀.\n外卖骑手之所以铤而走险地"冒死"送餐,其背后是饿了么,美团等平台企业所打造的强大算法系统.这套基于精准匹配法则的算法,可以实时实地监测外卖骑手,并基于数据对外卖骑手进行考核和奖惩.这使外卖企业极大地降低了成本并提高了效率,并通过压榨外卖骑手而赚得盆满钵满.但是,当人们对外卖平台企业骂声一片的时候,可能并没有认识到外卖骑手面临的职业安全难题,并没有看上去那么简单.\n作为一个新兴业态,外卖行业的崛起,既便利了人们的日常生活,提升了人们的点餐就餐体验,也创造了大量就业机会.据统计,外卖行业创造了700多万个工作岗位,使许多年轻人得到了"收入过万"的稳定工作.在这个外卖大军中,还不乏本科和硕士这样的高学历者.因此,外卖平台企业在创造就业方面,的确做出了应有的贡献,这是应该加以肯定的.\n在资本逐利的天性驱使下,外卖企业不得不尽可能压缩外卖骑手的福利,而去迎合投资方和消费者的需求.在强势的平台企业与弱势的劳动者之间,天平向平台企业倾斜,使外卖骑手不得不承受其难以承受之重.科技本应向善,但是资本裹挟却使科技向恶.外卖企业需要在追求效率,成本和利润的同时,更加注重企业社会责任,使企业在资本,劳动者和消费者之间求得平衡,而不是顾此失彼.\n新兴业态在创造一系列繁荣景象的同时,的确衍生出一些值得警惕的新问题.外卖骑手的职业安全问题不同于传统意义上的职业安全风险,后者通常是在固定工作场所发生的,劳动者与企业之间存在受法律保护的劳动关系.相对来说,外

In [None]:
# You can also download the whole data and then sample but will be much slower

# # Load the dataset
# dataset = load_dataset("BAAI/IndustryCorpus2")

# # Define general categories
# general_categories = ["news", "literature and emotions", "subject education", "sports", "games"]

# # Filter for general categories and both languages
# filtered_dataset = dataset["train"].filter(
#     lambda x: x["category"] in general_categories and (x["language"] == "Chinese" or x["language"] == "English")
# )

# # Separate by language
# chinese_data = filtered_dataset.select([i for i, x in enumerate(filtered_dataset) if x["language"] == "Chinese"])
# english_data = filtered_dataset.select([i for i, x in enumerate(filtered_dataset) if x["language"] == "English"])

# # Estimate tokens and sample (placeholder; adjust based on actual tokenization)
# def estimate_tokens(sample):
#     return len(sample["text"].split())  # Simple word count as proxy

# # Sample 5B tokens from each (simplified; actual implementation needs token counting)
# chinese_sample = chinese_data.shuffle(seed=42).select(range(5000000))  # 5000000. Adjust based on token count
# english_sample = english_data.shuffle(seed=42).select(range(5000000))  # 5000000. Adjust based on token count

# # Combine and save
# combined_sample = chinese_sample + english_sample
# combined_sample.to_json("pretrain_data.jsonl")

## Supervised Fine-Tuning (sft) Dataset: SmolTalk

The sft stage involves fine-tuning the model on a dataset to improve its ability to follow instructions and perform specific tasks. For smolLM2, the primary dataset used is SmolTalk, a curated collection that combines both existing public datasets and newly created synthetic datasets. This approach ensures a diverse and high-quality dataset for instruction tuning.
The composition of SmolTalk, as detailed in the research paper "SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model" (SmolLM2 paper), includes the following datasets:

| Dataset                     | Samples   | Description                                                                                                                               | URL                                      |
|-----------------------------|-----------|-------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------|
| Smol-Magpie-Ultra           | 400,000   | Core component, generated with Magpie pipeline using Llama-3.1-405B-Instruct, curated and filtered, outperforms public datasets like OpenHermes and Magpie Pro on IFEval and MT-Bench. | [SmolTalk GitHub](SmolTalk GitHub)       |
| Smol-contraints             | 36,000    | Trains models to follow specific constraints, decontaminated against IFEval.                                                              | [SmolTalk GitHub](SmolTalk GitHub)       |
| Smol-rewrite                | 50,000    | Focused on text rewriting tasks, e.g., adjusting tone.                                                                                   | [SmolTalk GitHub](SmolTalk GitHub)       |
| Smol-summarize              | 100,000   | Specialized in email and news summarization.                                                                                              | [SmolTalk GitHub](SmolTalk GitHub)       |
| OpenHermes2.5               | 100,000   | Enhances MMLU, WinoGrande, and BBH.                                                                                                       | [OpenHermes2.5 dataset](OpenHermes2.5 dataset) |
| MetaMathQA                  | 50,000    | Improves mathematics and reasoning, random samples.                                                                                      | [MetaMathQA dataset](MetaMathQA dataset)   |
| NuminaMath-CoT              | -         | Helps on mathematics, especially hard problems in MATH.                                                                                  | [NuminaMath-CoT dataset](NuminaMath-CoT dataset) |
| Self-OSS-Starcoder2-Instruct| -         | Improves coding capabilities.                                                                                                             | [Self-OSS-Starcoder2-Instruct dataset](Self-OSS-Starcoder2-Instruct dataset) |
| SystemChats2.0              | 30,000    | Supports various system prompt formats.                                                                                                    | [SystemChats2.0 dataset](SystemChats2.0 dataset) |
| LongAlign                   | <16k tokens| English samples <16k tokens, trains with 8192 sequence length for long-context understanding.                                             | [LongAlign dataset](LongAlign dataset)     |
| Everyday-conversations      | -         | Includes multi-turn everyday conversations like greeting, used in SmolLM v1 post-training.                                                | [Everyday-conversations dataset](Everyday-conversations dataset) |
| APIGen-Function-Calling     | 80,000    | Mix of Synth-APIGen-v0.1 and xlam-function-calling-60k for function calling.                                                              | [APIGen-Function-Calling dataset](APIGen-Function-Calling dataset) |
| Explore-Instruct-Rewriting  | 30,000    | Rewriting dataset.                                                                                                                         | [Explore-Instruct-Rewriting dataset](Explore-Instruct-Rewriting dataset) |

SmolTalk contains approximately 1 million instruction-response pairs, trained for 2 epochs with a global batch size of 128 and a sequence length of 8192, using a learning rate of 3.0 × 10⁻⁴. This dataset is licensed under Apache 2.0 (Apache License), with existing public datasets retaining their original licenses, accessible at their respective URLs.

For smaller models within the smolLM2 family (135M and 360M parameters), a subset called [Smol-SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) is used, available at Smol-SmolTalk dataset, and here we also use it for SFT.

## Direct Preference Optimization (dpo) Dataset: UltraFeedback
The dpo stage aligns the model with human preferences by optimizing on preference data. For smolLM2, the dataset used is [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), a large-scale, fine-grained, and diverse preference dataset developed by OpenBMB. UltraFeedback is designed for training reward models and critic models, containing over 64,000 prompts and 256,000 responses, annotated by GPT-4 for aspects like instruction-following, truthfulness, honesty, and helpfulness.

Details from the UltraFeedback documentation indicate that UltraFeedback consists of prompts from diverse resources such as UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN, with each prompt generating four different responses from various LLMs. This dataset is particularly effective for preference learning, improving benchmarks like MT-Bench, MMLU-Pro, and MATH. SmolLM2 was trained on UltraFeedback for 2 epochs, with a learning rate of 1.0 × 10⁻⁶, beta 0.5, global batch size 128, and sequence length 1024.