# DeepSeek-V2

```{note}
DeepSeek-V2{cite}`deepseekai2024deepseekv2strongeconomicalefficient` is a strong Mixture-of-Experts (MoE) language model characterized by
economical training and efficient inference.<br>
It comprises 236B total parameters, of which 21B
are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts
innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE.
```

## Architecture

[](mla)

[](deepseekmoe)

## Pre-Training

### Data Construction

While maintaining the same data processing stages as for DeepSeek 67B{cite}`deepseekai2024deepseekllmscalingopensource`, we extend the amount of data and elevate the data quality.

We adopt the same tokenizer as used in DeepSeek 67B, which is built based on the Byte-level
Byte-Pair Encoding (BBPE) algorithm and has a vocabulary size of 100K. Our tokenized pretraining
corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than
English ones.

### Hyper-Parameters

We set the number of Transformer layers to 60 and the hidden
dimension to 5120. In MLA, we set the number of attention heads $n_h$ to 128 and the per-head dimension $d_h$
to 128. The KV compression dimension $d_c$ is set to 512, and the query compression dimension
$d_{c}'$ is set to 1536. For the decoupled queries and key, we set the per-head dimension $d_{h}^{R}$ to 64. We substitute all FFNs except for the first layer with MoE layers.
Each MoE layer consists of 2 shared experts and 160 routed experts, where the intermediate
hidden dimension of each expert is 1536. Among the routed experts, 6 experts will be activated
for each token. Under this configuration, DeepSeek-V2 comprises 236B total
parameters, of which 21B are activated for each token.

```json
# DeepSeek-V2/config.json
{
    "vocab_size": 102400,
    "hidden_size": 5120,
    
    "num_attention_heads": 128,
    "qk_nope_head_dim": 128,
    "v_head_dim": 128,
    "kv_lora_rank": 512,
    "q_lora_rank": 1536,
    "qk_rope_head_dim": 64,

    "first_k_dense_replace": 1,
    "intermediate_size": 12288,
    "n_shared_experts": 2,
    "n_routed_experts": 160,
    "moe_intermediate_size": 1536,
    "num_experts_per_tok": 6,
    ...
}
```

Now Let's calculate the number of parameters of DeepSeek-V2 step by step:

1. Embedding and UnEmbedding: 1048576000

```python
class DeepseekV2Model(DeepseekV2PreTrainedModel):
    def __init__(self, config: DeepseekV2Config):
        super().__init__(config)
        # 102400 * 5120
        self.embed_tokens = nn.Embedding(
            config.vocab_size, config.hidden_size, self.padding_idx
        )
        
class DeepseekV2ForCausalLM(DeepseekV2PreTrainedModel):

    def __init__(self, config):
        super().__init__(config)
        # 5120 * 102400
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
```

2. MLA: 149225472 per layer (omit RMSNorm weight etc.)

```{figure} ../images/mla-3x.svg
---
height: 600px
name: mla-3
---
Multi-head Latent Attention.
```

```python
class DeepseekV2Attention(nn.Module):

    def __init__(self, config: DeepseekV2Config, layer_idx: Optional[int] = None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        
        self.q_lora_rank = config.q_lora_rank
        self.qk_rope_head_dim = config.qk_rope_head_dim
        self.kv_lora_rank = config.kv_lora_rank
        self.v_head_dim = config.v_head_dim
        self.qk_nope_head_dim = config.qk_nope_head_dim
        self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim

        # 5120 * 1536
        self.q_a_proj = nn.Linear(
            self.hidden_size, config.q_lora_rank, bias=config.attention_bias
        )
        self.q_a_layernorm = DeepseekV2RMSNorm(config.q_lora_rank)
        # 1536 * 128 * (128 + 64)
        self.q_b_proj = nn.Linear(
            config.q_lora_rank, self.num_heads * self.q_head_dim, bias=False
        )

        # 5120 * (512 + 64)
        self.kv_a_proj_with_mqa = nn.Linear(
            self.hidden_size,
            config.kv_lora_rank + config.qk_rope_head_dim,
            bias=config.attention_bias,
        )
        self.kv_a_layernorm = DeepseekV2RMSNorm(config.kv_lora_rank)
        # 512 * 128 * (128 + 128)
        self.kv_b_proj = nn.Linear(
            config.kv_lora_rank,
            self.num_heads
            * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim),
            bias=False,
        )

        # 128 * 128 * 5120
        self.o_proj = nn.Linear(
            self.num_heads * self.v_head_dim,
            self.hidden_size,
            bias=config.attention_bias,
        )
```

3. MOE:
    * first layer: 188743680
    * other layer total parameters: 3822059520
    * other layer activated parameters: 188743680
```python
class DeepseekV2MLP(nn.Module):
    def __init__(self, config, hidden_size=None, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
        self.intermediate_size = (
            config.intermediate_size if intermediate_size is None else intermediate_size
        )

        # 5120 * intermediate_size * 3
        # intermediate_size = 12288 if layer_idx=0 else 1536
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
```

Conclude our calculation:
* total parameters: 235692359680
* activated parameters: 21326725120

### Evaluation Results

```{figure} ../images/deepseek-v2-bench1.png
```

## Alignment

### Supervised Fine-Tuning

We curate our instruction tuning datasets
to include 1.5M instances, comprising 1.2M instances for helpfulness and 0.3M instances for
safety. In comparison to the initial version, we improve the data quality to mitigate hallucinatory
responses and enhance writing proficiency. 

We fine-tune DeepSeek-V2 with 2 epochs, and
the learning rate is set to 5 × 10−6.

### Reinforcement Learning

In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we
conduct Reinforcement Learning (RL) to adjust its preference.

**Reinforcement Learning Algorithm.** [](grpo)

**Training Strategy.** In our preliminary experiments, we find that the RL training on reasoning
data, such as code and math prompts, exhibits unique characteristics that are distinct from the
training on general data. For example, `the mathematical and coding abilities of our model can
keep improving over a longer period of training steps`. Therefore, we employ a two-stage RL
training strategy, which first performs reasoning alignment, and then performs human preference
alignment. In the first reasoning alignment stage, we train a reward model $RM_{reasoning}$ for
code and math reasoning tasks, and optimize the policy model with the feedback of $RM_{reasoning}$:

$$r_{i} = RM_{reasoning}(o_{i}).$$

In the second human preference alignment stage, we adopt a multi-reward framework, which
acquires rewards from a helpful reward model $RM_{helpful}$ , a safety reward model $RM_{safety}$, and a
rule-based reward model $RM_{rule}$. The final reward of a response $o_{i}$ is

$$r_{i} = c_{1}\cdot RM_{helpful}(o_{i}) + c_{2}\cdot RM_{safety}(o_{i}) + c_{3}\cdot RM_{rule}(o_{i}).$$

```{tip}
In order to obtain reliable reward models that play crucial roles in the RL training, we
carefully collect preference data, and meticulously conduct `quality filtering` and `proportion
adjustments`. We obtain code preference data based on `compiler-feedback`, and mathematical
preference data based on the ground-truth labels. For reward model training, we initialize
the reward models with DeepSeek-V2 Chat (SFT) and train them with either a point-wise or
a pair-wise loss.
```

### Evaluation Results

```{figure} ../images/deepseek-v2-bench2.png
```

### Discussion

**Amount of SFT Data.** The discussion surrounding the necessity of a large SFT corpus has been
a topic of intense debate. Previous works argue that fewer
than 10K instances of SFT data are enough to produce satisfactory results. However, in our
experiments, we observe a significant performance decline on the IFEval benchmark if we use
fewer than 10K instances. Moreover, the quality
of SFT data is also crucial.

**Online Reinforcement Learning.** In our preference alignment experiments, we find that the
online approach significantly outperforms the offline approach. Therefore, we invest tremendous
efforts in implementing an online RL framework for aligning DeepSeek-V2. The conclusion
about online or offline preference alignment can vary in different contexts, and we reserve a
more thorough comparison and analysis between them for future work.

## Takeaway

```{note}
SFT:
* enough data needed

RM:
* first performs reasoning alignment
* then performs human preference alignment
* obtain code preference data based on compiler-feedback, and mathematical preference data based on the ground-truth labels

RL:
* online > offline
```