# DeepSeek-Coder-V2

```{note}
DeepSeek-Coder-V2{cite}`deepseekai2024deepseekcoderv2breakingbarrierclosedsource` is further pre-trained from an intermediate checkpoint of DeepSeek-V2{cite}`deepseekai2024deepseekv2strongeconomicalefficient`
with additional 6 trillion tokens.<br>
In standard benchmark evaluations, DeepSeek-Coder-V2 achieves
superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus,
and Gemini 1.5 Pro in coding and math benchmarks.
```

```{figure} ../images/deepseek-coder-v2-1.png
---
height: 400px
name: ds-coder-v2-1
---
```

## Data Collection

The pre-training data for DeepSeek-Coder-V2 primarily consists of 60% source code, 10% math
corpus, and 30% natural language corpus. Since the natural language corpus is directly sampled
from the training dataset of DeepSeek-V2, this section focuses on the collection, cleaning, and
filtering processes of the code and math data.

We collect public repositories created before November 2023 on GitHub, We first apply the
same filtering rules and near-deduplication as those used in the DeepSeek-Coder{cite}`guo2024deepseekcoderlargelanguagemodel`. By
applying these filtering rules and near-deduplication, we obtain 821B code encompassing 338
programming languages and 185B code-related text, such as markdown and issues.

To collect code-related and math-related web texts from Common Crawl, we follow the
same pipeline as DeepSeekMath{cite}`shao2024deepseekmathpushinglimitsmathematical`.

## Alignment

### Supervised Fine-Tuning

To build DeepSeek-Coder-V2 Chat, we construct the instruction training dataset mixed with
code and math data. We first collect 20k code-related instruction data and 30k math related data
from DeepSeek-Coder and DeepSeek-Math. To maintain the general ability, we also sample
several data from the instruction data of DeepSeek-V2. Finally, we use a instruction dataset
of 300M tokens.

### Reinforcement Learning

**Prompts** Considerable effort was spent collecting prompts related to code and math from
various sources, and `each code prompt comes with corresponding test cases`. After filtering the
prompts, there are approximately 40k data in total.

**Reward Modeling** Reward models play crucial roles in the RL training. In terms of mathematical
preference data, we obtain them using the ground-truth labels. In terms of code preference
data, although the code compiler itself can already provide 0-1 feedback (whether the code pass
all test cases or not), some code prompts may have a limited number of test cases, and do not
provide full coverage, and hence directly using 0-1 feedback from the compiler may be noisy
and sub-optimal. Therefore, we still decide to train a reward model on the data provided by the
`compiler`, and use the reward model to provide signal during RL training, which is more robust and has better generalization ability, in comparison with raw compiler signal. As illustrated in
Figure 2, in our in-house test sets (Leetcode and Leetcode-zh), using a reward model to provide
RL training signal clearly outperforms using raw compiler signal.

**Reinforcement Learning Algorithm** We employ Group Relative Policy Optimization[](grpo) as our RL algorithm, which is the same as what DeepSeek-V2 uses. Notably,
GRPO is proven to be quite effective and has less cost compared with PPO, since there is no
need to maintain an additional critic model.

```{figure} ../images/deepseek-coder-v2-2.png
---
height: 300px
name: ds-coder-v2-2
---
```

## Experimental Results

```{figure} ../images/deepseek-coder-v2-bench1.png
```
```{figure} ../images/deepseek-coder-v2-bench2.png
```
```{figure} ../images/deepseek-coder-v2-bench3.png
```

## Takeaway

```{note}
* Considerable effort was spent collecting prompts related to code and math from various sources, and each code prompt comes with corresponding test cases.
* train a reward model on the data provided by the compiler, and use the reward model to provide signal during RL trainin.
```