# Distill Reasoning Data from DeepSeek-R1

In the field of LLMs, reasoning models leverage deep thinking capabilities to significantly enhance model performance across complex scenarios. According to the [DeepSeek-R1](https://arxiv.org/abs/2501.12948) paper, the reasoning pattern of larger models can be distilled into smaller models. Specifically, we can distill long-chain-of-thought (long-CoT) data that includes reasoning processes from DeepSeek-R1 and directly fine-tune open-source models like Qwen and Llama. This straightforward distillation method significantly enhances the reasoning abilities of smaller models.


To demonstrate the complete distillation process, we have prepared three notebooks that cover how to distill reasoning data from DeepSeek-R1 using the NIM API, how to train models using the distilled data, and how to evaluate the model.


- [1.generate_reasoning_data.ipynb](./1.generate_reasoning_data.ipynb) (⭐) demonstrates how to distill reasoning data from DeepSeek-R1 using the NIM API. 
- [2.qwen2_distill_nemo.ipynb](./2.qwen2_distill_nemo.ipynb) shows how to train open-source models using the distilled data.
- [3.evaluation.ipynb](./3.evaluation.ipynb) shows how the evaluate the model.


This tutorial is part 1 of the series, and it will demonstrate how to distill reasoning data from the DeepSeek-R1 model using NVIDIA NIM.

Prerequisites:
- Obtain an NVIDIA API Key (visit [build.nvidia.com](https://build.nvidia.com/explore/discover) for details)

This notebook contains three steps:
1. Prepare the raw dataset
2. Distill reasoning data from DeepSeek-R1 using NVIDIA NIM API
3. Post-process the distilled data

In [None]:
%pip install openai math_verify datasets

In [None]:
%env NVIDIA_API_KEY=nvapi-xxxxxxxxxx

## Step 1: Prepare Dataset

During the training process of DeepSeek-R1-Zero, DeepSeek mentioned they used data from math, code, science, and logic domains. However, since they haven't disclosed the specific data sources, we will use open-source datasets as examples.

In the following code, we will use the [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) from HuggingFace. 

You can also create your own dataset, but it's best to align with the example dataset's format, ensuring each entry contains both a `question` and an `answer`.

In [3]:
from datasets import load_dataset

dataset = load_dataset("open-r1/OpenR1-Math-220k", split="train")

print(f"Dataset size: {len(dataset)}")

# Print the first few examples
for i, example in enumerate(dataset.select(range(3))):
    print(f"===== Problem {i+1} =====")
    print(example["problem"])
    print(f"===== Answer {i+1} =====")
    print(example["answer"])
    print("\n")

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 93733/93733 [00:15<00:00, 6142.21 examples/s]

Dataset size: 93733
===== Problem 1 =====
## Task B-1.3.

A ship traveling along a river has covered $24 \mathrm{~km}$ upstream and $28 \mathrm{~km}$ downstream. For this journey, it took half an hour less than for traveling $30 \mathrm{~km}$ upstream and $21 \mathrm{~km}$ downstream, or half an hour more than for traveling $15 \mathrm{~km}$ upstream and $42 \mathrm{~km}$ downstream, assuming that both the ship and the river move uniformly.

Determine the speed of the ship in still water and the speed of the river.
===== Answer 1 =====
v_{R}=4\mathrm{~}/\mathrm{},v_{B}=10\mathrm{~}/\mathrm{}


===== Problem 2 =====
3. (6 points) A construction company was building a tunnel. When $\frac{1}{3}$ of the tunnel was completed at the original speed, they started using new equipment, which increased the construction speed by $20 \%$ and reduced the working hours to $80 \%$ of the original. As a result, it took a total of 185 days to complete the tunnel. If they had not used the new equipment a




## Step 2: Distill Reasoning Data from DeepSeek-R1 Using NVIDIA NIM API

DeepSeek recommends adhering to the following configurations when running inference the DeepSeek-R1 series of models, including benchmarking, to achieve the expected performance:

- Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
- Avoid adding a system prompt; all instructions should be contained within the user prompt.
- For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."

This cell configures the NIM client and runs a basic distillation test. 

In [4]:
import os
from openai import OpenAI

client = OpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = os.getenv("NVIDIA_API_KEY")
)

In [5]:
# A simple test case
problem = "which number is larger, 9.11 or 9.8?"
completion = client.chat.completions.create(
    model="deepseek-ai/deepseek-r1",
    messages=[{"role": "user", "content": f"Please reason step by step, and put your final answer within \\boxed{{}}. {problem}"}],
    temperature=0.6,
    top_p=0.7,
    max_tokens=32768,
    timeout=1000,
    stream=True
)

for chunk in completion:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

<think>
Okay, so I need to figure out which number is larger between 9.11 and 9.8. Let me start by writing both numbers down to see them clearly: 9.11 and 9.8. Hmm, both are decimals, but they have a different number of digits after the decimal point. Maybe I should compare them place by place.

First, let me look at the whole number part. Both numbers have 9 as the whole number, so that part is equal. Now, moving on to the decimal parts. For 9.11, the first decimal place is 1, and for 9.8, the first decimal place is 8. Wait, but 9.8 is the same as 9.80, right? Because adding a zero at the end of a decimal doesn't change its value. So if I rewrite 9.8 as 9.80, then both numbers have two decimal places, which might make it easier to compare.

So now we have 9.11 and 9.80. Comparing the tenths place: 1 versus 8. Since 1 is less than 8, does that mean 9.11 is less than 9.80? Let me check that again. The tenths place is the first digit after the decimal, which represents tenths. So 0.1 ver

Now, we're ready to generate reasoning traces using DeepSeek-R1 for the entire dataset.

In [8]:
# The prompt template recommended by DeepSeek for math problems
PROMPT_TEMPLATE = "Please reason step by step, and put your final answer within \\boxed{{}}. {problem}"

def process_streaming_response(completion):
    """Process the streaming response from the R1 model"""
    reasoning_trace = ""
    try:
        for chunk in completion:
            if chunk.choices[0].delta.content is not None:
                reasoning_trace += chunk.choices[0].delta.content
        return reasoning_trace
    except Exception as e:
        print(f"Error occurred: {e}")
        return reasoning_trace

def distill_data_from_r1(example):
    problem = example["problem"]
    completion = client.chat.completions.create(
        model="deepseek-ai/deepseek-r1",
        messages=[{"role": "user", "content": PROMPT_TEMPLATE.format(problem=problem)}],
        temperature=0.6,
        top_p=0.7,
        max_tokens=32768,
        timeout=10000,
        stream=True
    )
    
    reasoning_trace = process_streaming_response(completion)
    return {**example, "reasoning_trace": reasoning_trace}

In [None]:
# To speed up the process, we only use 2 examples here
sample_dataset = dataset.select(range(2))

# You can set num_proc to speed up the process
sample_dataset = sample_dataset.map(distill_data_from_r1, num_proc=1, desc="Distilling reasoning traces from R1")

In [None]:
sample_dataset['reasoning_trace']

## Step 3: Post-Process Distilled Data

After generating data, we should filter out any low-quality reasoning data. We can establish some filtering rules, such as:
- Whether the language in the reasoning trace meets requirements
- Whether the reasoning trace format is correct, i.e., wrapping the thinking process in `<think></think>` tags before giving the final answer
- Whether the answer given in the reasoning trace is correct
- Other filtering rules mentioned in the R1 paper
    - Long paragraphs
    - Containing Code blocks

In this tutorial, we will only verify the format and the correctness of the answers.

In [15]:
import re
from latex2sympy2_extended import NormalizationConfig
from math_verify import LatexExtractionConfig, parse, verify


def check_format(reasoning_trace):
    pattern = r"^<think>.*?</think>"
    if not re.match(pattern, reasoning_trace, re.DOTALL | re.MULTILINE):
        return False
    # check if all tags only appear once
    tags = ["<think>", "</think>"]
    for tag in tags:
        if reasoning_trace.count(tag) != 1:
            return False
    return True

# We use math_verify to check if the answer is mathematically equivalent to the ground truth
def calculate_answer(reasoning_trace, ground_truth):
    """Check if the answer is the same as the ground truth."""
    answer_parsed = parse(
        reasoning_trace,
        extraction_config=[
            LatexExtractionConfig(
                normalization_config=NormalizationConfig(
                    nits=False,
                    malformed_operators=False,
                    basic_latex=True,
                    equations=True,
                    boxed=True,
                    units=True,
                ),
                # Ensures that boxed is tried first
                boxed_match_priority=0,
                try_extract_without_anchor=False,
            )
        ],
        extraction_mode="first_match",
    )

    return verify(answer_parsed, ground_truth)

def filter_reasoning_trace(example):
    reasoning_trace = example["reasoning_trace"]
    ground_truth = example["answer"]
    if not check_format(reasoning_trace):
        return {**example, "filtered": True, "filtered_reason": "INVALID_FORMAT"}
    if not calculate_answer(reasoning_trace, ground_truth):
        return {**example, "filtered": True, "filtered_reason": "INCORRECT_ANSWER"}
    return {**example, "filtered": False, "filtered_reason": "VALID"}

In [None]:
sample_dataset = sample_dataset.map(filter_reasoning_trace, desc="Filtering reasoning traces")

# filter out the invalid reasoning traces
filtered_dataset = sample_dataset.filter(lambda x: not x["filtered"])

# save the filtered dataset
filtered_dataset.save_to_disk("filtered_dataset")

## Next Steps

Due to the randomness of the reasoning process, we can run the above process multiple times to generate multiple reasoning traces for each question. Then, we can apply quality filtering to construct the distilled dataset.

After collecting the distilled dataset, you can refer to [the qwen2 distillation notebook](./qwen2_distill_nemo.ipynb) to train your model using this dataset.