# ðŸš€ Customize and Deploy `deepseek-ai/DeepSeek-R1-0528` on Amazon SageMaker AI

In this notebook, we explore **DeepSeek-R1-0528**, a cutting-edge reasoning model from DeepSeek AI. You'll learn how to fine-tune it on reasoning datasets, evaluate its mathematical and logical capabilities, and deploy it using SageMaker for advanced reasoning tasks.

## What is DeepSeek-R1-0528?
DeepSeek-R1-0528 is part of DeepSeek's R1 series, specifically designed for advanced reasoning capabilities. This model represents a significant advancement in AI reasoning, combining deep learning with sophisticated reasoning mechanisms to tackle complex mathematical, logical, and analytical problems. It builds upon DeepSeek's expertise in creating efficient and powerful language models.  
ðŸ”— Model card: [deepseek-ai/DeepSeek-R1-0528 on Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)

## Key Specifications
| Feature | Details |
|---|---|
| **Parameters** | Multi-billion parameter architecture optimized for reasoning |
| **Architecture** | Advanced Transformer with specialized reasoning modules |
| **Context Length** | Extended context window for complex reasoning chains |
| **Modalities** | Text-in / Text-out with focus on reasoning tasks |
| **License** | Check model card for specific licensing terms |
| **Release Date** | May 28th release (0528) |

## Benchmarks & Behavior
- Exceptional performance on **mathematical reasoning, logical inference, and complex problem-solving** benchmarks.  
- Designed to excel at **multi-step reasoning tasks** with clear chain-of-thought capabilities.  
- Strong performance on competition mathematics, coding challenges, and analytical reasoning tasks.  
- Optimized for **step-by-step problem decomposition** and systematic solution approaches.  

## Using This Notebook
You'll cover:
* Load the NuminaMath-CoT reasoning dataset from Hugging Face and prepare it for fine-tuning  
* Fine-tune with SageMaker Training Jobs using reasoning-optimized configurations  
* Run model evaluation on mathematical reasoning benchmarks  
* Deploy to SageMaker Endpoints for production reasoning tasks  

Let's begin by exploring `deepseek-ai/DeepSeek-R1-0528` and testing its baseline reasoning performance with mathematical problems.


In [1]:
%pip install -Uq sagemaker datasets

/home/ubuntu/py312-training/bin/python3: No module named pip
Note: you may need to restart the kernel to use updated packages.


In [2]:
import boto3
import sagemaker
import time

ModuleNotFoundError: No module named 'sagemaker'

In [3]:
region = boto3.Session().region_name
sess = sagemaker.Session(boto3.Session(region_name=region))

sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()

In [4]:
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_session_bucket}")
print(f"sagemaker session region: {sess.boto_region_name}")

### [NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)

**NuminaMath-CoT** is a large-scale dataset of **~860,000+ math competition question-solution pairs**, designed to support chain-of-thought reasoning in mathematical problem solving.

**Data Format & Structure**:
- Each example is a question followed by a solution; the solution is formatted with detailed **Chain-of-Thought (CoT)** reasoning.  
- The data sources include *Chinese high school math exercises*, *US and international mathematics competition problems*, *online test-papers PDFs*, and *math discussion forums*.  
- Preprocessing includes OCR from PDFs, segmentation to extract problem-solution pairs, translation into English, alignment into CoT style, and formatting of final answers.  

**License**: Released under the **Apache-2.0** license.  

**Applications**:

This dataset is useful for training and evaluating models on tasks including:  
- Complex math problem solving with reasoning steps (algebra, geometry, number theory, etc.)  
- Benchmarking chain-of-thought performance of LLMs on competition-level math tasks  
- Educational tools and tutoring systems that require explainable math solutions  
- Fine-tuning models to improve consistency, reasoning depth, and accuracy in mathematical domains  


In [3]:
import os
import json
import pprint
from tqdm import tqdm
from datasets import load_dataset

In [4]:
dataset_parent_path = os.path.join(os.getcwd(), "tmp_cache_local_dataset")
os.makedirs(dataset_parent_path, exist_ok=True)

**Preparing Your Dataset in `messages` format**

This section walks you through creating a conversation-style datasetâ€”the required `messages` formatâ€”for directly training LLMs using SageMaker AI.

**What Is the `messages` Format?**

The `messages` format structures instances as chat-like exchanges, wrapping each conversation turn into a role-labeled JSON array. Itâ€™s widely used by frameworks like TRL.

Example entry:

```json
{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "How do I bake sourdough?" },
    { "role": "assistant", "content": "First, you need to create a starter by..." }
  ]
}


In [5]:
dataset_name = "AI-MO/NuminaMath-CoT"
dataset = load_dataset(dataset_name, split="train[:1000]")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/train-00001-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/train-00002-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/train-00003-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/train-00004-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/166k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/859494 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

In [6]:
pprint.pp(dataset[0])

{'source': 'synthetic_math',
 'problem': 'Consider the terms of an arithmetic sequence: $-\\frac{1}{3}, '
            'y+2, 4y, \\ldots$. Solve for $y$.',
 'solution': 'For an arithmetic sequence, the difference between consecutive '
             'terms must be equal. Therefore, we can set up the following '
             'equations based on the sequence given:\n'
             '\\[ (y + 2) - \\left(-\\frac{1}{3}\\right) = 4y - (y+2) \\]\n'
             '\n'
             'Simplify and solve these equations:\n'
             '\\[ y + 2 + \\frac{1}{3} = 4y - y - 2 \\]\n'
             '\\[ y + \\frac{7}{3} = 3y - 2 \\]\n'
             '\\[ \\frac{7}{3} + 2 = 3y - y \\]\n'
             '\\[ \\frac{13}{3} = 2y \\]\n'
             '\\[ y = \\frac{13}{6} \\]\n'
             '\n'
             'Thus, the value of $y$ that satisfies the given arithmetic '
             'sequence is $\\boxed{\\frac{13}{6}}$.',
 'messages': [{'content': 'Consider the terms of an arithmetic sequence: '
                

In [7]:
print(f"total number of fine-tunable samples: {len(dataset)}")

total number of fine-tunable samples: 1000


In [8]:
def convert_to_messages_reasoning(row):
    system_content = "You are a mathematical reasoning assistant. Read the problem, restate the key givens and goal, then solve step-by-step with clear algebra (use LaTeX), keeping exact arithmetic (fractions/surds) and justifying each transformation (e.g., equal differences for arithmetic sequences). Verify any domain or extraneous-solution constraints, and present the final simplified answer concisely on the last line."
    
    messages_user_row = row["messages"][0]
    assert messages_user_row["role"] == "user", f"user row unmatched"
    user_content = messages_user_row["content"]
    
    messages_assistant_row = row["messages"][1]
    assert messages_assistant_row["role"] == "assistant", f"assistant row unmatched"
    assistant_content = messages_assistant_row["content"]

    think_block = f"<think>{row['solution']}</think>"
    
    return {
        "messages": [
            { "role": "system", "content": system_content},
            { "role": "user", "content": user_content },
            { "role": "assistant", "content": f"{think_block}\n\n{assistant_content}" }
        ]
    }
    
    
dataset = dataset.map(convert_to_messages_reasoning, remove_columns=dataset.column_names)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [9]:
dataset_filename = os.path.join(dataset_parent_path, f"{dataset_name.replace('/', '--').replace('.', '-')}.jsonl")
dataset.to_json(dataset_filename, lines=True)

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

3243521