* Notebook created by Nov05 on 2025-01-30 
* local env `huggingface_py311`   
* ⚠️ Important: Check [the AWS g6 series instance prices](https://aws.amazon.com/ec2/instance-types/g6/)  
* HuggingFace [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)   

In [33]:
%pwd

'd:\\github\\deepseek-deployment'

## **👉 Set up Python environment**  

```bash
conda create --name huggingface_py311 python=3.11
conda activate huggingface_py311
conda install ipykernel
pip install pandas botocore boto3 sagemaker==2.238.0 --quiet --upgrade
conda list
```

> ValueError: Unsupported huggingface-llm version: 3.0.1. You may need to upgrade your SDK version (pip install -U 
sagemaker) for newer huggingface-llm versions. Supported huggingface-llm version(s): 0.6.0, 0.8.2, 0.9.3, 1.0.3, 
1.1.0, 1.2.0, 1.3.1, 1.3.3, 1.4.0, 1.4.2, 1.4.5, 2.0.0, 2.0.1, 2.0.2, 2.2.0, 2.3.1, 0.6, 0.8, 0.9, 1.0, 1.1, 1.2, 
1.3, 1.4, 2.0.

```bash
conda env remove --name huggingface_py311
conda env list
```

In [11]:
import sagemaker
print("SageMaker version:", sagemaker.__version__) ## SageMaker version: 2.238.0
## make sure this module can be imported
from sagemaker.model_card import ModelPackageModelCard
print(ModelPackageModelCard)

SageMaker version: 2.238.0
<class 'sagemaker.model_card.model_card.ModelPackageModelCard'>


## **👉 AWS Credentials**

In [24]:
## windows cmd to launch notepad to edit aws credential file
# !notepad C:\Users\guido\.aws\config
!notepad C:\Users\guido\.aws\credentials

## Skip this cell
⚠️ This federated account doesn't have the permissions.   

In [29]:
## reset the session after updating credentials
import boto3 # type: ignore
boto3.DEFAULT_SESSION = None
import sagemaker # type: ignore
from sagemaker import get_execution_role # type: ignore

# Extract and print the account ID
sts_client = boto3.client('sts') ## "default" profile
response = sts_client.get_caller_identity() 
account_id = response['Account']

role_arn = get_execution_role()  ## get role ARN
if 'AmazonSageMaker-ExecutionRole' not in role_arn:
    voclabs_role_arn = role_arn
    ## Go to "IAM - Roles", search for "SageMaker", find the execution role.
    ## This role is not allowed to use large GPU instances.
    sagemaker_role_arn = "arn:aws:iam::807711953667:role/service-role/AmazonSageMaker-ExecutionRole-20241121T213663"
sagemaker_session = sagemaker.Session()  ## "default"
region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()

print(f"Current AWS Account ID: {account_id}")
print("AWS Region: {}".format(region))
print("Default Bucket: {}".format(bucket))
print(f"Role voclabs ARN: {voclabs_role_arn}") 
print("SageMaker Role ARN: {}".format(sagemaker_role_arn)) 

role = sagemaker_role_arn

Current AWS Account ID: 807711953667
AWS Region: us-east-1
Default Bucket: sagemaker-us-east-1-807711953667
Role voclabs ARN: arn:aws:iam::807711953667:role/voclabs
SageMaker Role ARN: arn:aws:iam::807711953667:role/service-role/AmazonSageMaker-ExecutionRole-20241121T213663


✅ This account has full permissions.  

In [None]:
## reset the session after updating credentials
import boto3 # type: ignore
boto3.DEFAULT_SESSION = None
import sagemaker # type: ignore
from sagemaker import get_execution_role # type: ignore
## Get the account profile etc.
with open('secrets/my_aws_profile', 'r') as file:
    for line in file:
        my_aws_profile = line.strip().split(',')[0]
        break
# print(f"my_aws_profile: {my_aws_profile}")
## reset the boto3 session after updating credentials
import boto3 # type: ignore
boto3.DEFAULT_SESSION = None
boto3_session = boto3.Session(profile_name=my_aws_profile, region_name='us-east-1')
sagemaker_session = sagemaker.Session(boto_session=boto3_session)
## Get the SageMaker execution role
iam_client = boto3_session.client('iam')
roles = iam_client.list_roles()
for role in roles['Roles']:
    if 'AmazonSageMaker-ExecutionRole' in role['RoleName']:
        role = role['RoleName']
        print(role)
        break



sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\guido\AppData\Local\sagemaker\sagemaker\config.yaml


AmazonSageMaker-ExecutionRole-20241119T132969


## **👉 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B**  

* [HuggingFace model card](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)  
* [AWS SageMaker AI -> Pricing](https://aws.amazon.com/sagemaker-ai/pricing/) -> On-Demand Pricing -> Real-Time Inference  

    | Instance Type    | vCPU | Memory   | Price per Hour |
    |------------------|------|----------|----------------|
    |ml.g5.xlarge	|4	|16 GiB	|$1.41 |
    |ml.g5.2xlarge	|8	|32 GiB	|$1.52 |
    |ml.g5.4xlarge	|16	|64 GiB	|$2.03 |
    |ml.g5.8xlarge	|32	|128 GiB	|$3.06 |
    |ml.g5.12xlarge	|48	|192 GiB	|$7.09 |
    |ml.g5.16xlarge	|64	|256 GiB	|$5.12 |
    |ml.g5.24xlarge	|96	|384 GiB	|$10.18 |
    |ml.g5.48xlarge	|192	|768 GiB	|$20.36 |
    | ml.g6.xlarge      | 4    | 16 GiB   | $1.1267        |
    | ml.g6.2xlarge     | 8    | 32 GiB   | $1.222         |
    | ml.g6.4xlarge     | 16   | 64 GiB   | $1.654         |
    | ml.g6.8xlarge     | 32   | 128 GiB  | $2.518         |
    | ml.g6.12xlarge    | 48   | 192 GiB  | $5.752         |
    | ml.g6.16xlarge    | 64   | 256 GiB  | $4.246         |
    | ml.g6.24xlarge    | 96   | 384 GiB  | $8.344         |

<br>  

* [EC2 G6 Instances](https://aws.amazon.com/ec2/instance-types/g6/)

| Instance Size    | vCPUs | Instance Memory (GiB) | GPU Model     | GPUs | Total GPU Memory (GB) | Memory per GPU (GB) | Network Bandwidth (Gbps) | EBS Bandwidth (Gbps) | Instance Storage (GB) |
|------------------|-------|-----------------------|---------------|------|-----------------------|---------------------|--------------------------|----------------------|-----------------------|
| ml.g5.8xlarge     | 32    | 128                   | NVIDIA A10G   | 1    | 24                    | 24                  | 25                       | 16                   | 1x900                 |




In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
# Hub Model configuration. https://huggingface.co/models
# hub = {
# 	'HF_MODEL_ID':'deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
# 	'SM_NUM_GPUS': json.dumps(4)
# }
# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "HF_NUM_CORES": "8",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="3.0.1"),
	env=hub,
	role=role, 
    sagemaker_session=sagemaker_session
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
    # instance_type="ml.g6.8xlarge", ## $2.518/hour. ⚠️ CUDA out of memory. 24GB video memory * 1 GPU 
    instance_type="ml.g6.12xlarge", ## $5.752/hour. 24GB * 4 GPU
	container_startup_health_check_timeout=700, ## 1200 = 20 minutes
    sagemaker_session=sagemaker_session
)
# send request
predictor.predict({
	"inputs": "Hi, who are you and what can you help me with?",
})

---------------------------------*

In [None]:
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

> ClientError: An error occurred (ValidationException) when calling the CreateEndpointConfig operation: 1 validation 
error detected: Value 'gr6.8xlarge' at 'productionVariants.1.member.instanceType' failed to satisfy constraint: 
Member must satisfy enum value set: [ml.r7i.48xlarge, ml.trn1.32xlarge, ml.r6i.16xlarge, ml.m6i.xlarge, 
ml.r5d.12xlarge, ml.r5.12xlarge, ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.r7i.16xlarge, ml.m7i.xlarge, 
ml.p5.48xlarge, ml.r6gd.xlarge, ml.r6g.8xlarge, ml.r6g.large, ml.m6gd.16xlarge, ml.m6i.12xlarge, ml.r5d.24xlarge, 
ml.r5.24xlarge, ml.r7i.8xlarge, ml.r7i.large, ml.m7i.12xlarge, ml.r6gd.12xlarge, ml.r6g.16xlarge, ml.m6gd.8xlarge, 
ml.m6gd.large, ml.m6g.xlarge, ml.p4d.24xlarge, ml.m6i.24xlarge, ml.m7i.24xlarge, ml.m6g.12xlarge, ml.r6i.8xlarge, 
ml.r6i.large, ml.p5e.48xlarge, ml.trn2.48xlarge, ml.g5.2xlarge, ml.p3.16xlarge, ml.m5d.xlarge, ml.m5.large, 
ml.t2.xlarge, ml.m7i.48xlarge, ml.g6.2xlarge, ml.m6i.16xlarge, ml.p2.16xlarge, ml.m5d.12xlarge, ml.m7i.16xlarge, 
ml.r6gd.16xlarge, ml.c6gd.2xlarge, ml.g5.4xlarge, ml.inf1.2xlarge, ml.m5d.24xlarge, ml.m6g.16xlarge, ml.g6.4xlarge,
ml.c4.2xlarge, ml.c6gn.xlarge, ml.c6gd.4xlarge, ml.c5.2xlarge, ml.c6gn.12xlarge, ml.c6i.32xlarge, ml.c4.4xlarge, 
ml.g6e.xlarge, ml.g5.8xlarge, ml.c6i.xlarge, ml.inf1.6xlarge, ml.c5d.2xlarge, ml.c5.4xlarge, ml.c7i.xlarge, 
ml.inf2e.32xlarge, ml.c7g.2xlarge, ml.g6e.12xlarge, ml.g6.8xlarge, ml.c6i.12xlarge, ml.g4dn.xlarge, 
ml.c7i.12xlarge, ml.c6gd.8xlarge, ml.c6gd.large, ml.c6g.2xlarge, ml.c6g.xlarge, ml.g6e.24xlarge, ml.c6i.24xlarge, 
ml.g4dn.12xlarge, ml.c5d.4xlarge, ml.c7i.24xlarge, ml.c7i.2xlarge, ml.inf2.8xlarge, ml.c6gn.16xlarge, 
ml.c6g.12xlarge, ml.c7g.4xlarge, ml.c7g.xlarge, ml.g4dn.2xlarge, ml.c4.8xlarge, ml.c4.large, ml.c6g.4xlarge, 
ml.c7g.12xlarge, ml.g6e.48xlarge, ml.g6e.2xlarge, ml.c6i.2xlarge, ml.c5d.xlarge, ml.c5.large, ml.c7i.48xlarge, 
ml.c7i.4xlarge, ml.g6e.16xlarge, ml.c6i.16xlarge, ml.g4dn.4xlarge, ml.c5.9xlarge, ml.c7i.16xlarge, ml.c6gn.2xlarge,
ml.g6e.4xlarge, ml.c6i.4xlarge, ml.g4dn.16xlarge, ml.c5d.large, ml.c5.xlarge, ml.inf2.xlarge, ml.c6g.16xlarge, 
ml.c7g.8xlarge, ml.c7g.large, ml.c5d.9xlarge, ml.c4.xlarge, ml.trn1n.32xlarge, ml.c6gn.4xlarge, ml.c6gd.xlarge, 
ml.c6g.8xlarge, ml.c6g.large, ml.c7g.16xlarge, ml.inf1.xlarge, ml.c7i.8xlarge, ml.c7i.large, ml.inf2.24xlarge, 
ml.c6gd.12xlarge, ml.g6.xlarge, ml.g4dn.8xlarge, ml.g6e.8xlarge, ml.g6.12xlarge, ml.g5.xlarge, ml.c6i.8xlarge, 
ml.c6i.large, ml.inf1.24xlarge, ml.m5d.2xlarge, ml.t2.2xlarge, ml.inf2.48xlarge, ml.g6.24xlarge, ml.g5.12xlarge, 
ml.c5d.18xlarge, ml.c6gn.8xlarge, ml.c6gn.large, ml.m6g.2xlarge, ml.g5.24xlarge, ml.m5d.4xlarge, ml.t2.medium, 
ml.m7i.2xlarge, ml.trn1.2xlarge, ml.r6gd.2xlarge, ml.c6gd.16xlarge, ml.g6.48xlarge, ml.c5.18xlarge, ml.m6g.4xlarge,
ml.g6.16xlarge, ml.g5.48xlarge, ml.m6i.2xlarge, ml.m7i.4xlarge, ml.r6gd.4xlarge, ml.g5.16xlarge, ml.dl1.24xlarge, 
ml.r5d.2xlarge, ml.r5.2xlarge, ml.p3.2xlarge, ml.r6i.32xlarge, ml.m6i.4xlarge, ml.m5d.large, ml.m5.xlarge, 
ml.m4.10xlarge, ml.t2.large, ml.r6g.2xlarge, ml.r6i.xlarge, ml.r5d.4xlarge, ml.r5.4xlarge, ml.m5.12xlarge, 
ml.m4.xlarge, ml.r7i.2xlarge, ml.r7i.xlarge, ml.m6gd.2xlarge, ml.m6gd.xlarge, ml.m6g.8xlarge, ml.m6g.large, 
ml.r6i.12xlarge, ml.m5.24xlarge, ml.r7i.12xlarge, ml.m7i.8xlarge, ml.m7i.large, ml.r6gd.8xlarge, ml.r6gd.large, 
ml.r6g.4xlarge, ml.r6g.xlarge, ml.m6gd.12xlarge, ml.r6i.24xlarge, ml.r6i.2xlarge, ml.m4.2xlarge, ml.r7i.24xlarge, 
ml.r7i.4xlarge, ml.r6g.12xlarge, ml.m6gd.4xlarge, ml.m6i.8xlarge, ml.m6i.large, ml.p2.8xlarge, ml.m5.2xlarge, 
ml.p4de.24xlarge, ml.r6i.4xlarge, ml.m6i.32xlarge, ml.r5d.xlarge, ml.r5d.large, ml.r5.xlarge, ml.r5.large, 
ml.p3.8xlarge, ml.m4.4xlarge]

* Solution: SageMaker uses EC2 instance types that begin with 'ml' for machine learning workloads.

> ClientError: An error occurred (AccessDeniedException) when calling the CreateEndpointConfig operation: User: 
arn:aws:sts::807711953667:assumed-role/voclabs/user1359219=u229064 is not authorized to perform: 
sagemaker:CreateEndpointConfig on resource: 
arn:aws:sagemaker:us-east-1:807711953667:endpoint-config/huggingface-pytorch-tgi-inference-2025-01-31-07-33-35-437 
with an explicit deny in a service control policy  

* Solution: Use another account that has the permissions.  

> ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The 
account-level service limit 'ml.g6.8xlarge for endpoint usage' is 0 Instances, with current utilization of 0 
Instances and a request delta of 1 Instances. Please use AWS Service Quotas to request an increase for this quota. 
If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.

* ✅ Solution: Run the `aws_quota.ipynb` or go to the `AWS Service Quota` console to create a quota increase request on 'ml.g6.8xlarge for endpoint usage'.  

## **👉 deepseek-ai/DeepSeek-R1-Distill-Llama-70B**  

* [HuggingFace model card](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)  
* ⚠️ For the 70B model, some people recommended `g6.12xlarge` and above，which costs **$5 per hour and above**.       
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/discussions/

* Medium: [DeepSeek R1 on AWS](https://dgallitelli95.medium.com/deepseek-r1-on-aws-70c1c4b692f3)  