# DeepSeek LLM Deployment on SageMaker
This notebook demonstrates how to deploy the DeepSeek-R1-Distill-Llama-8B model on Amazon SageMaker using the Hugging Face Deep Learning Container.

# Install the latest version of SageMaker SDK
This helps you ensure compatibility with huggingface-llm library used in hosting models from Hugging Face.

In [None]:
!pip install -U sagemaker

## Model Configuration
Define the configuration for the DeepSeek model:
- `HF_MODEL_ID`: Specifies the Hugging Face model to be deployed
- `SM_NUM_GPUS`: Sets the number of GPUs to use for the deployment (1 in this case)


In [4]:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

role = sagemaker.get_execution_role()

hub = {
	'HF_MODEL_ID':'deepseek-ai/DeepSeek-R1-Distill-Llama-8B',
	'SM_NUM_GPUS': json.dumps(1),
}

## Deploy Model to SageMaker Endpoint
This cell deploys the model to a SageMaker endpoint with the following configuration:
- Uses a g5.2xlarge instance type for GPU acceleration
- Deploys in a VPC with specified subnets and security groups
- Sets an extended health check timeout for model loading

**Note:** The VPC argument details are optional, but highly recommended as they help you be in control of security. 

KMS key for encrypting EBS volume used by the hosting instance is not needed in this case since the instance store volume will be used. Relevant snippet from the documentation:

>The data on NVMe instance store volumes is encrypted using an XTS-AES-256 cipher, implemented on a hardware module on the instance. The keys used to encrypt data that's written to locally-attached NVMe storage devices are per-customer, and per volume. The keys are generated by, and only reside within, the hardware module, which is inaccessible to AWS personnel. The encryption keys are destroyed when the instance is stopped or terminated and cannot be recovered. You cannot disable this encryption and you cannot provide your own encryption key.
Reference: 

In [None]:
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="2.3.1"),
	env=hub,
	role=role,
 # Following is optional, but highly recommended
    vpc_config={
        'Subnets': [
            '<ENTER YOUR SUBNET 1 ID>',
            '<ENTER YOUR SUBNET 1 ID>'
        ],
        'SecurityGroupIds': [
            '<ENTER YOUR SECURITY GROUP ID>'
        ]
    }
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
	instance_type="ml.g5.2xlarge",
	container_startup_health_check_timeout=900,
  )

## Making Predictions with the Deployed Model
The following cell demonstrates how to send a request to the deployed model:
- We use the `predict()` method of our endpoint predictor
- The request includes:
  - `inputs`: The text prompt we want the model to process
  - `parameters`: Configuration for the generation
    - `max_length`: Maximum length of the entire sequence (input + generated text)
    - `max_new_tokens`: Maximum number of tokens to generate

The model will return a response comparing the two numbers provided in the prompt.


In [10]:
# send request
predictor.predict({
	"inputs": "Which is larger 9.11 or 9.8?",
    "parameters": {
        "max_length": 4096,
        "max_new_tokens": 2048
    }
})