# 03-Deploy: SageMaker Model Deployment

## Overview
This notebook handles the deployment of our fine-tuned language model to Amazon SageMaker. It uses DJL (Deep Java Library) inference container with VLLM support for optimized inference performance.


In [1]:
import os
import json
import time
import boto3
from uuid import uuid4
import sagemaker
from sagemaker.djl_inference.model import DJLModel

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/khaliladib/Library/Application Support/sagemaker/config.yaml


## Key Components

### 1. AWS Setup
- Uses the "firemind-sandbox" AWS profile
- Deploys in us-east-1 region
- Connects to S3 bucket for model artifacts
- Initializes necessary AWS clients and sessions


In [2]:
PROFILE_NAME = "dev"
REGION_NAME = "us-east-1"
BUCKET_NAME = "khalil-adib-bucket"
ARTIFACTS_KEY = "webinar/dataset/output/webinar-finetine-job-1c56da57-6cbe-453c-8131-76f4bfa2f66d/output/model/"
ARTIFACTS_PATH = f"s3://{BUCKET_NAME}/{ARTIFACTS_KEY}"

session = boto3.Session(profile_name=PROFILE_NAME, region_name=REGION_NAME)
s3_client = session.client('s3')
sagemaker_session = sagemaker.Session(boto_session=session)

### 2. Model Configuration
- Deploys model version v011
- Uses ml.g5.12xlarge instance for GPU acceleration
- Implements VLLM for efficient batch processing
- Configures tensor parallelism for model distribution

In [3]:
model_version = int(time.time())
model_name=f"webinar-model-{model_version}"

endpoint_name=f"webinar-model-endpoint-VLLM-{model_version}"

instance_type="ml.g5.12xlarge"
ROLE = "arn:aws:iam::026090512591:role/sagemaker-execution-role-SageMakerExecutionRole-lZm8CUm9jqkj"
container_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124"
model_data = {
    "S3DataSource": {
        "S3Uri": ARTIFACTS_PATH,
        'S3DataType': 'S3Prefix',
        'CompressionType': 'None'
    }
}

### 3. Deployment Settings
- Endpoint timeout: 3600 seconds
- Batch size: 64
- Tensor parallel degree: 2
- Uses DJL inference container with LMI 11.0.0


In [4]:
config = {
    "ENDPOINT_SERVER_TIMEOUT": "3600",
    "HF_MODEL_ID": "/opt/ml/model",
    "MODEL_CACHE_ROOT": "/opt/ml/model",
    "SAGEMAKER_ENV": "1",
    "SAGEMAKER_PROGRAM": "inference.py",
    "TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "64",
    "TENSOR_PARALLEL_DEGREE": "2"
}

model = DJLModel(
    name=model_name,
    image_uri=container_uri,
    model_data=model_data,
    role=ROLE,
    env=config,
    sagemaker_session=sagemaker_session
)

## Process Flow
1. Initialize AWS sessions and clients
2. Configure model and endpoint names
3. Set up DJL model with VLLM support
4. Deploy to specified instance type
5. Create and configure endpoint

## Important Notes
- Ensure proper IAM permissions are set
- Monitor deployment progress in SageMaker console
- Check CloudWatch logs for any issues
- Consider instance costs when keeping endpoint running

In [5]:
llm = model.deploy(
    instance_type=instance_type,
    initial_instance_count=1,
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session
)

----------!