# 🚀 Deploy Hugging Face Model to AWS SageMaker

**Model**: `openai/gpt-oss-20b` (20B parameters - Cost Optimized)  
**AWS Profile**: `default`

This notebook provides a step-by-step guide to deploy a Hugging Face model to AWS SageMaker.

---

## 📋 Table of Contents

1. [Prerequisites & Setup](#1-prerequisites--setup)
2. [AWS Configuration](#2-aws-configuration)
3. [Model Configuration](#3-model-configuration)
4. [Download & Prepare Model](#4-download--prepare-model)
5. [Create Inference Script](#5-create-inference-script)
6. [Package & Upload to S3](#6-package--upload-to-s3)
7. [Deploy to SageMaker](#7-deploy-to-sagemaker)
8. [Verify Deployment](#8-verify-deployment)
9. [Save Endpoint Configuration](#9-save-endpoint-configuration)

---

## 1. Prerequisites & Setup

First, we'll install all required packages and import necessary libraries.

> **Note**: This cell may take a few minutes to complete.

In [None]:
# Step 1.1: Install required packages
!pip install --upgrade pip
!pip install "sagemaker==2.232.0" transformers torch accelerate boto3 huggingface_hub bitsandbytes

print("\n✅ Packages installed successfully!")

In [None]:
# Step 1.2: Import required libraries
import os
import sys
import json
import time
import shutil
import tarfile
import tempfile
import traceback
import datetime
from pathlib import Path

import boto3
import sagemaker
from sagemaker.session import Session
from sagemaker.huggingface import HuggingFaceModel
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
from huggingface_hub import snapshot_download, login

print(f"✅ Libraries imported successfully!")
print(f"📦 SageMaker version: {sagemaker.__version__}")

---

## 2. AWS Configuration

Configure AWS credentials using your specified profile: `default`

### What this step does:
1. Sets up AWS session with your profile
2. Retrieves/creates SageMaker execution role
3. Configures S3 bucket for model artifacts

In [None]:
# Step 2.1: Configure AWS Profile
AWS_PROFILE = "default"

# Set environment variable for AWS profile
os.environ['AWS_PROFILE'] = AWS_PROFILE
os.environ['AWS_DEFAULT_PROFILE'] = AWS_PROFILE

print(f"🔐 Using AWS Profile: {AWS_PROFILE}")

In [None]:
# Step 2.2: Verify AWS credentials
try:
    # Create session with profile
    boto_session = boto3.Session(profile_name=AWS_PROFILE)
    sts_client = boto_session.client('sts')
    
    # Get caller identity to verify credentials
    identity = sts_client.get_caller_identity()
    
    print("✅ AWS credentials verified successfully!")
    print(f"   👤 Account ID: {identity['Account']}")
    print(f"   🆔 User ARN: {identity['Arn']}")
    print(f"   🔑 User ID: {identity['UserId']}")
    
except Exception as e:
    print(f"❌ Error verifying AWS credentials: {e}")
    print("\n💡 Please ensure:")
    print("   1. AWS CLI is configured with the profile 'default'")
    print("   2. Your credentials are valid and not expired")
    print("   3. Run: aws configure --profile default")
    raise

In [None]:
# Step 2.3: Initialize SageMaker session and get execution role

# Create SageMaker session with the boto session
sagemaker_session = sagemaker.Session(boto_session=boto_session)
region = sagemaker_session.boto_region_name

print(f"🌍 AWS Region: {region}")

def get_sagemaker_role(boto_session, account_id):
    """
    Get or find a SageMaker execution role from IAM.
    Works in both local notebooks and SageMaker notebooks.
    """
    # First check environment variable
    role = os.environ.get('SAGEMAKER_ROLE')
    if role:
        print(f"✅ Using role from environment: {role}")
        return role
    
    # Try to find SageMaker roles in IAM
    iam_client = boto_session.client('iam')
    
    print("🔍 Searching for SageMaker execution roles...")
    paginator = iam_client.get_paginator('list_roles')
    sagemaker_roles = []
    
    for page in paginator.paginate():
        for r in page['Roles']:
            role_name = r['RoleName'].lower()
            if 'sagemaker' in role_name or 'sage-maker' in role_name:
                sagemaker_roles.append(r['Arn'])
    
    if sagemaker_roles:
        print(f"\n📋 Found {len(sagemaker_roles)} SageMaker role(s):")
        for i, r in enumerate(sagemaker_roles, 1):
            print(f"   {i}. {r}")
        role = sagemaker_roles[0]  # Use the first one
        print(f"\n✅ Using role: {role}")
        return role
    else:
        # Create a default role ARN pattern
        role = f"arn:aws:iam::{account_id}:role/service-role/AmazonSageMaker-ExecutionRole"
        print(f"\n⚠️ No SageMaker roles found. Using default pattern: {role}")
        print("   Please verify this role exists in your AWS account.")
        return role

# Get the role
role = get_sagemaker_role(boto_session, identity['Account'])

print(f"\n🎯 Final execution role: {role}")

In [None]:
# Step 2.4: Configure S3 bucket

# Use default SageMaker bucket or create custom one
try:
    bucket = sagemaker_session.default_bucket()
    print(f"✅ Using S3 bucket: {bucket}")
except Exception as e:
    # Create a custom bucket name
    account_id = identity['Account']
    bucket = f"sagemaker-{region}-{account_id}"
    print(f"📦 Using S3 bucket: {bucket}")

# Verify bucket exists or create it
s3_client = boto_session.client('s3')
try:
    s3_client.head_bucket(Bucket=bucket)
    print(f"✅ Bucket verified: s3://{bucket}")
except:
    print(f"⚠️ Bucket doesn't exist. Creating: s3://{bucket}")
    try:
        if region == 'us-east-1':
            s3_client.create_bucket(Bucket=bucket)
        else:
            s3_client.create_bucket(
                Bucket=bucket,
                CreateBucketConfiguration={'LocationConstraint': region}
            )
        print(f"✅ Bucket created successfully!")
    except Exception as e:
        print(f"❌ Error creating bucket: {e}")

---

## 3. Model Configuration

Configure the Hugging Face model to deploy.

### Model Details:
- **Model ID**: `openai/gpt-oss-20b`
- **Parameters**: 20 Billion
- **Type**: Large Language Model (LLM)
- **Framework**: PyTorch

### 💰 Cost Optimization
Using **ml.g5.12xlarge** (~$7/hour) instead of **ml.p4d.24xlarge** (~$32/hour) saves ~75% on costs!

In [None]:
# Step 3.1: Configure model parameters

# ============================================
# 📌 MODEL CONFIGURATION - EDIT HERE
# ============================================

MODEL_ID = "openai/gpt-oss-20b"  # Hugging Face model ID (20B parameters)

# Instance configuration - optimized for 20B model
INSTANCE_TYPE = "ml.g5.12xlarge"  # 4x A10G GPUs, 96GB GPU memory (~$7.09/hour)

# ============================================
# 💰 COST COMPARISON (On-Demand pricing, us-east-1)
# ============================================
#
# For 7B models:
#   - ml.g5.2xlarge   - 1x A10G, 24GB  (~$1.52/hour)
#   - ml.g5.4xlarge   - 1x A10G, 24GB  (~$2.03/hour)
#
# For 13B-20B models:
#   - ml.g5.12xlarge  - 4x A10G, 96GB  (~$7.09/hour) ✅ SELECTED
#   - ml.g5.24xlarge  - 4x A10G, 96GB  (~$10.18/hour)
#   - ml.p3.8xlarge   - 4x V100, 64GB  (~$14.69/hour)
#
# For 30B-70B models:
#   - ml.g5.48xlarge  - 8x A10G, 192GB (~$20.36/hour)
#   - ml.p3.16xlarge  - 8x V100, 128GB (~$28.15/hour)
#
# For 70B-120B+ models:
#   - ml.p4d.24xlarge - 8x A100, 320GB (~$32.77/hour)
#
# ============================================

# Framework versions
TRANSFORMERS_VERSION = "4.36.0"
PYTORCH_VERSION = "2.1.0"
PY_VERSION = "py310"

# Endpoint configuration
INITIAL_INSTANCE_COUNT = 1
ENDPOINT_NAME_PREFIX = "openai-gpt-oss-20b"

print("📋 Model Configuration:")
print(f"   🤖 Model ID: {MODEL_ID}")
print(f"   💻 Instance Type: {INSTANCE_TYPE}")
print(f"   💰 Estimated Cost: ~$7.09/hour")
print(f"   🔢 Instance Count: {INITIAL_INSTANCE_COUNT}")
print(f"   📦 Transformers: {TRANSFORMERS_VERSION}")
print(f"   🔥 PyTorch: {PYTORCH_VERSION}")
print(f"   🐍 Python: {PY_VERSION}")
print()
print("💡 Cost Savings: Using g5.12xlarge instead of p4d.24xlarge saves ~75%!")

In [None]:
# Step 3.2: (Optional) Login to Hugging Face Hub
# This is required for gated models or private repositories

# Option 1: Set token as environment variable
HF_TOKEN = os.environ.get('HUGGINGFACE_TOKEN') or os.environ.get('HF_TOKEN') 

if HF_TOKEN:
    login(token=HF_TOKEN)
    print("✅ Logged in to Hugging Face Hub")
else:
    print("⚠️ No Hugging Face token found.")
    print("   If the model requires authentication, set HUGGINGFACE_TOKEN environment variable.")
    print("   Or run: huggingface-cli login")

---

## 4. Download & Prepare Model

Download the model from Hugging Face Hub and prepare it for deployment.

> **⚠️ Note**: For 20B parameter models, this step requires ~50GB disk space and may take 10-20 minutes.

In [None]:
# Step 4.1: Create working directories

# Create temporary directories for model files
WORK_DIR = tempfile.mkdtemp(prefix="sagemaker_model_")
MODEL_DIR = os.path.join(WORK_DIR, "model")
CODE_DIR = os.path.join(WORK_DIR, "code")

os.makedirs(MODEL_DIR, exist_ok=True)
os.makedirs(CODE_DIR, exist_ok=True)

print(f"📁 Working directory: {WORK_DIR}")
print(f"📁 Model directory: {MODEL_DIR}")
print(f"📁 Code directory: {CODE_DIR}")

In [None]:
# Step 4.2: Download model from Hugging Face Hub

print(f"⬇️ Downloading model: {MODEL_ID}")
print("   This may take 10-20 minutes for 20B models...")
print()

try:
    start_time = time.time()
    
    # Download model snapshot
    downloaded_path = snapshot_download(
        repo_id=MODEL_ID,
        local_dir=MODEL_DIR,
        local_dir_use_symlinks=False,  # Create actual copies
        token=HF_TOKEN if HF_TOKEN else None,
        resume_download=True  # Resume if interrupted
    )
    
    elapsed_time = time.time() - start_time
    
    print(f"\n✅ Model downloaded successfully!")
    print(f"   📁 Location: {downloaded_path}")
    print(f"   ⏱️ Time: {elapsed_time:.2f} seconds ({elapsed_time/60:.1f} minutes)")
    
    # Show model files
    model_files = os.listdir(MODEL_DIR)
    print(f"\n📄 Model files ({len(model_files)}):")
    total_size = 0
    for f in model_files[:15]:
        file_path = os.path.join(MODEL_DIR, f)
        if os.path.isfile(file_path):
            size = os.path.getsize(file_path) / (1024 * 1024)  # MB
            total_size += size
            print(f"   - {f} ({size:.2f} MB)")
    if len(model_files) > 15:
        print(f"   ... and {len(model_files) - 15} more files")
    print(f"\n   📊 Total size: {total_size/1024:.2f} GB")

except Exception as e:
    print(f"❌ Error downloading model: {e}")
    print("\n💡 Troubleshooting:")
    print("   1. Check if the model ID is correct")
    print("   2. For gated models, ensure you have accepted the license")
    print("   3. For private models, ensure your token has access")
    raise

---

## 5. Create Inference Script

Create the custom inference script that SageMaker will use to serve predictions.

### The script includes:
- `model_fn()`: Load the model
- `input_fn()`: Parse incoming requests
- `predict_fn()`: Generate predictions
- `output_fn()`: Format the response

In [None]:
# Step 5.1: Create the inference script

INFERENCE_SCRIPT = '''#!/usr/bin/env python3
"""
SageMaker Inference Script for Hugging Face LLM
Model: openai/gpt-oss-20b
"""

import os
import json
import torch
import traceback
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
    BitsAndBytesConfig
)

# Global variables for model and tokenizer
model = None
tokenizer = None


def model_fn(model_dir):
    """
    Load the model for inference.
    This function is called once when the container starts.
    
    Args:
        model_dir: Path to the model artifacts
    
    Returns:
        dict: Dictionary containing model and tokenizer
    """
    global model, tokenizer
    
    try:
        print(f"Loading model from: {model_dir}")
        print(f"Available files: {os.listdir(model_dir)}")
        
        # Check available GPU memory
        if torch.cuda.is_available():
            gpu_count = torch.cuda.device_count()
            print(f"Available GPUs: {gpu_count}")
            for i in range(gpu_count):
                gpu_mem = torch.cuda.get_device_properties(i).total_memory / (1024**3)
                print(f"  GPU {i}: {torch.cuda.get_device_name(i)} - {gpu_mem:.1f} GB")
        
        # Load tokenizer
        print("Loading tokenizer...")
        tokenizer = AutoTokenizer.from_pretrained(
            model_dir,
            padding_side="left",
            trust_remote_code=True
        )
        
        # Set pad token if not set
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        
        print(f"Tokenizer loaded. Vocab size: {len(tokenizer)}")
        
        # Load model with automatic device mapping
        print("Loading model... (this may take a few minutes)")
        model = AutoModelForCausalLM.from_pretrained(
            model_dir,
            torch_dtype=torch.float16,  # Use FP16 for efficiency
            device_map="auto",  # Automatically distribute across GPUs
            trust_remote_code=True,
        )
        
        # Set model to evaluation mode
        model.eval()
        
        print("✅ Model loaded successfully!")
        print(f"   Model type: {type(model).__name__}")
        
        return {"model": model, "tokenizer": tokenizer}
        
    except Exception as e:
        print(f"❌ Error loading model: {e}")
        print(traceback.format_exc())
        raise


def input_fn(request_body, request_content_type):
    """
    Parse input data from the request.
    
    Args:
        request_body: The request payload
        request_content_type: The content type of the request
    
    Returns:
        dict: Parsed input data
    """
    try:
        if request_content_type == "application/json":
            input_data = json.loads(request_body)
            return input_data
        else:
            # Assume text/plain
            return {"inputs": request_body.decode("utf-8") if isinstance(request_body, bytes) else request_body}
            
    except Exception as e:
        print(f"❌ Error parsing input: {e}")
        return {"inputs": str(request_body)}


def predict_fn(input_data, model_artifacts):
    """
    Generate predictions from the model.
    
    Args:
        input_data: Parsed input from input_fn
        model_artifacts: Model and tokenizer from model_fn
    
    Returns:
        dict: Model predictions
    """
    try:
        model = model_artifacts["model"]
        tokenizer = model_artifacts["tokenizer"]
        
        # Extract inputs and parameters
        inputs = input_data.get("inputs", input_data.get("prompt", ""))
        
        # Generation parameters with defaults
        max_new_tokens = input_data.get("max_new_tokens", 256)
        temperature = input_data.get("temperature", 0.7)
        top_p = input_data.get("top_p", 0.9)
        top_k = input_data.get("top_k", 50)
        do_sample = input_data.get("do_sample", True)
        repetition_penalty = input_data.get("repetition_penalty", 1.1)
        
        # Handle string or list inputs
        if isinstance(inputs, str):
            inputs = [inputs]
        
        # Tokenize inputs
        encoded = tokenizer(
            inputs,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=4096  # Adjust based on model's context length
        )
        
        # Move to GPU if available
        if torch.cuda.is_available():
            encoded = {k: v.cuda() for k, v in encoded.items()}
        
        # Generate response
        with torch.no_grad():
            outputs = model.generate(
                **encoded,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                top_p=top_p,
                top_k=top_k,
                do_sample=do_sample,
                repetition_penalty=repetition_penalty,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        
        # Decode outputs
        generated_texts = tokenizer.batch_decode(
            outputs,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True
        )
        
        return {
            "generated_text": generated_texts,
            "input_tokens": encoded["input_ids"].shape[1],
            "output_tokens": outputs.shape[1] - encoded["input_ids"].shape[1]
        }
        
    except Exception as e:
        print(f"❌ Error during prediction: {e}")
        print(traceback.format_exc())
        return {"error": str(e)}


def output_fn(prediction, response_content_type):
    """
    Format the prediction output.
    
    Args:
        prediction: Prediction from predict_fn
        response_content_type: Expected response content type
    
    Returns:
        str: Formatted response
    """
    try:
        if response_content_type == "application/json":
            return json.dumps(prediction, ensure_ascii=False, indent=2)
        else:
            # Return as plain text
            if isinstance(prediction.get("generated_text"), list):
                return "\\n\\n".join(prediction["generated_text"])
            return str(prediction)
            
    except Exception as e:
        print(f"❌ Error formatting output: {e}")
        return json.dumps({"error": str(e)})
'''

# Save the inference script
inference_script_path = os.path.join(CODE_DIR, "inference.py")
with open(inference_script_path, "w") as f:
    f.write(INFERENCE_SCRIPT)

print(f"✅ Inference script created: {inference_script_path}")

# Also copy to model directory
shutil.copy(inference_script_path, os.path.join(MODEL_DIR, "inference.py"))
print(f"✅ Inference script copied to model directory")

In [None]:
# Step 5.2: Create requirements.txt for the inference container

REQUIREMENTS = '''transformers>=4.36.0
torch>=2.1.0
accelerate>=0.25.0
bitsandbytes>=0.41.0
sentencepiece
protobuf
'''

requirements_path = os.path.join(CODE_DIR, "requirements.txt")
with open(requirements_path, "w") as f:
    f.write(REQUIREMENTS)

# Also copy to model directory
shutil.copy(requirements_path, os.path.join(MODEL_DIR, "requirements.txt"))

print(f"✅ Requirements file created: {requirements_path}")

---

## 6. Package & Upload to S3

Package the model and code into a tar.gz file and upload to S3.

> **⚠️ Note**: This step may take 5-10 minutes for 20B models.

In [None]:
# Step 6.1: Create model archive (tar.gz)

MODEL_TAR_PATH = os.path.join(WORK_DIR, "model.tar.gz")

print("📦 Creating model archive...")
print("   This may take 5-10 minutes for 20B models...")

start_time = time.time()

try:
    with tarfile.open(MODEL_TAR_PATH, "w:gz") as tar:
        # Add all files from model directory
        for item in os.listdir(MODEL_DIR):
            item_path = os.path.join(MODEL_DIR, item)
            tar.add(item_path, arcname=item)
            print(f"   Added: {item}")
    
    elapsed_time = time.time() - start_time
    file_size = os.path.getsize(MODEL_TAR_PATH) / (1024**3)  # GB
    
    print(f"\n✅ Model archive created!")
    print(f"   📁 Path: {MODEL_TAR_PATH}")
    print(f"   📊 Size: {file_size:.2f} GB")
    print(f"   ⏱️ Time: {elapsed_time:.2f} seconds")
    
except Exception as e:
    print(f"❌ Error creating archive: {e}")
    raise

In [None]:
# Step 6.2: Upload model to S3

# S3 path for model artifacts
MODEL_S3_PREFIX = f"sagemaker-models/{MODEL_ID.replace('/', '-')}/{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}"
MODEL_S3_URI = f"s3://{bucket}/{MODEL_S3_PREFIX}/model.tar.gz"

print(f"📤 Uploading model to S3...")
print(f"   🎯 Destination: {MODEL_S3_URI}")
print(f"   This may take 5-10 minutes...")
print()

start_time = time.time()

try:
    # Use SageMaker session for upload (handles multipart automatically)
    uploaded_path = sagemaker_session.upload_data(
        path=MODEL_TAR_PATH,
        bucket=bucket,
        key_prefix=MODEL_S3_PREFIX
    )
    
    elapsed_time = time.time() - start_time
    
    print(f"\n✅ Model uploaded successfully!")
    print(f"   📁 S3 URI: {uploaded_path}")
    print(f"   ⏱️ Time: {elapsed_time:.2f} seconds ({elapsed_time/60:.1f} minutes)")
    
    MODEL_S3_URI = uploaded_path
    
except Exception as e:
    print(f"❌ Error uploading to S3: {e}")
    raise

---

## 7. Deploy to SageMaker

Create the SageMaker model and deploy it to an endpoint.

> **⏱️ Note**: Endpoint creation typically takes 5-10 minutes for 20B models.

In [None]:
# Step 7.1: Create HuggingFace Model object

print("🏗️ Creating SageMaker HuggingFace Model...")

try:
    # Create the HuggingFace Model
    huggingface_model = HuggingFaceModel(
        model_data=MODEL_S3_URI,
        role=role,
        transformers_version=TRANSFORMERS_VERSION,
        pytorch_version=PYTORCH_VERSION,
        py_version=PY_VERSION,
        entry_point="inference.py",
        env={
            "HF_MODEL_ID": MODEL_ID,
            "HF_TASK": "text-generation",
            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",  # INFO level
            "SAGEMAKER_REGION": region,
        },
        sagemaker_session=sagemaker_session
    )
    
    print("✅ HuggingFace Model created successfully!")
    print(f"   📦 Model data: {huggingface_model.model_data}")
    print(f"   🔐 Role: {huggingface_model.role}")
    
except Exception as e:
    print(f"❌ Error creating HuggingFace model: {e}")
    raise

In [None]:
# Step 7.2: Deploy the model to SageMaker endpoint

# Generate endpoint name with timestamp
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
ENDPOINT_NAME = f"{ENDPOINT_NAME_PREFIX}-{timestamp}"

print(f"🚀 Deploying model to SageMaker endpoint...")
print(f"   📛 Endpoint name: {ENDPOINT_NAME}")
print(f"   💻 Instance type: {INSTANCE_TYPE}")
print(f"   💰 Cost: ~$7.09/hour")
print(f"   🔢 Instance count: {INITIAL_INSTANCE_COUNT}")
print()
print("   ⏳ This typically takes 5-10 minutes...")
print()

start_time = time.time()

try:
    # Deploy the model
    predictor = huggingface_model.deploy(
        initial_instance_count=INITIAL_INSTANCE_COUNT,
        instance_type=INSTANCE_TYPE,
        endpoint_name=ENDPOINT_NAME,
        container_startup_health_check_timeout=600,  # 10 minutes for model loading
        model_data_download_timeout=1200,  # 20 minutes for model download
    )
    
    elapsed_time = time.time() - start_time
    
    print(f"\n✅ Model deployed successfully!")
    print(f"   📛 Endpoint name: {predictor.endpoint_name}")
    print(f"   ⏱️ Deployment time: {elapsed_time:.2f} seconds ({elapsed_time/60:.1f} minutes)")
    
except Exception as e:
    print(f"❌ Error deploying model: {e}")
    print(traceback.format_exc())
    raise

---

## 8. Verify Deployment

Test the deployed endpoint with sample requests.

In [None]:
# Step 8.1: Check endpoint status

sagemaker_client = boto_session.client('sagemaker')

try:
    response = sagemaker_client.describe_endpoint(EndpointName=ENDPOINT_NAME)
    
    print("📊 Endpoint Status:")
    print(f"   📛 Name: {response['EndpointName']}")
    print(f"   🔄 Status: {response['EndpointStatus']}")
    print(f"   📅 Created: {response['CreationTime']}")
    print(f"   🔄 Last Modified: {response['LastModifiedTime']}")
    
    if response['EndpointStatus'] == 'InService':
        print("\n✅ Endpoint is ready for inference!")
    else:
        print(f"\n⚠️ Endpoint is not ready yet. Current status: {response['EndpointStatus']}")
        
except Exception as e:
    print(f"❌ Error checking endpoint status: {e}")

In [None]:
# Step 8.2: Test inference with a sample request

print("🧪 Testing inference...")

test_prompts = [
    "Hello, how are you today?",
    "Explain quantum computing in simple terms.",
    "Write a short poem about artificial intelligence."
]

for i, prompt in enumerate(test_prompts, 1):
    print(f"\n{'='*60}")
    print(f"Test {i}:")
    print(f"📝 Prompt: {prompt}")
    print()
    
    try:
        start_time = time.time()
        
        response = predictor.predict({
            "inputs": prompt,
            "max_new_tokens": 100,
            "temperature": 0.7,
            "do_sample": True
        })
        
        elapsed_time = time.time() - start_time
        
        print(f"💬 Response:")
        if isinstance(response, dict):
            if "generated_text" in response:
                for text in response["generated_text"]:
                    print(f"   {text}")
            else:
                print(f"   {response}")
        else:
            print(f"   {response}")
        
        print(f"\n⏱️ Response time: {elapsed_time:.2f} seconds")
        
    except Exception as e:
        print(f"❌ Error during inference: {e}")

print(f"\n{'='*60}")
print("\n✅ Inference testing completed!")

---

## 9. Save Endpoint Configuration

Save the endpoint configuration for use in the inference notebook.

In [None]:
# Step 9.1: Save endpoint configuration to file

endpoint_config = {
    "endpoint_name": ENDPOINT_NAME,
    "model_id": MODEL_ID,
    "instance_type": INSTANCE_TYPE,
    "instance_count": INITIAL_INSTANCE_COUNT,
    "estimated_hourly_cost": "$7.09",
    "region": region,
    "bucket": bucket,
    "model_s3_uri": MODEL_S3_URI,
    "aws_profile": AWS_PROFILE,
    "created_at": datetime.datetime.now().isoformat(),
    "transformers_version": TRANSFORMERS_VERSION,
    "pytorch_version": PYTORCH_VERSION,
}

# Save to JSON file
config_file = "endpoint_config.json"
with open(config_file, "w") as f:
    json.dump(endpoint_config, f, indent=2, default=str)

print(f"✅ Endpoint configuration saved to: {config_file}")
print()
print("📋 Configuration:")
print(json.dumps(endpoint_config, indent=2, default=str))

In [None]:
# Step 9.2: Cleanup temporary files

print("🧹 Cleaning up temporary files...")

try:
    shutil.rmtree(WORK_DIR, ignore_errors=True)
    print(f"   ✅ Removed: {WORK_DIR}")
except Exception as e:
    print(f"   ⚠️ Could not remove temp directory: {e}")

print("\n✅ Cleanup completed!")

---

## 🎉 Deployment Complete!

### Summary

Your model has been successfully deployed to AWS SageMaker!

### Endpoint Information

In [None]:
# Display final summary

print("="*60)
print("🎉 DEPLOYMENT SUMMARY")
print("="*60)
print()
print(f"📛 Endpoint Name: {ENDPOINT_NAME}")
print(f"🤖 Model: {MODEL_ID}")
print(f"💻 Instance Type: {INSTANCE_TYPE}")
print(f"💰 Estimated Cost: ~$7.09/hour")
print(f"🌍 Region: {region}")
print(f"📦 Model S3 URI: {MODEL_S3_URI}")
print()
print("="*60)
print("💰 COST SAVINGS")
print("="*60)
print()
print("Using ml.g5.12xlarge instead of ml.p4d.24xlarge:")
print(f"   • g5.12xlarge: ~$7.09/hour")
print(f"   • p4d.24xlarge: ~$32.77/hour")
print(f"   • Savings: ~$25.68/hour (~75%)")
print()
print("="*60)
print("📌 NEXT STEPS")
print("="*60)
print()
print("1. Use the '02_inference_openai_gpt_oss_120b.ipynb' notebook for inference")
print(f"2. Endpoint name to use: {ENDPOINT_NAME}")
print("3. Remember to delete the endpoint when not in use to avoid costs:")
print(f"   predictor.delete_endpoint()")
print()
print("="*60)

---

## ⚠️ Cost Warning & Cleanup

**Important**: SageMaker endpoints incur costs as long as they are running (~$7.09/hour for this configuration).

To delete the endpoint when you're done:

In [None]:
# ⚠️ UNCOMMENT THE LINE BELOW TO DELETE THE ENDPOINT
# Only run this when you're done with the endpoint!

# predictor.delete_endpoint()
# print(f"✅ Endpoint '{ENDPOINT_NAME}' deleted successfully!")