# Secure Model Import Using Protect AI Guardian to bring DeepSeek-R1-Distill-Llama Models to Amazon Bedrock

This notebook demonstrates how to securely import DeepSeek's distilled Llama models to Amazon Bedrock using Custom Model Import (CMI). We'll use the 8B parameter model as an example, <u>but the same process applies to the 70B variant</u>. Throughout this process, we'll implement security best practices with Protect AI Guardian to ensure our model is free from vulnerabilities before deployment.

## Introduction

As you integrate Generative AI into enterprise workflows, you unlock tremendous innovation potential but also face significant security challenges. Open source models, in particular, may contain hidden vulnerabilities including data leakage risks and susceptibility to adversarial attacks. To scale AI both rapidly and securely, it's critical to assess and mitigate these threats before deployment.

DeepSeek has released several distilled versions of their models based on Llama architecture. These models maintain strong performance while being more efficient, but like any third-party model, should be properly scanned for security issues. The 8B model we'll use here is derived from Llama 3.1 and has been **optimized for reasoning tasks**.

## Protect AI Guardian

Guardian secures your machine learning ecosystem through two integrated components that seamlessly connect to your existing MLOps pipelines:

- **Guardian Gateway:** This proxy service creates a secure perimeter around third-party model access by intercepting requests to sources like Hugging Face Hub.
  
- **Guardian Scanner:** Deployed within your infrastructure, Scanner provides a dedicated API endpoint for validating your internal first-party models as they move through development or undergo customization.

In this notebook, we'll demonstrate how Guardian Gateway and Guardian Scanner seamlessly integrate into your existing ML pipeline to establish a secure, repeatable workflow that protects both third-party and internally developed models throughout their lifecycle.

## Prerequisites

- An AWS account with access to Amazon Bedrock
- Appropriate IAM roles and permissions for Bedrock and Amazon S3, follow [the instructions here](https://docs.aws.amazon.com/bedrock/latest/userguide/model-import-iam-role.html)
- A S3 bucket prepared to store the custom model
- Sufficient local storage space (At least 17GB for 8B and 135GB for 70B models)
- Reach out to the [Protect AI team](https://protectai.com/contact-sales) to get access to Guardian

### Step 1: Install Required Packages

First, let's install the necessary Python packages:

In [None]:
!pip install transformers
!pip install boto3 --upgrade
!pip install -U huggingface_hub
!pip install hf_transfer huggingface huggingface_hub "huggingface_hub[hf_transfer]"

### Step 2: Configure Parameters

Update these parameters according to your AWS environment:

In [None]:
# Define your parameters (please update this part based on your setup)
bucket_name = "<YOUR-PREDEFINED-S3-BUCKET-TO-HOST-IMPORT-MODEL>"
s3_prefix = "<S3-PREFIX>" # E.x. DeepSeek-R1-Distill-Llama-8B
local_directory = "<LOCAL-FOLDER-TO-STORE-DOWNLOADED-MODEL>" # E.x. DeepSeek-R1-Distill-Llama-8B

job_name = '<CMI-JOB-NAME>' # E.x. Deepseek-8B-job
imported_model_name = '<CMI-MODEL-NAME>' # E.x. Deepseek-8B-model
role_arn = '<IAM-ROLE-ARN>' # Please make sure it has sufficient permission as listed in the pre-requisite

# Region (currently only 'us-west-2' and 'us-east-1' support CMI with Deepseek-Distilled-Llama models)
region_info = 'us-west-2' # You can modify to 'us-east-1' based on your need

In [None]:
import os

# Enable hf_transfer for faster downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Guardian Gateway endpoint provided by your Protect AI representative
os.environ["HF_ENDPOINT"]="<YOUR_GUARDIAN_GATEWAY_ENDPOINT>"

### Step 3: Download Model from Hugging Face

Download the model files from Hugging Face. 

- Note that you can also use the 70B model by changing the model_id to "deepseek-ai/DeepSeek-R1-Distill-Llama-70B":

<div class="alert alert-warning">
<b>Note:</b> Downloading the 8B model files may take 2-10 minutes depending on your internet connection speed.
</div>

In [None]:
# Delete any existing model files that may already exist
!rm -r ~/.cache/huggingface/

In [None]:
# Confirm the endpoint is set to the Guardian Gateway
print(os.environ["HF_ENDPOINT"])

In [None]:
from huggingface_hub import snapshot_download

hf_model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

# Download using snapshot_download with hf_transfer enabled
snapshot_download(repo_id=hf_model_id, local_dir=f"./{local_directory}")

Upon successful download, navigate to the Guardian dashboard to verify that the model was scanned and review the results.

### Step 4: Upload Model to S3

Upload the scanned model files to your S3 bucket

<div class="alert alert-warning">
<b>Note:</b> Uploading the 8B model files normally takes 10-20 minutes.
</div>

In [None]:
import os
import time
import json
import boto3
from pathlib import Path
from tqdm import tqdm

def upload_directory_to_s3(local_directory, bucket_name, s3_prefix):
    s3_client = boto3.client('s3')
    local_directory = Path(local_directory)
    
    # Get list of all files first
    all_files = []
    for root, dirs, files in os.walk(local_directory):
        for filename in files:
            local_path = Path(root) / filename
            relative_path = local_path.relative_to(local_directory)
            s3_key = f"{s3_prefix}/{relative_path}"
            all_files.append((local_path, s3_key))
    
    # Upload with progress bar
    for local_path, s3_key in tqdm(all_files, desc="Uploading files"):
        try:
            s3_client.upload_file(
                str(local_path),
                bucket_name,
                s3_key
            )
        except Exception as e:
            print(f"Error uploading {local_path}: {str(e)}")


# Upload all files
upload_directory_to_s3(local_directory, bucket_name, s3_prefix)


### Step 5: Simulate a Malicious Attack
We're going to add a "vulnerability" to our existing model by adding a file called extra_data.pkl in the model directory and uploading it to S3. This file is not part of the original model and should be flagged by Guardian Scanner.

In [None]:
import pickle

class MaliciousPayload:
    def __reduce__(self):
        # Harmless demo payload (prints a message)
        return (os.system, ('echo "Security vulnerability demonstration"',))

# Create malicious pickle
with open('./extra_data.pkl', 'wb') as f:
    pickle.dump(MaliciousPayload(), f)

# Upload the malicious pickle
s3_client = boto3.client('s3')
s3_client.upload_file(
    './extra_data.pkl',
    bucket_name,
    f"{s3_prefix}/extra_data.pkl"
)

### Step 6: Use Guardian Scanner to Detect the new "Vulnerability"
Load the Guardian Scanner environment variables to begin scanning the model stored in S3 for vulnerabilities.

In [None]:
# Install guardian-client from PyPI
%pip install guardian-client==1.2.2

In [None]:
from dotenv import load_dotenv

# Please reach out to your Protect AI representative to retrieve the proper environment variables
load_dotenv("./.env.local", override=True)  # Path is relative to current notebook path

In [None]:
# Set the endpoint of the Guardian Scanner's API
scanner_endpoint = os.environ["GUARDIAN_SCANNER_ENDPOINT"]

# Set the model URI path
model_uri = (
    f"s3://{bucket_name}/{s3_prefix}"
)

In [None]:
# Import the Guardian API Client
from guardian_client import GuardianAPIClient

# Initiate the client using the GUARDIAN_SCANNER_ENDPOINT that we set above
guardian = GuardianAPIClient(base_url=scanner_endpoint)

# # Scan the model
response = guardian.scan(model_uri=model_uri)

In [None]:
# Retrieve the pass/fail decision from Guardian
assert response.get("http_status_code") == 200
assert response.get("scan_status_json") != None
assert response.get("scan_status_json").get("aggregate_eval_outcome") != "ERROR"

if response.get("scan_status_json").get("aggregate_eval_outcome") == "FAIL":
    print(
        f"Model {model_uri} was blocked because it failed your organization's security policies"
    )

In [None]:
import pprint

pprint.pprint(response)

### Step 7: Take Action on the Detected Vulnerability and Re-Scan
After scanning the model, you can view the results in the Guardian dashboard. If a vulnerability is detected, you can take action to mitigate the risk. 

**We will simply remove the extra_data.pkl file in this example, but in a real-world scenario, you would address the underlying issue based on your organization's security policies.**

In [None]:
# Remove the malicious pickle
s3_client.delete_object(Bucket=bucket_name, Key=f"{s3_prefix}/extra_data.pkl")

Re-scan the model to ensure the vulnerability has been resolved.

In [None]:
# Scan the model after deleting the malicious file. 
response = guardian.scan(model_uri=model_uri)

# Retrieve the pass/fail decision from Guardian
assert response.get("http_status_code") == 200
assert response.get("scan_status_json") != None
assert response.get("scan_status_json").get("aggregate_eval_outcome") != "ERROR"

if response.get("scan_status_json").get("aggregate_eval_outcome") == "FAIL":
    print(
        f"Model {model_uri} was blocked because it failed your organization's security policies"
    )

pprint.pprint(response)

### Step 5: Create Custom Model Import Job

Initialize the import job in Amazon Bedrock

<div class="alert alert-warning">
<b>Note:</b> Creating CMI job for 8B model could take 5-20 minutes to complete.
</div>

In [None]:
# Initialize the Bedrock client
bedrock = boto3.client('bedrock', region_name=region_info)

s3_uri = f's3://{bucket_name}/{s3_prefix}/'
print(role_arn)

# Create the model import job
response = bedrock.create_model_import_job(
    jobName=job_name,
    importedModelName=imported_model_name,
    roleArn=role_arn,
    modelDataSource={
        's3DataSource': {
            's3Uri': s3_uri
        }
    }
)

job_Arn = response['jobArn']

# Output the job ARN
print(f"Model import job created with ARN: {response['jobArn']}")


### Step 6: Monitor Import Job Status

Check the status of your import job

In [None]:
# Check CMI job status
while True:
    response = bedrock.get_model_import_job(jobIdentifier=job_Arn)
    status = response['status'].upper()
    print(f"Status: {status}")
    
    if status in ['COMPLETED', 'FAILED']:
        break
        
    time.sleep(60)  # Check every 60 seconds

# Get the model ID
model_id = response['importedModelArn']

### Step 7: Wait for Model Initialization

Allow time for the model to initialize:

In [None]:
# Wait for 5mins for cold start 
time.sleep(300)

### Step 8: Model Inference with Proper Tokenization

#### Understanding the Tokenization Process
When working with DeepSeek models, proper tokenization is crucial for optimal performance. The model expects inputs to follow a specific format defined in its `tokenizer_config.json`. This format ensures the model receives prompts in the same structure it was trained on.

#### Key Components
1. **Tokenizer**: Uses HuggingFace's AutoTokenizer to properly format inputs
2. **Generation Function**: Handles the core interaction with the model
3. **Auto-Generate Function**: Manages longer responses that might exceed token limits

#### 8.1 Setting Up the Tokenizer
First, we'll initialize the tokenizer and Bedrock runtime client:

In [None]:
from transformers import AutoTokenizer
import json
import boto3
from botocore.config import Config
from IPython.display import Markdown, display

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(hf_model_id)

# Initialize Bedrock Runtime client
session = boto3.Session()
client = session.client(
    service_name='bedrock-runtime',
    region_name=region_info,
    config=Config(
        connect_timeout=300,  # 5 minutes
        read_timeout=300,     # 5 minutes
        retries={'max_attempts': 3}
    )
)

#### 8.2 Core Generation Function

This function handles the basic model interaction with proper tokenization:

In [None]:
def generate(messages, temperature=0.3, max_tokens=4096, top_p=0.9, continuation=False, max_retries=10):
    """
    Generate response using the model with proper tokenization and retry mechanism
    
    Parameters:
        messages (list): List of message dictionaries with 'role' and 'content'
        temperature (float): Controls randomness in generation (0.0-1.0)
        max_tokens (int): Maximum number of tokens to generate
        top_p (float): Nucleus sampling parameter (0.0-1.0)
        continuation (bool): Whether this is a continuation of previous generation
        max_retries (int): Maximum number of retry attempts
    
    Returns:
        dict: Model response containing generated text and metadata
    """
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, 
                                         add_generation_prompt=not continuation)
    
    attempt = 0
    while attempt < max_retries:
        try:
            response = client.invoke_model(
                modelId=model_id,
                body=json.dumps({
                    'prompt': prompt,
                    'temperature': temperature,
                    'max_gen_len': max_tokens,
                    'top_p': top_p
                }),
                accept='application/json',
                contentType='application/json'
            )
            
            result = json.loads(response['body'].read().decode('utf-8'))
            return result
            
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {str(e)}")
            attempt += 1
            if attempt < max_retries:
                time.sleep(30)
    
    raise Exception("Failed to get response after maximum retries")

#### 8.3 Extended Generation Function

The thinking process of the model can become quite extensive, especially when dealing with complex reasoning problems that require step-by-step analysis. This often exceeds the output context length we set for the model. To address this limitation:

1. We first attempt to generate a complete response
2. If the response is truncated (indicated by stop_reason = "length"), we:
   - Concatenate the partial response to the original prompt
   - Make another API call with `continuation=True`
   - This sets `add_generation_prompt=False` in the tokenizer call
3. This process continues until we get a complete response

This approach ensures we capture the model's complete reasoning process while maintaining coherence throughout the response.


In [None]:
def auto_generate(messages, **kwargs):
    """
    Handle longer responses that exceed token limit
    
    Parameters:
        messages (list): List of message dictionaries
        **kwargs: Additional parameters for generate function
    
    Returns:
        dict: Enhanced response including thinking process and final answer
    """
    res = generate(messages, **kwargs)
    while res["stop_reason"] == "length":
        for v in messages:
            if v.get("role") == "user":
               v["content"] += res["generation"]
        res = generate(messages, **kwargs, continuation=True)

    for v in messages:
        if v.get("role") == "user":
           gen = v["content"] + res["generation"]
           answer = gen.split("</think>")[-1]
           think = gen.split("</think>")[0].split("<think>")[-1]
           res = {**res, "generation": gen, "answer": answer, "think": think}
           return res

### Usage Examples
#### Basic Usage

In [None]:
test_prompt = """Given the following financial data:
- Company A's revenue grew from $10M to $15M in 2023
- Operating costs increased by 20%
- Initial operating costs were $7M

Calculate the company's operating margin for 2023. Please reason step by step.
"""

messages = [{"role": "user", "content": test_prompt}]
response = generate(messages)
print("Model Response:")
print(response["generation"])

#### Advanced Usage with Complex Prompt

In [None]:
complex_prompt = """Solve the following optimization problem:

A manufacturing company produces two types of products: A and B. 
They need to determine the optimal production quantities to maximize profit.

Given constraints:
1. Manufacturing capacity: 60 hours per week
2. Product A takes 4 hours to produce
3. Product B takes 3 hours to produce
4. Storage space can hold maximum 20 units total
5. Profit per unit:
   - Product A: $200
   - Product B: $150
6. Minimum required production:
   - At least 3 units of Product A
   - At least 2 units of Product B

Please:
1. Set up the linear programming equations
2. Solve step by step
3. Verify all constraints are met
4. Calculate maximum profit
5. Analyze sensitivity to changes in constraints
6. Recommend optimal production plan

Show all your work and reasoning at each step."""

# System prompt to encourage detailed mathematical reasoning
system_prompt = """You are a mathematical optimization expert. 
Please provide detailed step-by-step solutions showing:
- All equations and their development
- Each calculation step
- Verification of constraints
- Clear reasoning for each decision
- Visual representations where helpful"""

# Run the analysis with auto_generate
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": complex_prompt}
]

response = auto_generate(messages, temperature=0.7, max_tokens=4096, top_p=0.9)

# Display the response
print("\n=== Thinking Process ===")
display(Markdown(response["think"]))
print("\n=== Solution ===")
display(Markdown(response["answer"]))

## Conclusion

This notebook demonstrates a secure end-to-end process for importing DeepSeek's distilled Llama models to Amazon Bedrock using Custom Model Import (CMI). By implementing Protect AI Guardian throughout our workflow, we've established a "zero-trust" security posture for our ML pipeline.

We began by using Guardian Gateway to safely download the model from HuggingFace, automatically scanning it for potential vulnerabilities during transfer. This critical first step ensures that third-party models don't introduce security risks to your environment from the outset.

As we moved the model through our internal infrastructure and made modifications, we followed zero-trust principles by using Guardian Scanner to continuously validate the model's security posture. This ongoing verification caught potential issues like deserialization vulnerabilities before they could reach production, demonstrating why continuous scanning is essential whenever models are moved or modified.

Only after confirming our model passed all security checks did we proceed with creating a CMI job to upload to Amazon Bedrock Custom Model Import. This security-first approach should be considered best practice for any organization working with external AI models or developing models internally.

While we've used the DeepSeek-R1-Distill-Llama-8B model in this example, the same secure process applies to other variants including the 70B model. Regardless of the model size or source, always scan external models using Guardian Gateway and continuously validate models with Guardian Scanner throughout your ML lifecycle.

For more information about Custom Model Import and its features, refer to the [Amazon Bedrock documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html).