# Fine-tune Llama SEA-LION v3.5 8B R with QLoRA and SageMaker remote decorator

This notebook has been modified from [https://github.com/aws-samples/amazon-sagemaker-llm-fine-tuning-remote-decorator/blob/main/llama/llama-3.1-8b-qlora-ddp-remote-decorator_qa.ipynb](https://github.com/aws-samples/amazon-sagemaker-llm-fine-tuning-remote-decorator/blob/main/llama/llama-3.1-8b-qlora-ddp-remote-decorator_qa.ipynb)

## Question & Answering

---

In this demo notebook, we demonstrate how to fine-tune Llama-SEA-LION-v3.5-8B-R using QLoRA, Hugging Face PEFT, and bitsandbytes.

We are using SageMaker remote decorator for runinng the fine-tuning job on Amazon SageMaker Training job
---

JupyterLab Instance Type: ml.t3.medium

Python version: 3.11

Fine-Tuning:
* Instance Type: ml.g5.12xlarge

Install the required libriaries, including the Hugging Face libraries, and restart the kernel.

In [1]:
# %%writefile requirements.txt
# transformers==4.51.3
# peft==0.15.2
# accelerate==1.6.0
# bitsandbytes==0.45.5
# cloudpickle==3.1.1
# datasets==3.5.1
# evaluate==0.4.3
# huggingface_hub[hf_transfer]
# safetensors>=0.5.2
# sagemaker==2.244.0
# sentencepiece==0.2.0
# scikit-learn==1.6.1
# tokenizers>=0.21.1
# py7zr

In [2]:
%%writefile requirements.txt
transformers
peft
accelerate
bitsandbytes
cloudpickle
datasets
evaluate
huggingface_hub[hf_transfer]
safetensors
sagemaker
sentencepiece
scikit-learn
tokenizers
py7zr

Overwriting requirements.txt


In [3]:
%pip install -r requirements.txt --upgrade
%pip install -q -U python-dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.



## Setup Configuration file path

We are setting the directory in which the config.yaml file resides so that remote decorator can make use of the settings through [SageMaker Defaults](https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk).

This notebook is using the Hugging Face container for the `us-east-1` region. Make sure you are using the right image for your AWS region, otherwise edit [config.yaml](./config.yaml). Container Images are available [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)


In [4]:
from dotenv import load_dotenv
import os

# Use .env in case of hidden variables
load_dotenv()

# Set path to config file
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

## Visualize and upload the dataset

We are going to load [rajpurkar/squad](https://huggingface.co/datasets/rajpurkar/squad) dataset

In [5]:
!pip install -U datasets



Note that this notebook may not work on earlier versions of the datasets library, such as version 3.5.1

In [7]:
import datasets
datasets.__version__

'4.0.0'

In [8]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("rajpurkar/squad")

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [9]:
df = pd.DataFrame(dataset['train'])
df = df.iloc[0:1000]
df['answer'] = [answer['text'][0] for answer in df['answers']]
df = df[['context', 'question', 'answer']]

df.head()

Unnamed: 0,context,question,answer
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary


In [10]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.1, random_state=42)

print("Number of train elements: ", len(train))
print("Number of test elements: ", len(test))

Number of train elements:  900
Number of test elements:  100


Create a prompt template and load the dataset with a random sample to try summarization.

In [11]:
from random import randint

# custom instruct prompt start
prompt_template = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>Context:{{context}}  {{question}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>{{answer}}<|end_of_text|><|eot_id|>"""

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(context=sample["context"],
                                            question=sample["question"],
                                            answer=sample["answer"])
    return sample

Use the Hugging Face Trainer class to fine-tune the model. Define the hyperparameters we want to use. We also create a DataCollator that will take care of padding our inputs and labels.

In [12]:
from datasets import Dataset, DatasetDict

train_dataset = Dataset.from_pandas(train)
test_dataset = Dataset.from_pandas(test)

dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features))

print(train_dataset[randint(0, len(dataset))]["text"])

test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features))

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

<|begin_of_text|><|start_header_id|>user<|end_header_id|>Context:In 2001, she became the first African-American woman and second woman songwriter to win the Pop Songwriter of the Year award at the American Society of Composers, Authors, and Publishers Pop Music Awards. Beyoncé was the third woman to have writing credits on three number one songs ("Irreplaceable", "Grillz" and "Check on It") in the same year, after Carole King in 1971 and Mariah Carey in 1991. She is tied with American songwriter Diane Warren at third with nine songwriting credits on number-one singles. (The latter wrote her 9/11-motivated song "I Was Here" for 4.) In May 2011, Billboard magazine listed Beyoncé at number 17 on their list of the "Top 20 Hot 100 Songwriters", for having co-written eight singles that hit number one on the Billboard Hot 100 chart. She was one of only three women on that list.  Pop Songwriter of the Year award in 2001 was awarded to whom?<|eot_id|><|start_header_id|>assistant<|end_header_id|

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Utility function for initializing the distribution across multiple GPUs

In [13]:
import datetime
import torch

def init_distributed():
    # Initialize the process group
    torch.distributed.init_process_group(
        backend="nccl",  # Use "gloo" backend for CPU
        timeout=datetime.timedelta(seconds=5400),
    )
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    return local_rank

Utility function for model download

In [14]:
from huggingface_hub import snapshot_download
import os

def download_model(model_name):
    print("Downloading model ", model_name)

    os.makedirs("/tmp/tmp_folder", exist_ok=True)

    snapshot_download(repo_id=model_name, local_dir="/tmp/tmp_folder")

    print(f"Model {model_name} downloaded under /tmp/tmp_folder")

To train our model, we need to convert our inputs (text) to token IDs. This is done by a Hugging Face Transformers Tokenizer. In addition to Lora, we will use bitsanbytes 4-bit precision to quantize out frozen LLM to 4-bit and attach LoRA adapters on it.

Define the train function

In [15]:
# model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model_id = "aisingapore/Llama-SEA-LION-v3.5-8B-R"

In [26]:
%%writefile config.yaml
SchemaVersion: "1.0"
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        Dependencies: ./requirements.txt
        ImageUri: "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker"
        InstanceType: ml.g5.12xlarge
        IncludeLocalWorkDir: true
        PreExecutionCommands:
          - "export NCCL_P2P_DISABLE=1"
          - "export HF_HUB_ENABLE_HF_TRANSFER=1"
        CustomFileFilter:
          IgnoreNamePatterns:
            - "data/*"
            - "models/*"
            - "*.ipynb"
            - "*.csv"
            - "*.md"
            - "__pycache__"
        # RoleArn: RoleArn required if you are running the notebooks from a local IDE
  Model:
    EnableNetworkIsolation: false

Overwriting config.yaml


In [27]:
from accelerate import Accelerator
import datetime
import os
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from sagemaker.remote_function import remote
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed
import transformers

# Start training
@remote(
    keep_alive_period_in_seconds=0, #Warm-pool instance. Put 0 for avoiding additional costs
    volume_size=100,
    job_name_prefix=f"train-{model_id.split('/')[-1].replace('.', '-')}", 
    use_torchrun=True, # for distribution
)
def train_fn(
    model_name,             # Name or path of the base model to fine-tune
    train_ds,               # Training dataset
    test_ds=None,           # Optional test/validation dataset
    torch_dtype=torch.bfloat16,  # Precision type for training
    lora_r=8,               # LoRA rank - controls capacity of adaptations
    lora_alpha=16,          # LoRA alpha - scales the adaptations
    lora_dropout=0.1,       # Dropout probability for LoRA layers
    per_device_train_batch_size=8,  # Batch size for training
    per_device_eval_batch_size=8,   # Batch size for evaluation
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients
    learning_rate=2e-4,     # Learning rate for training
    num_train_epochs=1,     # Number of training epochs
    fsdp="",                # Fully Sharded Data Parallel configuration
    fsdp_config=None,       # Additional FSDP configurations
    gradient_checkpointing=False,  # Whether to use gradient checkpointing
    merge_weights=False,    # Whether to merge LoRA weights with base model
    seed=42,                # Random seed for reproducibility
    token=None              # HuggingFace token for model access
):
    # Initialize distributed training if multiple GPUs are available
    if torch.cuda.is_available() and (torch.cuda.device_count() > 1 or int(os.environ.get("SM_HOST_COUNT", 1)) > 1):
        # Call this function at the beginning of your script
        local_rank = init_distributed()

        # Now you can use distributed functionalities
        torch.distributed.barrier(device_ids=[local_rank])

    # Enable HuggingFace transfer for model downloading
    os.environ.update({"HF_HUB_ENABLE_HF_TRANSFER": "1"})

    set_seed(seed)

    accelerator = Accelerator()

    # Set up HuggingFace token if provided
    if token is not None:
        os.environ.update({"HF_TOKEN": token})
        accelerator.wait_for_everyone()

    # Download model based on training setup (single or multi-node)
    if int(os.environ.get("SM_HOST_COUNT", 1)) == 1:
        if accelerator.is_main_process:
            download_model(model_name)
    else:
        download_model(model_name)

    accelerator.wait_for_everyone()

    model_name = "/tmp/tmp_folder"

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Set Tokenizer pad Token
    tokenizer.pad_token = tokenizer.eos_token

    with accelerator.main_process_first():
        # tokenize and chunk dataset
        lm_train_dataset = train_ds.map(
            lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features)
        )

        print(f"Total number of train samples: {len(lm_train_dataset)}")

        if test_ds is not None:
            lm_test_dataset = test_ds.map(
                lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features)
            )

            print(f"Total number of test samples: {len(lm_test_dataset)}")
        else:
            lm_test_dataset = None

    # Configure model settings for bfloat16 precision
    # Setup flash_attention_2 for memory-efficient attention computation
    if torch_dtype == torch.bfloat16:
        print("flash_attention_2 init")

        model_configs = {
            "attn_implementation": "flash_attention_2",
            "torch_dtype": torch_dtype,
        }
    else:
        model_configs = dict()

    # Configure training settings based on FSDP usage
    # Set up trainer configurations for FSDP or standard training
    if fsdp != "" and fsdp_config is not None:
        print("Configurations for FSDP")

        bnb_config_params = {
            "bnb_4bit_quant_storage": torch_dtype
        }

        trainer_configs = {
            "fsdp": fsdp,
            "fsdp_config": fsdp_config,
            "gradient_checkpointing_kwargs": {
                "use_reentrant": False
            }
        }
    else:
        bnb_config_params = dict()
        trainer_configs = {
            "gradient_checkpointing": gradient_checkpointing, # Enable in case of DDP
        }

    # Enable Quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch_dtype,
        **bnb_config_params
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        quantization_config=bnb_config,
        use_cache=not gradient_checkpointing,
        cache_dir="/tmp/.cache",
        **model_configs
    )

    # Configure gradient checkpointing based on FSDP usage
    if fsdp == "" and fsdp_config is None:
        print("Prepare model for quantization")
        model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)

        if gradient_checkpointing:
            print("gradient_checkpointing enabled")
            model.gradient_checkpointing_enable()
    else:
        if gradient_checkpointing:
            print("gradient_checkpointing enabled")
            model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

    config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules="all-linear",
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)

    trainer = transformers.Trainer(
        model=model,
        train_dataset=lm_train_dataset,
        eval_dataset=lm_test_dataset if lm_test_dataset is not None else None,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            logging_strategy="steps",
            logging_steps=1,
            log_on_each_node=False,
            num_train_epochs=num_train_epochs,
            learning_rate=learning_rate,
            bf16=(
                True if torch_dtype == torch.bfloat16 else False
            ),  # Enable mixed-precision training
            tf32=False,
            ddp_find_unused_parameters=False,
            save_strategy="no",
            output_dir="outputs",
            **trainer_configs
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    if trainer.accelerator.is_main_process:
        trainer.model.print_trainable_parameters()

    trainer.train()

    if trainer.is_fsdp_enabled:
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

    if merge_weights:
        output_dir = "/tmp/model"

        # merge adapter weights with base model and save
        # save int 4 model
        trainer.model.save_pretrained(output_dir, safe_serialization=False)

        if accelerator.is_main_process:
            # clear memory
            del model
            del trainer

            torch.cuda.empty_cache()

            # load PEFT model
            model = AutoPeftModelForCausalLM.from_pretrained(
                output_dir,
                torch_dtype=torch.float16,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
                use_cache=True,
                cache_dir="/tmp/.cache",
            )

            # Merge LoRA and base model and save
            model = model.merge_and_unload()
            model.save_pretrained(
                os.environ.get("SM_MODEL_DIR", "/opt/ml/model"),
                safe_serialization=True,
                max_shard_size="2GB"
            )
    else:
        trainer.model.save_pretrained(
            os.environ.get("SM_MODEL_DIR", "/opt/ml/model"),
            safe_serialization=True
        )

    if accelerator.is_main_process:
        tokenizer.save_pretrained(os.environ.get("SM_MODEL_DIR", "/opt/ml/model"))

    accelerator.wait_for_everyone()

INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.


sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.ImageUri
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.PreExecutionCommands
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.InstanceType


In [None]:
train_fn(
    model_id,
    train_ds=train_dataset,
    test_ds=test_dataset,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    num_train_epochs=3,
    merge_weights=True,
    token=os.environ.get("HF_TOKEN", None) # Change None to your HF_TOKEN in case you don't use .env
)

2025-08-26 03:31:30,792 sagemaker.remote_function INFO     Serializing function code to s3://sagemaker-us-east-1-278313627171/train-Llama-SEA-LION-v3-5-8B-R-2025-08-26-03-31-30-792/function
2025-08-26 03:31:30,883 sagemaker.remote_function INFO     Serializing function arguments to s3://sagemaker-us-east-1-278313627171/train-Llama-SEA-LION-v3-5-8B-R-2025-08-26-03-31-30-792/arguments
2025-08-26 03:31:31,158 sagemaker.remote_function INFO     Copied user workspace to '/tmp/tmph8rjn0t1/temp_workspace/sagemaker_remote_function_workspace'
2025-08-26 03:31:31,160 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmph8rjn0t1/temp_workspace/sagemaker_remote_function_workspace/requirements.txt'
2025-08-26 03:31:31,160 sagemaker.remote_function INFO     Generated pre-execution script from commands to '/tmp/tmph8rjn0t1/temp_workspace/sagemaker_remote_function_workspace/pre_exec.sh'
2025-08-26 03:31:31,174 sagemaker.remote_function INFO     Successfully c

2025-08-26 03:31:31 Starting - Starting the training job
2025-08-26 03:31:31 Pending - Training job waiting for capacity............
2025-08-26 03:33:19 Pending - Preparing the instances for training...
2025-08-26 03:34:03 Downloading - Downloading the training image...[34m{'loss': 1.3548, 'grad_norm': 1.0879859924316406, 'learning_rate': 0.0001801169590643275, 'epoch': 0.32}[0m
[34m{'loss': 1.4307, 'grad_norm': 1.0881348848342896, 'learning_rate': 0.00017894736842105264, 'epoch': 0.34}[0m
[34m{'loss': 1.5761, 'grad_norm': 1.0529218912124634, 'learning_rate': 0.00017777777777777779, 'epoch': 0.35}[0m
[34m{'loss': 1.7304, 'grad_norm': 1.1258245706558228, 'learning_rate': 0.00017660818713450294, 'epoch': 0.37}[0m
[34m{'loss': 1.2001, 'grad_norm': 1.187904715538025, 'learning_rate': 0.00017543859649122806, 'epoch': 0.39}[0m
[34m{'loss': 1.3064, 'grad_norm': 1.344528317451477, 'learning_rate': 0.00017426900584795323, 'epoch': 0.41}[0m
[34m{'loss': 1.2368, 'grad_norm': 1.174536

## Load Fine-Tuned model

Note: Run `train_fn` with `merge_weights=True` for merging the trained adapter

### Download model

In [29]:
import boto3
import json
import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

In [30]:
sagemaker_session = sagemaker.Session()

In [31]:
# model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model_id = "aisingapore/Llama-SEA-LION-v3.5-8B-R"

bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix
job_prefix = f"train-{model_id.split('/')[-1].replace('.', '-')}"

In [32]:
def get_last_job_name(job_name_prefix):
    sagemaker_client = boto3.client('sagemaker')

    matching_jobs = []
    next_token = None

    while True:
        # Prepare the search parameters
        search_params = {
            'Resource': 'TrainingJob',
            'SearchExpression': {
                'Filters': [
                    {
                        'Name': 'TrainingJobName',
                        'Operator': 'Contains',
                        'Value': job_name_prefix
                    },
                    {
                        'Name': 'TrainingJobStatus',
                        'Operator': 'Equals',
                        'Value': "Completed"
                    }
                ]
            },
            'SortBy': 'CreationTime',
            'SortOrder': 'Descending',
            'MaxResults': 100
        }

        # Add NextToken if we have one
        if next_token:
            search_params['NextToken'] = next_token

        # Make the search request
        search_response = sagemaker_client.search(**search_params)

        # Filter and add matching jobs
        matching_jobs.extend([
            job['TrainingJob']['TrainingJobName'] 
            for job in search_response['Results']
            if job['TrainingJob']['TrainingJobName'].startswith(job_name_prefix)
        ])

        # Check if we have more results to fetch
        next_token = search_response.get('NextToken')
        if not next_token or matching_jobs:  # Stop if we found at least one match or no more results
            break

    if not matching_jobs:
        raise ValueError(f"No completed training jobs found starting with prefix '{job_name_prefix}'")

    return matching_jobs[0]

In [33]:
job_name = get_last_job_name(job_prefix)

job_name

'train-Llama-SEA-LION-v3-5-8B-R-2025-08-26-03-31-30-792'

#### Inference configurations

In [34]:
instance_count = 1
instance_type = "ml.g5.8xlarge"
number_of_gpu = 1
health_check_timeout = 700

Note that this notebook does not work on some earlier versions of the HuggingFace TGI container, such as version 2.2.0

In [41]:
image_uri = get_huggingface_llm_image_uri(
    "huggingface",
    version="3.2.0" # "2.2.0"  # Update from 2.2.0 to a newer version
)

image_uri

INFO:sagemaker.image_uris:Defaulting to only available Python version: py311
INFO:sagemaker.image_uris:Defaulting to only supported image scope: gpu.


'763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.6.0-tgi3.2.0-gpu-py311-cu124-ubuntu22.04'

Note: Image URI for training job:
`763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker`

In [42]:
if default_prefix:
    model_data = f"s3://{bucket_name}/{default_prefix}/{job_name}/{job_name}/output/model.tar.gz"
else:
    model_data = f"s3://{bucket_name}/{job_name}/{job_name}/output/model.tar.gz"

model = HuggingFaceModel(
    image_uri=image_uri,
    model_data=model_data,
    role=get_execution_role(),
    env={
        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
        'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
        'QUANTIZE': 'bitsandbytes',
        'MAX_INPUT_LENGTH': '4096',
        'MAX_TOTAL_TOKENS': '8192'
    }
)

sagemaker.config INFO - Applied value from config key = SageMaker.Model.EnableNetworkIsolation


If the following fails, you may wish to:
- check the troubleshooting guide: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html#sagemaker-python-sdk-troubleshooting-create-endpoint
- check the Cloudwatch logs: `/aws/sagemaker/Endpoints/{endpoint_name}` (for example, it may look something like this: `/aws/sagemaker/Endpoints/huggingface-pytorch-tgi-inference-2025-xx-xx-xx-xx-xx-xxx`)

In [43]:
predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    model_data_download_timeout=3600
)

INFO:sagemaker:Creating model with name: huggingface-pytorch-tgi-inference-2025-08-26-05-09-13-292
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-tgi-inference-2025-08-26-05-09-14-868
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-tgi-inference-2025-08-26-05-09-14-868


----------!

#### Predict

In [45]:
# Option 1: Automatically retrieve the first endpoint (if this is your only endpoint)
sagemaker_client = boto3.client('sagemaker')
response = sagemaker_client.list_endpoints()
endpoint_names = [ endpoint['EndpointName'] for endpoint in response['Endpoints'] ]
if len(endpoint_names):
    endpoint_name = endpoint_names[0]
    print(f'Using endpoint: {endpoint_name}')

# Option 2: Set the endpoint name manually (Uncomment below to use Option 2)
# endpoint_name = "gemma-3-27b-it-... (replace with your enpoint name)"

Using endpoint: huggingface-pytorch-tgi-inference-2025-08-26-05-09-14-868


In [46]:
from sagemaker.huggingface.model import HuggingFacePredictor

In [47]:
if 'predictor' not in locals() and 'predictor' not in globals():
    print("Create predictor")
    predictor = HuggingFacePredictor(
        endpoint_name=endpoint_name
    )

In [48]:
base_prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>{{question}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

In [49]:
prompt = base_prompt.format(question="What statue is in front of the Notre Dame building?")

predictor.predict({
	"inputs": prompt,
    "parameters": {
        "max_new_tokens": 300,
        "temperature": 0.2,
        "top_p": 0.9,
        "return_full_text": False,
        "stop": ['<|eot_id|>', '<|end_of_text|>']
    }
})

[{'generated_text': 'Mary'}]

In [55]:
def invoke_sealion(prompt: str, print_response=True, **kwargs):
    # Create payload
    parameters = {
        "max_new_tokens": 300,
        "temperature": 0.2,
        "top_p": 0.9,
        "return_full_text": False,
        "stop": ['<|eot_id|>', '<|end_of_text|>']
    }
    for k in kwargs:
        if k in ['max_new_tokens', 'temperature', 'top_p' ]:
            parameters[k] = kwargs[k]
    payload = {
    	"inputs": base_prompt.format(question=prompt),
        "parameters": parameters
    }

    # Invoke the model
    response = predictor.predict(payload)
    if print_response:
        print(response[0]['generated_text'])
    return response

In [56]:
SAMPLE_PROMPTS = [
    """Terjemahkan teks berikut ini ke dalam Bahasa Inggris. Teks: Anak laki-laki ini, yang secara teknis tidak diijinkan untuk memiliki akun situs ini untuk tiga tahun mendatang,menemukan sebuah bug (kesalahan akibat ketidaksempurnaan desain) yang memungkinkan dia menghapus komentar yang dibuat oleh pengguna lain. Masalah ini dengan “cepat” diperbaiki setelah ditemukan, demikian keterangan Facebook, perusahaan media sosial yang memiliki Instagram. Jani kemudian dibayar - yang membuat dia sebagai anak yang termuda yang pernah menerima hadiah atas penemuan bug ini. Setelah menemukan kekurangan itu pada Februari, dia mengirim email ke Facebook. Beli sepeda dan peralatan sepak bola Sejumlah ahli teknik keamanan di perusahaan itu telah membuat akun uji coba kepada Jani untuk membuktikan teorinya - dan dia dapat melakukannya. Anak laki-laki ini, dari Helsinki, mengatakan kepada koran Finlandia Iltalehti, dia berencana untuk menggunakan uang itu untuk membeli sepeda baru, peralatan sepak bola dan komputer untuk saudara laki-lakinya. Facebook mengatakan kepada BBC, telah membayar $4.3 juta sebagai hadiah bagi yang menemukan bug sejak 2011. Banyak perusahaan menawarkan sebuah insentif keuangan bagi profesional keamanan - dan anak-anak muda, yang menyampaikan kekurangan itu kepada perusahaan, dibandingkan menjualnya ke pasar gelap. Terjemahan:""",
    """Apa sentimen dari kalimat berikut ini? Kalimat: Buku ini sangat membosankan. Jawaban:""",
    """Anda akan diberikan sebuah teks dan pertanyaan. Jawablah pertanyaan tersebut berdasarkan teks yang tersedia. > Teks: “Isyana lahir di Bandung pada 2 Mei 1993. Dia menghabiskan masa kecilnya di berbagai lokasi, karena orang tuanya bekerja & melanjutkan studi mereka di Belgia. Namun, pada usia 7 tahun keluarganya pindah ke Bandung, Indonesia. Isyana adalah putri bungsu dari pasangan Luana Marpanda, seorang guru musik, dan Sapta Dwikardana, Ph.D seorang dosen dan terapis (grafologis). Ia memiliki kakak perempuan bernama Rara Sekar Larasati, yang juga merupakan vokalis band bernama Banda Neira. Dibesarkan dalam keluarga pendidik, Isyana diperkenalkan ke dunia musik pada usia 4 tahun oleh ibunya. Isyana telah menguasai sejumlah instrumen. Termasuk piano, electone, flute, biola, dan saksofon.” > Pertanyaan: Siapa nama orang tua Isyana?""",
    """Sebutkan persamaan dan perbedaan antara gado-gado, ketoprak dan karedok""",
    """Jelaskan budaya Indonesia menyapa orang yang lebih tua?""",
    """Jelaskan budaya pulang kampung ketika lebaran?""",
    """Sebutkan berbagai jenis kopi dan karakteristik rasanya yang berasal dari Indonesia"""   
]

In [57]:
for prompt in SAMPLE_PROMPTS:
    print(f"##### Prompt: {prompt} #####")
    print(f"##### Response #####")
    _ = invoke_sealion(prompt, max_tokens=1000, temperature=0.1, top_p=0.9)
    print()

##### Prompt: Terjemahkan teks berikut ini ke dalam Bahasa Inggris. Teks: Anak laki-laki ini, yang secara teknis tidak diijinkan untuk memiliki akun situs ini untuk tiga tahun mendatang,menemukan sebuah bug (kesalahan akibat ketidaksempurnaan desain) yang memungkinkan dia menghapus komentar yang dibuat oleh pengguna lain. Masalah ini dengan “cepat” diperbaiki setelah ditemukan, demikian keterangan Facebook, perusahaan media sosial yang memiliki Instagram. Jani kemudian dibayar - yang membuat dia sebagai anak yang termuda yang pernah menerima hadiah atas penemuan bug ini. Setelah menemukan kekurangan itu pada Februari, dia mengirim email ke Facebook. Beli sepeda dan peralatan sepak bola Sejumlah ahli teknik keamanan di perusahaan itu telah membuat akun uji coba kepada Jani untuk membuktikan teorinya - dan dia dapat melakukannya. Anak laki-laki ini, dari Helsinki, mengatakan kepada koran Finlandia Iltalehti, dia berencana untuk menggunakan uang itu untuk membeli sepeda baru, peralatan se

#### Delete Endpoint

In [58]:
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting model with name: huggingface-pytorch-tgi-inference-2025-08-26-05-09-13-292
INFO:sagemaker:Deleting endpoint configuration with name: huggingface-pytorch-tgi-inference-2025-08-26-05-09-14-868
INFO:sagemaker:Deleting endpoint with name: huggingface-pytorch-tgi-inference-2025-08-26-05-09-14-868


In [None]:
!jupyter nbconvert Llama-SEA-LION-v3.5-8B-R-qlora-ddp-remote-decorator_qa.ipynb --to html