# Sagemaker Training and Inference

In this notebook, I give a short demonstration of how to train and deploy the residual CNN model on Amazon SageMaker.

**This notebook:**
- Uploads training data to an S3 bucket
- Trains a model on SageMaker
- Deploys the trained model to an endpoint

In [None]:
%load_ext autoreload
%autoreload 2

import json
import os
from pathlib import Path

import boto3
from botocore.client import BaseClient
import sagemaker
from sagemaker.pytorch import PyTorch, PyTorchModel
from sagemaker.serializers import JSONSerializer


n_actions = 1
prefix = f"pdl/cnn1d-residual-n{n_actions}/v1"
framework_version = "2.6"
py_version = "py312"

session = sagemaker.Session()  # Use LocalSession() for local deployment testing
role = os.getenv("SAGEMAKER_ROLE_ARN")

bucket = session.default_bucket()

s3 = boto3.client("s3")
sm = boto3.client("sagemaker")

data_s3 = f"s3://{bucket}/{prefix}/"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Upload

First, I upload the training data to an S3 bucket.

In [None]:
data_path = Path(f"data/n{n_actions}/processed")

for path in data_path.glob("*.json"):
    file_path = f"{prefix}/{path.name}"
    s3.upload_file(str(path), bucket, file_path)
    print(f"Uploaded {path} to {file_path}")

## Training

Next, I train the model on SageMaker, and the torchscript-compiled model is saved, using the training script `train_cnn1d_residual.py`. 


In [None]:
estimator = PyTorch(
    role=role,
    framework_version=framework_version,
    py_version=py_version,
    source_dir="src",
    entry_point="preference_dynamics/sagemaker/train_cnn1d_residual.py",
    instance_type="ml.m5.large",
    instance_count=1,
    hyperparameters={
        "epochs": 200,
        "patience": 10,
        "lr": 1e-2,
        "batch-size": 32,
        "filters": "64 64 196",
        "kernel-sizes": "3 5 7",
        "hidden-dims": "64 96",
        "dropout": 0.3,
    },
    disable_profiler=True,
    use_spot_instances=True,
    max_run=3600,
    max_wait=5400,
    volume_size=50,
)

# Training data downloaded to /opt/ml/input/data/processed/
inputs = {"processed": data_s3}
estimator.fit(inputs)

training_job_name = estimator.latest_training_job.name
print(f"\ntraining job name: {training_job_name}")

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2026-01-11-14-38-37-129


2026-01-11 14:38:39 Starting - Starting the training job...
2026-01-11 14:38:54 Starting - Preparing the instances for training...
2026-01-11 14:39:17 Downloading - Downloading input data...
2026-01-11 14:39:57 Downloading - Downloading the training image......
2026-01-11 14:41:08 Training - Training image download completed. Training in progress..bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2026-01-11 14:41:11,764 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2026-01-11 14:41:11,765 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2026-01-11 14:41:11,766 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2026-01-11 14:41:11,776 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2026-01-11 14:41:11,809 sagemaker_pytorch_container.training INFO     Invoking user traini

## Inference

In [None]:
def model_data_for_job(training_job_name: str, sm_client: BaseClient):
    """
    Get model artifacts S3 path for a training job.

    Args:
        training_job_name: Name of the training job.
        sm_client: boto3 SageMaker client.

    Returns:
        S3 path to the model artifacts.
    """
    training_job = sm_client.describe_training_job(TrainingJobName=training_job_name)
    return training_job["ModelArtifacts"]["S3ModelArtifacts"]


def delete_endpoints_models(sm_client: BaseClient):
    """
    Delete all endpoints, endpoint configs, and models.

    Args:
        sm_client: boto3 SageMaker client.
    """
    for endpoint in sm_client.list_endpoints().get("Endpoints", []):
        sm_client.delete_endpoint(EndpointName=endpoint["EndpointName"])
        sm_client.delete_endpoint_config(EndpointConfigName=endpoint["EndpointName"])
        print(f"Deleted endpoint: {endpoint['EndpointName']}")
    for model in sm_client.list_models().get("Models", []):
        sm_client.delete_model(ModelName=model["ModelName"])
        print(f"Deleted model: {model['ModelName']}")

In [12]:
training_job_name = "pytorch-training-2026-01-11-14-38-37-129"
model_data = model_data_for_job(training_job_name, sm)

model = PyTorchModel(
    sagemaker_session=session,
    model_data=model_data,
    role=role,
    framework_version=framework_version,
    py_version=py_version,
    # Skip dependency loading for training through separate source_dir
    source_dir="src/preference_dynamics/sagemaker",
    entry_point="inference_cnn1d_residual.py",
)


# Deploy the model
endpoint_name = "pdl-cnn1d-residual-v1"
predictor = model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type="ml.m5.large",
    serializer=JSONSerializer(),
)

INFO:sagemaker:Repacking model artifact (s3://sagemaker-eu-central-1-857754129070/pytorch-training-2026-01-11-14-38-37-129/output/model.tar.gz), script artifact (src/preference_dynamics/sagemaker), and dependencies ([]) into single tar.gz file located at s3://sagemaker-eu-central-1-857754129070/pytorch-inference-2026-01-11-15-10-08-767/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: pytorch-inference-2026-01-11-15-10-10-402
INFO:sagemaker:Creating endpoint-config with name pdl-cnn1d-residual-v1
INFO:sagemaker:Creating endpoint with name pdl-cnn1d-residual-v1


-----!

In [13]:
request_body = json.loads(
    '[{"time_series": [[0.0, 4.130735205020703, 10.42741511136343, 13.000991632401586, 13.107347131379584, 12.428972206111084, 11.879681897898358, 11.660545016760784, 11.655366741255051, 11.71609027105237, 11.763977018052522, 11.782619408435737, 11.782715873817866, 11.77728643887746, 11.773114486025499, 11.771528825133688, 11.771546132995784], [3.8155207648866263, 2.2223991321403833, 6.6435242783096, 10.901003675676636, 12.846477694567007, 13.075174782767874, 12.675575775759418, 12.303005818271386, 12.136507385253337, 12.119354606686608, 12.155405105286086, 12.187986622354332, 12.202225796968577, 12.203477521999082, 12.20023221014977, 12.197386894106941, 12.19616613959889]], "features": [true, 3.6539340823159647, 2.994299404001566]}]'
)

preds = predictor.predict(
    request_body, initial_args={"ContentType": "application/json", "Accept": "application/json"}
)

preds

array([[ 6.00479126, -0.03229706,  4.22427464,  3.57283139, -9.11141968,
         1.27821589, 11.76381111, 11.92895222]])

In [14]:
# Clean up
delete_endpoints_models(sm)

Deleted endpoint: pdl-cnn1d-residual-v1
Deleted model: pytorch-inference-2026-01-11-15-10-10-402


## Summary

This notebook gives a short demonstration of model training deployment to an inference endpoint with SageMaker.

