# Data Preprocessing with SageMaker SDK (PyTorchProcessor)

This notebook demonstrates data preprocessing using the **SageMaker SDK's PyTorchProcessor** - a modern framework-based processor that simplifies preprocessing workflows.

**Key Advantages:**
- Framework-optimized for PyTorch-based workflows
- Automatic requirements.txt installation
- Built-in log streaming
- Simplified API compared to sagemaker-core

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from corelab.core.session import CoreLabSession

lab_session = CoreLabSession(
    'pytorch', 
    'customer-churn', 
    default_folder='preprocessing_sdk', 
    create_run_folder=True, 
    aws_profile='sagemaker-role'
)
lab_session.print()

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/machiel/Library/Application Support/sagemaker/config.yaml


Couldn't call 'get_role' to get Role ARN from role name machiel-crystalline to get Role path.


falling back to profile: sagemaker-role
AWS region: eu-central-1
Execution role arn:aws:iam::136548476532:role/service-role/AmazonSageMaker-ExecutionRole-20250902T164316
Output bucket uri: s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-13T18-08-50
Framework: pytorch
Project name: customer-churn


## Using PyTorchProcessor from SageMaker SDK

The PyTorchProcessor provides modern framework integration with better dependency management and simplified configuration.

In [3]:
from sagemaker.pytorch.processing import PyTorchProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Define data locations
data_s3_uri = f"s3://sagemaker-example-files-prod-{lab_session.region}/datasets/tabular/synthetic/churn.txt"
output_s3_uri = lab_session.jobs_output_s3_uri

print(f"📁 Data S3 URI: {data_s3_uri}")
print(f"📤 Output S3 URI: {output_s3_uri}")

📁 Data S3 URI: s3://sagemaker-example-files-prod-eu-central-1/datasets/tabular/synthetic/churn.txt
📤 Output S3 URI: s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-13T18-08-50/customer-churn-2025-10-13T18-08-50/jobs


In [4]:
# Create PyTorchProcessor - modern framework with better dependency handling
pytorch_processor = PyTorchProcessor(
    framework_version='2.0',
    py_version='py310',
    instance_type="ml.m5.large",
    instance_count=1,
    role=lab_session.role,
    sagemaker_session=lab_session.get_sagemaker_session(),
    volume_size_in_gb=30,
    max_runtime_in_seconds=3600,
    env={"PYTHONUNBUFFERED": "1"}
)

print("✅ PyTorchProcessor created")

✅ PyTorchProcessor created


In [6]:
# Run the processing job with framework features
job_name = lab_session.processing_job_name

pytorch_processor.run(
    code="preprocessing.py",         # Entry point script
    source_dir="src/",              # Directory with code and requirements.txt
    job_name=job_name,
    inputs=[
        ProcessingInput(
            source=data_s3_uri,
            destination="/opt/ml/processing/input/data"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="processed",
            source="/opt/ml/processing/output",
            destination=f"{output_s3_uri}/{job_name}"
        )
    ],
    arguments=["--train-test-split", "0.33"],
    wait=True,  # Wait for completion
    logs=True   # Stream logs to notebook
)

print(f"✅ PyTorch Processing job completed: {job_name}")

INFO:sagemaker.processing:Uploaded src/ to s3://sagemaker-eu-central-1-136548476532/customer-churn-pytorch-processing-2025-10-13T18-12-01/source/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-eu-central-1-136548476532/customer-churn-pytorch-processing-2025-10-13T18-12-01/source/runproc.sh
INFO:sagemaker:Creating processing-job with name customer-churn-pytorch-processing-2025-10-13T18-12-01
INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker.local.image:'Docker Compose' found using Docker CLI.
INFO:sagemaker.local.local_session:Starting processing job
INFO:sagemaker.loc

 Container gpsak1xuxp-algo-1-ucemk  Creating
 Container gpsak1xuxp-algo-1-ucemk  Created
Attaching to gpsak1xuxp-algo-1-ucemk
gpsak1xuxp-algo-1-ucemk  | CodeArtifact repository not specified. Skipping login.
gpsak1xuxp-algo-1-ucemk  | [0mCollecting xgboost==3.0.5 (from -r requirements.txt (line 1))
gpsak1xuxp-algo-1-ucemk  |   Downloading xgboost-3.0.5-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
gpsak1xuxp-algo-1-ucemk  | Collecting sagemaker-training>=5.1.1 (from -r requirements.txt (line 3))
gpsak1xuxp-algo-1-ucemk  |   Downloading sagemaker_training-5.1.1.tar.gz (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m eta [36m-:--:--[0m
gpsak1xuxp-algo-1-ucemk  | [?25h  Preparing metadata (setup.py) ... [?25ldone
gpsak1xuxp-algo-1-ucemk  | [?25hCollecting sagemaker-inference>=1.10.1 (from -r requirements.txt (line 4))
gpsak1xuxp-algo-1-ucemk  |   Downloading sagemaker_inference-1.10.1.tar.g

INFO:sagemaker.local.image:===== Job Complete =====


✅ PyTorch Processing job completed: customer-churn-pytorch-processing-2025-10-13T18-12-01


In [None]:
print(f"Output location of processing job: {output_s3_uri}/{job_name}")

## Key Features of PyTorchProcessor

### Advantages Over sagemaker-core ProcessingJob:

1. **Modern Framework Integration**
   - Optimized for PyTorch-based workflows
   - Uses latest PyTorch container images (2.0+)
   - Better compatibility with modern ML libraries and tools

2. **Automatic Dependency Management**
   - `source_dir` automatically uploads `requirements.txt`
   - Dependencies installed in container automatically
   - No manual container configuration needed

3. **Built-in Convenience Features**
   - `wait=True`: Automatic blocking until completion
   - `logs=True`: Stream logs directly to notebook
   - Cleaner API with less boilerplate

4. **SDK Ecosystem Integration**
   - Works seamlessly with other SDK components
   - Can be used with PipelineSession for pipelines
   - Integrates with SageMaker Experiments

### When to Use PyTorchProcessor:

✅ **Use PyTorchProcessor when:**
- Building modern ML workflows with PyTorch ecosystem
- Need automatic dependency installation
- Want simpler, higher-level API
- Building SageMaker Pipelines (next lab!)

✅ **Use sagemaker-core ProcessingJob when:**
- Need fine-grained control over all parameters
- Using custom containers
- Learning the underlying SageMaker APIs
- Building custom processing abstractions