# Data Preprocessing with SageMaker SDK (PyTorchProcessor)

This notebook demonstrates data preprocessing using the **SageMaker SDK's PyTorchProcessor** - a modern framework-based processor that simplifies preprocessing workflows.

**Key Advantages:**
- Framework-optimized for PyTorch-based workflows
- Automatic requirements.txt installation
- Built-in log streaming
- Simplified API compared to sagemaker-core

In [11]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
from corelab.core.session import CoreLabSession

lab_session = CoreLabSession(
    'pytorch', 
    'customer-churn', 
    default_folder='preprocessing_sdk', 
    create_run_folder=True, 
    aws_profile='sagemaker-role'
)
lab_session.print()

execution role available: arn:aws:iam::136548476532:role/service-role/AmazonSageMaker-ExecutionRole-20250902T164316
AWS region: eu-central-1
Execution role arn:aws:iam::136548476532:role/service-role/AmazonSageMaker-ExecutionRole-20250902T164316
Output bucket uri: s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-10T11-01-47
Framework: pytorch
Project name: customer-churn


## Using PyTorchProcessor from SageMaker SDK

The PyTorchProcessor provides modern framework integration with better dependency management and simplified configuration.

In [13]:
from sagemaker.pytorch.processing import PyTorchProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Define data locations
data_s3_uri = f"s3://sagemaker-example-files-prod-{lab_session.region}/datasets/tabular/synthetic/churn.txt"
output_s3_uri = lab_session.jobs_output_s3_uri

print(f"📁 Data S3 URI: {data_s3_uri}")
print(f"📤 Output S3 URI: {output_s3_uri}")

📁 Data S3 URI: s3://sagemaker-example-files-prod-eu-central-1/datasets/tabular/synthetic/churn.txt
📤 Output S3 URI: s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-10T11-01-47/customer-churn-2025-10-10T11-01-47/jobs


In [14]:
# Create PyTorchProcessor - modern framework with better dependency handling
pytorch_processor = PyTorchProcessor(
    framework_version='2.0',
    py_version='py310',
    instance_type="ml.m5.large",
    instance_count=1,
    role=lab_session.role,
    sagemaker_session=lab_session.get_sagemaker_session(),
    volume_size_in_gb=30,
    max_runtime_in_seconds=3600,
    env={"PYTHONUNBUFFERED": "1"}
)

print("✅ PyTorchProcessor created")

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


✅ PyTorchProcessor created


In [16]:
# Run the processing job with framework features
job_name = lab_session.processing_job_name

pytorch_processor.run(
    code="preprocessing.py",         # Entry point script
    source_dir="src/",              # Directory with code and requirements.txt
    job_name=job_name,
    inputs=[
        ProcessingInput(
            source=data_s3_uri,
            destination="/opt/ml/processing/input/data"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="processed",
            source="/opt/ml/processing/output",
            destination=f"{output_s3_uri}/{job_name}"
        )
    ],
    arguments=["--train-test-split", "0.33"],
    wait=True,  # Wait for completion
    logs=True   # Stream logs to notebook
)

print(f"✅ PyTorch Processing job completed: {job_name}")

INFO:sagemaker.processing:Uploaded src/ to s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-10T11-01-47/customer-churn-pytorch-processing-2025-10-10T11-04-12/source/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-10T11-01-47/customer-churn-pytorch-processing-2025-10-10T11-04-12/source/runproc.sh
INFO:sagemaker:Creating processing-job with name customer-churn-pytorch-processing-2025-10-10T11-04-12


...............[34mCodeArtifact repository not specified. Skipping login.[0m
[34mCollecting sagemaker==2.251.1 (from -r requirements.txt (line 1))
  Downloading sagemaker-2.251.1-py3-none-any.whl.metadata (17 kB)[0m
[34mCollecting attrs<26,>=24 (from sagemaker==2.251.1->-r requirements.txt (line 1))
  Downloading attrs-25.4.0-py3-none-any.whl.metadata (10 kB)[0m
[34mCollecting boto3<2.0,>=1.39.5 (from sagemaker==2.251.1->-r requirements.txt (line 1))
  Downloading boto3-1.40.49-py3-none-any.whl.metadata (6.7 kB)[0m
[34mCollecting docker (from sagemaker==2.251.1->-r requirements.txt (line 1))
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)[0m
[34mCollecting fastapi (from sagemaker==2.251.1->-r requirements.txt (line 1))
  Downloading fastapi-0.118.3-py3-none-any.whl.metadata (28 kB)[0m
[34mCollecting graphene<4,>=3 (from sagemaker==2.251.1->-r requirements.txt (line 1))
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)[0m
[34mCollecting num

In [17]:
print(f"Output location of processing job: {output_s3_uri}/{job_name}")

Output location of processing job: s3://sagemaker-eu-central-1-136548476532/preprocessing_sdk/2025-10-10T11-01-47/customer-churn-2025-10-10T11-01-47/jobs/customer-churn-pytorch-processing-2025-10-10T11-04-12


## Key Features of PyTorchProcessor

### Advantages Over sagemaker-core ProcessingJob:

1. **Modern Framework Integration**
   - Optimized for PyTorch-based workflows
   - Uses latest PyTorch container images (2.0+)
   - Better compatibility with modern ML libraries and tools

2. **Automatic Dependency Management**
   - `source_dir` automatically uploads `requirements.txt`
   - Dependencies installed in container automatically
   - No manual container configuration needed

3. **Built-in Convenience Features**
   - `wait=True`: Automatic blocking until completion
   - `logs=True`: Stream logs directly to notebook
   - Cleaner API with less boilerplate

4. **SDK Ecosystem Integration**
   - Works seamlessly with other SDK components
   - Can be used with PipelineSession for pipelines
   - Integrates with SageMaker Experiments

### When to Use PyTorchProcessor:

✅ **Use PyTorchProcessor when:**
- Building modern ML workflows with PyTorch ecosystem
- Need automatic dependency installation
- Want simpler, higher-level API
- Building SageMaker Pipelines (next lab!)

✅ **Use sagemaker-core ProcessingJob when:**
- Need fine-grained control over all parameters
- Using custom containers
- Learning the underlying SageMaker APIs
- Building custom processing abstractions