# Data Preprocessing with SageMaker SDK (PyTorchProcessor)

This notebook demonstrates data preprocessing using the **SageMaker SDK's PyTorchProcessor** - a modern framework-based processor that simplifies preprocessing workflows.

**Key Advantages:**
- Framework-optimized for PyTorch-based workflows
- Automatic requirements.txt installation
- Built-in log streaming
- Simplified API compared to sagemaker-core

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from corelab.core.session import CoreLabSession

lab_session = CoreLabSession(
    'pytorch', 
    'customer-churn', 
    default_folder='preprocessing_sdk', 
    create_run_folder=True, 
    aws_profile='sagemaker-role'
)
lab_session.print()

## Using PyTorchProcessor from SageMaker SDK

The PyTorchProcessor provides modern framework integration with better dependency management and simplified configuration.

In [None]:
from sagemaker.pytorch.processing import PyTorchProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Define data locations
data_s3_uri = f"s3://sagemaker-example-files-prod-{lab_session.region}/datasets/tabular/synthetic/churn.txt"
output_s3_uri = lab_session.jobs_output_s3_uri

print(f"📁 Data S3 URI: {data_s3_uri}")
print(f"📤 Output S3 URI: {output_s3_uri}")

In [None]:
# Create PyTorchProcessor - modern framework with better dependency handling
pytorch_processor = PyTorchProcessor(
    framework_version='2.0',
    py_version='py310',
    instance_type="ml.m5.large",
    instance_count=1,
    role=lab_session.role,
    sagemaker_session=lab_session.get_sagemaker_session(),
    volume_size_in_gb=30,
    max_runtime_in_seconds=3600,
    env={"PYTHONUNBUFFERED": "1"}
)

print("✅ PyTorchProcessor created")

In [None]:
# Run the processing job with framework features
job_name = lab_session.processing_job_name

pytorch_processor.run(
    code="preprocessing.py",         # Entry point script
    source_dir="src/",              # Directory with code and requirements.txt
    job_name=job_name,
    inputs=[
        ProcessingInput(
            source=data_s3_uri,
            destination="/opt/ml/processing/input/data"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="processed",
            source="/opt/ml/processing/output",
            destination=f"{output_s3_uri}/{job_name}"
        )
    ],
    arguments=["--train-test-split", "0.33"],
    wait=True,  # Wait for completion
    logs=True   # Stream logs to notebook
)

print(f"✅ PyTorch Processing job completed: {job_name}")

In [None]:
print(f"Output location of processing job: {output_s3_uri}/{job_name}")

## Key Features of PyTorchProcessor

### Advantages Over sagemaker-core ProcessingJob:

1. **Modern Framework Integration**
   - Optimized for PyTorch-based workflows
   - Uses latest PyTorch container images (2.0+)
   - Better compatibility with modern ML libraries and tools

2. **Automatic Dependency Management**
   - `source_dir` automatically uploads `requirements.txt`
   - Dependencies installed in container automatically
   - No manual container configuration needed

3. **Built-in Convenience Features**
   - `wait=True`: Automatic blocking until completion
   - `logs=True`: Stream logs directly to notebook
   - Cleaner API with less boilerplate

4. **SDK Ecosystem Integration**
   - Works seamlessly with other SDK components
   - Can be used with PipelineSession for pipelines
   - Integrates with SageMaker Experiments

### When to Use PyTorchProcessor:

✅ **Use PyTorchProcessor when:**
- Building modern ML workflows with PyTorch ecosystem
- Need automatic dependency installation
- Want simpler, higher-level API
- Building SageMaker Pipelines (next lab!)

✅ **Use sagemaker-core ProcessingJob when:**
- Need fine-grained control over all parameters
- Using custom containers
- Learning the underlying SageMaker APIs
- Building custom processing abstractions