<h1> Connection to s3 </h1>

In [1]:
import boto3
import s3fs

bucket = "feature-engineering-bucket-xxxxxx"
prefix = "Dataset/"

# List all files in the Dataset folder
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)

for obj in response.get('Contents', []):
    print(obj["Key"])


Dataset/bank-additional-full.csv
Dataset/bank-additional-names.txt
Dataset/bank-additional.csv


boto3 is used to connect to s3 bucket.

<h1> Docker based preprocessing job</h1>

In [11]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

# Role ARN - Replace this with your role ARN
# role = "arn:aws:iam::878254733488:role/sagemaker-role"
role = get_execution_role()

# Define your custom image URI from ECR
image_uri = "961807745392.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-feature-engineering:v2"

# Create a ScriptProcessor with the custom Docker image
script_processor = ScriptProcessor(
    image_uri=image_uri,
    role=role,
    command=["python3"],  # Specify the command to run Python scripts
    instance_count=1,
    instance_type="ml.t3.medium",
    base_job_name="feature-engineering-job"
)

# Run the script processor job
script_processor.run(
    code="feature_engineering.py",  # Ensure this script is correctly located
    inputs=[
        ProcessingInput(
            source="s3://feature-engineering-bucket-989220949c9c/Dataset/bank-additional-full.csv",  # S3 input file
            destination="/opt/ml/processing/input"  # Path inside the container
        )
    ],
    outputs=[
        ProcessingOutput(
            source="/opt/ml/processing/output",  # Path inside the container
            destination="s3://feature-engineering-bucket-989220949c9c/output/"  # S3 output location
        )
    ]
)


sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.Session.DefaultS3Bucket
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.Session.DefaultS3ObjectKeyPrefix
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.Session.DefaultS3Bucket
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.Session.DefaultS3ObjectKeyPrefix
sagemaker.config INFO - Applied value from config key = SageMaker.ProcessingJob.NetworkConfig.VpcConfig.Subnets
sagemaker.config INFO - Applied value from config key = SageMaker.ProcessingJob.NetworkConfig.VpcConfig.SecurityGroupIds
..................[34mFeature engineering complete. Saved to: /opt/ml/processing/output/bank_additional_transformed.csv[0m



in sagemaker job, docker provides the execution environment through the defintiion of script_processor. Script processor defines the script to be run in this environment, and the input "/opt/ml/processing/input" and the output "/opt/ml/processing/output". Even thorugh maybe feature_engineering.py may be defined in Dockerfile, how the script mentioned in the processor job script_processor will overwrite the same file defined during the docker definition.