# Label your dataset with Amazon SageMaker Ground Truth and SageMaker processing jobs


In [1]:
import boto3
import json
import numpy
import os
import sagemaker

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingOutput

sm_client = boto3.client('sagemaker')
s3_resource = boto3.resource('s3')
sm_session = sagemaker.Session()
role = sagemaker.get_execution_role()
## Role-arn information will be used as the role you enter when creating a labeling job. 
## Copy the role_arn information printed in the print statement and save it temporarily. 
region = boto3.Session().region_name
print("The IAM execution role ARN: ", role)


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


### Create a labeling job in Amazon SageMaker Ground Truth

In [None]:
#### If you don't have a suitable sample dataset, use the sample data below. 
#### Copy the sample file in S3 to your own s3 bucket as a sample file for bounding box labelling.

In [None]:
bucket = sm_session.default_bucket()

!aws s3 sync s3://sagemaker-sample-files/datasets/image/caltech-101/inference/ s3://{bucket}/raw_images/

print('Copy and paste the below link into a web browser to confirm the ten images were successfully uploaded to your bucket:')
print(f'https://s3.console.aws.amazon.com/s3/buckets/{bucket}/images/')

print('\nWhen prompted by Sagemaker to enter the S3 location for input datasets, you can paste in the below S3 URL')

print(f's3://{bucket}/images/')

print('\nWhen prompted by Sagemaker to Specify a new location, you can paste in the below S3 URL')

print(f's3://{bucket}/labeled-data/')

# To create your custom model on YOLOv11 you are going to need to label your custom dataset.  To label an object detection dataset you may use Amazon SageMaker Ground Truth.

| ⚠️ WARNING: If you have already labeled an object detection dataset with Amazon SageMaker Ground Truth you can skip to the "**Get Job Details**" |
| -- |

# Create a Labeling Workforce

Follow the steps in the SageMaker Ground Truth documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html#create-workforce-labeling-job


# Create your bounding box labeling job

Follow the steps in the SageMaker Ground Truth documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-create-labeling-job-console.html

If using the AWS Console, you should create a labeling job with the following options:

1. Job name: Set any unique name for the job name, for example "Object-Detection-bdsm1".
2. Leave the "I want to specify a label attribute..." option un-checked.
3. Input data setup: Pick "Automated data setup".
4. Input dataset location: Copy and paste the location of the single folder with your images in S3. Example: "s3://mybucket/raw_images".
5. Output dataset location: Choose "Same location as input dataset".
6. Data type: Choose "Image".
7. IAM Role: Create a new role and give access to the S3 bucket where your images are located, or any S3 bucket.
8. Now hit "Complete data setup" and wait for it to be ready.
9. Task category: Choose "Image" and select "Bounding box", then hit "Next".
10. Worker types: Select "Private" and choose your team for the "Private teams" option.
11. For the Bounding box labeling tool: Enter a description and instructions, and for the "Labels" section add the relevant labels for your job. ex("airplane", "car", "ferry", "helicopter", "motorbike") Yolo models require lowercase definitions when defining labels.
12. Finally choose "Create".

# Get Job Details and Labels

## Once you have finished labeling your images, let's retrieve the information we need to create our processing job which will create the dataset in the format YOLOv11 expects

In [19]:
groundtruth_job_name = "Object-Detection-bdsm1" ### <-- Replace with the name you used for your labeling job

In [21]:
response = sm_client.describe_labeling_job(
    LabelingJobName=groundtruth_job_name
)

labelingJobStatus = response["LabelingJobStatus"]
labelsListUri = response["LabelCategoryConfigS3Uri"]

print("Job Status: ",labelingJobStatus)
print("Labels Uri: ", labelsListUri)

Job Status:  Completed
Labels Uri:  s3://sagemaker-us-west-2-986221661979/raw_images/Object-Detection-bdsm1/annotation-tool/data.json


### Get labels

We need to retrieve the labels from the training job which are located in S3.

In [22]:
def split_s3_path(s3_path):
    path_parts=s3_path.replace("s3://","").split("/")
    bucket=path_parts.pop(0)
    key="/".join(path_parts)
    return bucket, key

def get_labels_list(labels_uri):
    labels = []
    bucket, key = split_s3_path(labels_uri)
    s3_resource.meta.client.download_file(bucket, key, 'labels.json')
    with open('labels.json') as f:
        data = json.load(f)
    for label in data["labels"]:
        labels.append(label["label"])
    return labels

In [23]:
labels = get_labels_list(labelsListUri)
print("Labels: ",labels)

Labels:  ['airplane', 'car', 'ferry', 'helicopter', 'moterbike']


### Create a SageMaker Processing Job
Scikit-learn을 사용하여 데이터 전처리나 후처리 작업을 수행


In [24]:
sklearn_processor = SKLearnProcessor(
    framework_version="1.0-1",
    instance_type="ml.c5.xlarge",
    env={'gt_job_name': groundtruth_job_name,
        'region': region},
    instance_count=1,
    base_job_name="yolov11-process",
    role=role,
    sagemaker_session = sm_session
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


In [25]:
sklearn_processor.run(
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train")
    ],
    code="code/preprocessing.py",
)

INFO:sagemaker:Creating processing-job with name yolov11-process-2025-01-08-13-00-11-518


...............
..

In [26]:
## copy the s3 output path, it requires next step.
dataset_s3_uri = sklearn_processor.jobs[-1].describe()["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]

| ⚠️ WARNING: These are the details you will need to train your models based on the labeling job you completed. |
| -- |

In [27]:
print("Dataset S3 location: ", dataset_s3_uri)
print("Labels: ", labels)

Dataset S3 location:  s3://sagemaker-us-west-2-986221661979/yolov11-process-2025-01-08-13-00-11-518/output/train
Labels:  ['airplane', 'car', 'ferry', 'helicopter', 'moterbike']
