# Introduction

In 2021, during a project at Explore, our team worked on classifying tweets into one of four categories related to man-made climate change: **"Negative"**, **"Pro"**, **"Neutral"**, or **"News"**. At the time, we relied on traditional NLP techniques such as lowercasing, removing stopwords, and lemmatization with part-of-speech tagging. To convert text into usable vectors, we used methods like Bag of Words and Word2Vec. For classification, we applied algorithms such as k-Nearest Neighbors, Random Forest, XGBoost, and Logistic Regression. Our efforts resulted in a macro-weighted F1 score of approximately **0.70** and an accuracy of around **0.77**.

Fast forward four years, I became curious about how transformer-based techniques would perform compared to our earlier approach. To explore this, I revisited the original dataset and used a **RoBERTa model** as a baseline, fine-tuning it to classify the tweets effectively. Fine-tuning on **RoBERTa** yielded similar results with minimal data preprocessing

This project demonstrates an end-to-end machine learning workflow, incorporating **SageMaker processing**, **SageMaker training jobs**, and **SageMaker endpoints**. It concludes with deploying the fine-tuned model as a real-time endpoint for practical use.


#### Set up

In [62]:
import sys
import os
from datetime import datetime
import boto3
import sagemaker

sm_boto3 = boto3.client("sagemaker")
sagemaker_session = sagemaker.session.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()

project_name = "roberta"

#### Upload Input Data to S3

In [83]:
# Local path to your data
local_path = "train.csv"

# s3 uri where data will be uploaded
base_uri = os.path.join(f"s3://{default_bucket}", project_name, "opt/ml/processing/input")

input_data_s3_uri = sagemaker.s3.S3Uploader.upload(
    local_path=local_path,
    desired_s3_uri=base_uri,
)
print(input_data_s3_uri)

s3://sagemaker-us-east-1-770208914484/roberta/opt/ml/processing/input/train.csv


#### Prepare Preprocessing Script

1. **Read the Data**  
   Load the dataset into memory to begin the processing pipeline.

2. **Equal Sampling Across Labels**  
   Perform stratified sampling to ensure equal representation of each label in the dataset. Use replacement sampling (`replace=True`) to accommodate cases where the target sample size (`n`) exceeds the number of observations in any category (`n_k`).

3. **Train-Test Split**  
   Utilize the `train_test_split` function from the `sklearn` library to partition the data into training and testing sets.

4. **Create Dataset Objects**  
   Convert the data into Dataset objects for streamlined handling and compatibility. Refer to the [Hugging Face Dataset documentation](https://huggingface.co/docs/datasets/v1.0.2/exploring.html) for more details.

5. **Tokenization**  
   Tokenize the dataset to transform raw text into numerical inputs suitable for model training.

6. **Format Conversion**  
   Convert the tokenized data into the appropriate input format required for training, including attention masks and input IDs.



In [84]:
%%writefile code/preprocessing.py
import subprocess
import sys

def install(name):
    subprocess.call([sys.executable, '-m', 'pip', 'install', name])

install('datasets==2.2.1')
install('transformers[torch]')
install('numpy==1.23.4')

import os
import argparse
import logging
import joblib

import re
import pandas as pd 
import numpy as np 

from sklearn.model_selection import train_test_split

from datasets import Dataset

from transformers import AutoTokenizer, AutoConfig, RobertaTokenizerFast

def tokenize(batch):
    return tokenizer(batch["text"], max_length=256, truncation=True, padding = True)

if __name__=="__main__":
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    logger.addHandler(logging.StreamHandler())
    
    parser = argparse.ArgumentParser()
    parser.add_argument("--file-name", type=str, default='train.csv')
    parser.add_argument("--target",  type=str, default='label')
    parser.add_argument("--text",  type=str, default='text')
    parser.add_argument("--sample-size-per-label",  type=str, default='100')
    parser.add_argument("--model-name",  type=str, default='roberta-base')
    
    args, _ = parser.parse_known_args()
    
    base_dir = "/opt/ml/processing"
    input_dir = os.path.join(base_dir, "input")

    logger.info("Reading in dataset...")
    df = pd.read_csv(os.path.join(input_dir, args.file_name))
    
    label2id = {label:i for i, label in enumerate(np.unique(df[args.target]))}
    id2label = dict(zip(label2id.values(), label2id.keys()))

    df["label"] = df[args.target].map(label2id)
    df["text"] = df[args.text]

    train_df, test_df = train_test_split(df[["label", "text"]], 
                                         test_size=0.2, 
                                         stratify = df["label"]                                 
                                        )
    MODEL = args.model_name
    tokenizer =  AutoTokenizer.from_pretrained(MODEL)

    full_train_dataset = Dataset.from_pandas(train_df).train_test_split(test_size=0.2)
    test_dataset = Dataset.from_pandas(test_df)
    
    train_dataset = full_train_dataset["train"].map(tokenize,
                                                batched=True,
                                                batch_size=len(full_train_dataset["train"])
                                               )
    val_dataset = full_train_dataset["test"].map(tokenize, 
                                             batched=True, 
                                             batch_size=len(full_train_dataset["test"])
                                            )

    test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

    train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
    val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
    test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
    
    train_dir = os.path.join(base_dir,  "output", "train")
    val_dir = os.path.join(base_dir, "output", "validation")  
    test_dir = os.path.join(base_dir, "output", "test")
    artifact_dir = os.path.join(base_dir, "output", "artifact")

    logger.info("Saving datasets and additional artifacts...")
    joblib.dump(train_dataset, os.path.join(train_dir, "train.joblib"))
    joblib.dump(val_dataset, os.path.join(val_dir, "validation.joblib"))
    joblib.dump(test_dataset, os.path.join(test_dir, "test.joblib"))
    joblib.dump(label2id, os.path.join(artifact_dir, "label2id.joblib"))
    joblib.dump(id2label, os.path.join(artifact_dir, "id2label.joblib"))

Overwriting code/preprocessing.py


#### Running a SageMaker Processing Job

Follow these steps to set up and execute a SageMaker processing job effectively:

1. **Set Output Directories in S3**  
   Define the output paths in Amazon S3 where the results of the processing job will be stored.

2. **Initialize a PyTorchProcessor Instance**  
   Create a `PyTorchProcessor` instance, specifying the appropriate parameters such as:  
   - IAM roles  
   - Instance type  
   - Framework version  
   - `base_job_name` for tracking the job

3. **Run the Processing Job**  
   Execute the processing job by passing the required arguments:  
   - **`code`**: The path to your preprocessing script.  
   - **`inputs`**: A `ProcessingInput` object with the following details:  
     - `input_name`: A name for the input.  
     - `source`: The S3 URI of the input data uploaded earlier.  
     - `destination`: The directory inside the processing container where the input data will be accessed.  
   - **`outputs`**: A `ProcessingOutput` object with the following details:  
     - `output_name`: A name for the output.  
     - `source`: The directory inside the processing container where the output is stored.  
     - `destination`: The S3 URI where the processed outputs will be saved.
   - **`arguments`**: Any additional arguments required for preprocessing

By following these steps, you can seamlessly run preprocessing jobs in SageMaker, ensuring your data is prepared efficiently for further analysis or training.


In [85]:
from sagemaker.pytorch.processing import PyTorchProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

# set output directories in s3
train_s3_destination = os.path.join("s3://", default_bucket, project_name, "opt/ml/processing/output/train")
val_s3_destination = os.path.join("s3://", default_bucket, project_name, "opt/ml/processing/output/validation")
test_s3_destination = os.path.join("s3://", default_bucket, project_name, "opt/ml/processing/output/test")
artifact_s3_destination = os.path.join("s3://", default_bucket, project_name, "opt/ml/processing/output/artifact")

# Specify model name in from Hugging Face, this will be used to tokenize the documents
model_name = "roberta-base"

#Initialize the PyTorchProcessor
pytorch_processor = PyTorchProcessor(
    framework_version='1.8',
    role=role,
    instance_type='ml.m5.xlarge',
    instance_count=1,
    base_job_name=f"{project_name}-processing"
)

#Run the processing job
pytorch_processor.run(
    code='code/preprocessing.py',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=input_data_s3_uri,
            destination='/opt/ml/processing/input'
        )
    ],
    outputs=[
        ProcessingOutput(output_name='train',
                         source='/opt/ml/processing/output/train',
                         destination=train_s3_destination),
        ProcessingOutput(output_name='validation',
                         source='/opt/ml/processing/output/validation', 
                         destination=val_s3_destination),
        ProcessingOutput(output_name='test', 
                         source='/opt/ml/processing/output/test', 
                         destination=test_s3_destination),
        ProcessingOutput(output_name='artifact', 
                         source='/opt/ml/processing/output/artifact', 
                         destination=artifact_s3_destination),
    ],
    arguments= [
        "--file-name", "train.csv",
        "--target", "sentiment",
        "--text", "message",
        "--sample-size-per-label", "1000",
        "--model-name", model_name
    ]
)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.processing:Uploaded None to s3://sagemaker-us-east-1-770208914484/roberta-processing-2025-01-12-18-26-27-372/source/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-us-east-1-770208914484/roberta-processing-2025-01-12-18-26-27-372/source/runproc.sh
INFO:sagemaker:Creating processing-job with name roberta-processing-2025-01-12-18-26-27-372


............[34mCollecting datasets==2.2.1
  Downloading datasets-2.2.1-py3-none-any.whl (342 kB)[0m
[34mCollecting xxhash
  Downloading xxhash-3.2.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (211 kB)[0m
[34mCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)[0m
[34mCollecting responses<0.19
  Downloading responses-0.17.0-py2.py3-none-any.whl (38 kB)[0m
[34mCollecting tqdm>=4.62.1
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)[0m
[34mCollecting importlib-resources
  Downloading importlib_resources-5.4.0-py3-none-any.whl (28 kB)[0m
[34mInstalling collected packages: importlib-resources, tqdm, xxhash, responses, huggingface-hub, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.61.2
    Uninstalling tqdm-4.61.2:
      Successfully uninstalled tqdm-4.61.2[0m
[34mSuccessfully installed datasets-2.2.1 huggingface-hub-0.4.0 importlib-resources-5.4.0 responses-0.17.0 tqdm-4.64.

#### Train Model

**Prepare training script:**
1. **Read Preprocessed Data and Artifacts**  
   Load the preprocessed datasets and artifacts generated during the processing step.

2. **Load the Base Model**  
   Import the base model that will be fine-tuned for your specific task.

3. **Train the Model**  
   Leverage the `Transformers` Trainer API to efficiently fine-tune the model on the provided data.

4. **Customize Model Labels**  
   To assign custom labels (instead of `LABEL_0`, `LABEL_1`, etc.), update the `config.json` file with `id2label` and `label2id` mappings before saving the fine-tuned model.

5. **Evaluate Model Performance**  
   The training script also functions as an evaluation tool to assess the model's performance on the test set. We aim to separate the evaluation step at later stage in the project. 

In [86]:
%%writefile code/train.py
import os
import argparse
import logging
import joblib

import numpy as np
from sklearn.metrics import classification_report
from transformers import (AutoTokenizer, 
                          AutoConfig,
                          TrainingArguments,
                          AutoModelForSequenceClassification, 
                          Trainer)

def tokenize(batch):
    return tokenizer(batch["text"], max_length=280, truncation=True, padding = True)

if __name__=="__main__":
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    logger.addHandler(logging.StreamHandler())
    logger.info("extracting arguments...")
    
    parser = argparse.ArgumentParser()
    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--validation", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--artifact", type=str, default=os.environ.get("SM_CHANNEL_ARTIFACT"))
    parser.add_argument("--model_name", type=str, default = "roberta-base")
    args, _ = parser.parse_known_args()

    # load labels mappings for the config.json file   
    id2label = joblib.load(os.path.join(args.artifact, "id2label.joblib"))
    label2id = joblib.load(os.path.join(args.artifact, "id2label.joblib"))

    # load model
    MODEL = args.model_name
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=len(id2label))
    model = model.to('cuda')

    # To ensure we get actual labels during inference
    model.config.id2label =  id2label 
    model.config.label2id = label2id 

    # Load input datasets
    train_dataset = joblib.load(os.path.join(args.train, 'train.joblib'))
    val_dataset = joblib.load(os.path.join(args.validation, 'validation.joblib'))
    test_dataset = joblib.load(os.path.join(args.test, "test.joblib"))

    # Add training arguments
    training_args = TrainingArguments(
        output_dir= args.model_dir 
    )

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        # compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer
    
    )
    trainer.train()
    
    predictions = trainer.predict(test_dataset)
    actuals = predictions.label_ids
    preds = np.argmax(predictions.predictions, axis=-1)

    clf_report = classification_report(actuals, preds, output_dict=True, target_names=id2label.values())

    joblib.dump(clf_report, os.path.join(args.model_dir, 'evaluation.json'))
    
    logger.info(f"Classification report : {classification_report(actuals, preds, output_dict=True, target_names=id2label.values())}") 

    # trainer.save_model(args.model_dir)
    model.save_pretrained(args.model_dir)
    tokenizer.save_pretrained(args.model_dir)

Overwriting code/train.py


**Train Model using Sagemaker Training Job**

SageMaker training jobs simplify the process of training machine learning models by providing a scalable and managed infrastructure. Here’s how to set up a training job using the Hugging Face Estimator:

1. **Create a Hugging Face Estimator**  
   Use the `HuggingFace` Estimator to define the configuration for your training job. Key arguments include:  
   - **`entry_point`**: The path to your custom training script. This script contains the logic for model training and evaluation.  
   - **`source_dir`** *(optional)*: A directory containing all supplementary scripts or files required for the training process.  
   - **`hyperparameters`**: A dictionary of additional parameters needed for training, such as learning rate, batch size, and epochs.  
   - **`role`**: An AWS IAM role that grants the permissions necessary to initiate and manage SageMaker training jobs.  
   - **`instance_type`**: Specifies the type of compute instance for training. For deep learning tasks, GPU instances (e.g., `ml.p3.2xlarge`) are recommended.  
   - **`py_version`**: The Python version to be used (e.g., `py36` for Python 3.6).  
   - **`pytorch_version`** and **`transformers_version`**: Specify the versions of PyTorch and Transformers libraries to match your training environment.  

2. **Benefits of Using SageMaker Training Jobs**  
   - **Scalability**: Automatically scales resources based on data and model requirements.  
   - **Managed Infrastructure**: Handles setup and optimization of underlying hardware, so you can focus on model development.  
   - **Seamless Integration**: Easily integrates with other AWS services like S3 for data storage and CloudWatch for monitoring.
3. The **fit** method takes in the training inputs, in our case it is the s3 uris containing the input datasets and artifacts


In [87]:
# Set additional arguments for training 
hyperparameters = {
    'model_name': model_name
}

In [88]:
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(entry_point='train.py',
                                    source_dir='code',
                                    instance_type='ml.p3.2xlarge',
                                    instance_count=1,
                                    role=role,
                                    transformers_version='4.6.1',
                                    pytorch_version='1.7.1',
                                    py_version='py36',
                                    hyperparameters=hyperparameters
                                   )

#### input data directories in s3
training_args = {
    "train" : train_s3_destination,
    "validation" : val_s3_destination,
    "artifact" : artifact_s3_destination,
    "test" : test_s3_destination
}
huggingface_estimator.fit(training_args, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2025-01-12-18-29-19-377


2025-01-12 18:29:20 Starting - Starting the training job
2025-01-12 18:29:20 Pending - Training job waiting for capacity......
2025-01-12 18:29:58 Pending - Preparing the instances for training...
2025-01-12 18:30:37 Downloading - Downloading input data...
2025-01-12 18:30:57 Downloading - Downloading the training image...............
2025-01-12 18:33:39 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2025-01-12 18:33:59,461 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2025-01-12 18:33:59,492 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2025-01-12 18:33:59,495 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2025-01-12 18:33:59,679 sagemaker-training-toolkit INFO     Installing dependencies

This took about 1044 seconds which is just under 30 minutes minutes to complete training. On 1/4 of the data, using an instance without a GPU, training took 2 hours 15 mins.
The overall macro weighted recall f1_score 0.70 and an accuracy of 0.79.

In [90]:
# obtain s3 uri where the model is saved
huggingface_estimator.latest_training_job.wait(logs="None")
artifact = sm_boto3.describe_training_job(
    TrainingJobName=huggingface_estimator.latest_training_job.name
)["ModelArtifacts"]["S3ModelArtifacts"]

print("Model artifact persisted at " + artifact)


2025-01-12 18:48:06 Starting - Starting the training job
2025-01-12 18:48:06 Pending - Preparing the instances for training
2025-01-12 18:48:06 Downloading - Downloading the training image
2025-01-12 18:48:06 Training - Training image download completed. Training in progress.
2025-01-12 18:48:06 Uploading - Uploading generated training model
2025-01-12 18:48:06 Completed - Training job completed
Model artifact persisted at s3://sagemaker-us-east-1-770208914484/huggingface-pytorch-training-2025-01-12-18-29-19-377/output/model.tar.gz


#### Deploy Model : Real Time Endpoint

**`env={'HF_TASK': 'text-classification'}`**
- Sets environment variables for the model container.
- `HF_TASK` specifies the task type, which is **text classification** in this case.

**`model_data=artifact`**
- Refers to the **S3 URI** containing the trained model artifacts (e.g., `.tar.gz` file with the model and tokenizer).
- This allows the container to load the model for inference.

**`role=role`**
- Defines the **IAM role** SageMaker uses to access AWS resources like S3.
- Ensures secure access to the model artifacts and other necessary services.

**`transformers_version="4.6.1"`**
- Specifies the version of the Hugging Face Transformers library to use.
- Ensures compatibility with the model's tokenizer and configuration.

**`pytorch_version="1.7.1"`**
- Indicates the version of PyTorch for the container environment, aligned with the model's requirements.

**`py_version='py36'`**
- Specifies the Python version for the SageMaker environment.

---

**Purpose**
The code prepares a Hugging Face model for deployment on SageMaker by defining all the necessary configurations, including task type, model artifacts, and compatible library versions.

---

**Output**
The result is a `HuggingFaceModel` object (`huggingface_model`) that can:
- Be deployed to a SageMaker endpoint for real-time inference.
- Be used in batch transform jobs for offline processing.







In [93]:
from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    env={'HF_TASK':'text-classification'},
    model_data=artifact,
    role=role,
    transformers_version="4.6.1",
    pytorch_version="1.7.1", 
    py_version='py36',
)

In [94]:
predictor = huggingface_model.deploy(
    instance_type="ml.p3.2xlarge",
    initial_instance_count=1
)

INFO:sagemaker:Creating model with name: huggingface-pytorch-inference-2025-01-12-18-50-39-496
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-inference-2025-01-12-18-50-40-217
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-inference-2025-01-12-18-50-40-217


----------!

Providing a couple of example tweets

In [96]:
data = {
    'inputs' : ['RT @derekahunter: Now they make 100 year predictions of doom &amp; gloom for climate change because no one will be around to call BS when it do…', 
                'RT @ClimateReality: We’re proud to know @EarthGuardianz and the inspiring work they do on climate change #LeadOnClimate https://t.co/Ok3D0a…',
                "RT @CBSNews: Bernie Sanders: 'What astounds me is that we now have a president-elect who does not believe climate change is realÃ¢â‚¬Â¦",
                '♥The Taiwan government should apologize to the whole world, making air pollution caused the global warming.\n\nhttps://t.co/iSY6XmoBmq',
                "Sandstorms: Day II. Fuck father time's and mother nature's inbred offspring that is climate change."
               ]
}

predictor.predict(data)

[{'label': 'Anti', 'score': 0.9833977818489075},
 {'label': 'Pro', 'score': 0.9952699542045593},
 {'label': 'News', 'score': 0.99051433801651},
 {'label': 'Pro', 'score': 0.9753908514976501},
 {'label': 'Pro', 'score': 0.6477715969085693}]

#### Clean up

In [98]:
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: huggingface-pytorch-inference-2025-01-12-18-50-40-217
INFO:sagemaker:Deleting endpoint with name: huggingface-pytorch-inference-2025-01-12-18-50-40-217


#### References

https://aws.amazon.com/blogs/machine-learning/fine-tune-and-host-hugging-face-bert-models-on-amazon-sagemaker/

https://stackoverflow.com/questions/77301807/sagemaker-an-error-occurred-modelerror-when-calling-the-invokeendpoint-operati

https://huggingface.co/docs/sagemaker/inference#create-a-model-artifact-for-deployment

https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/scripts/train.py

https://discuss.huggingface.co/t/how-to-save-custom-model-to-get-config-json-file/63665/3

https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job-frameworks-pytorch.html

https://huggingface.co/transformers/v3.4.0/custom_datasets.html