# Binary Classification `imdb` dataset

### Development Environment and Permissions 

In [1]:
!pip install "sagemaker>=2.140.0" "transformers==4.26.1" "datasets[s3]==2.10.1" --upgrade

Collecting datasets==2.10.1 (from datasets[s3]==2.10.1)
  Using cached datasets-2.10.1-py3-none-any.whl.metadata (20 kB)
Using cached datasets-2.10.1-py3-none-any.whl (469 kB)
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 3.0.1
    Uninstalling datasets-3.0.1:
      Successfully uninstalled datasets-3.0.1
Successfully installed datasets-2.10.1


### Development environment 

In [4]:
import sagemaker.huggingface

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


### Permissions

In [5]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::798111172440:role/service-role/c134909a3421853l7874054t1w79-SageMakerExecutionRole-N93UPDSTX8TN
sagemaker bucket: sagemaker-us-east-1-798111172440
sagemaker session region: us-east-1


# Preprocessing

We are using the `datasets` library to download and preprocess the `imdb` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The [imdb](http://ai.stanford.edu/~amaas/data/sentiment/) dataset consists of 25000 training and 25000 testing highly polar movie reviews.

## Tokenization 

In [6]:
pip install -U datasets huggingface_hub fsspec

Collecting datasets
  Using cached datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting fsspec
  Using cached fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Using cached datasets-3.0.1-py3-none-any.whl (471 kB)
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.10.1
    Uninstalling datasets-2.10.1:
      Successfully uninstalled datasets-2.10.1
Successfully installed datasets-3.0.1
Note: you may need to restart the kernel to use updated packages.


In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer

# tokenizer used in preprocessing
tokenizer_name = 'distilbert-base-uncased'

# dataset used
dataset_name = 'stanfordnlp/imdb'

# s3 key prefix for the data
s3_prefix = 'samples/datasets/imdb'

In [2]:
# load dataset
dataset = load_dataset(r'stanfordnlp/imdb', streaming=True)

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k 


# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# set format for pytorch
train_dataset =  train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])



Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

## Uploading data to `sagemaker_session_bucket`

In [6]:
# # save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path)

# # save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path)


severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

# Fine-tuning & starting Sagemaker Training Job


In [7]:
!pygmentize /home/ec2-user/SageMaker/code/train.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer[37m[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmetrics[39;49;00m [34mimport[39;49;00m accuracy_score, precision_recall_fscore_support[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_from_disk[37m[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m[

In [8]:
!ls /home/ec2-user/SageMaker/code

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
sagemaker-notebook.ipynb  train.py


## Creating an Estimator and start a training job

In [9]:
source_dir = "/home/ec2-user/SageMaker/code"

In [10]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name':'distilbert-base-uncased'
                 }

In [13]:
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir=source_dir,
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.26',
                            pytorch_version='1.13',
                            py_version='py39',
                            hyperparameters = hyperparameters)

In [14]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2024-10-08-16-13-15-443


2024-10-08 16:13:21 Starting - Starting the training job
2024-10-08 16:13:21 Pending - Training job waiting for capacity......
2024-10-08 16:13:59 Pending - Preparing the instances for training...
2024-10-08 16:14:47 Downloading - Downloading input data...
2024-10-08 16:15:13 Downloading - Downloading the training image........................
2024-10-08 16:19:20 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
  "cipher": algorithms.TripleDES,[0m
  "class": algorithms.TripleDES,[0m
[34m2024-10-08 16:19:36,337 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-10-08 16:19:36,359 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-10-08 16:19:36,375 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succe

## Deploying the endpoint

In [15]:
predictor = huggingface_estimator.deploy(1, "ml.g4dn.xlarge")

INFO:sagemaker:Creating model with name: huggingface-pytorch-training-2024-10-08-16-52-26-483
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-training-2024-10-08-16-52-26-483
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-training-2024-10-08-16-52-26-483


-----------!

In [19]:
sentiment_input= {"inputs":"I hate using the new Inference DLC."}

predictor.predict(sentiment_input)

[{'label': 'LABEL_0', 'score': 0.9186965227127075}]

In [20]:
predictor.delete_model()
predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: huggingface-pytorch-training-2024-10-08-16-52-26-483
INFO:sagemaker:Deleting endpoint configuration with name: huggingface-pytorch-training-2024-10-08-16-52-26-483
INFO:sagemaker:Deleting endpoint with name: huggingface-pytorch-training-2024-10-08-16-52-26-483
