# Train model on SageMaker

Load environment variables (mainly AWS configuration)

In [22]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


Initialize SageMaker.

In [23]:
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()                    # Set a default S3 bucket
prefix = 'hf-community-event-project'

Upload the dataset to S3

In [24]:
import os
import boto3

# imdb dataset from kaggle (https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, "data", 'train.csv')).upload_file('data/train.csv')

Create `HuggingFace` estimator, choose framework versions, training instance and the script to run (`entry_point='train.py'`)

In [25]:
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
  entry_point='train.py',
  source_dir='./scripts',
  instance_type='ml.p3.2xlarge',
  instance_count=1,
  role=role,
  transformers_version='4.12',
  tensorflow_version='2.5',
  py_version='py37',
  hyperparameters = {
    "sampling": 0, # take the whole dataset
    "epochs": 1
  }
)

Launch model training on SageMaker!

In [26]:
huggingface_estimator.fit({
  'train': 's3://{}/{}/data/'.format(bucket, prefix)
})

2021-11-14 14:09:30 Starting - Starting the training job...
2021-11-14 14:09:53 Starting - Launching requested ML instancesProfilerReport-1636898969: InProgress
......
2021-11-14 14:10:53 Starting - Preparing the instances for training.........
2021-11-14 14:12:14 Downloading - Downloading input data
2021-11-14 14:12:14 Training - Downloading the training image..................
2021-11-14 14:15:35 Training - Training image download completed. Training in progress.[34m2021-11-14 14:15:19.669093: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2021-11-14 14:15:19.680752: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2021-11-14 14:15:19.865227: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0[0m
[34m2021-11-1

Deploy an inference endpoint.

In [27]:
predictor = huggingface_estimator.deploy(
  initial_instance_count=1, 
  instance_type="ml.m5.xlarge"
)

----!

Test prediction

In [28]:
data = {
   "inputs": "The Sopranos is a terrific show. It may be violent, racist, sexist, and bad to the bone, it is also funny, melodramatic and cool. The characters are very well done and the acting is some of the best I've seen in years. It is also pretty keen for creator David Chase to pick Northern New Jersey as the set piece for his opus of crime life. I have liked this show alot since it aired on HBO in January of last year and I will keep on watching it because of the intrigue and drama."
}

predictor.predict(data)

[{'label': 'positive', 'score': 0.9976664781570435}]

Clean-up

In [29]:
predictor.delete_endpoint()