# How to train a new language model from scratch using Transformers and Tokenizers Sagemaker
* https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

## Setup
First install `sagemaker` and `sagemaker[local]1. Also install the latest transformers from HuggingFace.

In [1]:
!pip install -q sagemaker sagemaker[local]

# Install `transformers` from master
!pip install -q git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers|sagemaker'

[33mYou are using pip version 19.0.3, however version 20.2b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 19.0.3, however version 20.2b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
sagemaker           1.60.2    
tokenizers          0.7.0     
transformers        2.11.0    
[33mYou are using pip version 19.0.3, however version 20.2b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


### Open session and bucket
Open sagemaker session and setup the bucket for data upload

In [2]:
import sagemaker
import boto3

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/hunkim-transformer'



### Set a role
You need a role that can execute sagemaker and read/write S3. Create a role in your IAM setting, and use the role name 

In [3]:
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='hunkimSagemaker')['Role']['Arn']
    
print(role)

arn:aws:iam::294038372338:role/hunkimSagemaker


### Data downlaod
In this example, we are using a tinyshakespeare.

In [4]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!mkdir -p data
!wget -O data/oscar.eo.txt -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2020-06-11 10:48:53--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443...connected.
HTTP request sent, awaiting response...416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



### Uplaod data to S3
Upload the data to S3 and the S3 path will be passed to the train

In [5]:
inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix=prefix)
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-west-2-294038372338/sagemaker/hunkim-transformer


## Train
First we need to have a code directory which includes requrements.txt and train/infer code. Then, we will create pytorch container in the Sagemaker and execute the train. Finally, the train program will save models and necessary data in the `model_dir` and upload it to S3.

Later the uploaded model in S3 will be used for model serving (endpoint)

In [6]:
!pygmentize code/requirements.txt

transformers == 2.11.0
boto3


In [7]:
!pygmentize code/train.py

[34mfrom[39;49;00m [04m[36mtokenizers[39;49;00m[04m[36m.[39;49;00m[04m[36mimplementations[39;49;00m [34mimport[39;49;00m ByteLevelBPETokenizer
[34mfrom[39;49;00m [04m[36mtokenizers[39;49;00m[04m[36m.[39;49;00m[04m[36mprocessors[39;49;00m [34mimport[39;49;00m BertProcessing
[34mfrom[39;49;00m [04m[36mpathlib[39;49;00m [34mimport[39;49;00m Path
[34mfrom[39;49;00m [04m[36mtokenizers[39;49;00m [34mimport[39;49;00m ByteLevelBPETokenizer

[37m# Based on github.com/pytorch/examples/blob/master/word_language_model[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mmath[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36mshutil[39;49;00m [34mimport[39;49;00m copy
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;0

### PyTorch Container
define the container (framework) version and other options including hyperparameters. For local testing, use the `local` or `local_gpu` for the train_instance_type.

specify the `source_dir` and `entry_pint` to point the train code.

In [8]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='train.py',
                    role=role,
                    framework_version='1.5.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    #train_instance_type='local',
                    source_dir='code',
                    hyperparameters={
                        "vocab-size": 52_000,
                        "max-position-embeddings": 514,
                        "num-attention-heads": 12,
                        "num-hidden-layers": 6,
                        "type-vocab-size": 1,
                        "overwrite-output-dir": True,
                        "num-train-epochs": 1,
                        "per-gpu-train-batch-size": 64,
                        "save-steps": 10_000,
                        "save-total-limit": 2,
                        "token-max-len": 512
                    })

### Fire training
`fit` will sectue the train code we specified. Make sure we pass the training data

In [9]:
%%time
estimator.fit({'training': inputs})

6-cp36m-manylinux2010_x86_64.whl (660 kB)[0m
[34mCollecting tokenizers==0.7.0
  Downloading tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8 MB)[0m
[34mCollecting sentencepiece
  Downloading sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1 MB)[0m
[34mBuilding wheels for collected packages: default-user-module-name, sacremoses
  Building wheel for default-user-module-name (setup.py): started[0m
[34m  Building wheel for default-user-module-name (setup.py): finished with status 'done'
  Created wheel for default-user-module-name: filename=default_user_module_name-1.0.0-py2.py3-none-any.whl size=179589958 sha256=6dde25cd51902408d8f1443db969cee4c031683e0b2b02601640a446383c67c0
  Stored in directory: /tmp/pip-ephem-wheel-cache-_lqwx72o/wheels/18/c0/c7/be1cf409c57ce057601a86673d06a3a31b084080dbb03a31ea
  Building wheel for sacremoses (setup.py): started[0m
[34m  Building wheel for sacremoses (setup.py): finished with status 'done'
  Created wheel for sacremoses: file

UnexpectedStatusException: Error for Training job pytorch-training-2020-06-11-01-36-17-001: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train.py --max-position-embeddings 514 --num-attention-heads 12 --num-hidden-layers 6 --num-train-epochs 1 --overwrite-output-dir True --per-gpu-train-batch-size 64 --save-steps 10000 --save-total-limit 2 --token-max-len 512 --type-vocab-size 1 --vocab-size 52000"
/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils.py:831: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%| 

## Serving
The first step of serving is getting the models stored in S3 and launch a sagemaker serving container. 

### S3 models
Makr sure the S3 file has models and necessaty files.

In [10]:
training_job_name = estimator.latest_training_job.name
desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)
trained_model_location = desc['ModelArtifacts']['S3ModelArtifacts']
print(trained_model_location)
# s3://sagemaker-us-west-2-294038372338/pytorch-training-2020-06-10-23-53-28-771/output/model.tar.gz

s3://sagemaker-us-west-2-294038372338/pytorch-training-2020-06-11-01-36-17-001/output/model.tar.gz


### Json serializer
We are passing our data to json.

In [11]:
from sagemaker.predictor import RealTimePredictor, json_serializer, json_deserializer

class JSONPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(JSONPredictor, self).__init__(endpoint_name, sagemaker_session, json_serializer, json_deserializer)

### Infer code
First we will load our model using `model_fn`. Then using the loaded model, we implement `predict_fn`. The `input_fn` and `output_fn` will handle the input/out data checking and processing. 

In [12]:
!pygmentize code/infer.py

[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m

[34mimport[39;49;00m [04m[36mtorch[39;49;00m

[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m pipeline


JSON_CONTENT_TYPE = [33m'[39;49;00m[33mapplication/json[39;49;00m[33m'[39;49;00m

logger = logging.getLogger([31m__name__[39;49;00m)


[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    logger.info([33m'[39;49;00m[33mLoading the model.[39;49;00m[33m'[39;49;00m)

    fill_mask = pipeline(
        [33m"[39;49;00m[33mfill-mask[39;49;00m[33m"[39;49;00m,
        model=model_dir,
        tokenizer=model_dir,
    )
    [34mreturn[39;49;00m fill_mask


[34mdef[39;49;00m [32minput_fn[39;49;00m(serialized_input_data, content_type=JSON_CONTENT_TYPE):
    logger.info([33m'[39;49;00m[33mDeserializing the input data.[39;49;00m[33m'[39;49;00m)
    [34mif[39;49;00m content_t

### Config Serving Contaner
Create a container using the stored model (in S3). Specify the infer code and source directory.


In [13]:
from sagemaker.pytorch import PyTorchModel

training_job_name = estimator.latest_training_job.name
desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)
trained_model_location = desc['ModelArtifacts']['S3ModelArtifacts']

model = PyTorchModel(model_data=trained_model_location,
                     role=role,
                     framework_version='1.5.0',
                     entry_point='infer.py',
                     source_dir='code',
                     predictor_cls=JSONPredictor)



### Deploy!
Finally, it's ready to deploy. Let's fire it! The endpoint address will be used to connect to the server.

In [14]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

# Get the end point
endpoint = predictor.endpoint
print(endpoint)

-

## Testing
Finally we can test the endpoint. First, we will just reuse the predictor for a quick test.

In [24]:
input = {
    'text': "We are <mask>."
}
response = predictor.predict(input)
print(response)

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/pytorch-inference-2020-06-11-01-01-03-794 in account 294038372338 for more information.

### Remote
This example shows how to connect to the endpoint remotely.

In [25]:
import boto3
import json

client = boto3.client('sagemaker-runtime')

input = {
    'text': "we are <mask>."
}
payload = json.dumps(input)

response = client.invoke_endpoint(
    EndpointName=endpoint, 
    ContentType="application/json",
    Accept="application/json" ,
    Body=payload
)

print(response['Body'].read())  

ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.us-west-2.amazonaws.com/endpoints/pytorch-inference-2020-06-11-01-01-03-794/invocations"

## Shutdown
After finishing, delete the endpoint.

In [22]:
sagemaker_session.delete_endpoint(predictor.endpoint)

ClientError: An error occurred (ValidationException) when calling the DeleteEndpoint operation: Could not find endpoint "arn:aws:sagemaker:us-west-2:294038372338:endpoint/pytorch-inference-2020-06-11-01-01-03-794".