# How to train 6 Billion GPT-J with Hugging Face Transformers and Amazon SageMaker

### Model Parallelism using Amazon SageMaker 

1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
    3. [Permissions](#Permissions)
3. [Processing](#Preprocessing)   
    1. [Tokenization](#Tokenization)  
    2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket)  
4. [Fine-tuning & starting Sagemaker Training Job](#Fine-tuning-\&-starting-Sagemaker-Training-Job)  
    1. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)  
    2. [Estimator Parameters](#Estimator-Parameters)   
    3. [Download fine-tuned model from s3](#Download-fine-tuned-model-from-s3)
    3. [Attach to old training job to an estimator ](#Attach-to-old-training-job-to-an-estimator)  
5. [_Coming soon_:Push model to the Hugging Face hub](#Push-model-to-the-Hugging-Face-hub)

# Introduction

GPT-J 6B is a transformer model trained using Ben Wang's [Mesh Transformer JAX](https://github.com/kingoflolz/mesh-transformer-jax/). "GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters. GPT-J 6B was trained on the [Pile](https://pile.eleuther.ai/), a large-scale curated dataset created by [EleutherAI](https://www.eleuther.ai/). The weights of GPT-J-6B are licensed under version 2.0 of the Apache License.

Read more about `GPT-J`, how it was trained, his Limitations and Biases on the [Hugging Face Model Card](https://huggingface.co/EleutherAI/gpt-j-6B)



# Development Environment and Permissions 

## Development environment 

In [None]:
!pip install "sagemaker>=2.48.0" --upgrade

## Permissions

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [None]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

# Fine-tuning & starting Sagemaker Training Job

In order to create our sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles all end-to-end Amazon SageMaker training and deployment tasks. In the Estimator we define, which fine-tuning script (`entry_point`) should be used, which `instance_type` should be used, which `hyperparameters` are passed in .....



```python
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            transformers_version='4.4',
                            pytorch_version='1.6',
                            py_version='py36',
                            role=role,
                            hyperparameters = {'epochs': 1,
                                               'train_batch_size': 32,
                                               'model_name':'distilbert-base-uncased'
                                                })
```

When we create a SageMaker training job, SageMaker takes care of starting and managing the required ec2 instances for us, providing the fine-tuning script `train.py` and downloading the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. When starting the training SageMaer executes the following command:

```python
/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
```

The CLI arguments you see are passed in as `hyperparameters`, when creating the `HuggingFace` estimator.

Sagemaker is also providing useful properties about the training environment through various environment variables, including the following:

* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


## Model parallelism on Amazon SageMaker

A model parallel approach is used with large models too large to fit on one accelerator (GPU); This approach implements a parallelization strategy where the model architecture is divided into shards and placed on to different accelerators. [Amazon SageMaker's distributed model parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html) provides an automated model splitting and pipeline execution scheduling. The model splitting algorithms can be tuned for speed or memory. 

![parallelism-sagemaker-interleaved-pipeline.png](attachment:parallelism-sagemaker-interleaved-pipeline.png)
redo image, not owned -> belongs to AWS

The [Amazon SageMaker distributed model parallel library](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel.html) can be used for training large deep learning models that are difficult to train due to GPU memory limitations.
The `HuggingFace` Estimator object contains a `distribution` parameter, which is used to enable and specify parameters for the initialization of the SageMaker distributed model parallel library. The library internally uses MPI, so in order to use model parallelism, MPI must also be enabled using the distribution parameter.

You can use the list of `parameters` to initialize the library using the parameters in the `smdistributed` of `distribution`in the [Python SageMaker documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html#smdistributed-parameters)


In [None]:
# configuration for running training on smdistributed Model Parallel
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8, # number of available GPUs
}
smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4, # The number of microbatches to perform pipelining over. Batch size must be divisible.
        "placement_strategy": "spread", # Model placement strategy, either "spread" or "cluster"
        "pipeline": "interleaved", # The pipeline schedule.
        "optimize": "speed", # Whether the library should optimize for speed or memory.
        "partitions": 4, # The number of partitions to split the model into.
        "ddp": True, # Must be set to True for Transformers
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

## Creating an Estimator and start a training job

We are going to use the existing `run_clm.py` from the transformers example scripts, which implements causal language modeling. As `dataset` we are going to use the `cc_news`.
CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using [news-please](https://github.com/fhamborg/news-please) - an integrated web crawler and information extractor for news.
It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English language subset of the CC-News dataset.

If you want to use a custom dataset skip these cells and go directly to [Using custom data for training]().





## Defining Hyperparamters and Fine-tuning Script

In [10]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={
    'model_name_or_path':'EleutherAI/gpt-j-6B',
    'dataset_name':'cc_news',
    'per_device_train_batch_size': 2,
    'per_device_eval_batch_size': 2,
    'do_train': True,
    'do_eval': True,
    'num_train_epochs': 2,
   # 'output_dir':'/opt/ml/model',
    'max_steps': 500,
}

# git configuration to download our fine-tuning script
#git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.10.2'}

# instance configurations
instance_type='ml.p3.16xlarge'
instance_count=1
volume_size=200

# metric definition to extract the results
metric_definitions=[
     {'Name': 'train_runtime', 'Regex':"train_runtime.*=\D*(.*?)$"},
     {'Name': 'train_samples_per_second', 'Regex': "train_samples_per_second.*=\D*(.*?)$"},
     {'Name': 'epoch', 'Regex': "epoch.*=\D*(.*?)$"},
     {'Name': 'f1', 'Regex': "f1.*=\D*(.*?)$"},
     {'Name': 'exact_match', 'Regex': "exact_match.*=\D*(.*?)$"}]

In [11]:
# estimator
huggingface_estimator = HuggingFace(entry_point='run_clm.py', 
                                    source_dir="scripts",
                                    #source_dir='./examples/pytorch/language-modeling',
                                    #git_config=git_config,
                                    metrics_definition=metric_definitions,
                                    instance_type=instance_type,
                                    instance_count=instance_count,
                                    volume_size=volume_size,
                                    role=role,
                                    transformers_version='4.6',
                                    pytorch_version='1.7',
                                    py_version='py36',
                                    distribution= distribution,
                                    hyperparameters = hyperparameters)

In [None]:
huggingface_estimator.hyperparameters()

In [None]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit()

# Analyse Usage

Screenshots of cloudwatch and maybe sdk commands to get values

# Custom Data

* upload data to s3 / provide s3 uri
* change hyperparamters
* add s3 uris to fit method