# Huggingface Sagemaker-sdk - Distributed Training Demo
### Distributed Token Classification (NER) with `transformers` scripts +  `Trainer` and `conll2003` dataset

- Source: https://github.com/huggingface/transformers/tree/v4.17.0/examples/pytorch/token-classification

1. [Introduction](#Introduction)  
2. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
    3. [Permissions](#Permissions) 
4. [Fine-tuning & starting Sagemaker Training Job](#Fine-tuning-\&-starting-Sagemaker-Training-Job)  
    1. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)  
    2. [Estimator Parameters](#Estimator-Parameters)   
    3. [Download fine-tuned model from s3](#Download-fine-tuned-model-from-s3)
    3. [Attach to old training job to an estimator ](#Attach-to-old-training-job-to-an-estimator)  
5. [_Coming soon_:Push model to the Hugging Face hub](#Push-model-to-the-Hugging-Face-hub)

# Introduction

Welcome to our end-to-end `distributed` "Token Classification" (NER) example. In this demo, we will use the Hugging Face `transformers` together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer for token-classification on multiple-gpus. In particular, the pre-trained model will be fine-tuned using the `conll2003` dataset. The demo will use the new `smdistributed` library to run training on multiple gpus as training scripting we are going to use one of the `transformers` [example scripts from the repository](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_qa.py).

To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_

# Development Environment and Permissions 

## Installation

_*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed_

In [1]:
!pip install "sagemaker>=2.48.0"  --upgrade -q

## Development environment 

In [2]:
import sagemaker.huggingface

## Permissions

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [3]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::110564771975:role/service-role/AmazonSageMaker-ExecutionRole-20210806T162946
sagemaker bucket: sagemaker-eu-west-2-110564771975
sagemaker session region: eu-west-2


# Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....

```python
hyperparameters = {
	'model_name_or_path':'philschmid/distilroberta-base-ner-conll2003',
	'output_dir':'/opt/ml/model',
    'epochs': 1,
    'train_batch_size': 32,
	# add your remaining hyperparameters
	# more info here https://github.com/huggingface/transformers/tree/v4.17.0/examples/pytorch/token-classification
}

huggingface_estimator = HuggingFace(
    entry_point='run_ner.py',
    source_dir='./examples/pytorch/token-classification',
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    role=role,
    git_config=git_config,
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    py_version='py38',
    hyperparameters = hyperparameters
)
```

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running. 

```python
/opt/conda/bin/python train.py --epochs 1 --model_name philschmid/distilroberta-base-ner-conll2003 --train_batch_size 32
```

The `hyperparameters` you define in the `HuggingFace` estimator are passed in as named arguments. 

Sagemaker is providing useful properties about the training environment through various environment variables, including the following:

* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


To run your training job locally you can define `instance_type='local'` or `instance_type='local_gpu'` for gpu usage. _Note: this does not working within SageMaker Studio_


## Load training data from Flair, split into `train`/`dev`/`test`, and upload to S3
- https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md
- https://github.com/huggingface/transformers/blob/main/tests/fixtures/tests_samples/conll/sample.json
- https://github.com/huggingface/transformers/issues/8698
- https://huggingface.co/docs/datasets/loading#json-files

In [4]:
import flair
import inspect
import json
from os.path import exists as path_exists
from pathlib import Path

An example of what `train.json` should look like

```python
{"id": "1", "tokens": ["APPLICATION", "and", "Affidavit", "for", "Search", ...], "ner_tags": ["O", "O", "O", "O", "O", ...]}

...

{"id": "n", "tokens": ["APPLICATION", "and", "Affidavit", "for", "Search", ...], "ner_tags": ["O", "O", "O", "O", "O", ...]}
```

In [None]:
def get_flair_corpus(dataset_name,dataset_arguments:str=""):
    ner_task_mapping = {}

    for name, obj in inspect.getmembers(flair.datasets.sequence_labeling):
        if inspect.isclass(obj):
            if name.startswith("NER") or name.startswith("CONLL") or name.startswith("WNUT"):
                ner_task_mapping[name] = obj

    dataset_args = {}

    if dataset_arguments:
        dataset_args = json.loads(dataset_arguments)

    if dataset_name not in ner_task_mapping:
        raise ValueError(f"Dataset name {dataset_name} is not a valid Flair datasets name!")

    return ner_task_mapping[dataset_name](**dataset_args)

In [219]:
def upload_flair_corpus_to_s3(corpus, filepath, key_prefix = 'distributed_conll2003_data'):
    if not path_exists('dataset'): !mkdir dataset
    path = f'dataset/{filepath}'
    
    lines = []
    for idx, sentence in enumerate(corpus):
        tokens = []
        ner_tags = []
        for token in sentence:
            try: tag = token.tag
            except: tag = "O"
            tokens.append(token.text)
            ner_tags.append(tag)
        lines.append({"tokens": (' '.join(tokens)).strip(), "ner_tags": (','.join(ner_tags)).strip()})
            
    # convert to a list of JSON strings
    json_lines = [json.dumps(l) for l in lines]

    # join lines and save to .json file
    json_data = '\n'.join(json_lines)
    with open(path, 'w', encoding='utf-8') as f:
        f.write(json_data)
        
    # save to s3    
    s3_path = sess.upload_data(path, key_prefix=key_prefix)
    print('done! file also saved locally to "{}"'.format(filepath))
    
    return s3_path

In [28]:
corpus = get_flair_corpus(dataset_name='NER_ENGLISH_PERSON')
print(corpus)

2022-05-21 00:01:15,488 Reading data from /home/ec2-user/.flair/datasets/ner_english_person
2022-05-21 00:01:15,489 Train: /home/ec2-user/.flair/datasets/ner_english_person/bigFile.conll
2022-05-21 00:01:15,489 Dev: None
2022-05-21 00:01:15,490 Test: None
Corpus: 28362 train + 3151 dev + 3501 test sentences


In [29]:
tag_dictionary = corpus.make_label_dictionary("ner")
print(tag_dictionary)

2022-05-21 00:01:21,657 Computing label dictionary. Progress:


28362it [00:00, 71885.80it/s]

2022-05-21 00:01:22,054 Dictionary created for label 'ner' with 4 values: M (seen 27887 times), F (seen 4543 times), A (seen 634 times)
Dictionary with 4 tags: <unk>, M, F, A





In [220]:
train_input_path = upload_flair_corpus_to_s3(corpus.train, 'train.json')
test_input_path = upload_flair_corpus_to_s3(corpus.test, 'test.json')
valid_input_path = upload_flair_corpus_to_s3(corpus.dev, 'valid.json')

done! file also saved locally to "train.csv"
done! file also saved locally to "test.csv"
done! file also saved locally to "valid.csv"


## Test all steps locally first

See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
https://huggingface.co/docs/datasets/loading_datasets.html.

In [96]:
from datasets import ClassLabel, load_dataset, load_metric

In [92]:
class data_args:
    train_file = 'dataset/train.json'
    test_file = 'dataset/test.json'
    validation_file = 'dataset/valid.json'
    text_column_name='tokens'
    label_column_name='ner_tags'
    task_name='ner'

class training_args:
    do_train=True

In [94]:
data_files = {}
if data_args.train_file is not None:
    data_files["train"] = data_args.train_file
if data_args.validation_file is not None:
    data_files["validation"] = data_args.validation_file
if data_args.test_file is not None:
    data_files["test"] = data_args.test_file
extension = data_args.train_file.split(".")[-1]
raw_datasets = load_dataset(extension, data_files=data_files)

if training_args.do_train:
    column_names = raw_datasets["train"].column_names
    features = raw_datasets["train"].features
else:
    column_names = raw_datasets["validation"].column_names
    features = raw_datasets["validation"].features

if data_args.text_column_name is not None:
    text_column_name = data_args.text_column_name
elif "tokens" in column_names:
    text_column_name = "tokens"
else:
    text_column_name = column_names[0]

if data_args.label_column_name is not None:
    label_column_name = data_args.label_column_name
elif f"{data_args.task_name}_tags" in column_names:
    label_column_name = f"{data_args.task_name}_tags"
else:
    label_column_name = column_names[1]

Using custom data configuration default-aa398d1e47008087
Reusing dataset json (/home/ec2-user/.cache/huggingface/datasets/json/default-aa398d1e47008087/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/3 [00:00<?, ?it/s]

In [97]:
# In the event the labels are not a `Sequence[ClassLabel]`, we will need to go through the dataset to get the
# unique labels.
def get_label_list(labels):
    unique_labels = set()
    for label in labels:
        unique_labels = unique_labels | set(label)
    label_list = list(unique_labels)
    label_list.sort()
    return label_list

# If the labels are of type ClassLabel, they are already integers and we have the map stored somewhere.
# Otherwise, we have to get the list of labels manually.
labels_are_int = isinstance(features[label_column_name].feature, ClassLabel)
if labels_are_int:
    label_list = features[label_column_name].feature.names
    label_to_id = {i: i for i in range(len(label_list))}
else:
    label_list = get_label_list(raw_datasets["train"][label_column_name])
    label_to_id = {l: i for i, l in enumerate(label_list)}

num_labels = len(label_list)

In [120]:
import transformers
from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    HfArgumentParser,
    PretrainedConfig,
    PreTrainedTokenizerFast,
    Trainer,
    TrainingArguments,
    set_seed,
)

In [121]:
model = AutoModelForTokenClassification.from_pretrained('xlm-roberta-base')

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-st

In [125]:
model.config.label2id

{'LABEL_0': 0, 'LABEL_1': 1}

In [198]:
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")

Reusing dataset conll2003 (/home/ec2-user/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee)


  0%|          | 0/3 [00:00<?, ?it/s]

In [233]:
raw_datasets = load_dataset('json',data_files={
    'train':'dataset/train.json',
    'test':'dataset/test.json',
    'valid':'dataset/valid.json'})

Using custom data configuration default-72c4b80aac0f28a2
Reusing dataset json (/home/ec2-user/.cache/huggingface/datasets/json/default-72c4b80aac0f28a2/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/3 [00:00<?, ?it/s]

In [223]:
raw_datasets["train"].features["ner_tags"]

Value(dtype='string', id=None)

In [232]:
raw_datasets["train"].features

{'tokens': Value(dtype='string', id=None),
 'ner_tags': Value(dtype='string', id=None)}

In [209]:
from transformers import AutoTokenizer

model_checkpoint = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [231]:
# labels = raw_datasets["train"][0]["ner_tags"]
# word_ids = inputs.word_ids()
# print(labels)
# print(align_labels_with_tokens(labels, word_ids))

In [210]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [211]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [230]:
# tokenized_datasets = raw_datasets.map(
#     tokenize_and_align_labels,
#     batched=True,
#     remove_columns=raw_datasets["train"].column_names,
# )

In [243]:
# !pip install accelerate

## References

- [Named-Entity-Recognition-on-HuggingFace](https://wandb.ai/biased-ai/Named-Entity%20Recognition%20on%20HuggingFace/reports/Named-Entity-Recognition-on-HuggingFace--Vmlldzo3NTk4NjY)
- [`run_ner.py` script in HuggingFace](https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py)
- [Named Entity Recognition with Transformers](https://chriskhanhtran.github.io/posts/named-entity-recognition-with-transformers/)
- [Token classification in HuggingFace](https://github.com/huggingface/transformers/tree/v4.17.0/examples/pytorch/token-classification)
- [Training Custom NER Model Using Flair](https://medium.com/thecyphy/training-custom-ner-model-using-flair-df1f9ea9c762)