# Performance Tuning

This Notebook includes examples on how you can tune performance with the `aws-neuron-sdk` for you 🤗 Transformer models. 

## Batching

**batching** it is achieved by loading the data into an on-chip cache and reusing it multiple times for multiple different model-inputs.  
=> batching is preferred for applications that aim to optimize throughput and cost at the expense of latency.  

---

To enable the batching optimization, we first need to compile the model for a target `batch-size`.


In [1]:
import tensorflow  # to workaround a protobuf version conflict issue
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import transformers
  
# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment", return_dict=False)
# Setup some example inputs
positive_sequence = "This is a nice sentence about very kind guy from the east."
negative_sequence = "You fucking bastard."

max_length=128
batch_size=6

paraphrase = tokenizer(positive_sequence, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

example_inputs_paraphrase = (
    torch.cat([paraphrase['input_ids']] * batch_size,0), 
    torch.cat([paraphrase['attention_mask']] * batch_size,0)
)
# test model
outputs = model(**paraphrase)
assert 2 == outputs[0][0].argmax().item()

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=747.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=898822.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=456318.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=150.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=498679497.0), HTML(value='')))




In [None]:
## Analyze the model - this will show operator support and operator count
torch.neuron.analyze_model(model, example_inputs=example_inputs_paraphrase)

In [2]:
# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron_batch = torch.neuron.trace(model, example_inputs_paraphrase)

outputs = model_neuron_batch(*example_inputs_paraphrase)

for output in outputs[0]:
    assert 2 == output.argmax().item()
    
# Save the batched model
model_neuron_batch.save('roberta_neuron_b{}.pt'.format(batch_size))

  input_tensor.shape[chunk_dim] == tensor_shape for input_tensor in input_tensors
INFO:Neuron:There are 5 ops of 3 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, aten::cumsum, aten::type_as, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 722, fused = 696, percent fused = 96.4%
INFO:Neuron:compiling function _NeuronGraph$661 with neuron-cc
INFO:Neuron:Compiling with command line: '/opt/conda/bin/neuron-cc compile /tmp/tmp33ogt4ii/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp33ogt4ii/graph_def.neff --io-config {"inputs": {"0:0": [[6, 128, 768], "float32"], "1:0": [[6, 1, 1, 128], "float32"]}, "outputs": ["Add_133:0"]} --verbose 35'
Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape (Triggered internally at  /opt/workspace

## Test `neuron_model` vs vanilla model

test will be run with `batch_size=1`.

In [75]:
from datasets import load_dataset, load_metric

raw_dataset = load_dataset('tweet_eval', 'sentiment', split='test')

processed_dataset = raw_dataset.map(lambda seq: tokenizer(seq['text'], max_length=max_length, padding=True, truncation=True, return_tensors="pt"))



HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=12284.0), HTML(value='')))




In [53]:
import time
import pandas as pd


def do_test(processed_dataset,model,model_type):
    metric = load_metric("accuracy")
    processed_dataset = processed_dataset.select(range(1000))
    model_start = time.perf_counter()
    model.eval()
#     model_type = 'neuron' if isinstance(model, torch.jit.ScriptModule) else 'torch'
    with torch.no_grad():
        for step, batch in enumerate(processed_dataset):
            input_ids = torch.tensor(batch['input_ids'])
            attention_mask = torch.tensor(batch['attention_mask'])
            outputs = model(*[input_ids,attention_mask])
            predictions = outputs[0][0].argmax().item()
            metric.add_batch(predictions=[predictions],references=[batch["label"]])
        
    eval_metric = metric.compute()
    model_stop = time.perf_counter()
    total_time = round(model_stop - model_start,4)*1000
    average_time =  round(total_time/len(processed_dataset),4)
    return {'model_type':model_type,**eval_metric,'total_time':f"{total_time}ms",'average_time':f"{average_time}ms"}   


model_res=do_test(processed_dataset, model,'pytorch')
model_neuron_res = do_test(processed_dataset, model_neuron_batch,'neuron')


df = pd.DataFrame([model_res,model_neuron_res])
df

Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape (Triggered internally at  /opt/workspace/KaenaPyTorchRuntime/neuron_op/neuron_op_impl.cpp:38.)
  result = self.forward(*input, **kwargs)


Unnamed: 0,model_type,accuracy,total_time,average_time
0,pytorch,0.705,170575.3ms,170.5753ms
1,neuron,0.197,1817.6000000000001ms,1.8176ms


## batch inference

In [112]:
from transformers import DataCollatorWithPadding
import time

max_length = 128
batch_size = 6


batch_raw_dataset = raw_dataset.map(lambda seq: tokenizer(seq['text'],padding="max_length", max_length=max_length,truncation=True))
batch_raw_dataset = batch_raw_dataset.remove_columns('text')

data_loader = torch.utils.data.DataLoader(batch_raw_dataset,
                                         batch_size=batch_size,
                                        collate_fn=DataCollatorWithPadding(tokenizer)
                                         )

model_start = time.perf_counter()
model_neuron_batch.eval()
with torch.no_grad():
    for batch in data_loader:
            input_ids = torch.tensor(batch['input_ids'])
            attention_mask = torch.tensor(batch['attention_mask'])
            outputs = model_neuron_batch(*[input_ids,attention_mask])
            predictions = outputs[0][0].argmax().item()

model_stop = time.perf_counter()
total_time = round(model_stop - model_start,4)


print(f"inference of {len(batch_raw_dataset)} examples with batch_size {batch_size} took {total_time} seconds")



inference of 12284 examples with batch_size 1 took 20.0836 seconds


tried different batches

    inference of 12284 examples with batch_size 1 took 20.0836 seconds
    inference of 12284 examples with batch_size 3 took 9.1281 seconds
    inference of 12284 examples with batch_size 6 took 6.8953 seconds


---

# Mixed Precission

In [116]:
import tensorflow  # to workaround a protobuf version conflict issue
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import transformers
  
# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment", return_dict=False)
# Setup some example inputs
positive_sequence = "This is a nice sentence about very kind guy from the east."
negative_sequence = "You fucking bastard."

max_length=128
batch_size=6

paraphrase = tokenizer(positive_sequence, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask']

# test model
outputs = model(**paraphrase)
assert 2 == outputs[0][0].argmax().item()

In [120]:
compiler_args = ['--fp32-cast=matmult']

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron_mixed = torch.neuron.trace(model, 
                                        example_inputs=example_inputs_paraphrase,
                                        compiler_args=compiler_args)

outputs = model_neuron_mixed(*example_inputs_paraphrase)

assert 2 == output.argmax().item()
    
# Save the batched model
model_neuron_mixed.save('roberta_neuron_mixed.pt')

  input_tensor.shape[chunk_dim] == tensor_shape for input_tensor in input_tensors
INFO:Neuron:There are 5 ops of 3 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, aten::cumsum, aten::type_as, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 722, fused = 696, percent fused = 96.4%
INFO:Neuron:Compiler args type is <class 'list'> value is ['--fp32-cast=matmult']
INFO:Neuron:compiling function _NeuronGraph$1987 with neuron-cc
INFO:Neuron:Compiling with command line: '/opt/conda/bin/neuron-cc compile /tmp/tmpfxaew0b1/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpfxaew0b1/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Add_133:0"]} --fp32-cast=matmult --verbose 35'
Tensor output are ** NOT CAL

## test again

In [121]:
model_res=do_test(processed_dataset, model,'pytorch')
model_neuron_res = do_test(processed_dataset, model_neuron_mixed,'neuron')


df = pd.DataFrame([model_res,model_neuron_res])
df

Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape (Triggered internally at  /opt/workspace/KaenaPyTorchRuntime/neuron_op/neuron_op_impl.cpp:38.)
  result = self.forward(*input, **kwargs)


Unnamed: 0,model_type,accuracy,total_time,average_time
0,pytorch,0.705,70190.29999999999ms,70.1903ms
1,neuron,0.197,995.8000000000001ms,0.9958ms


In [119]:
model_res=do_test(processed_dataset, model,'pytorch')
model_neuron_res = do_test(processed_dataset, model_neuron_mixed,'neuron')


df = pd.DataFrame([model_res,model_neuron_res])
df

Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape (Triggered internally at  /opt/workspace/KaenaPyTorchRuntime/neuron_op/neuron_op_impl.cpp:38.)
  result = self.forward(*input, **kwargs)


Unnamed: 0,model_type,accuracy,total_time,average_time
0,pytorch,0.705,72931.2ms,72.9312ms
1,neuron,0.197,900.6999999999999ms,0.9007ms
