# torchserve.ipynb

This notebook contains code for the portions of the benchmark in [the benchmark notebook](./benchmark.ipynb) that use [TorchServe](https://github.com/pytorch/serve).



In [1]:
# Imports go here
import json
import os
import requests

import scipy.special
import transformers

# Fix silly warning messages about parallel tokenizers
os.environ['TOKENIZERS_PARALLELISM'] = 'False'

In [2]:
# Constants go here

INTENT_MODEL_NAME = 'mrm8488/t5-base-finetuned-e2m-intent'
SENTIMENT_MODEL_NAME = 'cardiffnlp/twitter-roberta-base-sentiment'
QA_MODEL_NAME = 'deepset/roberta-base-squad2'
GENERATE_MODEL_NAME = 'gpt2'


INTENT_INPUT = {
    'context':
        ("I came here to eat chips and beat you up, "
         "and I'm all out of chips.")
}

SENTIMENT_INPUT = {
    'context': "We're not happy unless you're not happy."
}

QA_INPUT = {
    'question': 'What is 1 + 1?',
    'context': 
        """Addition (usually signified by the plus symbol +) is one of the four basic operations of 
        arithmetic, the other three being subtraction, multiplication and division. The addition of two 
        whole numbers results in the total amount or sum of those values combined. The example in the
        adjacent image shows a combination of three apples and two apples, making a total of five apples. 
        This observation is equivalent to the mathematical expression "3 + 2 = 5" (that is, "3 plus 2 
        is equal to 5").
        """
}

GENERATE_INPUT = {
    'prompt_text': 'All your base are'
}

## Model Packaging

TorchServe requires models to be packaged up as model archive files. Documentation for this process (such as it is) is [here](https://github.com/pytorch/serve/blob/master/README.md#serve-a-model) and [here](https://github.com/pytorch/serve/blob/master/model-archiver/README.md).



### Intent Model

The intent model requires the caller to call the pre- and post-processing code manually. Only the model and tokenizer are provided on the model zoo.

In [3]:
# First we need to dump the model into a local directory.
intent_model = transformers.AutoModelForSeq2SeqLM.from_pretrained(
    INTENT_MODEL_NAME)
intent_tokenizer = transformers.AutoTokenizer.from_pretrained('t5-base')

intent_model.save_pretrained('torchserve/intent')
intent_tokenizer.save_pretrained('torchserve/intent')

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

('torchserve/intent/tokenizer_config.json',
 'torchserve/intent/special_tokens_map.json',
 'torchserve/intent/tokenizer.json')

Next we wrapped the model in a [handler class](./torchserve/intent/handler.py), which 
needs to be in its own separate Python file in order for the `torch-model-archiver`
utility to work.

The following command turns this Python file, plus the data files created by the 
previous cell, into a model archive (`.mar`) file at `torchserve/model_store/intent.mar`.

In [4]:
%%time
!mkdir -p torchserve/model_store
!torch-model-archiver --model-name intent --version 1.0 \
 --serialized-file torchserve/intent/pytorch_model.bin \
 --handler torchserve/handler_intent.py \
 --extra-files "torchserve/intent/config.json,torchserve/intent/special_tokens_map.json,torchserve/intent/tokenizer_config.json,torchserve/intent/tokenizer.json" \
 --export-path torchserve/model_store \
 --force

CPU times: user 438 ms, sys: 116 ms, total: 553 ms
Wall time: 54 s


### Sentiment Model

The sentiment model operates similarly to the intent model.

In [5]:
sentiment_tokenizer = transformers.AutoTokenizer.from_pretrained(
    SENTIMENT_MODEL_NAME)
sentiment_model = (
    transformers.AutoModelForSequenceClassification
    .from_pretrained(SENTIMENT_MODEL_NAME))

sentiment_model.save_pretrained('torchserve/sentiment')
sentiment_tokenizer.save_pretrained('torchserve/sentiment')

Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

('torchserve/sentiment/tokenizer_config.json',
 'torchserve/sentiment/special_tokens_map.json',
 'torchserve/sentiment/vocab.json',
 'torchserve/sentiment/merges.txt',
 'torchserve/sentiment/added_tokens.json',
 'torchserve/sentiment/tokenizer.json')

In [6]:
contexts = ['hello', 'world']
input_batch = sentiment_tokenizer(contexts, padding=True, 
                                  return_tensors='pt')

inference_output = sentiment_model(**input_batch)

scores = inference_output.logits.detach().numpy()
scores = scipy.special.softmax(scores, axis=1).tolist()
scores = [{k: v for k, v in zip(['positive', 'neutral', 'negative'], row)}
          for row in scores]
# return scores

scores

[{'positive': 0.13167870044708252,
  'neutral': 0.6034972071647644,
  'negative': 0.26482412219047546},
 {'positive': 0.22967909276485443,
  'neutral': 0.5535956025123596,
  'negative': 0.21672534942626953}]

In [7]:
%%time
!torch-model-archiver --model-name sentiment --version 1.0 \
 --serialized-file torchserve/sentiment/pytorch_model.bin \
 --handler torchserve/handler_sentiment.py \
 --extra-files "torchserve/sentiment/config.json,torchserve/sentiment/special_tokens_map.json,torchserve/sentiment/tokenizer_config.json,torchserve/sentiment/tokenizer.json" \
 --export-path torchserve/model_store \
 --force

CPU times: user 210 ms, sys: 114 ms, total: 324 ms
Wall time: 24.2 s


### Question Answering Model

The QA model uses a `transformers` pipeline. We squeeze this model into the TorchServe APIs by telling the pipeline to serialize all of its parts to a single directory, then passing the parts taht aren't `pytorch_model.bin` in as extra files. At runtime, our custom handler uses the model loading code from `transformers` on the reconstituted model directory.

In [8]:
qa_pipeline = transformers.pipeline('question-answering', model=QA_MODEL_NAME)
qa_pipeline.save_pretrained('torchserve/qa')

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [9]:
%%time
!torch-model-archiver --model-name qa --version 1.0 \
 --serialized-file torchserve/qa/pytorch_model.bin \
 --handler torchserve/handler_qa.py \
 --extra-files "torchserve/qa/config.json,torchserve/qa/merges.txt,torchserve/qa/special_tokens_map.json,torchserve/qa/tokenizer_config.json,torchserve/qa/tokenizer.json,torchserve/qa/vocab.json" \
 --export-path torchserve/model_store \
 --force

CPU times: user 287 ms, sys: 67.5 ms, total: 354 ms
Wall time: 24.7 s


In [10]:
data = [QA_INPUT, QA_INPUT]

# Preprocessing
samples = [qa_pipeline.create_sample(**r) for r in data]
generators = [qa_pipeline.preprocess(s) for s in samples]

# Inference
inference_outputs = ((qa_pipeline.forward(example) for example in batch) for batch in generators)

post_results = [qa_pipeline.postprocess(o) for o in inference_outputs]
post_results

[{'score': 4.278918822819833e-06, 'start': 483, 'end': 484, 'answer': '5'},
 {'score': 4.278918822819833e-06, 'start': 483, 'end': 484, 'answer': '5'}]

### Natural Language Generation Model

The text generation model is roughly similar to the QA model, albeit with important differences in how the three stages of the pipeline operate.  At least model loading is the same.

In [11]:
generate_pipeline = transformers.pipeline(
    'text-generation', model=GENERATE_MODEL_NAME)
generate_pipeline.save_pretrained('torchserve/generate')

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [12]:
data = [GENERATE_INPUT, GENERATE_INPUT]


pad_token_id = generate_pipeline.tokenizer.eos_token_id

json_records = data

# preprocess() takes a single input at a time, but we need to do 
# a batch at a time.
input_batch = [generate_pipeline.preprocess(**r) for r in json_records]

# forward() takes a single input at a time, but we need to run a
# batch at a time.
inference_output = [
    generate_pipeline.forward(r, pad_token_id=pad_token_id)
    for r in input_batch]

# postprocess() takes a single generation result at a time, but we
# need to run a batch at a time.
generate_result = [generate_pipeline.postprocess(i)
                   for i in inference_output]
generate_result

[[{'generated_text': "All your base are all fine. But don't bother telling me exactly what's going to change that. You'll likely keep getting annoyed, and a little further explanation is best. The best part? I don't want to hear you whine. Well"}],
 [{'generated_text': 'All your base are so focused, and so focused because you want to leave the world and get away from the pain." I said, "I\'ll give you my keys and your cash, and your time. Now stop fighting with me, that\'s'}]]

In [13]:
%%time
!torch-model-archiver --model-name generate --version 1.0 \
 --serialized-file torchserve/generate/pytorch_model.bin \
 --handler torchserve/handler_generate.py \
 --extra-files "torchserve/generate/config.json,torchserve/generate/merges.txt,torchserve/generate/special_tokens_map.json,torchserve/generate/tokenizer_config.json,torchserve/generate/tokenizer.json,torchserve/generate/vocab.json" \
 --export-path torchserve/model_store \
 --force

CPU times: user 198 ms, sys: 96 ms, total: 294 ms
Wall time: 24.5 s


## Testing

Now we can fire up TorchServe and test our models.

For some reason, starting TorchServe needs to be done in a proper terminal window. Running the command from this notebook has no effect.  The commands to run (from the root of the repository) are:

```
> conda activate ./env
> cd notebooks/benchmark/torchserve
> torchserve --start --ncs --model-store model_store --ts-config torchserve.properties
```

Then pick up a cup of coffee and a book and wait a while. The startup process is like cold-starting a gas turbine and takes about 10 minutes.

Once the server has started, we can test our deployed models by making POST requests.

In [14]:
# Probe the management API to verify that TorchServe is running.
requests.get('http://127.0.0.1:8081/models').json()

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8081): Max retries exceeded with url: /models (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fda44069670>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [None]:
port = 8080

intent_result = requests.put(
    f'http://127.0.0.1:{port}/predictions/intent_en', 
    json.dumps(INTENT_INPUT)).json()
print(f'Intent result: {intent_result}')

sentiment_result = requests.put(
    f'http://127.0.0.1:{port}/predictions/sentiment_en', 
    json.dumps(SENTIMENT_INPUT)).json()
print(f'Sentiment result: {sentiment_result}')

qa_result = requests.put(
    f'http://127.0.0.1:{port}/predictions/qa_en', 
    json.dumps(QA_INPUT)).json()
print(f'Question answering result: {qa_result}')

generate_result = requests.put(
    f'http://127.0.0.1:{port}/predictions/generate_en', 
    json.dumps(GENERATE_INPUT)).json()
print(f'Natural language generation result: {generate_result}')

## Cleanup

TorchServe consumes many resources even when it isn't doing anything. When you're done running the baseline portion of the benchmark, be sure to shut down the server by running:
```
> torchserve --stop
```