# ray_deploy.ipynb

Optimized model serving implementation from the [benchmark notebook](./benchmark.ipynb).

## Boilerplate

In [1]:
# Initialization and import code goes in this cell.

# Imports: Python core, then third-party, then local.
# Try to keep each block in alphabetical order, or the linter may get angry.

import asyncio
import requests
import starlette
import time
import os
import json

import scipy.special

import ray
from ray import serve
import torch
import transformers
import zerocopy

import concurrent

# Fix silly warning messages about parallel tokenizers
os.environ['TOKENIZERS_PARALLELISM'] = 'False'


# Reduce the volume of warning messages from `transformers`
transformers.logging.set_verbosity_error()


def reboot_ray():
    if ray.is_initialized():
        ray.shutdown()

    if torch.cuda.is_available():
        return ray.init(num_gpus=1)
    else:
        return ray.init()

In [2]:
# Constants go here
INTENT_MODEL_NAME = 'mrm8488/t5-base-finetuned-e2m-intent'
SENTIMENT_MODEL_NAME = 'cardiffnlp/twitter-roberta-base-sentiment'
QA_MODEL_NAME = 'deepset/roberta-base-squad2'
GENERATE_MODEL_NAME = 'gpt2'


INTENT_INPUT = {
    'context':
        ("I came here to eat chips and beat you up, "
         "and I'm all out of chips.")
}

SENTIMENT_INPUT = {
    'context': "We're not happy unless you're not happy."
}

QA_INPUT = {
    'question': 'What is 1 + 1?',
    'context': 
        """Addition (usually signified by the plus symbol +) is one of the four basic operations of 
        arithmetic, the other three being subtraction, multiplication and division. The addition of two 
        whole numbers results in the total amount or sum of those values combined. The example in the
        adjacent image shows a combination of three apples and two apples, making a total of five apples. 
        This observation is equivalent to the mathematical expression "3 + 2 = 5" (that is, "3 plus 2 
        is equal to 5").
        """
}

GENERATE_INPUT = {
    'prompt_text': 'All your base are'
}

## Example model code

This is the single-node code on which the Serve deployments below are based.  Some of this code is duplicated in `benchmark.ipynb` and should be kept in sync.

### Intent model

For our intent detection models, we'll use the model [`mrm8488/t5-base-finetuned-e2m-intent`](https://huggingface.co/mrm8488/t5-base-finetuned-e2m-intent).

The intent model comes as three parts: A *tokenizer* that converts raw text into a sequence numeric token IDs, a core *model* that transforms these token sequences, and *preprocessing and postprocessing code* to choreograph the usage of the first two parts.

In [3]:
# Load model and tokenizer
intent_tokenizer = transformers.AutoTokenizer.from_pretrained('t5-base')
intent_model = transformers.AutoModelForSeq2SeqLM.from_pretrained(
    INTENT_MODEL_NAME)

# Preprocessing
input_text = f'{INTENT_INPUT["context"]} </s>'
features = intent_tokenizer([input_text], return_tensors='pt')

# Inference
output = intent_model.generate(**features)

# Postprocessing
result_string = intent_tokenizer.decode(output[0])
result_string = result_string.replace('<pad>', '')
result_string = result_string[len(' '):-len('</s>')]

result_string

'to eat'

### Sentiment model

For our sentiment models, we'll use model [`cardiffnlp/twitter-roberta-base-sentiment`](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment).

Like the intent model, the sentiment model is packaged as a tokenizer, a core model, and instructions for pre- and post-processing.

In [4]:
# Model loading
sentiment_tokenizer = transformers.AutoTokenizer.from_pretrained(
    SENTIMENT_MODEL_NAME)
sentiment_model = (transformers.AutoModelForSequenceClassification
                .from_pretrained(SENTIMENT_MODEL_NAME))

# Preprocessing
encoded_input = sentiment_tokenizer(SENTIMENT_INPUT['context'], 
                                 return_tensors='pt')   

# Inference
output = sentiment_model(**encoded_input)

# Postprocessing
scores = output[0][0].detach().numpy()
scores = scipy.special.softmax(scores)
scores = [float(s) for s in scores]
scores = {k: v for k, v in zip(['positive', 'neutral', 'negative'], scores)}

scores

{'positive': 0.5419477820396423,
 'neutral': 0.38251084089279175,
 'negative': 0.07554134726524353}

### Question Answering Model

For our question answering models, we'll use the model [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2).

Unlike the intent and sentiment models, the question answering model comes prepackaged as a `question-answering` pipeline via the `tokenizers` library's [Pipelines API](https://huggingface.co/docs/transformers/main_classes/pipelines). 

So we can load and run all parts of the model, including pre- and post-processing code, by creating an instance of the pipeline class. The pipeline object has methods `preprocess()`, `forward()`, and `postprocess()` to perform preprocessing, inference, and postprocessing.

In [5]:
# Loading the model
qa_pipeline = transformers.pipeline('question-answering',
                                    model=QA_MODEL_NAME)
# Preprocessing (returns a Python generator)
qa_pre = qa_pipeline.preprocess(qa_pipeline.create_sample(**QA_INPUT))

# Inference
qa_output = (qa_pipeline.forward(example) for example in qa_pre)

# Postprocessing
qa_result = qa_pipeline.postprocess(qa_output)

qa_result

{'score': 4.278938831703272e-06, 'start': 483, 'end': 484, 'answer': '5'}

There is also a convenience method `__call__()` that runs all three phases of processing in sequence..

In [6]:
# This code also appears in `benchmark.ipynb`

# Loading the model and associated resources
qa_pipeline = transformers.pipeline('question-answering',
                                    model=QA_MODEL_NAME)
# Preprocessing, inference, and postprocessing all happen in
# the Python object's the __call__() method.
qa_result = qa_pipeline(**QA_INPUT)

qa_result

{'score': 4.278938831703272e-06, 'start': 483, 'end': 484, 'answer': '5'}

### Natural Language Generation Model

For natural language generation, we'll use the [`gpt2`](https://huggingface.co/gpt2) language model. Like the question answering model, this natural language generation model comes wrapped in a `tokenizers` pipeline class. The class's `__call__()` method performs all the steps necessary to run end-to-end inference.


In [7]:
# Load the model
generate_pipeline = transformers.pipeline(
    'text-generation', model=GENERATE_MODEL_NAME)
pad_token_id = generate_pipeline.tokenizer.eos_token_id

# Preprocessing
generate_pre = generate_pipeline.preprocess(**GENERATE_INPUT)

# Inference
generate_output = generate_pipeline.forward(generate_pre,
                                            pad_token_id=pad_token_id)

# Postprocessing
generate_result = generate_pipeline.postprocess(generate_output)
generate_result

[{'generated_text': 'All your base are just to get you going. If you have any problems you can use this guide to try and start playing with our new cards. There are a lot of great options you can use.\n\nFor the players that will run into'}]

## Start Ray Serve

In [8]:
serve.shutdown()
reboot_ray()
serve.start()

2022-04-14 16:23:27,706	INFO services.py:1412 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8266[39m[22m
[2m[36m(ServeController pid=51116)[0m 2022-04-14 16:23:32,867	INFO checkpoint_path.py:16 -- Using RayInternalKVStore for controller checkpoint and recovery.
[2m[36m(ServeController pid=51116)[0m 2022-04-14 16:23:32,974	INFO http_state.py:98 -- Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:TOnaCn:SERVE_PROXY_ACTOR-node:127.0.0.1-0' on node 'node:127.0.0.1-0' listening on '127.0.0.1:8000'
2022-04-14 16:23:33,483	INFO api.py:521 -- Started Serve instance in namespace '5ba3a27e-f16a-4829-9268-ad13be21fc2e'.


<ray.serve.api.Client at 0x7fe790835a90>

[2m[36m(HTTPProxyActor pid=51122)[0m INFO:     Started server process [51122]


## Optimized Model Deployments

Some of these classes appear in slightly modified format in `benchmark.ipynb`. Make sure to keep the code in sync.

In [9]:
# This class also appears in `benchmark.ipynb`
@serve.deployment
class Intent:
    def __init__(self):
        self._tokenizer = transformers.AutoTokenizer.from_pretrained('t5-base')

        # Extract weights and load them onto the Plasma object store
        self._model_ref = ray.put(zerocopy.extract_tensors(
            transformers.AutoModelForSeq2SeqLM.from_pretrained(
                    INTENT_MODEL_NAME)))

    async def __call__(self, request: starlette.requests.Request):
        json_request = await request.json()

        # Preprocessing
        input_text = f'{json_request["context"]} </s>'
        features = self._tokenizer([input_text], return_tensors='pt')

        # Model inference runs asynchronously in a Ray task
        output = await zerocopy.call_model.remote(
            self._model_ref, [], features, 'generate')

        # Postprocessing
        result_string = self._tokenizer.decode(output[0])
        result_string = result_string[len('<pad> '):-len('</s>')]
        return {
            'intent': result_string
        }


@serve.deployment
class Sentiment:
    def __init__(self):
        self._tokenizer = transformers.AutoTokenizer.from_pretrained(
            SENTIMENT_MODEL_NAME)

        model = (transformers.AutoModelForSequenceClassification
                 .from_pretrained(SENTIMENT_MODEL_NAME))
        self._model_ref = ray.put(zerocopy.extract_tensors(model))

    async def __call__(self, request: starlette.requests.Request):
        json_request = await request.json()

        # Preprocessing
        encoded_input = self._tokenizer(json_request['context'], 
                                         return_tensors='pt')   

        # Inference
        output = await zerocopy.call_model.remote(
            self._model_ref, [], encoded_input)

        # Postprocessing
        scores = output[0][0].detach().numpy()
        scores = scipy.special.softmax(scores)
        scores = [float(s) for s in scores]
        scores = {k: v for k, v in zip(['positive', 'neutral', 'negative'], scores)}
        return scores


# This class also appears in `benchmark.ipynb`
@serve.deployment
class QA:
    def __init__(self):
        # Load the pipeline and move the model's weights onto the
        # Plasma object store.
        self._pipeline = zerocopy.rewrite_pipeline(
            transformers.pipeline('question-answering', 
                                  model=QA_MODEL_NAME))
        self._threadpool = concurrent.futures.ThreadPoolExecutor()

    async def __call__(self, request: starlette.requests.Request):
        json_request = await request.json()

        # The original `transformers` code is not async-aware, so we
        # call it from `run_in_executor()`
        result = await asyncio.get_running_loop().run_in_executor(
             self._threadpool, lambda: self._pipeline(**json_request))
        return result


@serve.deployment
class Generate:
    def __init__(self):
        self._pipeline = zerocopy.rewrite_pipeline(
            transformers.pipeline('text-generation',
                                  model=GENERATE_MODEL_NAME),
            ('__call__', 'generate'))
        self._pad_token_id = self._pipeline.tokenizer.eos_token_id
        self._threadpool = concurrent.futures.ThreadPoolExecutor()

    async def __call__(self, request: starlette.requests.Request):
        json_request = await request.json()

        result = await asyncio.get_running_loop().run_in_executor(
            self._threadpool, 
            lambda: self._pipeline(
                json_request['prompt_text'], 
                pad_token_id=self._pad_token_id))
        return result


Now we can deploy all of these pipelines as Serve endpoints.

In [10]:
# Define endpoints.
# Everything gets deployed under the prefix /predictions/ to make
# the deployment as similar as possible to the TorchServe baseline.
LANGUAGES = ['en', 'es', 'zh']


for lang in LANGUAGES:
    (Intent.options(name=f'intent_{lang}',
                   route_prefix=f'/predictions/intent_{lang}',
                   ray_actor_options={"num_cpus": 0.1})
     .deploy(_blocking=False))
    (Sentiment.options(name=f'sentiment_{lang}',
                   route_prefix=f'/predictions/sentiment_{lang}',
                   ray_actor_options={"num_cpus": 0.1})
     .deploy(_blocking=False))
    (QA.options(name=f'qa_{lang}',
                   route_prefix=f'/predictions/qa_{lang}',
                   ray_actor_options={"num_cpus": 0.1})
     .deploy(_blocking=False))
    (Generate.options(name=f'generate_{lang}',
                   route_prefix=f'/predictions/generate_{lang}',
                   ray_actor_options={"num_cpus": 0.1})
     .deploy(_blocking=False))

# Wait a moment so log output doesn't go to the next cell's output
time.sleep(5.)

2022-04-14 16:23:58,332	INFO api.py:262 -- Updating deployment 'intent_en'. component=serve deployment=intent_en
2022-04-14 16:23:58,343	INFO api.py:262 -- Updating deployment 'sentiment_en'. component=serve deployment=sentiment_en
2022-04-14 16:23:58,353	INFO api.py:262 -- Updating deployment 'qa_en'. component=serve deployment=qa_en
2022-04-14 16:23:58,364	INFO api.py:262 -- Updating deployment 'generate_en'. component=serve deployment=generate_en
2022-04-14 16:23:58,377	INFO api.py:262 -- Updating deployment 'intent_es'. component=serve deployment=intent_es
2022-04-14 16:23:58,391	INFO api.py:262 -- Updating deployment 'sentiment_es'. component=serve deployment=sentiment_es
2022-04-14 16:23:58,404	INFO api.py:262 -- Updating deployment 'qa_es'. component=serve deployment=qa_es
2022-04-14 16:23:58,417	INFO api.py:262 -- Updating deployment 'generate_es'. component=serve deployment=generate_es
2022-04-14 16:23:58,431	INFO api.py:262 -- Updating deployment 'intent_zh'. component=serve 

In [14]:
# Dump object sizes from Plasma. Used to populate the table of model sizes in the main notebook.
!ray memory --units MB | grep MB

7950.210511 MB       61, (7950.210511 MB)  0, (0.0 MB)   0, (0.0 MB)    0, (0.0 MB)          53, (0.0 MB) 
127.0.0.1     | 51116    | Worker  |           | 1.5e-05 MB | LOCAL_REFERENCE | b4567f7b86f1c9b04089143b9fcd9bcfa6aa4e880100000001000000
127.0.0.1     | 51116    | Worker  |           | 1.5e-05 MB | LOCAL_REFERENCE | 0cb686442cb43d5ecb863c842117b9ea56b331370100000001000000
127.0.0.1     | 51116    | Worker  |           | 1.5e-05 MB | LOCAL_REFERENCE | 5072e9fc92a6447effd95f38533e692f8796b72b0100000001000000
127.0.0.1     | 51116    | Worker  |           | 1.5e-05 MB | LOCAL_REFERENCE | d714b645ac9c0d738e0f785a690e9d7149e007d60100000001000000
127.0.0.1     | 51116    | Worker  |           | 1.5e-05 MB | LOCAL_REFERENCE | dd797876ac844e6cec693c0afb09f36ab69518530100000001000000
127.0.0.1     | 51116    | Worker  |           | 1.5e-05 MB | LOCAL_REFERENCE | d53def7e0cdfbb7ceeeccac7d4023d02b21863c30100000001000000
127.0.0.1     | 51116    | Worker  |           | 1.5e-05 MB | LOCAL_REF

In [12]:
# Verify that everything deployed properly.
intent_result = requests.put(
    'http://127.0.0.1:8000/predictions/intent_en', 
    json.dumps(INTENT_INPUT)).json()
print(f'Intent result: {intent_result}')

sentiment_result = requests.put(
    'http://127.0.0.1:8000/predictions/sentiment_en', 
    json.dumps(SENTIMENT_INPUT)).json()
print(f'Sentiment result: {sentiment_result}')

qa_result = requests.put(
    'http://127.0.0.1:8000/predictions/qa_en', 
    json.dumps(QA_INPUT)).json()
print(f'Question answering result: {qa_result}')

generate_result = requests.put(
    'http://127.0.0.1:8000/predictions/generate_en', 
    json.dumps(GENERATE_INPUT)).json()
print(f'Natural language generation result: {generate_result}')

Intent result: {'intent': 'to eat'}
Sentiment result: {'positive': 0.5419476628303528, 'neutral': 0.38251087069511414, 'negative': 0.07554134726524353}
Question answering result: {'score': 4.278897904441692e-06, 'start': 483, 'end': 484, 'answer': '5'}
Natural language generation result: [{'generated_text': "All your base are in a position to be able to compete to be the best in the world. We take your feedback very seriously. We're going to be working to make sure that we're doing everything we can to make a better game for everyone"}]


[2m[36m(pid=66875)[0m [2022-03-08 11:49:17,044 E 66875 66920] core_worker_process.cc:348: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
[2m[36m(pid=70721)[0m [2022-03-08 11:52:00,501 E 70721 70771] core_worker_process.cc:348: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
[2m[36m(pid=71167)[0m [2022-03-08 11:52:22,160 E 71167 71218] core_worker_process.cc:348: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
[2m[36m(pid=71286)[0m [2022-03-08 11:52:27,603 E 71286 71403] core_worker_process.cc:348: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit


# Cleanup

Once the benchmark is complete, shut down this notebook's Ray cluster.

In [None]:
serve.shutdown()
ray.shutdown()

[2m[36m(ServeController pid=19731)[0m 2022-03-08 12:47:52,714	INFO deployment_state.py:940 -- Removing 1 replicas from deployment 'intent_en'. component=serve deployment=intent_en
[2m[36m(ServeController pid=19731)[0m 2022-03-08 12:47:52,719	INFO deployment_state.py:940 -- Removing 1 replicas from deployment 'sentiment_en'. component=serve deployment=sentiment_en
[2m[36m(ServeController pid=19731)[0m 2022-03-08 12:47:52,734	INFO deployment_state.py:940 -- Removing 1 replicas from deployment 'qa_en'. component=serve deployment=qa_en
[2m[36m(ServeController pid=19731)[0m 2022-03-08 12:47:52,737	INFO deployment_state.py:940 -- Removing 1 replicas from deployment 'generate_en'. component=serve deployment=generate_en
[2m[36m(ServeController pid=19731)[0m 2022-03-08 12:47:52,740	INFO deployment_state.py:940 -- Removing 1 replicas from deployment 'intent_es'. component=serve deployment=intent_es
[2m[36m(ServeController pid=19731)[0m 2022-03-08 12:47:52,743	INFO deployment_st