*Copyright 2024 Modular, Inc: Licensed under the Apache License v2.0 with LLVM Exceptions.*

# Run inference with Python

The Python API for MAX Engine enables you to upgrade your runtime performance for TensorFlow and PyTorch models, on a wide range of hardware, with just three lines of code (not counting the import):


```python
from max import engine

# Load your model:
session = engine.InferenceSession()
model = session.load(model_path)

# Prepare the inputs, then run an inference:
outputs = model.execute(**inputs)

# Process the output here.
```

That's all you need! Everything else is the usual code to prepare your
inputs and process the outputs.

But, it's also nice to see a fully working example. So the
rest of this page shows how to run an inference using a version of
[RoBERTa from Cardiff
NLP](https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion-multilabel-latest),
which is a language model trained on tweets to perform sentiment analysis.

This example uses is a TensorFlow
model (which must be converted to SavedModel format), and it's just as easy to
load a model from PyTorch (which must be converted to TorchScript format).

:::note Try it

This page is written in a Jupyter notebook you can get from
[our GitHub repo](https://github.com/modularml/max/blob/main/examples/notebooks/roberta-python-tensorflow.ipynb).

:::

## Install the MAX Engine Python package

Naturally, you first need to install the `max` Python package.
This package is not hosted in a package repository (PyPI), and can only be
installed with the `modular` CLI tool.

For instructions, see
[Get started with MAX Engine](https://docs.modular.com/engine/get-started).

In [1]:
# Install the MAX Engine Python package
!python3 -m pip install -q --find-links "$(modular config max.path)/wheels" max-engine
# Install other packages
# If you're using Python 3.9+, please run the following.
!python3 -m pip install -q transformers tensorflow-cpu
# If you're running Python 3.8, please run this instead.
!python3 -m pip install -q transformers tensorflow

## Import Python modules

To start coding, we need some libraries that help us get the model and process
the input/output data.

In [2]:
#| echo: false
# suppress extraneous logging
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
os.environ["TRANSFORMERS_VERBOSITY"] = "critical"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [3]:
from pathlib import Path

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

from max import engine

  from .autonotebook import tqdm as notebook_tqdm


## Download the model

Now we download the [RoBERTa model](https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion-multilabel-latest) from HuggingFace and save it in TensorFlow SavedModel format.

In [4]:
HF_MODEL_NAME = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest"
hf_model = TFAutoModelForSequenceClassification.from_pretrained(HF_MODEL_NAME)

model_path = Path("roberta")
tf.saved_model.save(hf_model, model_path)

INFO:tensorflow:Assets written to: roberta/assets


INFO:tensorflow:Assets written to: roberta/assets


## Load and compile the model

Then, we load and compile the model in MAX Engine using an
[`InferenceSession`](/engine/reference/python/engine.html#max.engine.InferenceSession).

In [5]:
session = engine.InferenceSession()
model = session.load(model_path)

Compiling model.    
Done!


That's two lines down, just one to go.

:::note

Some models might take a few minutes to compile the first time, but this up-front cost will pay in dividends with latency savings provided by our next-generation graph compiler.

:::

## Prepare the input

This part is your usual pre-processing. 
For the RoBERTa model, we need to process the text input into a sequence of tokens, so we'll do that with [`transformers.AutoTokenizer`](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer).

First, let's take a look at the model's inputs:

In [6]:
for tensor in model.input_metadata:
    print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')

name: attention_mask, shape: [None, None], dtype: DType.int32
name: input_ids, shape: [None, None], dtype: DType.int32
name: token_type_ids, shape: [None, None], dtype: DType.int32


This tells us the model needs 3 inputs. When a dimension size is `None`, that means it's dynamic.

In [7]:
INPUT="There are many exciting developments in the field of AI Infrastructure!"

tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
inputs = tokenizer(INPUT, return_tensors="np", return_token_type_ids=True)
print(inputs)

{'input_ids': array([[    0,   970,    32,   171,  3571,  5126,    11,     5,   882,
            9,  4687, 13469,   328,     2]]), 'token_type_ids': array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


## Run an inference

Now for that third line of code, we pass the inputs to
[`execute()`](/engine/reference/python/engine#max.engine.Model.execute). This
function requires all inputs as keyword arguments, so we'll
unpack the `inputs` dictionary as we pass it through:

In [8]:
outputs = model.execute(**inputs)
print(outputs)

{'logits': array([[-4.7728553 ,  0.67335933, -4.7689624 , -0.78787935,  3.842757  ,
        -2.0537782 ,  2.1849515 , -4.227579  , -5.3352814 , -1.3866923 ,
        -1.0697807 ]], dtype=float32)}


That's it!

The output from [`execute()`](/engine/reference/python/engine.html#max.engine.Model.execute) is a dictionary of output tensors, each in an `ndarray`. Let's now figure out what they say.

## Process the outputs

Again, we'll use some help from the [transformers library](https://huggingface.co/docs/transformers/main/en/model_doc/roberta#transformers.TFRobertaForSequenceClassification) to convert the output ids to labels:

In [9]:
# Extract class prediction from output
predicted_class_id = outputs["logits"].argmax(axis=-1)[0]
classification = hf_model.config.id2label[predicted_class_id]

print(f"The sentiment is: {classification}")

The sentiment is: joy


Ta-da! 🎉

If you're running this notebook yourself, beware that this notebook does not illustrate MAX Engine's runtime performance. For actual benchmark results, try [our benchmark tool](/engine/benchmark) or check out our [performance dashboard](https://performance.modular.com).

For more details about the inferencing API, see the [Python API reference](/engine/reference/python/engine).