In [1]:
from transformers import AutoModel, pipeline
from optimum.bettertransformer import BetterTransformer
from IPython.display import display, Markdown
import pandas as pd

# Load Model & Convert to Optimum

In this section we will see how to load a pre-trained model from the HuggingFace Hub. 
You can shop for models [here](https://huggingface.co/models).

Then it will be coverted using the [`BetterTransformer`](https://huggingface.co/docs/optimum/bettertransformer/overview) from the [optimum project](https://huggingface.co/docs/optimum/index).

In [2]:
model_name = "roberta-base" 
model = AutoModel.from_pretrained(model_name, device_map="auto")

# convert to BetterTransformer format to speed up inference
bt_model = BetterTransformer.transform(model, keep_original_model=True)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.


In [3]:
print("converted_model: ", bt_model)

converted_model:  RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayerBetterTransformer(
        (act_fn_callable): GELUActivation()
      )
    )
  )
  (pooler): RobertaPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)


# HuggingFace Pipeline API

In the previous section we saw how to load a model, in this section we see the easiest way to use HuggingFace models for inference.

Specifically, we will show the following APIs of the [HuggingFace Pipeline API](https://huggingface.co/docs/transformers/v4.34.0/en/main_classes/pipelines) and its cousin from the [optimum project](https://huggingface.co/docs/optimum/index), a collaboration between HuggingFace and PyTorch which improves inference latency with no performance hit:
* [Text Classsification](#text-classification)
* [Text Generation](#text-generation)
* [Text Mask Fill - Optimum](#optimum-for-faster-latency)

## Text Classification

In [4]:
# More text classification models: https://huggingface.co/models?pipeline_tag=text-classification&sort=trending
model_name = "SamLowe/roberta-base-go_emotions" 
classifier_pipe = pipeline("text-classification", model=model_name)

In [5]:
sentences = [
    "I am feeling inspired today.",
    "This talk is informative, but a bit high-level, where I can find more details?",
    "I wonder about all the hype around Generative AI, is smoke and mirrors?",
    "Building production machine learning systems is challenging."
]

In [6]:
classifier_pipe(sentences)

[{'label': 'excitement', 'score': 0.24082745611667633},
 {'label': 'admiration', 'score': 0.5622110366821289},
 {'label': 'curiosity', 'score': 0.5444050431251526},
 {'label': 'neutral', 'score': 0.46457335352897644}]

## Text Generation

In [7]:
model_name = "bigscience/bloom-560m" # https://huggingface.co/bigscience/bloom-560m
generator = pipeline("text-generation", model=model_name, device_map="auto")

In [8]:
prompt = "The Generative AI World Summit is a"
response = generator(prompt, do_sample=False, max_new_tokens=25)

In [9]:
Markdown(f"""
**Prompt**: {prompt}

**{model_name}'s continuation**: {response[0]['generated_text']}...
""")


**Prompt**: The Generative AI World Summit is a

**bigscience/bloom-560m's continuation**: The Generative AI World Summit is a global conference that brings together the best in AI, Machine Learning, and Artificial Intelligence to discuss the future of AI and how it...


## Optimum for Faster Latency

In [10]:
from optimum.pipelines import pipeline

model_name = "distilbert-base-uncased"
prompt = "I am attending the Generative AI Summit and I am a practicing [MASK]."

unmasked_optimum_pipeline = pipeline(task="fill-mask", model=model_name, accelerator="bettertransformer")
response = unmasked_optimum_pipeline(prompt)

The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
  hidden_states = torch._nested_tensor_from_mask(hidden_states, attn_mask)


In [11]:
pd.set_option('display.max_colwidth', 0)
col_mapping = {"score": "Score", "token_str": "Token mask fill", "token": "Token ID", "sequence": "Full generated text"}
pd.DataFrame(response).rename(columns=col_mapping)

Unnamed: 0,Score,Token ID,Token mask fill,Full generated text
0,0.11081,15034,psychologist,i am attending the generative ai summit and i am a practicing psychologist.
1,0.078805,13235,mathematician,i am attending the generative ai summit and i am a practicing mathematician.
2,0.051947,7992,buddhist,i am attending the generative ai summit and i am a practicing buddhist.
3,0.047386,7155,scientist,i am attending the generative ai summit and i am a practicing scientist.
4,0.04567,21477,biologist,i am attending the generative ai summit and i am a practicing biologist.
