# `Transformers`

## Quick Try

### Pipeline

In [1]:
# pipeline() is the easiest and fastest way to use a pretrained model for inference.
from transformers import pipeline
import torch

In [None]:
classifier = pipeline("sentiment-analysis")

The `pipeline()` downloads and caches a default pretrained model and tokenizer for sentiment analysis.

In [3]:
classifier("I am very happy to use the Transformers library!")

[{'label': 'POSITIVE', 'score': 0.9997723698616028}]

For more than one input, pass the inputs as a list

In [4]:
results = classifier([
    "I am very happy to use the Transformers library!",
    "I hope it is easy to learn.",
])

for res in results:
    print(f"label: {res['label']}, with score: {round(res['score'], 5)}")

label: POSITIVE, with score: 0.99977
label: POSITIVE, with score: 0.99744


The `pipeline()` can also iterate over an entire dataset for any task. For example, in an automatic speech recognition,

In [5]:
speech_recognizer = pipeline('automatic-speech-recognition',
                             model='facebook/wav2vec2-base-960h')

config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You sho

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

In [None]:
# if datasets is not installed,
!pip install datasets

In [7]:
# load an example audio dataset
from datasets import load_dataset, Audio

dataset = load_dataset('PolyAI/minds14',
                       name='en-US',
                       split='train')

minds14.py:   0%|          | 0.00/5.83k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


MInDS-14.zip:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Make sure the sampling rate of the dataset matches the sampling rate `facebook/wav2vec2-base-960h` was trained on

In [8]:
dataset = dataset.cast_column(
    'audio',
    Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate),
)

The audio files are automatically loaded and resampled when calling the `audio` column.
Extract the raw waveform arrays from the first 4 samples and pass it as a list to the pipeline:

In [9]:
result = speech_recognizer(dataset[:4]['audio'])
print([d['text'] for d in result])

['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT']


#### Use another model and tokenizer in the pipeline

In [10]:
model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'

Use `AutoModelForSequenceClassification` and `AutoTokenizer` to load the pretrained model and it's associated tokenizer:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Specify the model and tokenizer in the `pipeline()`,

In [12]:
classifier = pipeline('sentiment-analysis',
                      model=model,
                      tokenizer=tokenizer)

classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")

[{'label': '5 stars', 'score': 0.7272651791572571}]

### AutoTokenizer

A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model.

We need to instantiate a tokenizer with the same model name to ensure we used the same tokenization rules a model was pretrained with.

In [13]:
from transformers import AutoTokenizer

model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
tokenizer = AutoTokenizer.from_pretrained(model_name)

Pass the text to the tokenizer:

In [14]:
encoding = tokenizer("I am very happy to use the Transformers library!")
print(encoding)

{'input_ids': [101, 151, 10345, 12495, 19308, 10114, 11868, 10103, 58263, 13299, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The returned dictionary:
* `input_ids` - numerical representations of the tokens.
* `attention_mask` - tokens that should be attended to.

A tokenizer can also accept a list of inputs, pad, and truncate the text to return a batch with uniform length:

In [16]:
pt_batch = tokenizer(
    ["I am very happy to use the Transformers library!", "I hope it is easy to learn."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors='pt',
)

for key, value in pt_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 151, 10345, 12495, 19308, 10114, 11868, 10103, 58263, 13299, 106, 102], [101, 151, 18763, 10197, 10127, 24806, 10114, 34990, 119, 102, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]


### AutoModel

For text (or sequence) classification, use `AutoModelForSequenceClassification`:

In [17]:
from transformers import AutoModelForSequenceClassification

model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

Pass the preprocessed batch of inputs directly to the model. Unpack the dictionary by adding `**`:

In [18]:
pt_outputs = pt_model(**pt_batch)

print(pt_outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-2.6944, -2.8291, -0.9644,  2.0473,  3.4336],
        [-2.4763, -1.2895,  0.9839,  1.6416,  0.8076]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


The model outputs the final activations in the `logits` attribute. Apply the softmax function to the `logits` to retrieve the probabilities:

In [19]:
from torch import nn

pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)

print(pt_predictions)

tensor([[0.0017, 0.0015, 0.0097, 0.1974, 0.7896],
        [0.0081, 0.0264, 0.2562, 0.4946, 0.2148]], grad_fn=<SoftmaxBackward0>)


All `transformers` models output the tensors *before* the final activation function because the final activation is often fused with the loss.

### Save a model

Once a model is fune-tuned, we can save it with its tokenizer using `PreTrainedModel.save_pretrained()`:

In [None]:
pt_save_directory = './pt_save_pretrained'
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)

In [None]:
# load the saved model again
pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")

There is even an option to save a model and reload it as either a PyTorch or TensorFlow model. The `from_pt` or `from_tf` parameter can convert the model from one framework to the other:

In [None]:
from transformers import AutoModel

# assume the tf models are saved under a directory
tf_save_directory = "./tf_save_pretrained"

tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)

## Custom model builds

We can modify the model's configuration class to change how a model is built. The configuration specifies a model's attributes, such as the number of hidden layers or attention heads.

Start by importing `AutoConfig`, and then load the pretrained model we want to modify. Within `AutoConfig.from_pretrained()`, we can specify the attribute we want to change, such as the number of attention heads (`n_heads`):

In [20]:
from transformers import AutoConfig

my_config = AutoConfig.from_pretrained(
    "distilbert/distilbert-base-uncased",
    n_heads=12,
)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Create a model from our custom configuration with `AutoModel.from_config()`:

In [21]:
from transformers import AutoModel

my_model = AutoModel.from_config(my_config)

## Trainer - a PyTorch optimized training loop

All models are a standard `torch.nn.Module` so we can use them in any typical training loop. `transformers` provides a Trainer class for PyTorch, which contains the basic training loop and adds additional functionality for features like distributed training, mix precision, and more.

Depending on our task, we will typically pass the following parameters to Trainer:

1. Start with a PreTrainedModel or a `torch.nn.Module`:

In [22]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


2. TrainingArguments contains the model hyperparameters we can change like learning rate, batch size, and the number of epochs to train for.

In [23]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
)

3. Load a preprocessing class like a tokenizer, image processor, feature extractor, or processor:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

4. Load a dataset:

In [None]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")

5. Create a function to tokenize the dataset:

In [26]:
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])

In [None]:
# apply it over the entire dataset with map:
dataset = dataset.map(tokenize_dataset, batched=True)

6. A DataCollatorWithPadding to create a batch of examples from our datatset:

In [28]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Gather all these classes in Trainer:

In [30]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Whenever ready, call `train()` to start training:

In [31]:
trainer.train()

Step,Training Loss
500,0.4563
1000,0.3886
1500,0.2679
2000,0.2723


TrainOutput(global_step=2134, training_loss=0.3395412895538702, metrics={'train_runtime': 5297.7645, 'train_samples_per_second': 3.22, 'train_steps_per_second': 0.403, 'total_flos': 195974132394480.0, 'train_loss': 0.3395412895538702, 'epoch': 2.0})

For tasks like translation or summarization that use a sequence-to-sequence model, use the `Seq2SeqTrainer` and `Seq2SeqTrainingArguments` classes instead.