# How to Use Transformers

## Tokenizers

In [1]:
from transformers import AutoTokenizer

In [5]:
# Initialize tokenizer
model = "sshleifer/distilbart-cnn-12-6"
tokenizer = AutoTokenizer.from_pretrained(model)

In [19]:
# Create inputs
raw_inputs = ["Deep learning is super rad!", "Deep learning is pretty lame"]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[    0, 35166,  2239,    16,  2422, 13206,   328,     2],
        [    0, 35166,  2239,    16,  1256, 31411,     2,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0]])}


In [20]:
# Run tokenizer on inputs
print(f'Tokenizer output for: {raw_inputs[0]}')
print(f"Input ids: {inputs['input_ids'][0]}")
print(f"Attention Mask: {inputs['attention_mask'][0]}")
print("-"*100)
print(f'Tokenizer output for {raw_inputs[1]}')
print(f"Input ids: {inputs['input_ids'][1]}")
print(f"Attention Mask: {inputs['attention_mask'][1]}")

Tokenizer output for: Deep learning is super rad!
Input ids: tensor([    0, 35166,  2239,    16,  2422, 13206,   328,     2])
Attention Mask: tensor([1, 1, 1, 1, 1, 1, 1, 1])
----------------------------------------------------------------------------------------------------
Tokenizer output for Deep learning is pretty lame
Input ids: tensor([    0, 35166,  2239,    16,  1256, 31411,     2,     1])
Attention Mask: tensor([1, 1, 1, 1, 1, 1, 1, 0])


Padding tokens are used to normalize the length of the inputs.  The second input "Deep learning is the pits" has one fewer token than "Deep learning is super rad!"  This is because the exclamation point is counted as having semantic meaning.  The tokenizer can automatically adjust the lengths to be the same.  The Attention Mask literally tells the model which tokens to pay attention to.  A 1 represents "True" or "do pay attention" and a 0 represents "False" or "do not pay attention".

### Tokenizers Under the Hood

In [21]:
# View individual tokens
tokens = tokenizer.tokenize(f"{raw_inputs[0]}")
tokens

['Deep', 'Ġlearning', 'Ġis', 'Ġsuper', 'Ġrad', '!']

In [22]:
# View token ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[35166, 2239, 16, 2422, 13206, 328]

Notice how the token_ids are *almost* the same as the token_ids in the inputs above.  When the tokens are being formatted as inputs for a model, the model puts in a "0" to indicate the beginning of a sentence, and a "2" to indicate the end.

In [23]:
# Get the inputs ready for the model
model_prepped_ids = tokenizer.prepare_for_model(token_ids)
model_prepped_ids

{'input_ids': [0, 35166, 2239, 16, 2422, 13206, 328, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [24]:
# Decode the token_ids
decoded_tokens = tokenizer.decode(token_ids)
decoded_tokens

'Deep learning is super rad!'

## Pipelines

In [25]:
from transformers import pipeline

In [26]:
# Create a classifier pipeline
classifier = pipeline("sentiment-analysis")
classifier(raw_inputs)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.984260618686676},
 {'label': 'NEGATIVE', 'score': 0.9996439218521118}]

Even if you don't specify a model to use, it will default to one.

In [27]:
# Create a text-generation pipeline
text_generator = pipeline("text-generation")
text_generator([
    "I went to the store to buy a",
    "When two objects in space get close together"
])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use mps:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[[{'generated_text': 'I went to the store to buy a cup of ice cream," says Joe. "Everybody asked me why I didn\'t dress earlier or on Halloween. I was scared of the thought that I\'d have to wear the same color, the same size,'}],
 [{'generated_text': 'When two objects in space get close together and share the same color to create a background around each other I use the same method defined by the CSS class.\n\nYou can see the two methods defined in the middle of the CSS class definition:\n'}]]

In [28]:
# Create a summarization pipeline
summarizer = pipeline("summarization")
summarizer([
    """A Fibonacci heap is a collection of trees satisfying the min-heap property. It allows faster amortized time for many operations than binary or binomial heaps.
    Trees in a Fibonacci heap can have any shape, which facilitates efficient operations. Lazy strategies are employed: node removals and consolidations are delayed until
    absolutely necessary (like during an extract-min operation). The main advantage lies in decreasing a key and merging two heaps, which are constant and amortized
    constant time, respectively. Nodes have a "mark" indicating if they've lost a child since the last time they were made a child of another node, assisting in
    restructuring during operations."""
])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Device set to use mps:0


model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

[{'summary_text': ' A Fibonacci heap is a collection of trees satisfying the min-heap property . It allows faster amortized time for many operations than binary or binomial heaps . Nodes have a "mark" indicating if they\'ve lost a child since the last time they were made a child of another node .'}]

## Directly Accessing Pretrained models

In [30]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
# View inputs
inputs = tokenizer('I love deep learning', return_tensors='pt')
inputs

{'input_ids': tensor([[ 101, 1045, 2293, 2784, 4083,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [None]:
# View model outputs
outputs = model(**inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-4.1975,  4.4937]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## Model Embeddings

In [33]:
from transformers import AutoModel

In [34]:
# Load model
model = AutoModel.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

In [35]:
# View outputs from model
inputs = tokenizer('I love deep learning!', padding=True, truncation=True, return_tensors='pt')

outputs = model(**inputs)

print(outputs.last_hidden_state.shape) # the token embeddings

torch.Size([1, 7, 768])


In [36]:
# to get the full context vector for the sequence
context_vectors = outputs.last_hidden_state.mean(dim=1)
context_vectors.shape

torch.Size([1, 768])

## Accessing Model Config & Creating Custom Models

In [37]:
from transformers import GPT2Config, GPT2Model

# Building the config
config = GPT2Config()

In [38]:
print(config)

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.51.3",
  "use_cache": true,
  "vocab_size": 50257
}



In [39]:
# Building the model from the config
gpt_model = GPT2Model(config)

The cell above creates a model with activations and randomly initialized weights.  It's untrained, so it's not going to produce any meaningful output.  But, the config is what defines all of the parameters for this new model.

## Save Models

In [None]:
diredctory_name = 'gpt2_model'
gpt_model.save_pretrained(directory_name)

This will save your model as at least a `config.json` and a `model.safetensors` file in a new directory within your PWD.