# 2. Using Transformers

Learning the basics of `Transformers` library from HuggingFace

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 18/11/2025   | Martin | Created   | Notebook created to explore Transformers library | 
| 19/11/2025   | Martin | Update   |  Completed basic usage of transformer library. Working on Inferenced Deploymnent |

# Content

* [Introduction](#introduction)
* [Models](#models)
* [Tokenizers](#tokenizers)
* [Putting it Together](#putting-it-together)

# Introduction

In [1]:
%load_ext watermark

In [24]:
from transformers import (
  pipeline,
  infer_device,
  AutoTokenizer,
  AutoModel,
  AutoModelForSequenceClassification
)
import torch
from torch.nn import functional as F

In [None]:
device = infer_device()
classifier = pipeline("sentiment-analysis")
classifier([
  "I've been waiting for a HuggingFace course my whole life.",
  "I hate this so much",
  "Dam my glorious king stephen wardell curry"
])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]




Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9995144605636597},
 {'label': 'POSITIVE', 'score': 0.9972779154777527}]

Understanding the tokenizer

In [9]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [12]:
raw_inputs = [
  "I've been waiting for a HuggingFace course my whole life.",
  "I hate this so much",
  "Dam my glorious king stephen wardell curry"
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='pt')
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  5477,  2026, 14013,  2332,  4459,  4829,  5349, 15478,   102,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])}


- `input_ids`: Unique identifiers of the tokens in each sentence
- `attention_mask`: Hides token weights that should not be available for the prediction task

Getting the pretrained model

- Contains base transformer model: `hidden_states` and `features`
- Retrieve high-dimensional vector representation - contextual understanding of that input by the Transformer model

In [16]:
model = AutoModel.from_pretrained(checkpoint)

sd = model.state_dict()
for k, v in sd.items():
  print(k, v.size())

embeddings.word_embeddings.weight torch.Size([30522, 768])
embeddings.position_embeddings.weight torch.Size([512, 768])
embeddings.LayerNorm.weight torch.Size([768])
embeddings.LayerNorm.bias torch.Size([768])
transformer.layer.0.attention.q_lin.weight torch.Size([768, 768])
transformer.layer.0.attention.q_lin.bias torch.Size([768])
transformer.layer.0.attention.k_lin.weight torch.Size([768, 768])
transformer.layer.0.attention.k_lin.bias torch.Size([768])
transformer.layer.0.attention.v_lin.weight torch.Size([768, 768])
transformer.layer.0.attention.v_lin.bias torch.Size([768])
transformer.layer.0.attention.out_lin.weight torch.Size([768, 768])
transformer.layer.0.attention.out_lin.bias torch.Size([768])
transformer.layer.0.sa_layer_norm.weight torch.Size([768])
transformer.layer.0.sa_layer_norm.bias torch.Size([768])
transformer.layer.0.ffn.lin1.weight torch.Size([3072, 768])
transformer.layer.0.ffn.lin1.bias torch.Size([3072])
transformer.layer.0.ffn.lin2.weight torch.Size([768, 3072

In [17]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([3, 16, 768])


Adding the classification head

In [22]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([3, 2])


In [26]:
predictions = F.softmax(outputs.logits, dim=-1)
predictions

tensor([[4.0195e-02, 9.5980e-01],
        [9.9951e-01, 4.8549e-04],
        [2.7221e-03, 9.9728e-01]], grad_fn=<SoftmaxBackward0>)

In [27]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

---

# Models

`AutoModel` class provides a simplied way to instantiate any model from a checkpoint

In [10]:
from transformers import AutoModel
from huggingface_hub import notebook_login

In [3]:
model = AutoModel.from_pretrained("bert-base-cased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Saving and loadin a model using `save_pretrained` and `from_pretrained`

- Saves to a directory containing 2 files:

1. `config.json` containing attributes needed to build the model architecture and metadata
2. `model.safetensors` is the state dict containing the model weights

Both files should be stored in the same folder to load them

In [6]:
model.save_pretrained("test")

In [7]:
model_2 = AutoModel.from_pretrained("test")

Pushing a model to HuggingFace Hub

In [14]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

In [15]:
model.push_to_hub('my-awesome-model')

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Minimartzz/my-awesome-model/commit/9b9901d2fbb9a736f7d9f757ba9cd7e3b1eb1c4f', commit_message='Upload model', commit_description='', oid='9b9901d2fbb9a736f7d9f757ba9cd7e3b1eb1c4f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Minimartzz/my-awesome-model', endpoint='https://huggingface.co', repo_type='model', repo_id='Minimartzz/my-awesome-model'), pr_revision=None, pr_num=None)

In [16]:
# Loading the model from the hub
model = AutoModel.from_pretrained("Minimartzz/my-awesome-model")

config.json:   0%|          | 0.00/637 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

---

# Tokenizers

Returns a dictionary:

- `input_ids`: numerical representation of tokens
- `token_type_ids`: tell the model which part of the input is sentence A and which is sentence B
- `attention_mask`: indicates which tokens should be attended to and which should not

When decoding there are additional special tokens appended to indicate the type of task required to be performed by the model. This is unqiue to the specific model used.

`"[CLS] Hello, I'm a single sentence! [SEP]"`

In [17]:
from transformers import AutoTokenizer

In [19]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I am a basketball player")
print(encoded_input)

decoded_input = tokenizer.decode(encoded_input["input_ids"])
print(decoded_input)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'input_ids': [101, 8667, 117, 146, 1821, 170, 3163, 1591, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] Hello, I am a basketball player [SEP]


In [21]:
# With additional configuration
encoded_add = tokenizer(
  "Hello, I am a basketball player",
  padding=True,
  truncation=True,
  max_length=5,
  return_tensors='pt'
)
encoded_add

{'input_ids': tensor([[ 101, 8667,  117,  146,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [24]:
# 1. Tokenisation
# 2. Input IDs
sequence = "I am a cow, hear me moo"
tokens = tokenizer.tokenize(sequence)
print(tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['I', 'am', 'a', 'cow', ',', 'hear', 'me', 'm', '##oo']
[146, 1821, 170, 13991, 117, 2100, 1143, 182, 5658]


In [26]:
tokenizer.decode(ids)

'I am a cow, hear me moo'

## Multiple sequences

Transformer models expect multiple sentence by default i.e all data sent must have at least batch size of 1

In [30]:
import torch
from torch.nn import functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

x = torch.tensor([ids])
out = model(x)

print("Logits:", out.logits)
print("Probs:", F.softmax(out.logits, dim=-1))

Logits:  tensor([[ 3.5694, -2.8733]], grad_fn=<AddmmBackward0>)
Probs: tensor([[0.9984, 0.0016]], grad_fn=<SoftmaxBackward0>)


In [33]:
# For padding use tokenizer.pad_token_id
tokenizer.pad_token_id

0

In [34]:
# Attention mask
batched_ids = [
  [200, 200, 200],
  [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
  [1, 1, 1],
  [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


---

# Putting it Together

In [35]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["What do you want from me?!", "This food is p mehh", "That was a sick game!"]

tokens = tokenizer(
  sequences,
  padding=True,
  truncation=True,
  return_tensors='pt'
)

output = model(**tokens)

In [39]:
F.softmax(output.logits, dim=-1)

tensor([[9.7726e-01, 2.2741e-02],
        [1.1789e-02, 9.8821e-01],
        [9.9968e-01, 3.2245e-04]], grad_fn=<SoftmaxBackward0>)

---

# Deployments

Testing various to call models from different methods of deployments

1. TGI
2. vLLM
3. llama.cpp

Refer here for more details on sample deployments: https://huggingface.co/learn/llm-course/en/chapter2/8

## 1. TGI

TGI model hosted in docker container

```
docker run --platform linux/amd64 \
  --shm-size 1g \
  -p 8080:80 \
  -v ~/.cache/huggingface:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id HuggingFaceTB/SmolLM2-360M-Instruct
```

In [1]:
from huggingface_hub import InferenceClient

In [None]:
# Initalise client to point at TGI endpoint
client = InferenceClient(
  model="http://localhost:8080"
)

In [None]:
# Text generation task
response = client.text_generation(
  "Tell me a story",
  max_new_tokens=100,
  temperature=0.7,
  top_p=0.9,
  details=True,
  stop_sequences=[]
)
print(response.generated_text)

In [None]:
# Chat task
response = client.chat_completion(
  messages=[
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell me a story"}
  ],
  max_tokens=100,
  temperature=0.7,
  top_p=0.9
)
print(response.choices[0].message.content)

## 2. vLLM

## 3. llama.cpp

Deployment requires installation and build of the llama.cpp interface

In [None]:
# Initialize client pointing to llama.cpp server
client = InferenceClient(
  model="http://localhost:8080/v1",  # URL to the llama.cpp server
  token="sk-no-key-required",  # llama.cpp server requires this placeholder
)

In [None]:
%watermark