# 5. Transformers on Hugging Face

### About this notebook

This notebook was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.0 (19/01/2025)

**Requirements:**
- Python 3 (tested on v3.11.4)
- Numpy (tested on v1.25.2)
- Torch (tested on v2.0.1+cu118)
- Torchvision (tested on v0.15.2+cu118)
- Transformers (tested on v4.48)
- We also strongly recommend setting up CUDA on your machine! (At this point, honestly, it is almost mandatory).

### Imports and CUDA

In [1]:
from transformers import BertTokenizer, BertModel, logging
import torch
from numpy.linalg import norm
import numpy as np
logging.set_verbosity_info()

### Downloading a pre-trained transformer model

We will be using a pre-trained bert model to make embeddings. Note that, in practice, we often decompose the entire word embedding procedure into a tokenizer and a model, as shown below.

The tokenizer is responsible for preprocessing text input so it can be understood by the model. The tokenizer will:
- Split the input text into smaller units called "tokens." For BERT, this includes breaking down words into subwords or word pieces, called tokens (e.g., "unhappiness" will decompse into the tokens ["un", "happy", "ness"]).
- Add special tokens such as separators (to separate sentences or mark the end of a sequence).
- Convert tokens into numerical IDs based on the model's vocabulary.

This breakdown will be explored more in detail in the NLP Term 7 course.

The model is the neural network itself. It processes the tokenized input and produces meaningful numerical representations (our word embeddings) of the text. The model will
- receive the tokenized output from the tokenizer,
- produce contextual embeddings for each token, capturing semantic and syntactic meaning,
- and as a result produce vector representations of the entire input.

Both are necessary, as the tokenizer prepares raw text in the correct format and numerical representation for the model. And, the model processes this representation to generate embeddings or predictions.

In [6]:
# Create and load pre-trained BERT model and tokenizer
# (Ignore the warning, if any.)
# (This might take a while if you run it for the first time,
# as we need to download the model from huggingface and it is 400MB!)
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

loading file vocab.txt from cache at C:\Users\matth\.cache\huggingface\hub\models--bert-base-uncased\snapshots\86b5e0934494bd15c9632b12f734a8a67f723594\vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at C:\Users\matth\.cache\huggingface\hub\models--bert-base-uncased\snapshots\86b5e0934494bd15c9632b12f734a8a67f723594\tokenizer_config.json
loading file tokenizer.json from cache at C:\Users\matth\.cache\huggingface\hub\models--bert-base-uncased\snapshots\86b5e0934494bd15c9632b12f734a8a67f723594\tokenizer.json
loading file chat_template.jinja from cache at None
loading configuration file config.json from cache at C:\Users\matth\.cache\huggingface\hub\models--bert-base-uncased\snapshots\86b5e0934494bd15c9632b12f734a8a67f723594\config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_p

### Using our pre-trained model to generate some word embeddings

Below, we will simply use our pre-trained model to generate word embeddings for three words: cat, kitten and university.

In [40]:
# Define the words to get embeddings for
words = ["cat", "kitten", "university"]

# Process each word individually
for word in words:
    # Tokenize first
    inputs = tokenizer(word, return_tensors = "pt", add_special_tokens = False)
    # Compute embedding second
    with torch.no_grad():
        outputs = model(**inputs)
    # Extract the embedding for the word (BERT's output includes embeddings for all tokens)
    embedding = outputs.last_hidden_state.mean(dim = 1)
    embeddings.append(embedding.squeeze().numpy())

### Checking the embeddings and their similarities

Later on, we can check the embeddings (by printing the vectors), and more importantly, confirm that the word embedsdings preserve similarity. We do so, by checking the cosine similarity between all pair of words and observe that there is a high similarity between the words "cat" and "kitten", as opposed to the other pairs of words.

In [41]:
# Helper function to check similarity between the embeddings
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2)/(norm(vec1)*norm(vec2))

In [42]:
# Compute the cosine similarities between some pairs of words to check embedding similarities
similarity_cat_kitten = cosine_similarity(embeddings[0], embeddings[1])
similarity_cat_university = cosine_similarity(embeddings[0], embeddings[2])
similarity_kitten_university = cosine_similarity(embeddings[1], embeddings[2])
print(f"Cosine similarity between 'cat' and 'kitten': {similarity_cat_kitten:.4f}")
print(f"Cosine similarity between 'cat' and 'university': {similarity_cat_university:.4f}")
print(f"Cosine similarity between 'kitten' and 'university': {similarity_kitten_university:.4f}")

Cosine similarity between 'cat' and 'kitten': 0.8756
Cosine similarity between 'cat' and 'university': 0.4423
Cosine similarity between 'kitten' and 'university': 0.4332
