# Notebook overview

This notebook explores tokenization and embeddings with Hugging Face transformers (BERT/DistilBERT) and PyTorch. It demonstrates how to inspect token ids, embedding tables, raw input embeddings, and contextual embeddings, and how to build simple sentence embeddings.

## High-level flow
- Check PyTorch/CUDA availability and GPU name (cell 1).
- Tokenize a long domain-specific string with `BertTokenizerFast` and inspect tokens/ids (cell 2). The variables `text`, `tok`, and `out` come from here:
    - `text`: the input string
    - `tok`: the loaded `BertTokenizerFast`
    - `out`: the tokenization result (`BatchEncoding` with `input_ids`)
- Load `DistilBertModel` and access its embedding weight matrices (word & position) (cell 3).
- Inspect single elements of the embedding matrices (cells 4 and 5).
- Load tokenizer + DistilBERT, move model to device, create a small batch of example texts, and tokenize them into `enc` tensors on device (cell 6).
- Show tokenization for the first example text (cell 7).
- Run a forward pass to get `last_hidden_state` (contextual per-token embeddings), compute attention-masked mean pooling to get sentence embeddings, retrieve raw input embeddings via `model.get_input_embeddings()(input_ids)`, and map ids back to token strings (cell 8). Example: print token strings and inspect a particular token embedding.
- Demonstrate how to compute raw input embedding sum h0 = word_emb + pos_emb and compare it to `model.embeddings(...)` which includes LayerNorm+Dropout (cell 9).

## Key concepts shown
- Tokenization → input ids, special tokens ([CLS], [SEP], [PAD]).
- Embedding tables:
    - Word embeddings: lookup of token ids → vectors.
    - Position embeddings: absolute position ids → vectors.
- Raw input embeddings (word + position) vs. contextual embeddings (outputs of Transformer layers).
- Masked mean pooling to obtain fixed-size sentence embeddings.
- Moving tensors and model to GPU if available.

## How to reuse
- `tok` and `out` are available in the notebook and can be reused directly.
- `enc` produced in cell 6 contains tensors used for forward passes.
- `model` (DistilBertModel) and its embedding tables are available after their respective cells; avoid re-importing or reloading unless you intend to overwrite them.

## Understanding BERT Embeddings
This notebook explores how tokenization and embeddings work with Hugging Face transformers (BERT/DistilBERT) and PyTorch. 

It provides examples for how to inspect token ids, embedding tables, raw input embeddings, and contextual embeddings, and how to build simple sentence embeddings using the torch library.


In [2]:

import torch
from transformers import BertTokenizerFast, DistilBertTokenizerFast, DistilBertModel


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
## Check PyTorch/CUDA availability and GPU name
## This will print the installed PyTorch version, CUDA availability, and GPU name if available.
 
print("PyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("Torch CUDA runtime:", torch.version.cuda)
print("cuDNN:", torch.backends.cudnn.version())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

PyTorch: 2.5.1
CUDA available: True
Torch CUDA runtime: 12.4
cuDNN: 90100
GPU: NVIDIA GeForce RTX 4070


### Bert Tokenization

- Tokenize a string with `BertTokenizerFast` and inspect tokens/ids.    
    - `text`: the input string
    - `tok`: the loaded `BertTokenizerFast`
    - `out`: the tokenization result (`BatchEncoding` with `input_ids`)

In [4]:
# Bert Tokenization
tok = BertTokenizerFast.from_pretrained("bert-base-uncased")
text = "COL cictt_make VAL AERO COMMANDER COL cictt_model VAL 690 COL cictt_series VAL UNDESIGNATED SERIES\tCOL make VAL CE COL model VAL 180 COL series VAL 180"
out = tok(text, return_attention_mask=False, return_token_type_ids=False)
print("Input Text:", text)
print("Tokens:",tok.convert_ids_to_tokens(out["input_ids"]))
print("Token IDs:", out["input_ids"])
print("Number of tokens:", len(out["input_ids"]))

Input Text: COL cictt_make VAL AERO COMMANDER COL cictt_model VAL 690 COL cictt_series VAL UNDESIGNATED SERIES	COL make VAL CE COL model VAL 180 COL series VAL 180
Tokens: ['[CLS]', 'col', 'ci', '##ct', '##t', '_', 'make', 'val', 'aero', 'commander', 'col', 'ci', '##ct', '##t', '_', 'model', 'val', '690', 'col', 'ci', '##ct', '##t', '_', 'series', 'val', 'und', '##es', '##ign', '##ated', 'series', 'col', 'make', 'val', 'ce', 'col', 'model', 'val', '180', 'col', 'series', 'val', '180', '[SEP]']
Token IDs: [101, 8902, 25022, 6593, 2102, 1035, 2191, 11748, 18440, 3474, 8902, 25022, 6593, 2102, 1035, 2944, 11748, 28066, 8902, 25022, 6593, 2102, 1035, 2186, 11748, 6151, 2229, 23773, 4383, 2186, 8902, 2191, 11748, 8292, 8902, 2944, 11748, 8380, 8902, 2186, 11748, 8380, 102]
Number of tokens: 43


### Load pre-trained Bert Model (distlbert-base-uncased)
- Embedding tables:
    - Ew - Word embeddings: lookup of token ids → vectors (30522 token ids to 768 dimensions).
    - Ep - Position embeddings: absolute position ids → vectors .

In [5]:
# Load `DistilBertModel` and inspect embedding weight matrices
m = DistilBertModel.from_pretrained("distilbert-base-uncased")
# Embedding weight matrices (word)
Ew = m.embeddings.word_embeddings.weight        # [30522, 768]
# Embedding weight matrices (position)
Ep = m.embeddings.position_embeddings.weight    # [512, 768]
print("Shape Ew:", Ew.shape, ", Shape Ep:", Ep.shape)

Shape Ew: torch.Size([30522, 768]) , Shape Ep: torch.Size([512, 768])


In [6]:
# Inspect single elements of the embedding matrices
Ew[0,0]

tensor(-0.0166, grad_fn=<SelectBackward0>)

In [7]:
# Inspect single elements of the embedding matrices
Ep[0,0]

tensor(0.0175, grad_fn=<SelectBackward0>)

### Load tokenizer and model
Load tokenizer + DistilBERT, move model to device, create a small batch of example texts, and tokenize them into `enc` tensors on device

In [None]:
# Load tokenizer + DistilBERT, move model to device, create a small batch of example texts, and tokenize them into `enc` tensors on device

# 1) Load tokenizer + model
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
model.eval()

# Check GPU:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 2) Example text (can be a list for batching)
texts = [
    "COL cictt_make VAL AERO COMMANDER COL cictt_model VAL 690 COL cictt_series VAL UNDESIGNATED SERIES\tCOL make VAL CE COL model VAL 180 COL series VAL 180", 
    "COL cictt_make VAL BOEING COL cictt_model VAL 737 COL cictt_series VAL 7V3	COL make VAL DH COL model VAL 104 COL series VAL 7A"
]

# 3) Tokenize
enc = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512
)
enc = {k: v.to(device) for k, v in enc.items()}  # move to device


# Inspect batch size B, sequence length L, hidden size H
input_ids = enc["input_ids"].to(device)          # [B, L]
B, L = input_ids.shape
H = model.config.dim                              # 768

print("Input shape:", input_ids.shape)
print("B (batch size): ", B)
print("L (sequence length in tokens): ", L)
print("H: (hidden size)", H)


B (batch size):  2
L (sequence3 length in tokens):  43
H: (hidden size) 768
Input shape: torch.Size([2, 43])


In [None]:
# Run a forward pass to get `last_hidden_state` (contextual per-token embeddings)

# 4) Forward pass → contextual token embeddings (last hidden state)
#    Shape: [batch_size, seq_len, hidden_size]
with torch.no_grad():
    outputs = model(**enc)
last_hidden = outputs.last_hidden_state  # contextual embeddings

print("Last Hidden shape:", last_hidden.shape)

Last Hidden shape: torch.Size([2, 43, 768])


In [None]:
# Compute attention-masked mean pooling to get sentence embeddings, 
# retrieve raw input embeddings via `model.get_input_embeddings()(input_ids)`, and 
# map ids back to token strings
# Print token strings and inspect a particular token embedding.


# 5) Build a sentence embedding (attention-masked mean pooling)
#    This averages only over real tokens (excludes padding).
attn = enc["attention_mask"].unsqueeze(-1)            # [B, L, 1]
summed = (last_hidden * attn).sum(dim=1)              # [B, H]
counts = attn.sum(dim=1).clamp(min=1)                 # [B, 1]
sentence_embeddings = summed / counts                 # [B, H]

# 6) Optional: raw input embeddings (lookup table before Transformer layers)
#    These are *not* contextual; they’re the embedding matrix rows for your tokens.
with torch.no_grad():
    input_embeds = model.get_input_embeddings()(enc["input_ids"])  # [B, L, H]

# 7) Map back to tokens if you want to line up embeddings with strings:
tokens = [tokenizer.convert_ids_to_tokens(ids) for ids in enc["input_ids"].tolist()]

# ---- Examples of how to use the results ----
print("Tokens for item 0:\n", tokens[0])
print("Tokens for item 0:\n", tokens[1])
print("Per-token contextual embeddings shape:", last_hidden.shape)
print("Sentence embedding shape:", sentence_embeddings.shape)




# Access the embedding for, say, the 10th token in the sequence (0-based):
b, t = 0, 10
print("Token[10]:", tokens[b][t])
print("Embedding vector (last hidden) shape:", last_hidden[b, t].shape)

Tokens for item 0:
 ['[CLS]', 'col', 'ci', '##ct', '##t', '_', 'make', 'val', 'aero', 'commander', 'col', 'ci', '##ct', '##t', '_', 'model', 'val', '690', 'col', 'ci', '##ct', '##t', '_', 'series', 'val', 'und', '##es', '##ign', '##ated', 'series', 'col', 'make', 'val', 'ce', 'col', 'model', 'val', '180', 'col', 'series', 'val', '180', '[SEP]']
Tokens for item 0:
 ['[CLS]', 'col', 'ci', '##ct', '##t', '_', 'make', 'val', 'boeing', 'col', 'ci', '##ct', '##t', '_', 'model', 'val', '737', 'col', 'ci', '##ct', '##t', '_', 'series', 'val', '7', '##v', '##3', 'col', 'make', 'val', 'dh', 'col', 'model', 'val', '104', 'col', 'series', 'val', '7', '##a', '[SEP]', '[PAD]', '[PAD]']
Per-token contextual embeddings shape: torch.Size([2, 43, 768])
Sentence embedding shape: torch.Size([2, 768])
Token[10]: col
Embedding vector (last hidden) shape: torch.Size([768])


In [16]:


# Demonstrate how to compute raw input embedding sum h0 = word_emb + pos_emb and compare it to `model.embeddings(...)` which includes LayerNorm+Dropout (cell 9).

# 1) Load tokenizer + model
tok = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased").eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 2) Example inputs (batch of 2)
texts = [
    "hello world",
    "Smoke was detected in the cabin on a Boeing 737."
]
enc = tok(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
input_ids = enc["input_ids"].to(device)          # [B, L]
B, L = input_ids.shape
H = model.config.dim                              # 768

print("B (batch size): ", B)
print("L (sequence3 length in tokens): ", L)
print("H: (hidden size)", H)
# 3) Lookup word and position embeddings
Ew_table = model.embeddings.word_embeddings       # nn.Embedding[V, H]
Ep_table = model.embeddings.position_embeddings   # nn.Embedding[512, H]

E_word = Ew_table(input_ids)                      # [B, L, H]

# DistilBERT uses absolute position ids 0..L-1 per sequence
position_ids = torch.arange(L, device=device).unsqueeze(0).expand(B, L)  # [B, L]
E_pos = Ep_table(position_ids)                    # [B, L, H]

# 4) Raw embedding sum (this is h0 before LayerNorm/Dropout)
h0 = E_word + E_pos                               # [B, L, H]

# (Optional) Compare to the model’s embedding output (which adds LayerNorm+Dropout)
with torch.no_grad():
    h0_full = model.embeddings(input_ids=input_ids)  # [B, L, H]
print(h0.shape, h0_full.shape)

B (batch size):  2
L (sequence3 length in tokens):  13
H: (hidden size) 768
torch.Size([2, 13, 768]) torch.Size([2, 13, 768])


## Huggingface demo
From here, this is from the huggingface demo

In [34]:
encoded_input = tokenizer("How are you?")
print(encoded_input["input_ids"])
tokenizer.decode(encoded_input["input_ids"])

[101, 2129, 2024, 2017, 1029, 102]


'[CLS] how are you? [SEP]'

In [18]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [19]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [20]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [30]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [31]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


In [32]:
print(outputs)

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)


In [23]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [24]:
print(outputs.logits.shape)

torch.Size([2, 2])


In [25]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [26]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [27]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [36]:
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")
print(model.config)

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



In [37]:
from transformers import BertConfig
bert_config = BertConfig.from_pretrained("bert-base-cased")
print(bert_config)

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



In [38]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [39]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json')