In [None]:
!pip install -q transformers accelerate bitsandbytes

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer


Installs transformers, accelerate, and bitsandbytes libraries, which are essential for running and optimizing transformer models efficiently. The transformers library is used to work with pre-trained models, while accelerate and bitsandbytes help with faster computation and quantization. Finally, it imports AutoModelForCausalLM for language modeling and AutoTokenizer for tokenizing inputs.

In [None]:
OPTMETA = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")

This loads the OPT-1.3b model from Hugging Face's Transformers library with 8-bit quantization enabled and stores it in a variable named OPTMETA. Quantization reduces the model's precision to save memory and improve efficiency when running on hardware like GPUs or CPUs. This is a memory-saving technique (8-bit precision) during the loading process.


In [None]:
tokenizeropt = AutoTokenizer.from_pretrained("facebook/opt-1.3b")

In [None]:
#inputval = "In Bangalore it is a bright sunny"
inputval = "A boy leaps across the sleeping"

inputval_tokenized = tokenizeropt(inputval, return_tensors="pt")

print("Token IDs size:", inputval_tokenized['input_ids'].size())

# Display each token and its corresponding token ID
tokens = tokenizeropt.tokenize(inputval)
ids = inputval_tokenized['input_ids'][0].tolist()  # Extract IDs from tensor
token_id_pairs = tokenizeropt.convert_ids_to_tokens(ids)

print("\nTokens and Corresponding IDs:")
for token, token_id in zip(token_id_pairs, ids):
    print(f"Token: {token} - Token ID: {token_id}")

Token IDs size: torch.Size([1, 7])

Tokens and Corresponding IDs:
Token: </s> - Token ID: 2
Token: A - Token ID: 250
Token: Ġboy - Token ID: 2143
Token: Ġleaps - Token ID: 32564
Token: Ġacross - Token ID: 420
Token: Ġthe - Token ID: 5
Token: Ġsleeping - Token ID: 8416


In [None]:
print(OPTMETA.model)

OPTModel(
  (decoder): OPTDecoder(
    (embed_tokens): Embedding(50272, 2048, padding_idx=1)
    (embed_positions): OPTLearnedPositionalEmbedding(2050, 2048)
    (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (layers): ModuleList(
      (0-23): 24 x OPTDecoderLayer(
        (self_attn): OPTSdpaAttention(
          (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (out_proj): Linear(in_features=2048, out_features=2048, bias=True)
        )
        (activation_fn): ReLU()
        (self_attn_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=2048, out_features=8192, bias=True)
        (fc2): Linear(in_features=8192, out_features=2048, bias=True)
        (final_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
      )


Eplanation

**embed_tokens:** Converts input words/tokens into vectors of size 2048 using a learned embedding matrix. (Vocabulary size = 50,272).

**embed_positions:** Adds position information (like word order) to each token's vector so the model understands the sentence structure.

The model has 24 layers of OPTDecoderLayer, which process the input data step-by-step.

**Each decoder layer consists of:**

Self-Attention (self_attn): Allows each word to "focus" on other words in the sentence.
Projections (k_proj, v_proj, q_proj, out_proj): Map input into key, value, and query vectors.
Linear8bitLt: These layers perform matrix multiplications but use 8-bit precision to save memory and improve efficiency.

**Feedforward Network:**

fc1: Expands the vector size from 2048 → 8192.
fc2: Shrinks it back from 8192 → 2048.
Activation Function (ReLU): Adds non-linearity so the model can learn complex patterns.

**Layer Normalization (LayerNorm):** Normalizes outputs at different stages for stable training.

The final layer normalizes the output one last time before producing the result.


Together, these components enable the OPT model to process input tokens, understand relationships, and generate predictions efficiently.

In [None]:
# Extract token embeddings for the input tokens
token_embeddings = OPTMETA.model.decoder.embed_tokens(inputval_tokenized['input_ids'])

print("===== Token Embeddings Details =====")
print(f"Layer:    {OPTMETA.model.decoder.embed_tokens}")
print(f"Size:     {token_embeddings.size()}")
print("Output:   Token Embeddings (First 2 Tokens):\n", token_embeddings[:, :2, :])


===== Token Embeddings Details =====
Layer:    Embedding(50272, 2048, padding_idx=1)
Size:     torch.Size([1, 7, 2048])
Output:   Token Embeddings (First 2 Tokens):
 tensor([[[-0.0407,  0.0519,  0.0574,  ..., -0.0263, -0.0355, -0.0260],
         [-0.0425,  0.0070,  0.0229,  ...,  0.0706, -0.0323, -0.0276]]],
       grad_fn=<SliceBackward0>)


This code extracts token embeddings for the input tokens using the embed_tokens layer of the OPT model. The embeddings have a size of (1, 9, 2048), which means there is 1 batch, 9 tokens, and each token is represented as a 2048-dimensional vector. The output shows the first 2 token embeddings (partial values) to keep the result readable. These embeddings are numerical representations of the tokens, enabling the model to understand the input text.

In [None]:
# Extract positional embeddings for the input tokens
positional_embeddings = OPTMETA.model.decoder.embed_positions(inputval_tokenized['attention_mask'])

print("===== Positional Embeddings Details =====")
print(f"Layer:    {OPTMETA.model.decoder.embed_positions}")
print(f"Size:     {positional_embeddings.size()}")
print("Output:   Positional Embeddings (First 2 Positions):\n", positional_embeddings[:, :2, :])

===== Positional Embeddings Details =====
Layer:    OPTLearnedPositionalEmbedding(2050, 2048)
Size:     torch.Size([1, 7, 2048])
Output:   Positional Embeddings (First 2 Positions):
 tensor([[[-8.1406e-03, -2.6221e-01,  6.0768e-03,  ...,  1.7273e-02,
          -5.0621e-03, -1.6220e-02],
         [-8.0585e-05,  2.5000e-01, -1.6632e-02,  ..., -1.5419e-02,
          -1.7838e-02,  2.4948e-02]]], grad_fn=<SliceBackward0>)


The code extracts **positional embeddings** for the input tokens using `embed_positions` from the OPT decoder.  
`inputval_tokenized['attention_mask']` ensures embeddings align correctly with valid input tokens.  
It prints the **layer** type (`OPTLearnedPositionalEmbedding`) and the **size** of the embeddings: `(1, 9, 2048)` → batch size 1, 9 positions, and 2048 dimensions.  
It displays the **first two positional embeddings** (partial output) for simplicity.  
The output tensor runs on a **GPU** (`cuda:0`) and uses **float16** precision for efficiency.

In [None]:
# Combine token embeddings and positional embeddings
combined_input = token_embeddings + positional_embeddings

# Pass the combined input through the first self-attention layer
hidden_states, _, _ = OPTMETA.model.decoder.layers[0].self_attn(combined_input)

print("===== Self-Attention Layer Details =====")
print(f"Layer:    {OPTMETA.model.decoder.layers[0].self_attn}")
print(f"Size:     {hidden_states.size()}")
print("Output:   Hidden States (First 2 Positions):\n", hidden_states[:, :2, :])


===== Self-Attention Layer Details =====
Layer:    OPTSdpaAttention(
  (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
  (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
  (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
  (out_proj): Linear(in_features=2048, out_features=2048, bias=True)
)
Size:     torch.Size([1, 7, 2048])
Output:   Hidden States (First 2 Positions):
 tensor([[[-1.3504e-02, -9.5360e-03,  1.2638e-03,  ...,  6.4713e-03,
          -1.7172e-03,  1.3433e-02],
         [-1.2404e-02, -1.0845e-02,  1.1231e-03,  ...,  9.6275e-03,
           8.4561e-05,  9.9411e-03]]], grad_fn=<SliceBackward0>)


This code passes the combined input (token embeddings + positional embeddings) through the first self-attention layer of the OPT model's decoder. The self-attention layer allows each token to focus on other tokens, generating new hidden states that represent contextual relationships between tokens. The output shape (1, 9, 2048) indicates 1 batch, 9 positions (tokens), and 2048 dimensions per token. The displayed values show the first 2 hidden states, which are updated token representations after applying self-attention.

In [None]:
# Pass the combined input through all decoder layers
hidden_states = combined_input
for i, layer in enumerate(OPTMETA.model.decoder.layers):
    hidden_states, _, _ = layer.self_attn(hidden_states)
    print(f"Layer {i+1} Output Size: {hidden_states.size()}")

Layer 1 Output Size: torch.Size([1, 7, 2048])
Layer 2 Output Size: torch.Size([1, 7, 2048])
Layer 3 Output Size: torch.Size([1, 7, 2048])
Layer 4 Output Size: torch.Size([1, 7, 2048])
Layer 5 Output Size: torch.Size([1, 7, 2048])
Layer 6 Output Size: torch.Size([1, 7, 2048])
Layer 7 Output Size: torch.Size([1, 7, 2048])
Layer 8 Output Size: torch.Size([1, 7, 2048])
Layer 9 Output Size: torch.Size([1, 7, 2048])
Layer 10 Output Size: torch.Size([1, 7, 2048])
Layer 11 Output Size: torch.Size([1, 7, 2048])
Layer 12 Output Size: torch.Size([1, 7, 2048])
Layer 13 Output Size: torch.Size([1, 7, 2048])
Layer 14 Output Size: torch.Size([1, 7, 2048])
Layer 15 Output Size: torch.Size([1, 7, 2048])
Layer 16 Output Size: torch.Size([1, 7, 2048])
Layer 17 Output Size: torch.Size([1, 7, 2048])
Layer 18 Output Size: torch.Size([1, 7, 2048])
Layer 19 Output Size: torch.Size([1, 7, 2048])
Layer 20 Output Size: torch.Size([1, 7, 2048])
Layer 21 Output Size: torch.Size([1, 7, 2048])
Layer 22 Output Size: 

In [None]:
final_output = OPTMETA.model.decoder.final_layer_norm(hidden_states)
print("===== Final Output Details =====")
print(f"Size: {final_output.size()}")
print("Output (First 2 Positions):\n", final_output[:, :2, :])

===== Final Output Details =====
Size: torch.Size([1, 7, 2048])
Output (First 2 Positions):
 tensor([[[ 1.0349, -0.5554, -1.3153,  ...,  0.3701,  0.4754, -0.1551],
         [ 1.0349, -0.5554, -1.3153,  ...,  0.3701,  0.4754, -0.1551]]],
       grad_fn=<SliceBackward0>)


Predicted Next Token:  body


In [None]:
import torch

# Generate 3 next tokens
next_tokens = []
current_input_ids = inputval_tokenized['input_ids']

for _ in range(3):  # Loop for 3 predictions
    # Compute logits using the updated input sequence
    current_output = OPTMETA(**{"input_ids": current_input_ids}).logits

    # Predict the next token ID
    predicted_token_id = torch.argmax(current_output[:, -1, :], dim=-1)

    # Decode the predicted token and store it
    predicted_token = tokenizeropt.decode(predicted_token_id)
    next_tokens.append(predicted_token)

    # Append the predicted token ID to the current input IDs
    current_input_ids = torch.cat((current_input_ids, predicted_token_id.unsqueeze(0)), dim=-1)

# Print the predicted words
print(f"Predicted Next Tokens: {' '.join(next_tokens)}")



Predicted Next Tokens:  day  and  I
