## Problem: Quantize Your Language Model

### Problem Statement
Implement a **language model** using an LSTM and apply **dynamic quantization** to optimize it for inference. Dynamic quantization reduces the model size and enhances inference speed by quantizing the weights of the model.

### Requirements

1. **Define the Language Model**:
   - **Purpose**: Build a simple language model that predicts the next token in a sequence.
   - **Components**:
     - **Embedding Layer**: Converts input tokens into dense vector representations.
     - **LSTM Layer**: Processes the embedded sequence to capture temporal dependencies.
     - **Fully Connected Layer**: Outputs predictions for the next token.
     - **Softmax Layer**: Applies a probability distribution over the vocabulary for predictions.
   - **Forward Pass**:
     - Pass the input sequence through the embedding layer.
     - Feed the embedded sequence into the LSTM.
     - Use the final hidden state from the LSTM to make predictions via the fully connected layer.
     - Apply the softmax function to obtain probabilities over the vocabulary.

2. **Apply Dynamic Quantization**:
   - Quantize the model dynamically
   - Evaluate the quantized model's performance compared to the original model.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.quantization import quantize_dynamic

In [26]:
# TODO: Define a simple Language Model (an LSTM-based model)
class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LanguageModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        output = self.fc(lstm_out[:, -1, :])  # Use the last hidden state for prediction
        return self.softmax(output)

In [34]:
# Create synthetic training data
torch.manual_seed(42)
vocab_size = 50
seq_length = 10
batch_size = 32
X_train = torch.randint(0, vocab_size, (batch_size, seq_length))  # Random integer input
y_train = torch.randint(0, vocab_size, (batch_size,))  # Random target words
print(y_train.shape)
# Initialize the model, loss function, and optimizer
embed_size = 64
hidden_size = 128
num_layers = 2
model = LanguageModel(vocab_size, embed_size, hidden_size, num_layers)
torch.save(model.state_dict(), "/workspaces/pytorch_handson/easy/data/model.pth")

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

torch.Size([32])


In [None]:
# for name, param in quantized_model.named_parameters():
#     print(name, param.shape, param.dtype)
# to check if quanitisation has actuly worked
# 1. we should check the by follwing print command, if the type is AffineQuantizedTensor then its quantized
print(quantized_model.fc.weight)
# not ieven if the weights are quantized they will not be of type torch.qint8
# other way to check it is to check the size of the model on the disc before and after quantization

# -rw-rw-rw-  1 codespace codespace 946K Dec  3 10:24 model.pth
# -rw-rw-rw-  1 codespace codespace 930K Dec  3 10:24 quantized_model.pth
# The core reason you didn't see a significant size reduction is likely that the non-quantized parameters and the metadata overhead dominate the file size of your relatively small model.

AffineQuantizedTensor(tensor_impl=PlainAQTTensorImpl(data=tensor([[ -19,   62,  105,  ...,   97, -116,  -26],
        [  -9,   25,   49,  ...,  -13,  -52,   -1],
        [  -6,  -92,   -4,  ...,  -79,  -52,   19],
        ...,
        [  38, -101,  -31,  ...,  -49,  -58,  126],
        [  -6,  -44,   18,  ...,  -70,  -98,  101],
        [ -61,  116,  -40,  ...,   72,   70,  -16]], dtype=torch.int8)... , scale=tensor([0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007,
        0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007,
        0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007,
        0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007,
        0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007, 0.0007,
        0.0007, 0.0007, 0.0007, 0.0007, 0.0007])... , zero_point=tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [35]:
# compute total number of parameters and total size in bytes (uses element_size() so dtype-aware)
def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti']:
        if abs(num) < 1024.0:
            return f"{num:3.2f}{unit}{suffix}"
        num /= 1024.0
    return f"{num:.2f}Pi{suffix}"

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_bytes = sum(p.numel() * p.element_size() for p in model.parameters())

print(f"Total params: {total_params}")
print(f"Trainable params: {trainable_params}")
print(f"Total size: {total_bytes} bytes ({sizeof_fmt(total_bytes)})")

# Note the vale count of trainable parametrs chanes after quantization

Total params: 241074
Trainable params: 241074
Total size: 964296 bytes (941.70KiB)


In [36]:
# Training loop
epochs = 5
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()
    output = model(X_train)
    print(output.shape)
    print(y_train.shape)
    loss = criterion(output, y_train)
    loss.backward()
    optimizer.step()

    # Log progress every epoch
    print(f"Epoch [{epoch + 1}/{epochs}] - Loss: {loss.item():.4f}")

# Now, we will quantize the model dynamically to reduce its size and improve inference speed
# Quantization: Apply dynamic quantization to the language model
# quantized_model = quantize_dynamic(model, {nn.Linear, nn.LSTM}, dtype=torch.qint8)
from torchao.quantization import quantize_, Int8WeightOnlyConfig
config = Int8WeightOnlyConfig()
quantized_model = quantize_(model, config)

quantize_(model, config) 

quantized_model = model

# Save the quantized model
# torch.save(quantized_model.state_dict(), "quantized_language_model.pth")
quantized_model


torch.Size([32, 50])
torch.Size([32])
Epoch [1/5] - Loss: 3.9118
torch.Size([32, 50])
torch.Size([32])
Epoch [2/5] - Loss: 3.9113
torch.Size([32, 50])
torch.Size([32])
Epoch [3/5] - Loss: 3.9108
torch.Size([32, 50])
torch.Size([32])
Epoch [4/5] - Loss: 3.9103
torch.Size([32, 50])
torch.Size([32])
Epoch [5/5] - Loss: 3.9097


LanguageModel(
  (embedding): Embedding(50, 64)
  (lstm): LSTM(64, 128, num_layers=2, batch_first=True)
  (fc): Linear(in_features=128, out_features=50, weight=AffineQuantizedTensor(shape=torch.Size([50, 128]), block_size=(1, 128), device=cpu, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=None, quant_max=None))
  (softmax): Softmax(dim=1)
)

In [37]:
print(quantized_model)
torch.save(quantized_model.state_dict(), "/workspaces/pytorch_handson/easy/data/quantized_model.pth")

LanguageModel(
  (embedding): Embedding(50, 64)
  (lstm): LSTM(64, 128, num_layers=2, batch_first=True)
  (fc): Linear(in_features=128, out_features=50, weight=AffineQuantizedTensor(shape=torch.Size([50, 128]), block_size=(1, 128), device=cpu, _layout=PlainLayout(), tensor_impl_dtype=torch.int8, quant_min=None, quant_max=None))
  (softmax): Softmax(dim=1)
)


In [16]:
def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti']:
        if abs(num) < 1024.0:
            return f"{num:3.2f}{unit}{suffix}"
        num /= 1024.0
    return f"{num:.2f}Pi{suffix}"

total_params = sum(p.numel() for p in quantized_model.parameters())
trainable_params = sum(p.numel() for p in quantized_model.parameters() if p.requires_grad)
total_bytes = sum(p.numel() * p.element_size() for p in quantized_model.parameters())

print(f"Total params: {total_params}")
print(f"Trainable params: {trainable_params}")
print(f"Total size: {total_bytes} bytes ({sizeof_fmt(total_bytes)})")

Total params: 241074
Trainable params: 234674
Total size: 964296 bytes (941.70KiB)


In [8]:
# Load the quantized model and test it
quantized_model = LanguageModel(vocab_size, embed_size, hidden_size, num_layers)

# Apply dynamic quantization on the model after defining it
quantized_model = quantize_dynamic(quantized_model, {nn.Linear, nn.LSTM}, dtype=torch.qint8)

quantized_model.load_state_dict(torch.load("quantized_language_model.pth"))

For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  quantized_model = quantize_dynamic(quantized_model, {nn.Linear, nn.LSTM}, dtype=torch.qint8)


UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, [1mdo those steps only if you trust the source of the checkpoint[0m. 
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL torch.ScriptObject was not an allowed global by default. Please use `torch.serialization.add_safe_globals([torch.ScriptObject])` or the `torch.serialization.safe_globals([torch.ScriptObject])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

In [8]:
# Testing the quantized model on a sample input
quantized_model.eval()
test_input = torch.randint(0, vocab_size, (1, seq_length))
with torch.no_grad():
    prediction = quantized_model(test_input)
    print(f"Prediction for input {test_input.tolist()}: {prediction.argmax(dim=1).item()}")

Prediction for input [[15, 28, 33, 19, 37, 24, 48, 42, 33, 35]]: 49
