<a href="https://colab.research.google.com/github/hissain/ml/blob/main/codes/optim/LLM_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install torch transformers bitsandbytes



In [None]:
!pip install huggingface_hub



# Huggingface model

In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model_name = "meta-llama/Llama-3.2-1B"

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import bitsandbytes as bnb
from torch.quantization import get_default_qat_qconfig, prepare_qat, convert

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Dynamic Quantization
Dynamic quantization reduces precision during inference for linear layers, which is simple to apply but has a lesser impact on memory than other methods.

In [None]:
# Load the Mistral 7B model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Apply dynamic quantization on linear layers
model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [None]:
# Run inference
prompt = "What are the future implications of artificial intelligence?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=50)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print("Dynamic Quantization Response:", response)

# Parameter count
print("Parameter count after dynamic quantization:", sum(p.numel() for p in model.parameters()))

NameError: name 'tokenizer' is not defined

# 8-bit Static Quantization with bitsandbytes
Static quantization with 8-bit precision using bitsandbytes can be more efficient.

In [None]:
# Load model with 8-bit quantization using bitsandbytes
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# Run inference with the quantized model
prompt = "Applications of artificial intelligence in healthcare include"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print("8-bit Static Quantization Response:", response)

# Parameter count check
print("Parameter count after 8-bit static quantization:", sum(p.numel() for p in model.parameters()))


# Quantization-Aware Training (QAT)
Quantization-Aware Training involves training the model to adapt to quantization effects. Note that this is resource-intensive.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.quantization
import torchvision
import torchvision.transforms as transforms
from torchvision.models import mobilenet_v2

# Step 1: Set up the environment and configurations
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 64
epochs = 1  # For demonstration; increase for better results

# Step 2: Load and prepare the dataset (CIFAR-10)
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize to fit MobileNet input size
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Step 3: Define and prepare the model for QAT
# Load a pretrained MobileNetV2 model and adjust for CIFAR-10 (10 classes)
model = mobilenet_v2(pretrained=True)
model.classifier[1] = nn.Linear(model.last_channel, 10)  # Adjust final layer for CIFAR-10
model = model.to(device)

# QAT requires the model in training mode
model.train()

# Set QAT configurations
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)

# Step 4: Train the model with QAT
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(epochs):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(train_loader, 0):
        inputs, labels = inputs.to(device), labels.to(device)

        # Forward pass
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 100 == 99:  # Print every 100 mini-batches
            print(f"[Epoch {epoch + 1}, Batch {i + 1}] Loss: {running_loss / 100:.3f}")
            running_loss = 0.0

# Step 5: Convert the model to a quantized version
model.eval()
torch.quantization.convert(model, inplace=True)

# Testing the quantized model
correct = 0
total = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Accuracy of the quantized model on the CIFAR-10 test images: {100 * correct / total:.2f}%")


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170M/170M [00:11<00:00, 14.3MB/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /root/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
100%|██████████| 13.6M/13.6M [00:00<00:00, 60.0MB/s]


[Epoch 1, Batch 100] Loss: 1.119


### Explanation of the Steps

1. **Dataset Preparation**:
   - CIFAR-10 images are resized to 224x224 to match the input size expected by MobileNetV2, then normalized.
2. **Model Definition and Preparation**:
   - A pretrained MobileNetV2 model is loaded, and its final classification layer is adjusted to classify the 10 classes in CIFAR-10.
   - The model’s quantization configuration (`qconfig`) is set for QAT, and `prepare_qat` is called to instrument it for quantization during training.
3. **Quantization-Aware Training (QAT)**:
   - The model is trained with quantization effects simulated at each step, helping it learn to operate effectively in lower precision.
4. **Convert the Model to Quantized Form**:
   - After training, the model is converted to a quantized version for optimized inference.
5. **Evaluation**:
   - We evaluate the quantized model’s accuracy on the CIFAR-10 test set to confirm that the model has maintained performance after quantization.

### Summary
QAT enables models to learn robustly under quantization constraints. This example uses CIFAR-10, but QAT can also be applied to larger models and datasets with adequate resources. This approach can lead to efficient, smaller models with minimal accuracy loss.

## For LLM

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import bitsandbytes as bnb
from torch.utils.data import DataLoader, Dataset

# Step 1: Custom Dataset for Two Sample Data Entries
class CustomTextDataset(Dataset):
    def __init__(self, tokenizer, texts, max_length=128):
        self.tokenizer = tokenizer
        self.texts = texts
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        # Tokenize text and prepare input_ids and attention_mask
        encodings = self.tokenizer(self.texts[idx], truncation=True, max_length=self.max_length, return_tensors="pt")
        input_ids = encodings["input_ids"].squeeze()  # (sequence_length,)
        attention_mask = encodings["attention_mask"].squeeze()
        return input_ids, attention_mask

# Sample dataset with two text entries
texts = [
    "Artificial intelligence is transforming industries by automating tasks.",
    "Quantum computing holds promise for solving complex problems faster."
]

# Load tokenizer
model_name = "EleutherAI/gpt-neo-1.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = CustomTextDataset(tokenizer, texts)
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

# Step 2: Load Model and Configure for QAT
model = AutoModelForCausalLM.from_pretrained(model_name)
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

# Step 3: QAT Fine-Tuning Loop
model.train()
epochs = 2  # For demonstration; adjust for better training
for epoch in range(epochs):
    for batch in dataloader:
        input_ids, attention_mask = [b.to(device) for b in batch]

        # Prepare labels (shifted inputs) for causal language modeling
        labels = input_ids.clone()

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

        # Calculate loss and update weights
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        print(f"Epoch [{epoch+1}/{epochs}] Loss: {loss.item():.4f}")

# Step 4: Convert the Model to Quantized Form
model.eval()
torch.quantization.convert(model, inplace=True)

# Step 5: Evaluate Quantized Model on a Test Prompt
test_prompt = "How will AI impact the future of healthcare?"
inputs = tokenizer(test_prompt, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=50)
response = tokenizer.decode(output[0], skip_special_tokens=True)

print("Quantized Model Response:", response)


Here's the complete code for **Quantization-Aware Training (QAT)** with **GPT-Neo 1.3B** using a custom dataset containing two sample data entries. We’ll simulate a text dataset with two prompts to demonstrate the QAT process. This dataset will be used for fine-tuning the model with simulated 4-bit quantization effects.

### Requirements
Ensure you have `torch`, `transformers`, and `bitsandbytes` installed:

### Explanation of the Steps

1. **Custom Dataset**:
   - A simple `CustomTextDataset` class is defined to hold two sample texts, which are tokenized using the Hugging Face tokenizer for `GPT-Neo`.
   - The dataset is wrapped in a `DataLoader` for easy batch processing.

2. **Model Loading and QAT Configuration**:
   - We load `GPT-Neo 1.3B` and configure it for QAT by setting `qconfig` to `fbgemm` and using `prepare_qat`. This prepares the model to simulate 4-bit quantization effects during training.

3. **Fine-Tuning with QAT**:
   - A standard training loop is implemented where each sample is processed, and gradients are calculated with simulated quantization effects.
   - The `labels` are set to `input_ids`, as this is a causal language model, where the input prompt serves as the context for predicting the next token.

4. **Conversion to Quantized Model**:
   - After training, `torch.quantization.convert` is called to finalize the model as a quantized version for inference.

5. **Inference on Test Prompt**:
   - The quantized model is evaluated on a new prompt, demonstrating that it retains its ability to generate meaningful responses post-quantization.

### Notes

- **Sample Size**: In practice, use a larger dataset for meaningful results.
- **Quantization Effects**: QAT fine-tunes the model to handle lower precision more robustly, allowing it to adapt to 4-bit inference better.

This setup demonstrates the QAT process with a manageable dataset, allowing you to see how a large language model like `GPT-Neo 1.3B` can be quantized effectively for resource-constrained environments.