<a href="https://colab.research.google.com/github/pierredantas/LLMCompress/blob/main/UoM_Pruning_Sparsity_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Techniques for Compressing Large Language Models in Python

In this code, I am optimizing a lightweight large language model (~0.1B parameters, specifically BERT) by applying pruning to increase sparsity, followed by quantization to reduce memory usage and model size. Notably, this entire process is performed without requiring a GPU, making it highly accessible for environments with limited computational resources.

## Conclusion

The results demonstrate the combined effects of pruning and quantization on the original model, focusing on parameter sparsity, memory reduction, and compression rate. Here's an analysis:

### 1. **Pruning Results**:
   - **Pruned Model Size and Memory**:
     - The pruned model size and memory allocation remain identical to the original model. This happens because PyTorch's pruning mechanism uses a **mask-based approach**.
     - **Mask-based pruning** applies sparsity by creating a binary `weight_mask` for each layer, which zeroes out specific weights but does not physically remove them. The pruned weights (`weight_orig`) still occupy memory, leading to no immediate reduction in size or memory usage.

   - **Sparsity After Pruning**:
     - The pruning process achieved a sparsity of **39.10%**, meaning 39.10% of the model's parameters were set to zero. However, because the weights are masked rather than removed, the total parameter count remains the same.

### 2. **Quantization Results**:
   - **Model Size Reduction**:
     - Quantization physically reduced the parameter size from 32 bits (float32) to 8 bits (qint8), resulting in a **78.19% compression rate**. This reflects a significant reduction in memory usage and storage requirements.
   
   - **Memory Allocation**:
     - The quantized model occupies only **22.77 MB** of memory compared to the original **417.64 MB**, achieving a **94.55% memory reduction rate**. This makes the model highly efficient for deployment on resource-constrained devices.

### 3. **Overall Observations**:
   - **Pruning** primarily increases sparsity, which can improve computational efficiency when supported by specialized hardware or frameworks capable of exploiting sparsity (e.g., sparse matrix multiplication). However, without removing pruned weights, the memory footprint remains unchanged.
   - **Quantization** complements pruning by reducing the bit-width of parameters, leading to substantial reductions in model size and memory usage, irrespective of sparsity.

### 4. **Recommendations**:
   - **Mask Removal for Pruned Models**:
     - To fully benefit from pruning, the pruned weights should be physically removed by reconstructing the model with reduced dimensions (e.g., excluding zeroed-out rows and columns). This would align the sparsity with the parameter count and memory usage.
   - **Deploying Quantized Models**:
     - The quantized model is ideal for deployment on edge devices or low-resource environments, given its significantly reduced size and memory requirements.

### 5. **Key Takeaways**:
   - The combination of **pruning** and **quantization** effectively balances sparsity, model size, and memory usage, making it a robust strategy for optimizing deep learning models.
   - For maximum impact, frameworks that exploit both sparsity and quantization simultaneously should be used during inference.Introduction to LLM Model Compression

Model compression is crucial for deploying large language models (LLMs) in resource-constrained environments. It aims to reduce model size and computational requirements while maintaining performance. This slideshow explores common techniques for LLM model compression using Python

In [1]:
import torch
import torch.nn.utils.prune as prune
import torch.nn as nn
import transformers
from transformers import AutoModel
import time

# Load a pre-trained model
model = transformers.AutoModel.from_pretrained("bert-base-uncased")

# Print model size
print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M parameters")

Model size: 109.48M parameters


# Pruning - Removing Unimportant Weights

Pruning involves removing less important weights from the model. This technique can significantly reduce model size with minimal impact on performance. We'll demonstrate magnitude-based pruning.

In [2]:
# Function to apply pruning and increase sparsity
def prune_model(model, amount=0.5):
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name="weight", amount=amount)  # Apply pruning
    return model

In [3]:
# Function to physically remove pruned weights
def rebuild_pruned_model(model):
    new_model = model.__class__(model.config)  # Create a new instance with the same configuration

    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            # Extract pruned weight and bias
            weight = module.weight.detach()
            non_zero_rows = weight.abs().sum(dim=1) != 0
            pruned_weight = weight[non_zero_rows]

            if module.bias is not None:
                bias = module.bias.detach()[non_zero_rows]
            else:
                bias = None

            # Update the dimensions for the new Linear layer
            in_features = module.in_features
            out_features = pruned_weight.size(0)

            # Create a new Linear layer with reduced dimensions
            new_linear = nn.Linear(in_features, out_features, bias=(bias is not None))
            new_linear.weight = nn.Parameter(pruned_weight)
            if bias is not None:
                new_linear.bias = nn.Parameter(bias)

            # Replace the module in the new model
            parent_module = new_model
            sub_names = name.split(".")
            for sub_name in sub_names[:-1]:
                parent_module = getattr(parent_module, sub_name)
            setattr(parent_module, sub_names[-1], new_linear)

    return new_model

# Function to calculate sparsity
def calculate_sparsity(model):
    total_params = 0
    total_nonzero = 0

    print(f"{'Layer':<30} {'Non-Zero':<15} {'Total':<15} {'Sparsity (%)':<15}")
    print("-" * 75)

    for name, param in model.named_parameters():
        if "weight" in name:
            num_nonzero = (param != 0).sum().item()
            num_total = param.numel()
            sparsity = 100 * (1 - num_nonzero / num_total)

            total_nonzero += num_nonzero
            total_params += num_total

            print(f"{name:<30} {num_nonzero:<15} {num_total:<15} {sparsity:<15.2f}")

    overall_sparsity = 100 * (1 - total_nonzero / total_params)
    print("-" * 75)
    print(f"{'Total':<30} {total_nonzero:<15} {total_params:<15} {overall_sparsity:<15.2f}")
    return total_nonzero, total_params, overall_sparsity

# Measure pruning time
try:
    # Original model details
    original_model_size = sum(p.numel() for p in model.parameters()) / 1e6  # In millions
    original_memory_size = sum(p.numel() for p in model.parameters()) * 4 / (1024**2)  # In MB

    # Prune the model
    start_time = time.time()
    pruned_model = prune_model(model, amount=0.5)
    pruned_model = rebuild_pruned_model(pruned_model)  # Rebuild with physically pruned weights
    end_time = time.time()

    # Calculate sparsity and memory
    pruned_nonzero, pruned_total, pruned_sparsity = calculate_sparsity(pruned_model)
    pruned_memory_size = pruned_nonzero * 4 / (1024**2)  # Memory for non-zero parameters

    # Compression rate and time
    pruning_time = end_time - start_time
    compression_rate = ((original_model_size - pruned_total / 1e6) / original_model_size) * 100
    memory_reduction_rate = ((original_memory_size - pruned_memory_size) / original_memory_size) * 100

    # Output results
    print(f"Pruning completed successfully.")
    print(f"Pruning time: {pruning_time:.2f} seconds")
    print(f"------------------------------------------------------")
    print(f"Original model size: {original_model_size:.2f}M parameters")
    print(f"Pruned model size: {pruned_total / 1e6:.2f}M parameters")
    print(f"------------------------------------------------------")
    print(f"Original memory allocation: {original_memory_size:.2f} MB")
    print(f"Pruned memory allocation: {pruned_memory_size:.2f} MB")
    print(f"------------------------------------------------------")
    print(f"Sparsity after pruning: {pruned_sparsity:.2f}%")
    print(f"Compression rate: {compression_rate:.2f}%")
    print(f"Memory reduction rate: {memory_reduction_rate:.2f}%")
except Exception as e:
    print(f"Pruning failed: {e}")

Layer                          Non-Zero        Total           Sparsity (%)   
---------------------------------------------------------------------------
embeddings.word_embeddings.weight 23440120        23440896        0.00           
embeddings.position_embeddings.weight 393216          393216          0.00           
embeddings.token_type_embeddings.weight 1536            1536            0.00           
embeddings.LayerNorm.weight    768             768             0.00           
encoder.layer.0.attention.self.query.weight 294912          589824          50.00          
encoder.layer.0.attention.self.key.weight 294912          589824          50.00          
encoder.layer.0.attention.self.value.weight 294912          589824          50.00          
encoder.layer.0.attention.output.dense.weight 294912          589824          50.00          
encoder.layer.0.attention.output.LayerNorm.weight 768             768             0.00           
encoder.layer.0.intermediate.dense.weight 11

#Quantization - Reducing Numerical Precision

Quantization reduces the numerical precision of model weights and activations. This technique can significantly decrease model size and inference time. We'll demonstrate post-training static quantization.

In [4]:
# Function for Hugging Face dynamic quantization
def quantize_model(model):
    model.eval()
    quantized_model = torch.quantization.quantize_dynamic(
        model, {torch.nn.Linear}, dtype=torch.qint8
    )
    return quantized_model

# Measure quantization time
try:
    # Assume `pruned_model` is obtained from the previous step
    pruned_model_size = sum(p.numel() for p in pruned_model.parameters()) / 1e6  # In millions of parameters
    pruned_memory_size = sum(p.numel() for p in pruned_model.parameters()) * 4 / (1024**2)  # In MB (4 bytes per parameter)

    # Quantize the pruned model
    start_time = time.time()
    quantized_model = quantize_model(pruned_model)
    end_time = time.time()

    # Calculate quantized model size
    quantized_model_size = sum(p.numel() for p in quantized_model.parameters()) / 1e6  # In millions of parameters
    quantized_memory_size = sum(p.numel() for p in quantized_model.parameters()) * 1 / (1024**2)  # In MB (1 byte per parameter for qint8)

    # Quantization timing
    quantization_time = end_time - start_time

    # Calculate final compression rates relative to the original model
    compression_rate = ((original_model_size - quantized_model_size) / original_model_size) * 100
    memory_reduction_rate = ((original_memory_size - quantized_memory_size) / original_memory_size) * 100

    # Output results
    print(f"Quantization of pruned model completed successfully.")
    print(f"Quantization time: {quantization_time:.2f} seconds")
    print(f"------------------------------------------------------")
    print(f"Original model size: {original_model_size:.2f}M parameters")
    print(f"Pruned model size: {pruned_model_size:.2f}M parameters")
    print(f"Quantized model size: {quantized_model_size:.2f}M parameters")
    print(f"------------------------------------------------------")
    print(f"Original memory allocation: {original_memory_size:.2f} MB")
    print(f"Pruned memory allocation: {pruned_memory_size:.2f} MB")
    print(f"Quantized memory allocation: {quantized_memory_size:.2f} MB")
    print(f"------------------------------------------------------")
    print(f"Final sparsity: {pruned_sparsity:.2f}%")
    print(f"Final compression rate: {compression_rate:.2f}%")
    print(f"Final memory reduction rate: {memory_reduction_rate:.2f}%")
except Exception as e:
    print(f"Quantization failed: {e}")

Quantization of pruned model completed successfully.
Quantization time: 1.32 seconds
------------------------------------------------------
Original model size: 109.48M parameters
Pruned model size: 109.48M parameters
Quantized model size: 23.87M parameters
------------------------------------------------------
Original memory allocation: 417.64 MB
Pruned memory allocation: 417.64 MB
Quantized memory allocation: 22.77 MB
------------------------------------------------------
Final sparsity: 39.10%
Final compression rate: 78.19%
Final memory reduction rate: 94.55%
