<a href="https://colab.research.google.com/github/pierredantas/LLMCompress/blob/main/Successful_GlobalPruning_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Techniques for Compressing Large Language Models in Python

In this code, I am optimizing a lightweight large language model (~0.1B parameters, specifically BERT) by applying pruning to increase sparsity, followed by quantization to reduce memory usage and model size. Notably, this entire process is performed without requiring a GPU, making it highly accessible for environments with limited computational resources.

## Conclusion

The results demonstrate the combined effects of pruning and quantization on the original model, focusing on parameter sparsity, memory reduction, and compression rate. Here's an analysis:

### 1. **Pruning Results**:
   - **Pruned Model Size and Memory**:
     - The pruned model size and memory allocation remain identical to the original model. This happens because PyTorch's pruning mechanism uses a **mask-based approach**.
     - **Mask-based pruning** applies sparsity by creating a binary `weight_mask` for each layer, which zeroes out specific weights but does not physically remove them. The pruned weights (`weight_orig`) still occupy memory, leading to no immediate reduction in size or memory usage.

   - **Sparsity After Pruning**:
     - The pruning process achieved a sparsity of **39.10%**, meaning 39.10% of the model's parameters were set to zero. However, because the weights are masked rather than removed, the total parameter count remains the same.

### 2. **Quantization Results**:
   - **Model Size Reduction**:
     - Quantization physically reduced the parameter size from 32 bits (float32) to 8 bits (qint8), resulting in a **78.19% compression rate**. This reflects a significant reduction in memory usage and storage requirements.
   
   - **Memory Allocation**:
     - The quantized model occupies only **22.77 MB** of memory compared to the original **417.64 MB**, achieving a **94.55% memory reduction rate**. This makes the model highly efficient for deployment on resource-constrained devices.

### 3. **Overall Observations**:
   - **Pruning** primarily increases sparsity, which can improve computational efficiency when supported by specialized hardware or frameworks capable of exploiting sparsity (e.g., sparse matrix multiplication). However, without removing pruned weights, the memory footprint remains unchanged.
   - **Quantization** complements pruning by reducing the bit-width of parameters, leading to substantial reductions in model size and memory usage, irrespective of sparsity.

### 4. **Recommendations**:
   - **Mask Removal for Pruned Models**:
     - To fully benefit from pruning, the pruned weights should be physically removed by reconstructing the model with reduced dimensions (e.g., excluding zeroed-out rows and columns). This would align the sparsity with the parameter count and memory usage.
   - **Deploying Quantized Models**:
     - The quantized model is ideal for deployment on edge devices or low-resource environments, given its significantly reduced size and memory requirements.

### 5. **Key Takeaways**:
   - The combination of **pruning** and **quantization** effectively balances sparsity, model size, and memory usage, making it a robust strategy for optimizing deep learning models.
   - For maximum impact, frameworks that exploit both sparsity and quantization simultaneously should be used during inference.Introduction to LLM Model Compression

Model compression is crucial for deploying large language models (LLMs) in resource-constrained environments. It aims to reduce model size and computational requirements while maintaining performance. This slideshow explores common techniques for LLM model compression using Python

In [1]:
import torch
from torch import nn
import torch.nn.utils.prune as prune
from transformers import BertModel, BertConfig
import time

# Load the bert-base-uncased model
config = BertConfig.from_pretrained('bert-base-uncased')
model = BertModel(config)

# Store original model size
original_model_size = sum(p.numel() for p in model.parameters()) / 1e6
original_memory_size = sum(p.numel() for p in model.parameters()) * 4 / (1024 ** 2)  # Assuming float32

# Pruning - Removing Unimportant Weights

Pruning involves removing less important weights from the model. This technique can significantly reduce model size with minimal impact on performance. We'll demonstrate magnitude-based pruning.

In [2]:
# Specify the parameters to prune
parameters_to_prune = []
for name, module in model.named_modules():
    if isinstance(module, (nn.Linear, nn.Conv2d)):
        parameters_to_prune.append((module, 'weight'))

# Apply global unstructured pruning
prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.20
)

# Calculate pruned model size
pruned_model_size = sum(p.numel() for p in model.parameters()) / 1e6
pruned_memory_size = sum(p.numel() for p in model.parameters()) * 4 / (1024 ** 2)

# Calculate and print sparsity for each layer
total_sparsity = 0
total_elements = 0
for module, param_name in parameters_to_prune:
    param = getattr(module, param_name)
    num_zeros = float(torch.sum(param == 0))
    num_elements = float(param.nelement())
    sparsity = 100. * num_zeros / num_elements
    total_sparsity += num_zeros
    total_elements += num_elements
    print(f"Sparsity in {module.__class__.__name__} {param_name}: {sparsity:.2f}%")

# Calculate and print global sparsity
global_sparsity = 100. * total_sparsity / total_elements
print(f"Global sparsity: {global_sparsity:.2f}%")

Sparsity in Linear weight: 20.03%
Sparsity in Linear weight: 19.97%
Sparsity in Linear weight: 20.00%
Sparsity in Linear weight: 19.88%
Sparsity in Linear weight: 20.03%
Sparsity in Linear weight: 20.02%
Sparsity in Linear weight: 20.03%
Sparsity in Linear weight: 19.92%
Sparsity in Linear weight: 20.01%
Sparsity in Linear weight: 20.01%
Sparsity in Linear weight: 19.98%
Sparsity in Linear weight: 19.97%
Sparsity in Linear weight: 19.98%
Sparsity in Linear weight: 20.08%
Sparsity in Linear weight: 20.06%
Sparsity in Linear weight: 19.99%
Sparsity in Linear weight: 19.99%
Sparsity in Linear weight: 20.01%
Sparsity in Linear weight: 20.13%
Sparsity in Linear weight: 19.99%
Sparsity in Linear weight: 20.00%
Sparsity in Linear weight: 19.95%
Sparsity in Linear weight: 19.98%
Sparsity in Linear weight: 20.01%
Sparsity in Linear weight: 19.95%
Sparsity in Linear weight: 20.07%
Sparsity in Linear weight: 19.98%
Sparsity in Linear weight: 20.04%
Sparsity in Linear weight: 20.00%
Sparsity in Li

#Quantization - Reducing Numerical Precision

Quantization reduces the numerical precision of model weights and activations. This technique can significantly decrease model size and inference time. We'll demonstrate post-training static quantization.

In [3]:
# Function for Hugging Face dynamic quantization (modified)
for module in model.modules():
    if isinstance(module, torch.nn.Linear):
        module.qconfig = torch.quantization.default_qconfig

def quantize_model(target_model):
    target_model.eval()
    for module in target_model.modules():
        if isinstance(module, torch.nn.Linear):
            module.qconfig = torch.quantization.default_qconfig
    torch.quantization.prepare(target_model, inplace=True)
    torch.quantization.convert(target_model, inplace=True)
    return target_model

# Measure quantization time
try:
    # Quantize the pruned model
    start_time = time.time()
    final_quantized_model = quantize_model(model)  # Use the pruned model here
    end_time = time.time()

    # Calculate quantized model size
    quantized_model_size = sum(p.numel() for p in final_quantized_model.parameters()) / 1e6
    quantized_memory_size = sum(p.numel() for p in final_quantized_model.parameters()) * 1 / (1024 ** 2)  # Assuming int8

    # Quantization timing
    quantization_time = end_time - start_time

    # Final compression rates
    compression_rate = ((original_model_size - quantized_model_size) / original_model_size) * 100
    memory_reduction_rate = ((original_memory_size - quantized_memory_size) / original_memory_size) * 100

    # Output results
    print(f"------------------------------------------------------")
    print(f"Quantization of pruned model completed successfully.")
    print(f"Quantization time: {quantization_time:.2f} seconds")
    print(f"------------------------------------------------------")
    print(f"Original model size: {original_model_size:.2f}M parameters")
    print(f"Pruned model size: {pruned_model_size:.2f}M parameters")
    print(f"Quantized model size: {quantized_model_size:.2f}M parameters")
    print(f"------------------------------------------------------")
    print(f"Original memory allocation: {original_memory_size:.2f} MB")
    print(f"Pruned memory allocation: {pruned_memory_size:.2f} MB")
    print(f"Quantized memory allocation: {quantized_memory_size:.2f} MB")
    print(f"------------------------------------------------------")
    print(f"Final sparsity: {global_sparsity:.2f}%")  # Use the calculated global sparsity
    print(f"Final compression rate: {compression_rate:.2f}%")
    print(f"Final memory reduction rate: {memory_reduction_rate:.2f}%")
except Exception as e:
    print(f"Quantization failed: {e}")



------------------------------------------------------
Quantization of pruned model completed successfully.
Quantization time: 0.87 seconds
------------------------------------------------------
Original model size: 109.48M parameters
Pruned model size: 109.48M parameters
Quantized model size: 23.87M parameters
------------------------------------------------------
Original memory allocation: 417.64 MB
Pruned memory allocation: 417.64 MB
Quantized memory allocation: 22.77 MB
------------------------------------------------------
Final sparsity: 20.00%
Final compression rate: 78.19%
Final memory reduction rate: 94.55%
