<a href="https://colab.research.google.com/github/peremartra/LLMOptCost/blob/main/pruning_diltilgpt2_o1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>Pruning distilGPT2.</h2>
    <h3></h3>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)
_______
Models: distilgpt2

Colab Environment: CPU / GPU T4.

Keys:
* Pruning
* Structured pruning


Related article: --.
_______
This is the unofficial repository for the book:
        <a href="https://amzn.to/4eanT1g"> <b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).
        The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters.
        If you are looking for the official repository for the book, with the original notebooks, you should visit the
        <a href="https://github.com/Apress/Large-Language-Models-Projects">Apress repository</a>, where you can find all the notebooks in their original format as they appear in the book.

#PRUNING
Pruning is an important optimization technique in machine learning that aims to reduce the size of a model without sacrificing much of its accuracy. By removing less important components, pruning not only decreases the computational cost but also makes the model more efficient for deployment, especially on resource-constrained devices.

Can be compared to quantization, another optimization technique that reduces the precision of the model's weights, typically converting them from high-precision floating-point numbers to lower-precision representations. While quantization can significantly reduce model size and speed up inference, it does not selectively remove weights.

On the other hand, pruning, allows for targeted removal of less important weights or neurons, which can lead to a more efficient reduction in model size while better preserving accuracy. By selecting the weights to eliminate based on their importance scores, pruning provides more control over the model's structure, often making it a more effective approach when aiming for both model compression and high performance.

The effectiveness of removing specific parts of a model could be debated, but recent studies, such as the one conducted by NVIDIA: [How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/), concluding that pruning, combined with fine-tuning techniques applied after pruning, can produce models that are not only more efficient but also more effective in specific domains.

Also you can combine both techniques and quantize a model that has been previousluy pruned.

This notebook focuses on **structured width pruning**, where entire neurons are eliminated based on their low importance scores, which are computed using the L1 norm. The assumption is that neurons with lower L1 norm values contribute less to the overall output of the model, allowing for safe removal to enhance efficiency without drastically impacting accuracy.

# Install Libraries & Configure variables.

In [None]:
#Install necessary libraries
!pip install -q transformers
!pip install -q torch

In [None]:
#Import libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import nn
import os

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")



Using device: cuda


In [None]:
prune_percent = 0.2  # For example, prune 20% of neurons
model_name = 'distilgpt2'

NameError: name 'model' is not defined

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())

#Download Model and explore structure.

In [None]:
#Download the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token


In [None]:
def get_ouput(prompt, model=model, tokenizer=tokenizer):
  inputs = tokenizer(prompt, return_tensors='pt').to(device)
  outputs = model.generate(inputs['input_ids'],
                           attention_mask=inputs['attention_mask'],
                           max_length=10,
                           num_return_sequences=1)
  generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
  return generated

In [None]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [None]:
print(model.config)

GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.44.2",
  "use_cache": true,
  "vocab_size": 50257
}



In [None]:
#Test the original model with a simple prompt
prompt = "Paris is the capital of"
generated = get_ouput(prompt)
print(f"Generated text: {generated}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text: Paris is the capital of the United States.



In [None]:
#Print the size of the original model
original_param_count = count_parameters(model)
print(f"Original model parameters: {original_param_count}")

Original model parameters: 81912576


## Pruning Model.

In [None]:
#Prune the MLP layers based on weight magnitude (adjusted for Conv1D layers)
from transformers.models.gpt2.modeling_gpt2 import Conv1D

# Initialize new_intermediate_size
new_intermediate_size = None

for idx, block in enumerate(model.transformer.h):
    mlp = block.mlp

    # Get the weights of the c_fc layer (input projection)
    # c_fc.weight: [hidden_size, intermediate_size]
    c_fc_weight = mlp.c_fc.weight.data

    # Compute importance scores (L1 norm over input dimension for each output neuron)
    importance_scores = torch.sum(torch.abs(c_fc_weight), dim=0)  # Shape: [intermediate_size]

    # Determine the number of neurons to prune
    original_intermediate_size = c_fc_weight.size(1)  # This is intermediate_size
    num_neurons_to_prune = int(prune_percent * original_intermediate_size)

    # Get indices of neurons to keep (those with highest importance)
    _, indices_to_keep = torch.topk(importance_scores, original_intermediate_size - num_neurons_to_prune)

    # Sort indices to maintain order
    indices_to_keep, _ = torch.sort(indices_to_keep)

    # Create new Conv1D layers with reduced size
    # For c_fc: Conv1D(new_intermediate_size, hidden_size)
    new_c_fc = Conv1D(len(indices_to_keep), mlp.c_fc.weight.size(0)).to(device)

    # For c_proj: Conv1D(hidden_size, new_intermediate_size)
    new_c_proj = Conv1D(mlp.c_proj.weight.size(1), len(indices_to_keep)).to(device)

    # Copy over the weights and biases for neurons we're keeping
    # c_fc.weight: [hidden_size, intermediate_size], select columns
    new_c_fc.weight.data = mlp.c_fc.weight.data[:, indices_to_keep]
    new_c_fc.bias.data = mlp.c_fc.bias.data[indices_to_keep]

    # c_proj.weight: [intermediate_size, hidden_size], select rows
    new_c_proj.weight.data = mlp.c_proj.weight.data[indices_to_keep, :]
    new_c_proj.bias.data = mlp.c_proj.bias.data

    # Replace the layers in the model
    mlp.c_fc = new_c_fc
    mlp.c_proj = new_c_proj

    # Set new_intermediate_size if not already set
    if new_intermediate_size is None:
        new_intermediate_size = len(indices_to_keep)

# After pruning, update the model configuration
model.config.n_inner = new_intermediate_size

## Check pruned model.

In [None]:
#Step 7: Recalculate the number of parameters
pruned_param_count = count_parameters(model)
print(f"Pruned model parameters: {pruned_param_count}")
print(f"Reduction in parameters: {original_param_count - pruned_param_count}")

Pruned model parameters: 76250268
Reduction in parameters: 5662308


In [None]:
print(model)

In [None]:
print(model.config)

In [None]:
generated = get_ouput(prompt)
print(f"Generated text after pruning: {generated}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text after pruning: Paris is the capital of the United States, and


In [None]:
# Step 9: Save the pruned model
output_dir = './pruned_distilgpt2'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Pruned model saved to {output_dir}")

Pruned model saved to ./pruned_distilgpt2


In [None]:
# Push the model to your Hugging Face repository
name_model_to_push="pruned_distilgpt2"

model.push_to_hub(name_model_to_push,
                  private=True,
                  use_temp_dir=False)




README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/oopere/pruned_distilgpt2/commit/1ca19c3fc9319fe913e88d47e92af02303c318d4', commit_message='Upload model', commit_description='', oid='1ca19c3fc9319fe913e88d47e92af02303c318d4', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub(name_model_to_push,
                      private=False,
                      use_temp_dir=False)

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/oopere/pruned_distilgpt2/commit/1ca19c3fc9319fe913e88d47e92af02303c318d4', commit_message='Upload tokenizer', commit_description='', oid='1ca19c3fc9319fe913e88d47e92af02303c318d4', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# Step 11: Download the model from Hugging Face
pruned_model_name = 'oopere/pruned_distilgpt2'
pruned_model = AutoModelForCausalLM.from_pretrained(pruned_model_name).to(device)
pruned_tokenizer = AutoTokenizer.from_pretrained(pruned_model_name)

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/305M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [None]:
generated = get_ouput(prompt, pruned_model, pruned_tokenizer)
print(f"Pruned Downloaded Generated text: {generated}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Pruned Downloaded Generated text: Paris is the capital of the United States, and
