<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_2_pruning_structured_llama3.2-1b_KO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>Pruning Llama 3.2.</h2>
    <h3>Example of INCORRECT approach to pruning a Llama Model.</h3>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)
_______
Models: meta-llama/Llama-3.2-1B

Colab Environment: GPU T4.

Keys:
* Pruning
* Structured pruning


Related article: --.
_______
This is the unofficial repository for the book:
        <a href="https://amzn.to/4eanT1g"> <b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).
        The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters.
        If you are looking for the official repository for the book, with the original notebooks, you should visit the
        <a href="https://github.com/Apress/Large-Language-Models-Projects">Apress repository</a>, where you can find all the notebooks in their original format as they appear in the book.
______

# Introduction
This notebook cotinues the work done at: [6_1_pruning_structured_l1_diltilgpt2.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_1_pruning_structured_l1_diltilgpt2.ipynb) where pruning was applied to a distilGPT2 model.

The pruning process was based on selecting neurons from the model's feedforward layers that have the least importance using the L1 norm, assuming these contributed the least to the model's output.

In this notebook, the same process is applied to a state-of-the-art model from the Llama family. The results, however, are not as expected, simply because the model's structure is very different, and the method needs to be adapted to these characteristics.

**In this notebook, we'll identify the main issues so we can address them in a follow-up notebook.**

#Install libraries & Configure variables.

In [None]:
#Install necessary libraries.
!pip install -q transformers
!pip install -q torch
!pip install -q sentencepiece  # Required for LLaMA tokenizer

In [None]:
#Import libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import nn
import os

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


#Download model and explore structure

In [None]:
model_name = 'meta-llama/Llama-3.2-1B'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set pad token

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [None]:
def get_output(prompt, model=model, tokenizer=tokenizer):
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=50,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated

## studying the model structure

As you already know, studying the model’s structure is crucial for a successful pruning process.

In this example, I’ll use the same pruning approach as in the previous example with a distilGPT2 model, which has a different structure. You can see the structure and the example in the notebook: [6_1_pruning_structured_l1_diltilgpt2.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_1_pruning_structured_l1_diltilgpt2.ipynb).

The process involved removing a percentage of the neurons with the lowest weights from the feedforward layers of the model, located within the MLP module. In the GPT2 model, these layers were called `c_fc` and `c_proj`, while in the Llama model, these layers are `gat_proj`, `up_proj`, and additionally `down_proj`.

But the name isn’t the most important part, these layers have a very different structure and function compared to the `MLP` module layers in the distilGPT2 model.

Understanding these differences will be crucial for defining the pruning process. In this notebook, we will examine how the Llama model  is negatively affected by a pruning process that worked correctly with the distilGPT2 model, even though both target the MLP layers and use the same neuron selection method.


In [None]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm):

Each transformer block `lamaDecoderLayer` contains a MLP `LlamaMLP` Layer with a GLU (Gated Linear Structured.)

It is a sophisticated structure whe comparing with other transformer models.

Let's see each layer:
* `gate_proj`: Projects the input to a higher dimension (2048 to 8192).

* `up_proj`: Another projection to the higher dimension (2048 to 8192).

* `down_proj`: Projects back to the original dimension (8192 to 2048).

When prunig you should have in mind the relationship between these layers.

Another important consideration is the model's configuration file. Since the pruning process alters the model's structure, the resulting structure must be reflected in the configuration file.

Otherwise, we might encounter issues where the model doesn't work properly with the Transformers library or produces errors or incorrect results during inference.


In [None]:
print(model.config)

LlamaConfig {
  "_name_or_path": "meta-llama/Llama-3.2-1B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "float16",
  "transformers_version": "4.44.2",
  "use_cache": true,
  "vocab_size": 128256
}



In [None]:
# Test the original model
prompt = "Paris is the capital of"
generated = get_output(prompt)
print(f"Generated text: {generated}")

Generated text: Paris is the capital of France and the most popular tourist destination in the country. It’s also the largest city in Europe, with a population of more than 2 million people. Paris is a city of culture, art, and history, and


In [None]:
# Support function to check the size reduction.
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())

In [None]:
original_param_count = count_parameters(model)
print(f"Original model parameters: {original_param_count}")

Original model parameters: 1235814400


# Pruning Model.

## Support pruning functions.

In [None]:
# Function to compute importance scores (L1 norm)
def compute_importance_scores(layer_weight):
    """
    compute importance scores (L1 norm)

    Args:
    - layer_weight: Weight matrix from a gate_proj / up_proj layer.

    Returns:
    - importance_scores: L1 norm Importance scores for each neuron.
    """
    weight = layer_weight.float()
    return torch.sum(torch.abs(weight), dim=1)

In [None]:
def prune_neurons(mlp, prune_percent):
    """
    Prune neurons from the gate_weight and c_proj layers of the MLP based on importance scores.

    Args:
    - mlp: The MLP layer (contains gate_proj and up_proj) to prune.
    - prune_percent: Percentage of neurons to prune.

    Returns:
    - new_gate_proj: New pruned c_fc layer.
    - new_up_proj: New pruned c_proj layer.
    - new_down_proj: New down_proj layer.
    - new_size: Number of neurons after pruning.
    - indices_to_keep: Indices of neurons to keep.
    """
    # Get the weights of the gate_proj and up_proj layers
    gate_weight = mlp.gate_proj.weight.data.float()  # Shape: [output_features, input_features]
    up_weight = mlp.up_proj.weight.data.float()      # Shape: [output_features, input_features]

    print(f"gate_weight.shape: {gate_weight.shape}")
    print(f"up_weight.shape: {up_weight.shape}")

    # Compute importance scores for each neuron separately and sum them
    importance_scores_gate = compute_importance_scores(gate_weight)
    importance_scores_up = compute_importance_scores(up_weight)
    importance_scores = importance_scores_gate + importance_scores_up

    # Check for NaNs or Infs
    if torch.isnan(importance_scores).any():
        print("Warning: importance_scores contains NaNs")
    if torch.isinf(importance_scores).any():
        print("Warning: importance_scores contains Infs")

    # Determine the number of neurons to prune
    original_intermediate_size = gate_weight.size(0)  # This is output_features
    num_neurons_to_prune = int(prune_percent * original_intermediate_size)

    # Ensure num_neurons_to_prune is valid
    num_neurons_to_prune = max(0, min(num_neurons_to_prune, original_intermediate_size - 1))
    k = original_intermediate_size - num_neurons_to_prune

    print(f"Original intermediate size: {original_intermediate_size}")
    print(f"Number of neurons to prune: {num_neurons_to_prune}")
    print(f"Number of neurons to keep (k): {k}")

    if k <= 0:
        raise ValueError(f"Invalid number of neurons to keep: {k}. Adjust the prune_percent or check the layer sizes.")

    # Ensure importance_scores is on the same device
    importance_scores = importance_scores.to(device)

    # Get indices of neurons to keep (those with highest importance)
    _, indices_to_keep = torch.topk(importance_scores, k)

    # Sort indices to maintain order
    indices_to_keep, _ = torch.sort(indices_to_keep)

    # Create new Linear layers with reduced size
    new_gate_proj = nn.Linear(mlp.gate_proj.in_features, len(indices_to_keep), bias=False).to(device)
    new_up_proj = nn.Linear(mlp.up_proj.in_features, len(indices_to_keep), bias=False).to(device)
    new_down_proj = nn.Linear(len(indices_to_keep), mlp.down_proj.out_features, bias=False).to(device)

    return new_gate_proj, new_up_proj, new_down_proj, len(indices_to_keep), indices_to_keep


In [None]:
# Function to copy weights and biases to new pruned layers
def copy_weights_and_biases(mlp, new_gate_proj, new_up_proj, new_down_proj, indices_to_keep):
    """
    Copy the weights and biases from the original layers to the new pruned layers.

    Args:
    - mlp: The original MLP layer.
    - new_cnew_gate_proj_fc: New pruned gate_proj layer.
    - new_up_proj: New pruned up_proj layer.
    - new_down_proj: New pruned down_proj layer.
    - indices_to_keep: Indices of neurons that are retained.
    """
    # Copy weights for gate_proj and up_proj (input features remain the same)
    new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
    new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]

    # Copy weights for down_proj (output features remain the same)
    new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]


# Prune Loop
The update_model function iterates through the blocks within the model's Transformer structure. This structure consists of multiple `LlamaDecoderLayer` blocks, and each of these blocks contains a pair of `LlamaSdpaAttention` and `LlamaMLP` components. The latter contains the MLP layers that will be the target of the pruning process.

```
(layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
  )    
```

The layers that will undergo the removal of neurons identified as less useful are:
```
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
```
The neurons are removed in the `prune_neurons` function based on the values returned by `compute_importance_scores`.



In [None]:
# Function to update the model
def update_model(model, prune_percent):
    new_intermediate_size = None

    for idx, layer in enumerate(model.model.layers):
        mlp = layer.mlp

        # Prune the neurons and create new layers
        new_gate_proj, new_up_proj, new_down_proj, new_size, indices_to_keep = prune_neurons(mlp, prune_percent)

        # Copy weights from old layers to new pruned layers
        copy_weights_and_biases(mlp, new_gate_proj, new_up_proj, new_down_proj, indices_to_keep)

        # Replace old layers with new pruned layers
        mlp.gate_proj = new_gate_proj
        mlp.up_proj = new_up_proj
        mlp.down_proj = new_down_proj

        # Update the intermediate size for the first layer
        if new_intermediate_size is None:
            new_intermediate_size = new_size

    # Update the model configuration with the new intermediate size
    model.config.intermediate_size = new_intermediate_size

    return model

## Obtain & test the model.  

In [None]:
prune_percent = 0.2  # Prune 20% of neurons
model = update_model(model, prune_percent)

gate_weight.shape: torch.Size([8192, 2048])
up_weight.shape: torch.Size([8192, 2048])
Original intermediate size: 8192
Number of neurons to prune: 1638
Number of neurons to keep (k): 6554
gate_weight.shape: torch.Size([8192, 2048])
up_weight.shape: torch.Size([8192, 2048])
Original intermediate size: 8192
Number of neurons to prune: 1638
Number of neurons to keep (k): 6554
gate_weight.shape: torch.Size([8192, 2048])
up_weight.shape: torch.Size([8192, 2048])
Original intermediate size: 8192
Number of neurons to prune: 1638
Number of neurons to keep (k): 6554
gate_weight.shape: torch.Size([8192, 2048])
up_weight.shape: torch.Size([8192, 2048])
Original intermediate size: 8192
Number of neurons to prune: 1638
Number of neurons to keep (k): 6554
gate_weight.shape: torch.Size([8192, 2048])
up_weight.shape: torch.Size([8192, 2048])
Original intermediate size: 8192
Number of neurons to prune: 1638
Number of neurons to keep (k): 6554
gate_weight.shape: torch.Size([8192, 2048])
up_weight.shape:

As is posible to see in this simple log we are reducing the number of features in the upgrad layers from 8192 to 6554. Ther are 16 * 2 layers affected by the reduction.

In [None]:
# Recalculate the number of parameters
pruned_param_count = count_parameters(model)
reduction_in_params = original_param_count - pruned_param_count
percentage_savings = (reduction_in_params / original_param_count) * 100

print(f"Pruned model parameters: {pruned_param_count}")
print(f"Reduction in parameters: {reduction_in_params}")
print(f"Percentage of weight savings: {percentage_savings:.2f}%")


Pruned model parameters: 1074792448
Reduction in parameters: 161021952
Percentage of weight savings: 13.03%


In [None]:
# Test the pruned model
generated = get_output(prompt, model, tokenizer)
print(f"Generated text after pruning: {generated}")

Generated text after pruning: Paris is the capital of of France. This is the the the the main the area of. This is the the the the the the the the the the the the the the the the city of the the France of the of the of the of


**-- WARNING --**

Although it's normal for a model to lose some capabilities due to a pruning process, what has happened to our model is not normal.

It's not just a matter of reducing the pruning percentage. The issue here runs deeper. There are a couple of problems in the pruning process that need to be addressed.

## Identifying the problems.


In [None]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (up_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (down_proj): Linear(in_features=6554, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm):


In the model’s structure, at first glance, there doesn’t seem to be any error, but the MLP block structure has not been properly considered.

The layers are being treated as if they were from the distilGPT model, whereas Llama uses a `GLU (Gated Linear Unit)` structure, where the `gate_proj` and `up_proj` layers work together. Therefore, pruning cannot be done by calculating the importance of neurons separately and removing different neurons in each layer. Instead, the pruning process should respect that these layers function as pairs.

Thus, the evaluation of which neurons to prune should take into account that they must be assessed together, and pruning should be done on pairs of neurons.

We now have some key points that need to be addressed in order to develop a pruning solution that suits the Llama model’s structure.

* Consider the GLU (Gated Linear Unit) structure of the MLP layers.
* Use a neuron selection method that is compatible with the GLU structure.

**We will explore this in the next notebook.**

# Upload the model to Hugging Face & Download to test.

Aunque el modelo no sea funcional, vamos a comprobar que como mínimo puede trabajar correctamente con las librerias Transformers.

In [None]:
output_dir = './pruned_llama_1b'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Pruned model saved to {output_dir}")

Pruned model saved to ./pruned_llama_1b


In [None]:
# Push the model to your Hugging Face repository
model.push_to_hub('pruned-llama-1b')

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/oopere/pruned-llama-1b/commit/383b3b2cf8bec7bb7df853261150ee448cc67757', commit_message='Upload LlamaForCausalLM', commit_description='', oid='383b3b2cf8bec7bb7df853261150ee448cc67757', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub('pruned-llama-1b')

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/oopere/pruned-llama-1b/commit/383b3b2cf8bec7bb7df853261150ee448cc67757', commit_message='Upload tokenizer', commit_description='', oid='383b3b2cf8bec7bb7df853261150ee448cc67757', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# Download the pruned model
pruned_model_name = 'oopere/pruned-llama-1b'
pruned_model = AutoModelForCausalLM.from_pretrained(pruned_model_name, torch_dtype=torch.float16).to(device)
pruned_tokenizer = AutoTokenizer.from_pretrained(pruned_model_name)


config.json:   0%|          | 0.00/883 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.15G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/335 [00:00<?, ?B/s]

In [None]:
# Test the downloaded pruned model
generated = get_output(prompt, pruned_model, pruned_tokenizer)
print(f"Generated text from downloaded pruned model: {generated}")

Generated text from downloaded pruned model: Paris is the capital of of of Europe. It is the the the the the largest city of France. It is the the the the the the the the the the the the the the the the the the the the the the the the the is


#Conclusion.
Este notebook nos recuerda que un proceso de pruning debe tener en cuenta la estructura del modelo, y que no puede reusarse el mismo proceso para modelos con estructuras diferentes.

Lo que previamente ha funcionado perfectamente con el modelo distilGPT2 ha destrozado el modelo Llama3.2 hasta dejarlo inservible.  

## Future work.
Esta claro que la tarea para el proximo notebook vamos a trabajar en un proceso de pruning, que si bien este inspirado en este sea capaz de respetar la estructura del modelo y conseguir reducir su tamaño sin afectar demasiado su funcionalidad.

##Authors Note.
In addition to creating content like this notebook and offering it under the MIT license, I have also contributed to repositories such as those of Hugging Face and Google Gemini.

I am especially proud of my book: <a href="https://amzn.to/4eanT1g"><b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).

You can find it on both <a href="https://amzn.to/4eanT1g">Amazon</a> and <a href="https://link.springer.com/book/10.1007/979-8-8688-0515-8">Springer</a>, where they often have good deals on the purchase price.

If you take a look and end up purchasing it, keep in mind that you can reach out with any questions via the Discussions section of this same repository or on any of my social media channels. I’ll do my best to respond as quickly as possible.