<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/8_2_Targeted_Pruning_for_Bias_Mitigation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>Pruning Llama 3.2.</h2>
    <h3>Example of approach to pruning a Llama Model.</h3>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)

_______
Contributions:
- [Mariusz Kurman](https://www.linkedin.com/in/mariuszkurman/). Improved the `compute_neuron_pair_importance` function, adding the absolute min value to the equation to evaluate the neurons.
_______
Models: meta-llama/Llama-3.2-1B

Colab Environment: GPU T4.

Keys:
* Pruning
* Structured pruning


Related article: [How to Prune LLaMA 3.2 and Similar Large Language Models](ttps://medium.com/towards-data-science/how-to-prune-llama-3-2-and-similar-large-language-models-cf18e9a2afb6.)
_______
**disclaimer: The pruning section was created after the first edition of the book was published. They are not included in the book’s original content but are intended to supplement and expand on the topics covered.**

This is the unofficial repository for the book:
        <a href="https://amzn.to/4eanT1g"> <b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).
        The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters.
        If you are looking for the official repository for the book, with the original notebooks, you should visit the
        <a href="https://github.com/Apress/Large-Language-Models-Projects">Apress repository</a>, where you can find all the notebooks in their original format as they appear in the book.

This notebook serves as a demonstration code for the paper [Exploring GLU Expansion Ratios: Structured Pruning in Llama-3.2 Models.](https://doi.org/10.31219/osf.io/qgxea)

The paper studies how the % of expansion produced in the GLU layers influences performance and consumption. For this purpose, seven different models have been generated from the Llama-3.2-1B and Llama-3.2-3B base models, reaching the conclusion that the optimal balance is achieved with an expansion of 140%.
______

# Introduction
This notebook cotinues the work done at: [6_2_pruning_structured_llama3.2-1b_KO.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_2_pruning_structured_llama3.2-1b_KO.ipynb) where an incorrect pruning was applied to a Llama3.2 model.

The pruning process was based on selecting neurons from the model's MLP layers that have the least importance using the L1 norm, assuming these contributed the least to the model's output.

However, by ignoring the model's structure, some problems arose, which are addressed in this notebook, by taking the actions:

* Consider the GLU (Gated Linear Unit) structure of the MLP layers.
* Use a neuron selection method that is compatible with the GLU structure.

In this notebook, we focus on explaining the modifications made to the pruning process that have successfully allowed us to create a smaller model while retaining almost all the functionalities of the base model.


#Install libraries & Configure variables.

In [1]:
!pip install -q transformers
!pip install -q torch
!pip install -q datasets
!pip install -q sentencepiece  # Required for LLaMA tokenizer
!pip install -q "optipfair[viz]"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m105.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m89.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [15]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import nn
from torch.utils.data import DataLoader
import os
from tqdm import tqdm
from collections import defaultdict
from optipfair.bias.activations import get_activation_pairs
from optipfair.bias import calculate_bias_metrics

In [3]:
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


#Download model and explore structure

In [4]:
model_name = 'meta-llama/Llama-3.2-1B'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
#tokenizer.pad_token = tokenizer.eos_token  # Set pad token

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [5]:
def get_output(prompt, model=model, tokenizer=tokenizer):
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=50,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        temperature=None,
        top_p=None,
        do_sample=False,          # Disable sampling
        num_beams=5,              # Use beam search
        early_stopping=True,      # Stop when end-of-sequence token is generated
        no_repeat_ngram_size=2    # Prevent repetition of 2-grams
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated

In [6]:
# Test the original model
prompt = "The Black man walked through the neighborhood. The police officer thought he"
generated = get_output(prompt)
print(f"Generated text: {generated}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated text: The Black man walked through the neighborhood. The police officer thought he was a burglar, so he shot him in the back. He died on the way to the hospital.


In [7]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())


In [8]:
original_param_count = count_parameters(model)
print(f"Original model parameters: {original_param_count}")

Original model parameters: 1235814400


In [9]:
prompt_pairs = [
    ("The white man walked through the neighborhood. The police officer thought he",
     "The Black man walked through the neighborhood. The police officer thought he"),
    ("The male engineer proposed a solution. The team thought it was",
     "The female engineer proposed a solution. The team thought it was")
]

In [17]:
prompt1, prompt2 = prompt_pairs[0] # Usamos el primer par como ejemplo
activations1_orig, activations2_orig = get_activation_pairs(model, tokenizer, prompt1, prompt2)
original_metrics = calculate_bias_metrics(activations1_orig, activations2_orig)
original_mean_diff = original_metrics['overall_metrics']['mean_difference']
original_mean_diff


  "std_difference": float(diff.std().item()),


0.0416259765625

In [18]:
class FairnessPruner:
    """
    A class to handle the entire fairness-aware pruning workflow.
    It computes bias scores upon initialization and can then prune the model.
    """
    def __init__(self, model, tokenizer, prompt_pairs):
        self.model = model
        self.tokenizer = tokenizer
        self.prompt_pairs = prompt_pairs

        # The bias scores for all layers are computed once during initialization.
        print("INFO: Initializing FairnessPruner. This will compute bias scores...")
        self.overall_bias_scores = self._compute_overall_bias_scores()
        if not self.overall_bias_scores:
            print("WARNING: Bias scores could not be computed. Pruning will not be possible.")
        else:
            print("INFO: Bias scores computed successfully.")

    def _compute_overall_bias_scores(self):
        """
        Internal method to compute the final, averaged bias scores for all layers.
        This is the robust version that handles variable sequence lengths correctly.
        """
        layer_diffs = defaultdict(list)
        num_layers = self.model.config.num_hidden_layers

        for prompt1, prompt2 in self.prompt_pairs:
            activations1, activations2 = get_activation_pairs(self.model, self.tokenizer, prompt1, prompt2)

            for i in range(num_layers):
                gate_key = f"gate_proj_layer_{i}"
                up_key = f"up_proj_layer_{i}"

                if gate_key in activations1 and up_key in activations1:
                    gate_act1, gate_act2 = activations1[gate_key], activations2[gate_key]
                    up_act1, up_act2 = activations1[up_key], activations2[up_key]


                    # Fix for variable sequence lengths - ensure we get the same sequence length
                    min_len = min(gate_act1.shape[0], gate_act2.shape[0],
                                up_act1.shape[0], up_act2.shape[0])

                    # Calculate difference and PROPERLY average to get a fixed-size vector
                    gate_diff = torch.abs(gate_act1[:min_len] - gate_act2[:min_len])
                    up_diff = torch.abs(up_act1[:min_len] - up_act2[:min_len])

                    # Average across ALL dimensions except the last one (neurons)
                    # This ensures we get a 1D vector of size [hidden_dim]
                    while gate_diff.dim() > 1:
                        gate_diff = gate_diff.mean(dim=0)
                    while up_diff.dim() > 1:
                        up_diff = up_diff.mean(dim=0)

                    # Now we should have vectors of size [8192] each
                    combined_bias = gate_diff + up_diff

                    # Append the combined, FIXED-SIZE score vector to the list
                    layer_diffs[i].append(combined_bias)

        overall_scores = {}
        for i, diffs in layer_diffs.items():
            if diffs:
                # Verify all tensors have the same shape before stacking
                shapes = [d.shape for d in diffs]
                print(f"Layer {i} - All diff shapes: {shapes}")

                # This torch.stack now operates on a list of identically-sized vectors [8192]
                overall_scores[i] = torch.stack(diffs).mean(dim=0)

        return overall_scores

    def _prune_neuron_pairs(self, mlp, prune_percent, importance_scores):
        """Internal method to perform the mechanical pruning of a single MLP block."""
        original_intermediate_size = mlp.gate_proj.weight.size(0)
        num_to_prune = min(int(prune_percent * original_intermediate_size), original_intermediate_size - 1)
        k = original_intermediate_size - num_to_prune

        _, indices_to_keep = torch.topk(importance_scores, k, largest=False)
        indices_to_keep = indices_to_keep.sort().values

        device = mlp.gate_proj.weight.device
        new_gate_proj = nn.Linear(mlp.gate_proj.in_features, k, bias=False).to(device)
        new_up_proj = nn.Linear(mlp.up_proj.in_features, k, bias=False).to(device)
        new_down_proj = nn.Linear(k, mlp.down_proj.out_features, bias=False).to(device)

        new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
        new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
        new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]

        return new_gate_proj, new_up_proj, new_down_proj, k

    def prune(self, prune_percent):
        """
        Prunes the model using the pre-computed bias scores.
        """
        if not self.overall_bias_scores:
            print("ERROR: Cannot prune because bias scores were not computed successfully.")
            return self.model

        print(f"\nINFO: Starting pruning process for {prune_percent*100:.0f}% of neurons...")
        new_intermediate_size = None

        for idx, layer in enumerate(self.model.model.layers):
            if idx in self.overall_bias_scores:
                mlp = layer.mlp
                importance_scores = self.overall_bias_scores[idx]

                new_gate_proj, new_up_proj, new_down_proj, new_size = self._prune_neuron_pairs(
                    mlp, prune_percent, importance_scores
                )

                mlp.gate_proj = new_gate_proj
                mlp.up_proj = new_up_proj
                mlp.down_proj = new_down_proj
                if new_intermediate_size is None:
                    new_intermediate_size = new_size

        if new_intermediate_size is not None:
            self.model.config.intermediate_size = new_intermediate_size

        print("INFO: Pruning complete.")
        return self.model

## Obtain & test the pruned model.

In [19]:
prompt1, prompt2 = prompt_pairs[0] # Usamos el primer par como ejemplo
activations1_orig, activations2_orig = get_activation_pairs(model, tokenizer, prompt1, prompt2)
original_metrics = calculate_bias_metrics(activations1_orig, activations2_orig)
original_mean_diff = original_metrics['overall_metrics']['mean_difference']
original_mean_diff

0.0416259765625

In [20]:
# 3. Inicializa el pruner. Esto calculará automáticamente las puntuaciones de sesgo.
pruner = FairnessPruner(model=model, tokenizer=tokenizer, prompt_pairs=prompt_pairs)

# 4. Poda el modelo.
pruned_model = pruner.prune(prune_percent=0.01) # Poda el 1%

INFO: Initializing FairnessPruner. This will compute bias scores...
Layer 0 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 1 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 2 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 3 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 4 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 5 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 6 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 7 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 8 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 9 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 10 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 11 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 12 - All diff shapes: [torch.Size([8192]), torch.Size([8192])]
Layer 13 - All diff shapes: [torch.Size([8192

In [21]:
# Recalculate the number of parameters
#original_params = sum(p.numel() for p in model.parameters())
pruned_params = sum(p.numel() for p in pruned_model.parameters())
reduction = (original_param_count - pruned_params) / original_param_count * 100

print(f"Parámetros originales: {original_param_count:,}")
print(f"Parámetros podados: {pruned_params:,}")
print(f"Reducción: {reduction:.2f}%")


Parámetros originales: 1,235,814,400
Parámetros podados: 1,227,851,776
Reducción: 0.64%


In [22]:
activations1_pruned, activations2_pruned = get_activation_pairs(pruned_model, tokenizer, prompt1, prompt2)
pruned_metrics = calculate_bias_metrics(activations1_pruned, activations2_pruned)
pruned_mean_diff = pruned_metrics['overall_metrics']['mean_difference']
pruned_mean_diff

0.041351318359375

In [23]:
# --- 3. Comparar los Resultados ---
print("\n--- Bias Comparison ---")
print(f"Original Model Mean Difference: {original_mean_diff:.6f}")
print(f"Pruned Model (1%) Mean Difference: {pruned_mean_diff:.6f}")

if pruned_mean_diff < original_mean_diff:
    change = (pruned_mean_diff - original_mean_diff) / original_mean_diff * 100
    print(f"Bias metric DECREASED by {abs(change):.2f}%")
else:
    change = (pruned_mean_diff - original_mean_diff) / original_mean_diff * 100
    print(f"Bias metric INCREASED by {abs(change):.2f}%")


--- Bias Comparison ---
Original Model Mean Difference: 0.041626
Pruned Model (1%) Mean Difference: 0.041351
Bias metric DECREASED by 0.66%


In [24]:
# Test the pruned model
generated = get_output(prompt, model, tokenizer)
print(f"Generated text after pruning: {generated}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated text after pruning: The Black man walked through the neighborhood. The police officer thought he was a black man.


The result is slightly different from what the original model produced, but it’s still a fairly accurate response.

In contrast to the model created in notebook: [6_2_pruning_structured_llama3.2-1b_KO.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_2_pruning_structured_llama3.2-1b_KO.ipynb) where the pruned Llama model lost almost all its utility, the model in this notebook retains a good portion of its knowledge.

Looking at the model’s new structure, we can see that the `gate_proj` and `up_proj` layers have had their `out_features` reduced to 6554 from 8192. Consequently, the `down_proj` layer has its `in_features` adjusted to match the new size.

In [None]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (up_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (down_proj): Linear(in_features=6554, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb):

#Upload the model to HuggingFace.

In [None]:
new_model_name = 'pruned20-llama-1b-st'
output_dir = './'+new_model_name
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Pruned model saved to {output_dir}")

Pruned model saved to ./pruned20-llama-1b-st


In [None]:
# Push the model to your Hugging Face repository

model.push_to_hub(new_model_name, private=True)

In [None]:
tokenizer.push_to_hub(new_model_name)

#Evaluating models

In this section, we'll take a look at some standard evaluations in the world of Large Language Models using the lm-evaluation library from EleutherAI.

Specifically, we'll use LAMBADA and BoolQ. Since the pruning performed could be considered structural—that is, it affects the model's overall structure without a specific target—I’ve chosen two rather different evaluation tasks.

I want to remind you that the goal of this notebook is to demonstrate the pruning process, so I won’t be doing a comprehensive study of how it impacts performance; that will be saved for a future article. Additionally, these models are designed to be fine-tuned before being used.

However, I believe that seeing how pruning impacts model performance can help illustrate the pruning process itself.

In [None]:
!pip install -q lm-eval
from lm_eval import evaluator, tasks, models

In [None]:
def evaluate_hf_model(model_name, tasks=['arc_easy'], num_fewshot=0):
    """
    It calls the evaluator to evaluate a model available on Hugging Face.

    Args:
    - model_name: The model name in hugging Face.
    - tasks: Tasks to evaluate.
    - num_fewshot: Number of examples of few-shot learning

    Returns:
    - metrics.
    """
    model_args = f"pretrained={model_name},device=cuda"
    tasks = tasks

    results = evaluator.simple_evaluate(
      model="hf",
      model_args=model_args,
      tasks=tasks,
      num_fewshot=0,  # Number of few-shot smaples.
      limit=None,  # Use all the samples in the Evaluate Dataset.
      bootstrap_iters=10
    )

    metrics = results.get('results', {})
    return metrics

In [None]:
# Select tasks to evaluate.
tasks = ['lambada', 'boolq']

In [None]:
metrics_base = evaluate_hf_model("meta-llama/Llama-3.2-1B", tasks=tasks)

In [None]:
metrics_base

In [None]:
metrics_pruned = evaluate_hf_model("oopere/pruned40-llama-1b", tasks=tasks)

In [None]:
metrics_pruned

![My Image](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/lambada_BooQ_Accuracy.png?raw=true)


As we can see, the effect of pruning has been somewhat asymmetrical. The tasks evaluated by the BoolQ test haven’t experienced significant degradation—only about a 2% drop for a model that lost 35% of its weight.

In contrast, the impact on the Lambada test has been remarkable, with a drop in accuracy of over 50%.

This indicates that the model retains much of its comprehension ability but struggles with tests requiring more open-ended generation.

BoolQ simply presents the model with a text and a question to be answered with Yes/No. It’s a test focused on measuring the model’s ability to understand relationships within the input text.

Lambada, on the other hand, asks the model to guess the last word of a paragraph, a complex task where the final word tests the model’s capability in complex language modeling.

These results are consistent with the functionality of the MLP layers that were pruned.


#Conclusion.
This time, we successfully pruned the Llama model correctly. This same procedure could be applied to any model that shares this structure, regardless of its size.

We’ve managed to reduce the model’s size while, at least initially, preserving much of its functionality, depending on the % pruned and the task demanded to the model.

It’s important to remember that a pruned model doesn’t typically have direct application on its own; rather, it often serves as the foundation for a new model obtained through further training.

## Future Work.
The first three notebooks of the course have focused on a type of structured pruning that removes neurons deemed less important.

We should explore other forms of structured pruning, such as removing entire layers, as well as different ways to determine which elements are pruned from the model. One such method is Activation-Based Pruning, where neuron activations are evaluated using a specific dataset, and those with low activation are removed.


##Authors Note.
In addition to creating content like this notebook and offering it under the MIT license, I have also contributed to repositories such as those of Hugging Face and Google Gemini.

I am especially proud of my book: <a href="https://amzn.to/4eanT1g"><b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).

You can find it on both <a href="https://amzn.to/4eanT1g">Amazon</a> and <a href="https://link.springer.com/book/10.1007/979-8-8688-0515-8">Springer</a>, where they often have good deals on the purchase price.

If you take a look and end up purchasing it, keep in mind that you can reach out with any questions via the Discussions section of this same repository or on any of my social media channels. I’ll do my best to respond as quickly as possible.