<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3ba_pruning_llama_instruct_optipfair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>Pruning Llama 3.2. with OptiPFair (Adapted to Instruct models)</h2>
    <h3>Using OptiPFair to prune MLP Layers with GLU structure.</h3>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)

________
Models: meta-llama/Llama-3.2-1B

Colab Environment: GPU T4.

Keys:
* Pruning
* Structured pruning
* optiPfair

_______
**disclaimer: The pruning section was created after the first edition of the book was published. They are not included in the book’s original content but are intended to supplement and expand on the topics covered.**

This is the unofficial repository for the book:
        <a href="https://amzn.to/4eanT1g"> <b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).
        The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters.
        If you are looking for the official repository for the book, with the original notebooks, you should visit the
        <a href="https://github.com/Apress/Large-Language-Models-Projects">Apress repository</a>, where you can find all the notebooks in their original format as they appear in the book.

This notebook serves as a demonstration code for the paper [Exploring GLU Expansion Ratios: Structured Pruning in Llama-3.2 Models.](https://doi.org/10.31219/osf.io/qgxea)

The paper studies how the % of expansion produced in the GLU layers influences performance and consumption. For this purpose, seven different models have been generated from the Llama-3.2-1B and Llama-3.2-3B base models, reaching the conclusion that the optimal balance is achieved with an expansion of 140%.
______

# Introduction
This notebook cotinues the work done at: [6_3_pruning_structured_llama3.2-1b_OK.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3_pruning_structured_llama3.2-1b_OK.ipynb) the pruning process was done manually, and you can find the implementation code there. In this notebook, we use the [OptiPFair](https://github.com/peremartra/optipfair) library, developed by myself, which simplifies the pruning process for LLMs.

En este notebook nos focalizamo en explicar el funcionamiento de la libreria OptiPFair y sus diversas opciones para realizar el pruning de las capas MLP de modelos con estructura GLU: Llama, Gemma, QWen, Mistral y otros.


#Install libraries & Configure variables.

In [1]:
!pip install -q transformers
!pip install -q optipfair

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.9/44.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
pip show optipfair

Name: optipfair
Version: 0.1.5
Summary: A library for structured pruning & Bias visualization of large language models
Home-page: https://github.com/peremartra/optipfair
Author: Pere Martra
Author-email: peremartra@uadla.com
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: click, torch, tqdm, transformers
Required-by: 


In [3]:
import torch
import os

from tqdm import tqdm
from optipfair import prune_model
from transformers import AutoModelForCausalLM, AutoTokenizer

In [4]:
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


#Download model and explore structure

In [5]:
model_name = 'meta-llama/Llama-3.2-1B-Instruct'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
#tokenizer.pad_token = tokenizer.eos_token  # Set pad token

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [6]:
def get_output(prompt, model=model, tokenizer=tokenizer):
    # Chat forma for modelInstruct
    messages = [
        {"role": "user", "content": prompt}
    ]

    # Aply chat template
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors='pt'
    ).to(device)

    outputs = model.generate(
        inputs,
        max_length=150,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        temperature=None,
        top_p=None,
        do_sample=False,
        num_beams=5,
        early_stopping=True,
        no_repeat_ngram_size=2
    )
    generated_tokens = outputs[0][len(inputs[0]):]
    generated = tokenizer.decode(generated_tokens, skip_special_tokens=True)


    return generated.strip()

## studying the model structure
As demonstrated in the [previous notebook](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_2_pruning_structured_llama3.2-1b_KO.ipynb), studying the structure of the model that will undergo pruning is crucial.

In this notebook, we’re going to fine-tune the pruning process for the Llama3.2 model.

In [7]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (ro


An MLP block typically consists of layers that scale the data to larger dimensions and others that return it to its original size.

In the MLP block of the model, we find two projection layers: `gat_proj` and `down_proj`, both scaling from 2048 to 8192. The purpose of having two layers projecting to the same intermediate size might be related to gating mechanisms. A gating mechanism selectively controls information flow in neural networks by using learned weights to "gate" or filter inputs.

However, to truly understand how these layers function, we’d need to refer to the model's documentation or even the source code. But, this structure usually indicates, at least, I haven't encountered a case where it doesn't, that the layers performing the upsizing work in pairs, and they cannot be treated as independent linear layers.

In other words, any operation we apply to one layer must be replicated in the other. Most importantly, when identifying which neurons have more or less importance, we can't evaluate the neurons of a single layer in isolation; we need to treat them as pairs.



In [8]:
# Test the original model
prompt = "What is the capital of France?"
generated = get_output(prompt)
print(f"Generated text: {generated}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated text: Paris is both the largest city in France and its capital, as it is where the French government is located.


#Pruning the Model.
##Support pruning functions.
###Compute neuron importance functions.

To perform the pruning process, you need to call the `prune_model` function from the library (OptiPFair)[https://github.com/peremartra/optipfair].

To use this function, you need to provide the following parameters:

* model: The model to be pruned.
* pruning_type: The type of pruning to apply. In this case, it will be "MLP_GLU", which is currently the only option supported by the library.
* neuron_selection_method:
  * MAW: Maximum Absolute Weight Values.
  * VOW: Variance Of Weigths.
  * PON: Product Of Norms.
  With LLaMA models, the method that works best without requiring further training is MAW.
* pruning_percentage o expansion_rate: you need to provide one of them. In this notebook, we’ll use pruning_percentage.
* show_progress: By default, it's set to True. It displays the progress of the pruning process.
* return_stats: By default, it's set to True. It returns the percentage of neurons removed and the resulting expansion rate.


*I’m leaving the others in the notebook purely as an exercise.*

The **MAW** method works better because it directly identifies the most influential neurons based on the magnitude of their connections. These neurons are likely responsible for key decisions, making the model more accurate after pruning. The Variance of Weights method, while useful in some contexts, can retain neurons that may not contribute significantly to the task, leading to less coherent model outputs.

However, we shouldn’t fall into the trap of assuming that this neuron selection method will work best across all model structures. It works well with Llama models, and this may be due to several factors:

* The relatively large projection from 2048 to 8192.
* The use of a GLU structure.
* The type of activation function used.

So, if we use a model from another family, like Gemma or Mistral, the neuron selection method might need to be entirely different.

## Obtain & test the pruned model.

In [9]:
# Prune 10% of neurons from MLP layers using MAW method
pruned_model, stats = prune_model(
    model=model,
    pruning_type="MLP_GLU",
    neuron_selection_method="MAW",
    pruning_percentage=20,
    show_progress=True,
    return_stats=True
)

Pruning layers: 100%|██████████| 16/16 [00:05<00:00,  3.17it/s]


In [10]:
# Print pruning statistics
print(f"Original parameters: {stats['original_parameters']:,}")
print(f"Pruned parameters: {stats['pruned_parameters']:,}")
print(f"Reduction: {stats['reduction']:,} parameters ({stats['percentage_reduction']:.2f}%)")
print(f"Expansion rate: {stats['expansion_rate']:,}%")


Original parameters: 1,235,814,400
Pruned parameters: 1,074,792,448
Reduction: 161,021,952 parameters (13.03%)
Expansion rate: 320.01953125%


In [11]:
# prompt: call the generate_response for the pruned_model

prompt = "What is the capital of France?"
generated = get_output(prompt)
print(f"Generated text: {generated}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated text: Yes, it is Paris.


The result is slightly different from what the original model produced, but it’s still a fairly accurate response.

In contrast to the model created in notebook: [6_2_pruning_structured_llama3.2-1b_KO.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_2_pruning_structured_llama3.2-1b_KO.ipynb) where the pruned Llama model lost almost all its utility, the model in this notebook retains a good portion of its knowledge.

Looking at the model’s new structure, we can see that the `gate_proj` and `up_proj` layers have had their `out_features` reduced to 6554 from 8192. Consequently, the `down_proj` layer has its `in_features` adjusted to match the new size.

In [12]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (up_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (down_proj): Linear(in_features=6554, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (ro

# Conclusion
This notebook serves as a demonstration of how simple it can be to create your own optimized models using the [OptiPFair](https://github.com/peremartra/optipfair/tree/main) library.

The result is a model that uses less memory and is faster at inference than the original, while still retaining most of its knowledge.

Of course, many more tests and a wider range of rankings are needed to fully understand the model’s performance. The evaluation module of the [OptiPFair](https://github.com/peremartra/optipfair/tree/main) library is still under development, so in the future it will be possible to apply more types of pruning and evaluate the resulting models with better benchmarks.

##Authors Note.
In addition to creating content like this notebook and offering it under the MIT license, I have also contributed to repositories such as those of Hugging Face and Google Gemini.

I am especially proud of my book: <a href="https://amzn.to/4eanT1g"><b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).

You can find it on both <a href="https://amzn.to/4eanT1g">Amazon</a> and <a href="https://link.springer.com/book/10.1007/979-8-8688-0515-8">Springer</a>, where they often have good deals on the purchase price.

If you take a look and end up purchasing it, keep in mind that you can reach out with any questions via the Discussions section of this same repository or on any of my social media channels. I’ll do my best to respond as quickly as possible.

## References.
* Martra, P. (2024). EXPLORING GLU EXPANSION RATIOS: STRUCTURED PRUNING IN LLAMA-3.2 MODELS. https://doi.org/https://doi.org/10.31219/osf.io/qgxea