<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_6_pruning_attention_layers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>Not All Attention is needed</h2>
    <h3></h3>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)

_______
Models: meta-llama/Llama-3.2

Colab Environment: GPU A100.

Keys:
* Pruning
* Attention
_______
**disclaimer: The pruning / knowledge distillation section has been created after the first edition of the book was published. They are not included in the book’s original content but are intended to supplement and expand on the topics covered.**

This is the unofficial repository for the book:
        <a href="https://amzn.to/4eanT1g"> <b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).
        The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters.
        If you are looking for the official repository for the book, with the original notebooks, you should visit the
        <a href="https://github.com/Apress/Large-Language-Models-Projects">Apress repository</a>, where you can find all the notebooks in their original format as they appear in the book.

______
# INTRODUCTION
Este notebook es una implementación del paper: [What Matters in Transformers? Not all Attention is Needed](https://arxiv.org/abs/2406.15786). Aunque se siguen las indicaciones del paper me he tomado algunas libertades para que el código sea más claro y sencillo de entender.

The original paper demonstrates that larger models tend to have excessive redundancy in their attention layers. They achieved a 48.4% increase in inference performance for a Llama-2-70B model with only a minor 2.4% drop in response quality, **just bypassing the 50% of the Attention layers!**

In this notebook, tests have been conducted using Llama-3.2-1B and 3B models. With these models, I found that removing 50% of the attention layers significantly impacted the model's functionality. However, the 3B model handled the removal of these layers much better. This suggests that redundancy may become more pronounced as model size increases.

#METHODOLOGY.
To identify which layers contribute the least, the cosine distance between the layer's input and output is measured. In the paper, this distance is calculated using a test dataset, while in the notebook, I used a simple prompt to activate the layers.  

This method of measuring the importance of attention layers and their contribution to the model allows pruning to be tailored to a specific dataset. This approach can lead to the creation of more efficient models for specialized sectors such as healthcare or finance.


Once the layer contributing the least to the final output is identified (the one with the smallest difference between input and output), it is added to a list included in the configuration file.

This list is then referenced during inference by a new forward function that replaces the original one for attention layers. When this new function detects that a layer is in the list, it skips its execution and simply returns the input without modifications.

The process of identifying layers to deactivate and marking them as non-executable is iterative. In other words, it does not determine all the layers to skip in one go. Instead, it first identifies a layer to deactivate, disables it, and, if necessary, identifies the next layer to deactivate. This procedure differs from the one used in the paper, where layers are selected in a single pass.

The iterative implementation has a significant drawback: the test dataset must be processed for each layer to be deactivated. The paper's authors note that while the iterative method may bring slight improvements, the added computational cost is not justified. However, since this is an example notebook, and there is no test dataset—just a small prompt—and the layer selection process takes only seconds, I chose the iterative approach.

This pruning method does not produce a smaller model, as the layers are not physically removed. They remain in the model but are not executed, resulting in improved inference response times.
______

# Install libraries & Configure variables.

In [None]:
!pip install -q transformers==4.47.1
!pip install -q datasets==3.2.0
!pip install -q torch==2.5.1
!pip install -q lm-eval==0.4.7

In [None]:
import logging
import math
import os
import sys
import shutil
from copy import deepcopy

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer


# Download the Model.

In [None]:
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


In [None]:
#model_name = 'meta-llama/Llama-3.2-1B'
model_name = 'meta-llama/Llama-3.2-3B'
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
#tokenizer.pad_token = tokenizer.eos_token  # Set pad token

config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

## Study the structure.
* Llama-3.2-1B
```
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
```


The model follows the typical structure of modern Llama models, consisting of blocks made up of an Attention layer and an MLP layer with a GLU structure.

> If you want to see an example of how to perform pruning on the MLP layers of the model, you can check out the notebook:[Pruning Llama 3.2.](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3_pruning_structured_llama3.2-1b_OK.ipynb) y leer el paper [Exploring GLU expansion ratios: Structured pruning in Llama-3.2 models](https://osf.io/preprints/osf/qgxea)



Since the layers form a block, the attention layer cannot be removed without also removing the accompanying MLP layer. For this reason, the decision was made to deactivate their execution during inference.

The 1B model has 16 layers, as shown in the structure above, while the 3B model has 28 layers.


# Pruning the Model.

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from copy import deepcopy

This function `measure_unpruned_layer_importances` is designed to calculate importance scores for the unpruned attention layers in a model, that in the first iteration are all the layers that compose the Model.

The basic idea is: if a layer's output is very similar to its input, it might not be doing much important work and could be a candidate for pruning. To check the difference I'm using the cosine similarity.

In [None]:
def measure_unpruned_layer_importances(pruned_model, tokenizer, input_text):
    """
    Measures and returns importance scores for all unpruned (non-bypassed) layers.
    """
    # PREPARATION
    """
    set the model to evaluation mode to ensure that no gradients
    are computed during the forward pass.
    """
    pruned_model.eval()
    device = next(pruned_model.parameters()).device

    """
    The provided input text (input_text) is tokenized into tensors
    suitable for processing by the model.
    """
    inputs = tokenizer(input_text, return_tensors="pt").to(device)

    """This will hold tuples of (layer_idx, importance_score)"""
    importance_scores = []

    # IDENTIFY UNPRUNED LAYERS & CREATING HOOKS
    """
    We'll register hooks for only layers that are NOT in drop_attn_list
    The list of attention layers that have already been pruned,
    is stored in a variable in the model's config: pruned_model.config.drop_attn_list.
    """
    unpruned_layer_indices = [
        idx for idx in range(len(pruned_model.model.layers))
        if idx not in pruned_model.config.drop_attn_list
    ]

    """
    Temporary storage for each layer's input/output
    We'll store them by layer index
    """
    layer_inputs = {}
    layer_outputs = {}

    """
    Create 2 hooks to capture the input and the output of the layers.
    These hooks store the inputs and outputs in dictionaries
    (layer_inputs and layer_outputs) for later analysis
    """
    #Allows capture the input to the query projection (q_proj)
    def q_proj_input_hook(layer_idx):
        def _hook(module, module_input):
            # module_input can be a tuple depending on PyTorch version
            inp = module_input[0] if isinstance(module_input, tuple) else module_input
            layer_inputs[layer_idx] = inp.detach().clone()
        return _hook

    # Allows capture the output from the output projection (o_proj)
    def o_proj_output_hook(layer_idx):
        def _hook(module, module_input, module_output):
            out = module_output[0] if isinstance(module_output, tuple) else module_output
            layer_outputs[layer_idx] = out.detach().clone()
        return _hook

    # Register hooks for each unpruned layer
    handles = []
    for idx in unpruned_layer_indices:
        layer = pruned_model.model.layers[idx]
        handles.append(layer.self_attn.q_proj.register_forward_pre_hook(q_proj_input_hook(idx)))
        handles.append(layer.self_attn.o_proj.register_forward_hook(o_proj_output_hook(idx)))

    # FORWARD PASS
    """
    Single forward pass (no gradient needed)
    A single forward pass is performed on the input text.
    During this pass, the hooks capture the inputs and outputs of the unpruned layers.
    This step is done with torch.no_grad(),
    ensuring no gradients are calculated, which saves memory and computation.
    """
    with torch.no_grad():
        _ = pruned_model(**inputs)

    """
    The hooks are removed after the forward pass
    to avoid memory leaks or interference with subsequent operations.
    """
    for h in handles:
        h.remove()


    #COMPUTE IMPORTANCE SCORES
    """
    For each unpruned layer, the inputs and outputs are flattened into vectors for comparison.

    Cosine Similarity: The similarity between the input and output vectors is
    computed using cosine similarity. Layers with outputs that are very similar
    to their inputs likely contribute less to the model’s overall computation.

    Importance Score: The importance score for each layer is calculated as 1−similarity
    A higher score indicates that the layer transforms its input significantly
    and is therefore more important to the model's function.
    """
    for idx in unpruned_layer_indices:
        if idx in layer_inputs and idx in layer_outputs:
            inp = layer_inputs[idx]
            out = layer_outputs[idx]

            inp_flat = inp.view(inp.size(0), -1)
            out_flat = out.view(out.size(0), -1)

            similarity = F.cosine_similarity(inp_flat, out_flat, dim=1).mean().item()
            importance_score = 1 - similarity
            importance_scores.append((idx, importance_score))

            print(f"[Iterative] Layer {idx} importance score: {importance_score:.4f}")

    """A list of tuples is returned, where each tuple contains the layer index
    and its calculated importance score."""
    return importance_scores


The function `bypass_single_layer` is used to disable the attention mechanism of a specific layer in the model without permanently removing or modifying the layer.

This is achieved by dynamically overriding the layer’s forward method to bypass its attention computation.

As the attention layers are grouped with the MLP Layers we can just remove an attention layer without removing the associated MLP layer. But we can bypass the layer.

The bypassed layer skips computationally expensive attention operations, reducing inference time and memory usage.


In [None]:
def bypass_single_layer(pruned_model, layer_idx):
    """
    Modifies the specified layer's forward method so that attention is bypassed.
    """
    layer = pruned_model.model.layers[layer_idx]
    # Store the original forward.
    if not hasattr(layer.self_attn, '_original_forward'):
        layer.self_attn._original_forward = layer.self_attn.forward

    # A new forward that checks whether to bypass
    def new_attention_forward(self, hidden_states, attention_mask=None, position_ids=None,
                              past_key_value=None, output_attentions=False, use_cache=False,
                              **kwargs):
        # If this layer is in drop_attn_list, bypass
        if getattr(self, 'layer_idx', -1) in pruned_model.config.drop_attn_list:
            return hidden_states, None, None
        # Otherwise, use the original forward
        return self._original_forward(hidden_states, attention_mask, position_ids,
                                      past_key_value, output_attentions, use_cache, **kwargs)

    # Set the layer index and forward
    layer.self_attn.layer_idx = layer_idx
    layer.self_attn.forward = new_attention_forward.__get__(layer.self_attn, type(layer.self_attn))


The `iterative_pruning` function is the core of the pruning process.

It performs iterative pruning on a model by repeatedly identifying and bypassing the least important layers until a specified number of layers have been pruned.

In [None]:
def iterative_pruning(model, tokenizer, input_text, num_layers_to_prune):
    """
    Iteratively:
      1) Measures importance of unpruned layers,
      2) Prunes (bypasses) the least important layer,
      3) Repeats until num_layers_to_prune layers are pruned.
    """
    # Create a copy of the model so we don't modify the original
    pruned_model = deepcopy(model)

    # Make sure we have a list of pruned layers in config
    pruned_model.config.drop_attn_list = []

    total_layers = len(pruned_model.model.layers)
    print(f"Total layers: {total_layers}")

    for step in range(num_layers_to_prune):
        print(f"\n--- Iteration {step + 1} of {num_layers_to_prune} ---")

        # 1) Measure importance scores for all unpruned layers
        importance_scores = measure_unpruned_layer_importances(pruned_model, tokenizer, input_text)
        if not importance_scores:
            print("No unpruned layers found or no importance scores computed.")
            break

        # 2) Pick layer with the lowest importance
        layer_to_bypass, min_score = min(importance_scores, key=lambda x: x[1])

        # 3) Bypass that layer
        pruned_model.config.drop_attn_list.append(layer_to_bypass)
        bypass_single_layer(pruned_model, layer_to_bypass)

        print(f"Bypassing layer {layer_to_bypass} with importance score {min_score:.4f}")
        print(f"Current bypass list: {pruned_model.config.drop_attn_list}")

    print(f"\nFinal bypassed layers: {sorted(pruned_model.config.drop_attn_list)}")
    print(f"Number of bypassed layers: {len(pruned_model.config.drop_attn_list)}")

    return pruned_model

## Execute Pruning.

In [None]:
pruned_model = iterative_pruning(
      model,
      tokenizer,
       "Hi I'm a sample text, use to calculate teh cosine difference between input and output.",
      num_layers_to_prune=4
)

Total layers: 28

--- Iteration 1 of 4 ---
[Iterative] Layer 0 importance score: 1.0713
[Iterative] Layer 1 importance score: 0.9355
[Iterative] Layer 2 importance score: 0.9594
[Iterative] Layer 3 importance score: 0.9597
[Iterative] Layer 4 importance score: 0.9598
[Iterative] Layer 5 importance score: 0.9989
[Iterative] Layer 6 importance score: 1.0272
[Iterative] Layer 7 importance score: 1.0362
[Iterative] Layer 8 importance score: 1.1124
[Iterative] Layer 9 importance score: 1.1242
[Iterative] Layer 10 importance score: 1.0488
[Iterative] Layer 11 importance score: 0.9960
[Iterative] Layer 12 importance score: 1.0860
[Iterative] Layer 13 importance score: 1.0245
[Iterative] Layer 14 importance score: 0.9680
[Iterative] Layer 15 importance score: 0.9707
[Iterative] Layer 16 importance score: 0.9034
[Iterative] Layer 17 importance score: 0.9300
[Iterative] Layer 18 importance score: 1.0231
[Iterative] Layer 19 importance score: 0.7274
[Iterative] Layer 20 importance score: 0.8679
[

# Test Models
The `get_output` function is designed to generate text  and measure the time taken for different stages of the generation process.

It provides insights into the performance of the model and can be used to evaluate the efficiency of text generation.

In [None]:
import time

def get_output(prompt, model=model, tokenizer=tokenizer, num_runs=1, max_length=50):
    total_time = 0
    generated_outputs = []

    for run in range(num_runs):
        # Start timing
        start_time = time.time()

        # Tokenization time
        token_start = time.time()
        inputs = tokenizer(prompt, return_tensors='pt').to(device)
        token_time = time.time() - token_start

        # Generation time
        gen_start = time.time()
        outputs = model.generate(
            inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=max_length,
            num_return_sequences=1,
            pad_token_id=tokenizer.pad_token_id,
            temperature=None,
            top_p=None,
            do_sample=False,  # Disable sampling
            num_beams=5,      # Use beam search
            early_stopping=True,  # Stop when end-of-sequence token is generated
            no_repeat_ngram_size=2  # Prevent repetition of 2-grams
        )
        gen_time = time.time() - gen_start

        # Decoding time
        decode_start = time.time()
        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
        decode_time = time.time() - decode_start

        # Total time for this run
        total_time += time.time() - start_time
        generated_outputs.append(generated)

        if num_runs > 1:
            print(f"\nRun {run + 1}:")
        print(f"Tokenization time: {token_time*1000:.2f} ms")
        print(f"Generation time: {gen_time*1000:.2f} ms")
        print(f"Decoding time: {decode_time*1000:.2f} ms")
        print(f"Total time: {(time.time() - start_time)*1000:.2f} ms")

    if num_runs > 1:
        avg_time = total_time / num_runs
        print(f"\nAverage time over {num_runs} runs: {avg_time*1000:.2f} ms")

    return generated_outputs[0] if num_runs == 1 else generated_outputs

In [None]:
# Test the original model
prompt = "Paris is the capital of"
generated = get_output(prompt, num_runs=2)
print(f"Generated text: {generated}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Run 1:
Tokenization time: 3.14 ms
Generation time: 1709.53 ms
Decoding time: 0.28 ms
Total time: 1714.02 ms

Run 2:
Tokenization time: 0.87 ms
Generation time: 1678.55 ms
Decoding time: 0.26 ms
Total time: 1679.80 ms

Average time over 2 runs: 1696.31 ms
Generated text: ['Paris is the capital of France. It is located in the north-central part of the country, on the river Seine. The city has a population of over 2 million people, making it the largest city in France and the second-largest city', 'Paris is the capital of France. It is located in the north-central part of the country, on the river Seine. The city has a population of over 2 million people, making it the largest city in France and the second-largest city']



The text generation of the original model, as expected, works perfectly and returns a correct and meaningful sentence.

Now, let's test the pruned model, which is a Llama-3.2-3B model where I have marked 4 Attention layers to be bypassed.

In [None]:
# Test the pruned model
generated = get_output(prompt, pruned_model, num_runs=2)
print(f"Generated text: {generated}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Run 1:
Tokenization time: 1.45 ms
Generation time: 1583.52 ms
Decoding time: 0.23 ms
Total time: 1585.33 ms

Run 2:
Tokenization time: 1.74 ms
Generation time: 1567.40 ms
Decoding time: 0.26 ms
Total time: 1570.76 ms

Average time over 2 runs: 1577.31 ms
Generated text: ['Paris is the capital of France and the largest city in France. It is also one of the most popular tourist destinations worldwide because of its rich history and culture. If you’re planning to visit Paris, you’ll need to be prepared to spend money', 'Paris is the capital of France and the largest city in France. It is also one of the most popular tourist destinations worldwide because of its rich history and culture. If you’re planning to visit Paris, you’ll need to be prepared to spend money']



The execution of this second model is slightly faster than that of the base model, and the generated text is fairly accurate, although some repetition can be noticed towards the end of the sentence.

# Store the Model.
Time to store the model and upload it to Hugging Face.

In [None]:
new_model_name = 'attnprun-llama-3.2-3B'
output_dir = './'+new_model_name
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

pruned_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
#new_config.save_pretrained(output_dir)
print(f"Pruned model saved to {output_dir}")

Pruned model saved to ./attnprun-llama-3.2-3B


In [None]:
# Push the model to your Hugging Face repository

pruned_model.push_to_hub(new_model_name, private=True)

model-00001-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/2.92G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/oopere/attnprun-llama-3.2-3B/commit/9fb91445eb54cab3f89c9141a75e652626a74ad7', commit_message='Upload LlamaForCausalLM', commit_description='', oid='9fb91445eb54cab3f89c9141a75e652626a74ad7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/attnprun-llama-3.2-3B', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/attnprun-llama-3.2-3B'), pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub(new_model_name)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/oopere/attnprun-llama-3.2-3B/commit/61c7eedb518a8d4d0aabab0f48739516757ccebf', commit_message='Upload tokenizer', commit_description='', oid='61c7eedb518a8d4d0aabab0f48739516757ccebf', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/attnprun-llama-3.2-3B', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/attnprun-llama-3.2-3B'), pr_revision=None, pr_num=None)

# Evaluating the model.
In addition to the small test performed on the response generation right after the pruning process, I will run a simple BoolQ test.

BoolQ presents the model with a text and a question to be answered with "Yes" or "No." It focuses on evaluating the model's ability to understand relationships within the input text.


In [None]:
!pip install -q lm-eval
from lm_eval import evaluator, tasks, models

In [None]:
def evaluate_hf_model(model_name, tasks=['arc_easy'], num_fewshot=0):
    """
    It calls the evaluator to evaluate a model available on Hugging Face.

    Args:
    - model_name: The model name in hugging Face.
    - tasks: Tasks to evaluate.
    - num_fewshot: Number of examples of few-shot learning

    Returns:
    - metrics.
    """
    model_args = f"pretrained={model_name},device=cuda"
    tasks = tasks

    results = evaluator.simple_evaluate(
      model="hf",
      model_args=model_args,
      tasks=tasks,
      num_fewshot=0,  # Number of few-shot smaples.
      limit=None,  # Use all the samples in the Evaluate Dataset.
      bootstrap_iters=10
    )

    metrics = results.get('results', {})
    return metrics

In [None]:
# Select tasks to evaluate.
tasks = ['boolq']

In [None]:
metrics_pruned = evaluate_hf_model("oopere/attnprun-llama-3.2-1b", tasks=tasks)


INFO:lm-eval:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm-eval:Initializing hf model, with arguments: {'pretrained': 'oopere/attnprun-llama-3.2-1b', 'device': 'cuda'}
INFO:lm-eval:Using device 'cuda'


config.json:   0%|          | 0.00/941 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

INFO:lm-eval:Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}


model.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]



README.md:   0%|          | 0.00/18.2k [00:00<?, ?B/s]

super_glue.py:   0%|          | 0.00/30.7k [00:00<?, ?B/s]

The repository for super_glue contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/super_glue.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/4.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3245 [00:00<?, ? examples/s]

INFO:lm-eval:Building contexts for boolq on rank 0...
100%|██████████| 3270/3270 [00:01<00:00, 2002.38it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 6540/6540 [03:23<00:00, 32.18it/s]


In [None]:
metrics_pruned

{'boolq': {'alias': 'boolq',
  'acc,none': 0.6397553516819572,
  'acc_stderr,none': 0.008396499614382}}

The result is not bad at all. Considering that the original model achieves a score of 0.73 on this same test and ours scores 0.64, it seems that the model retains much of its ability to understand relationships within the input text, even after removing part of its attention layers.

# Conclusion.
Based on the findings in the paper and the results obtained, I believe this type of pruning may work better with larger models where attention layers tend to have redundancy.

Since this type of pruning does not alter the model's structure, it does not result in a reduction in its size or the memory required to load it. The main advantage of using this pruning approach is the reduction of computational load during inference, leading to a more efficient model with faster responses and lower resource consumption.

# Authors Note.

In addition to creating content like this notebook and offering it under the MIT license, I have also contributed to repositories such as those of Hugging Face and Google Gemini.

I am especially proud of my book: [Large Language Models: Apply and Implement Strategies for Large Language Models (Apress)(https://amzn.to/3DSepLb).

You can find it on both [Amazon](https://amzn.to/3DSepLb) and [Springer](https://link.springer.com/book/10.1007/979-8-8688-0515-8), where they often have good deals on the purchase price.

If you take a look and end up purchasing it, keep in mind that you can reach out with any questions via the Discussions section of this same repository or on any of my social media channels. I’ll do my best to respond as quickly as possible.