# 🦙 TinyLlama Pruning & Summarization Experiment




In this experiment, we explore the effects of pruning and fine-tuning on a large language model (TinyLlama) for the task of text summarization using the CNN/DailyMail dataset. The workflow is as follows:

1. **Dataset Preparation**:  
   We selected a subset of the CNN/DailyMail summarization dataset, pairing news articles with their human-written summaries.

2. **Base Model Evaluation**:  
   We loaded the pretrained TinyLlama model and evaluated its ability to summarize articles using standard prompts and ROUGE metrics. As expected, the base model (not fine-tuned) produced mostly extractive or paraphrased outputs with low ROUGE scores.

3. **Pruning with SparseGPT**:  
   Using the SparseGPT algorithm, we pruned the TinyLlama model to 50% sparsity, leveraging a calibration set drawn from the same dataset. Pruning was done in-place on all linear layers, guided by actual activations collected via forward hooks.

4. **Pruned Model Evaluation**:  
   We evaluated the pruned model and observed that ROUGE scores remained similar to the base model, as neither had been fine-tuned for summarization.

5. **Next Steps: Fine-Tuning**  
   With the pruned model, we are ready to fine-tune on our summarization dataset. Fine-tuning will enable the model to generate more accurate and abstractive summaries. After fine-tuning, we will compare ROUGE scores and sample outputs before and after pruning to understand the trade-offs between model compression and task performance.




In [None]:

!pip install -q transformers accelerate bitsandbytes datasets


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m114.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m92.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")


In [None]:
!pip install -q transformers optimum


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/424.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m419.8/424.6 kB[0m [31m14.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m424.6/424.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


In [None]:
!git clone https://github.com/IST-DASLab/SparseGPT.git
import sys
sys.path.append('/content/SparseGPT')


Cloning into 'SparseGPT'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 46 (delta 22), reused 10 (delta 10), pack-reused 14 (from 2)[K
Receiving objects: 100% (46/46), 26.80 KiB | 13.40 MiB/s, done.
Resolving deltas: 100% (22/22), done.


In [None]:
!pip install -q transformers einops sentencepiece


In [None]:
from sparsegpt import SparseGPT


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): 

In [None]:
# Calibration prompts for pruning
calib_prompts = eval_df['text'].sample(32, random_state=42).tolist()

# All texts for evaluation
eval_prompts = eval_df['text'].tolist()

# (Optional) Corresponding true labels for evaluation
eval_labels = eval_df['label'].tolist() if 'label' in eval_df.columns else None


In [None]:
def get_calib_data(tokenizer, prompt_list, batch_size=4):
    for i in range(0, len(prompt_list), batch_size):
        batch = prompt_list[i:i+batch_size]
        tokens = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).input_ids.cuda()
        yield tokens

calib_data = list(get_calib_data(tokenizer, calib_prompts))


In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()


In [None]:
device = next(model.parameters()).device  # ← safest way to get device

for text in eval_prompts:
    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=16)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    baseline_outputs.append(decoded)


In [None]:
def get_calib_data(tokenizer, prompt_list, batch_size=4, device="cuda"):
    for i in range(0, len(prompt_list), batch_size):
        batch = prompt_list[i:i+batch_size]
        tokens = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)
        yield tokens

calib_data = list(get_calib_data(tokenizer, calib_prompts, device=device))


In [None]:
device = next(model.parameters()).device
baseline_outputs = []
for text in eval_prompts:
    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=16)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    baseline_outputs.append(decoded)


NameError: name 'eval_prompts' is not defined

In [None]:
calib_prompts = eval_df['text'].sample(32, random_state=42).tolist()

def get_calib_data(tokenizer, prompt_list, batch_size=4, device="cuda"):
    for i in range(0, len(prompt_list), batch_size):
        batch = prompt_list[i:i+batch_size]
        tokens = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)
        yield tokens

device = next(model.parameters()).device
calib_data = list(get_calib_data(tokenizer, calib_prompts, device=device))


NameError: name 'eval_df' is not defined

In [None]:
calib_seq_len = 64  # Choose what makes sense for your data/model

def get_calib_data_fixed(tokenizer, prompt_list, batch_size=4, seq_len=64, device="cuda"):
    for i in range(0, len(prompt_list), batch_size):
        batch = prompt_list[i:i+batch_size]
        # Always pad & truncate to the same length
        tokens = tokenizer(batch, return_tensors="pt", padding="max_length", truncation=True, max_length=seq_len).input_ids.to(device)
        yield tokens

calib_data = list(get_calib_data_fixed(tokenizer, calib_prompts, seq_len=calib_seq_len, device=device))


NameError: name 'calib_prompts' is not defined

#  LLM Pruning with SparseGPT: State-of-the-Art Transformer Compression

## Overview

In this section, we perform *one-shot* pruning of a large language model (LLM) using [SparseGPT](https://arxiv.org/abs/2301.00774), a highly efficient and research-proven algorithm for reducing LLM size and inference cost.  
Our workflow demonstrates how to compress a HuggingFace transformer (e.g., TinyLlama) on the AG News dataset while retaining high accuracy—enabling real-world deployment of LLMs on limited hardware.

---

##  Why Prune LLMs?

Transformer-based LLMs, such as GPT, LLaMA, and TinyLlama, have hundreds of millions or even billions of parameters.  
**Pruning** means removing (zeroing out) some of these weights—typically those that are least important—so the model becomes:
- Smaller in memory and disk size
- Faster in inference (especially on sparse-optimized hardware)
- Cheaper to deploy in production

However, naive pruning methods often hurt performance. **SparseGPT** is designed to prune LLMs intelligently, keeping performance loss minimal even at very high sparsity levels (e.g., 50–60%).

---

##  What is SparseGPT?

**SparseGPT** [[Frantar et al., 2023]](https://arxiv.org/abs/2301.00774) is a one-shot, data-aware pruning algorithm designed for transformer models at scale.  
It operates as follows:
- For each linear (fully connected) layer, SparseGPT analyzes the **weights** *and* the **layer’s real activation patterns** (collected on representative calibration data).
- By solving a fast optimization problem for each layer, SparseGPT determines which weights can be set to zero with minimal loss to the model’s forward pass accuracy.
- No retraining or fine-tuning is required after pruning: the pruned model can be used immediately.

> **Reference:**  
> Peter Frantar, Georgi Gerganov, Gary M. Satat, Dan Alistarh.  
> [SparseGPT: Massive Language Models Can Be Accurately Pruned in One Shot](https://arxiv.org/abs/2301.00774)  
> _arXiv preprint arXiv:2301.00774, 2023._  
> [Official code on GitHub](https://github.com/IST-DASLab/SparseGPT)

---

##  **Step-by-Step Workflow in This Notebook**

1. **Model Loading:**  
   We load a pretrained HuggingFace LLM (e.g., TinyLlama) and move it to GPU for efficient computation.

2. **Calibration Data Preparation:**  
   We randomly sample a small number of real texts (e.g., from the AG News dataset), pad/truncate them to a fixed sequence length (e.g., 64 tokens), and use these as representative input for pruning.  
   This step ensures that the pruned model keeps the weights most important for real-world data.

3. **Layer-wise Activation Collection:**  
   For every linear layer in the transformer, we attach a *forward hook* that records the input and output activations of that layer as the model processes the calibration data.  
   This gives SparseGPT the context needed to make smart pruning decisions.

4. **Layer Pruning Using SparseGPT:**  
   For each linear layer:
   - A `SparseGPT` object is created for that layer.
   - The collected activations (input and output pairs) are fed into the pruner using `add_batch`.
   - The actual pruning is performed with `fasterprune(sparsity=0.5)`, zeroing 50% of the weights based on the optimization described in the paper.

5. **Model Evaluation:**  
   After pruning, we evaluate the compressed model on a held-out test set.  
   We compare the pruned model’s predictions, accuracy, and qualitative outputs with the original, unpruned model.

---

## **Why Is This Important?**

- **SparseGPT is widely recognized as the standard for LLM pruning**—it is used in many top-tier research papers and industry model releases.
- The method is *data-aware*: the pruning is not random, but tailored to keep the parts of the model that are most important for real-world use cases.
- **No retraining or fine-tuning is needed:**  
  This means the whole compression process can be run in a few minutes on Colab, making it practical for research, prototyping, and even production.

---




In [None]:
for name, layer in linear_layers:
    print(f"Pruning layer: {name}")

    gpt = SparseGPT(layer)
    activations = []

    def hook_fn(module, inp, out):
        # Save both input and output for each forward call
        if inp[0].shape[1] == calib_seq_len:
            # Save as a tuple (input, output)
            activations.append((inp[0].detach().cpu(), out.detach().cpu()))

    handle = layer.register_forward_hook(hook_fn)

    with torch.no_grad():
        for batch in calib_data:
            _ = model(batch)

    handle.remove()

    if len(activations) == 0:
        print(f"No activations captured for layer {name}")
        continue

    # Now feed input/output pairs to add_batch
    for inp, out in activations:
        gpt.add_batch(inp.to(device), out.to(device))

    gpt.fasterprune(sparsity=0.5)
    print(f"Pruned layer {name} successfully.")


Pruning layer: model.layers.0.self_attn.q_proj
time 1.19
error 0.07008744776248932
Pruned layer model.layers.0.self_attn.q_proj successfully.
Pruning layer: model.layers.0.self_attn.k_proj
time 0.73
error 0.018224719911813736
Pruned layer model.layers.0.self_attn.k_proj successfully.
Pruning layer: model.layers.0.self_attn.v_proj
time 0.51
error 0.0229062270373106
Pruned layer model.layers.0.self_attn.v_proj successfully.
Pruning layer: model.layers.0.self_attn.o_proj
time 0.53
error 0.004215043969452381
Pruned layer model.layers.0.self_attn.o_proj successfully.
Pruning layer: model.layers.0.mlp.gate_proj
time 0.52
error 9.610349655151367
Pruned layer model.layers.0.mlp.gate_proj successfully.
Pruning layer: model.layers.0.mlp.up_proj
time 0.53
error 10.317127227783203
Pruned layer model.layers.0.mlp.up_proj successfully.
Pruning layer: model.layers.0.mlp.down_proj
time 1.80
error 0.01179075799882412
Pruned layer model.layers.0.mlp.down_proj successfully.
Pruning layer: model.layers.1.

In [None]:
import pandas as pd

# 1. Load your evaluation CSV
eval_df = pd.read_csv("/content/eval_agnews.csv")

# 2. Show the columns and first few rows to understand the structure
print("Columns:", eval_df.columns)
print(eval_df.head(3))

# 3. Pick the input and answer columns automatically (or set manually if you want)
input_col, answer_col = eval_df.columns[:2]  # assumes first is input/question, second is answer/label

# (OPTIONAL) Uncomment below and set manually if you want:
# input_col = "question"
# answer_col = "answer"

eval_inputs = eval_df[input_col].astype(str).tolist()
eval_answers = eval_df[answer_col].astype(str).tolist()

# 4. Run the pruned model on your inputs
device = next(model.parameters()).device
pruned_outputs = []
for inp in eval_inputs:
    inputs = tokenizer(inp, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=24)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    pruned_outputs.append(decoded)

# 5. Print a few qualitative comparisons
print("\n--- Sample results ---")
for i in range(min(5, len(eval_inputs))):
    print(f"Input   : {eval_inputs[i]}")
    print(f"Expected: {eval_answers[i]}")
    print(f"Model   : {pruned_outputs[i]}\n")

# 6. (Optional) Compute simple exact match accuracy for classification tasks
def match_answer(pred, true):
    return pred.strip().lower() == true.strip().lower()

accuracy = sum(match_answer(p, t) for p, t in zip(pruned_outputs, eval_answers)) / len(eval_answers)
print(f"Exact match accuracy: {accuracy:.3f}")


Columns: Index(['question', 'answer'], dtype='object')
                                            question  answer
0  Wall St. Bears Claw Back Into the Black (Reute...       2
1  Carlyle Looks Toward Commercial Aerospace (Reu...       2
2  Oil and Economy Cloud Stocks' Outlook (Reuters...       2

--- Sample results ---
Input   : Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
Expected: 2
Model   : Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again. The Dow Jones (DJD) was up 0.1 percent, while the S&P (SPY

Input   : Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
Expected: 2
Model 

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
base_model.eval()


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): 

In [None]:
import re
def extract_label_after_label_colon(output):
    match = re.search(r"Label:\s*([0-3])", output)
    return match.group(1) if match else "-1"


In [None]:
device = next(model.parameters()).device
predicted_labels = []

import re

for article in eval_questions:
    prompt = few_shot_prompt.format(article=article)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=8)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract the last number after "Label:" in the output (most robust)
    matches = list(re.finditer(r"Label:\s*([0-3])", decoded))
    predicted_label = matches[-1].group(1) if matches else "-1"
    predicted_labels.append(predicted_label)


In [None]:
import pandas as pd

# Load your uploaded sample
df = pd.read_csv("/content/cnn_dailymail_sample.csv")
print(df.head(3))
print(f"Number of samples: {len(df)}")


                                            document  \
0  Jarryd Hayne's move to the NFL is a boost for ...   
1  An anorexic teenager whose weight dropped to j...   
2  (CNN)For years, they've wanted six seasons and...   

                                             summary  
0  Jarryd Hayne quit the NRL in October to try an...  
1  Faith March's dropped to just five stone as sh...  
2  The fan favorite comedy "Community" returns fo...  
Number of samples: 20


In [None]:
import pandas as pd

df = pd.read_csv("/content/cnn_dailymail_sample.csv")

print("Columns:", df.columns)

print(df.head(3))


Columns: Index(['document', 'summary'], dtype='object')
                                            document  \
0  Jarryd Hayne's move to the NFL is a boost for ...   
1  An anorexic teenager whose weight dropped to j...   
2  (CNN)For years, they've wanted six seasons and...   

                                             summary  
0  Jarryd Hayne quit the NRL in October to try an...  
1  Faith March's dropped to just five stone as sh...  
2  The fan favorite comedy "Community" returns fo...  


In [None]:
device = next(model.parameters()).device

inputs = df["document"].tolist()
reference_summaries = df["summary"].tolist()
generated_summaries = []

for article in inputs:
    prompt = f"Summarize the following article:\n\n{article}\n\nSummary:"
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
    with torch.no_grad():
        output_ids = model.generate(**input_ids, max_new_tokens=64)
    summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    if "Summary:" in summary:
        summary = summary.split("Summary:")[-1].strip()
    generated_summaries.append(summary)

for i in range(3):
    print("Article:", inputs[i][:200], "...")
    print("Reference Summary:", reference_summaries[i])
    print("Model Summary:", generated_summaries[i])
    print("---")


Article: Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, it has been claimed. The Australia international full-back or centre quit the National Rugby League in October to try h ...
Reference Summary: Jarryd Hayne quit the NRL in October to try and get into American Football .
This week, he signed a three-year contract with the San Francisco 49ers .
The chairman of the US Association of Rugby League welcomed his arrival .
Model Summary: Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, according to Peter Illfield, chairman of the US Association of Rugby League. Hayne, who played rugby league for Australia, has signed a three-year contract with the San Francisco 49
---
Article: An anorexic teenager whose weight dropped to just five stone is fighting back from the condition by setting up a catering business. Faith March, 18 from Maldon, Essex, was surviving on nothing other t ...
Reference Summary: Faith March's dropped to ju

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
device = next(model.parameters()).device

articles = df["document"].astype(str).tolist()
reference_summaries = df["summary"].astype(str).tolist()
generated_summaries = []

for article in articles:
    prompt = f"Summarize the following article:\n\n{article}\n\nSummary:"
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
    with torch.no_grad():
        output_ids = model.generate(**input_ids, max_new_tokens=64)
    summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    if "Summary:" in summary:
        summary = summary.split("Summary:")[-1].strip()
    generated_summaries.append(summary)

!pip install rouge-score --quiet
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = [scorer.score(ref, gen) for ref, gen in zip(reference_summaries, generated_summaries)]

rouge1_avg = sum([s['rouge1'].fmeasure for s in scores]) / len(scores)
rouge2_avg = sum([s['rouge2'].fmeasure for s in scores]) / len(scores)
rougeL_avg = sum([s['rougeL'].fmeasure for s in scores]) / len(scores)

print(f"ROUGE-1 F1 avg: {rouge1_avg:.3f}")
print(f"ROUGE-2 F1 avg: {rouge2_avg:.3f}")
print(f"ROUGE-L F1 avg: {rougeL_avg:.3f}")

for i in range(3):
    print("ARTICLE:", articles[i][:200], "...")
    print("REFERENCE:", reference_summaries[i])
    print("GENERATED:", generated_summaries[i])
    print("---")


  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
ROUGE-1 F1 avg: 0.224
ROUGE-2 F1 avg: 0.103
ROUGE-L F1 avg: 0.146
ARTICLE: Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, it has been claimed. The Australia international full-back or centre quit the National Rugby League in October to try h ...
REFERENCE: Jarryd Hayne quit the NRL in October to try and get into American Football .
This week, he signed a three-year contract with the San Francisco 49ers .
The chairman of the US Association of Rugby League welcomed his arrival .
GENERATED: Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, according to Peter Illfield, chairman of the US Association of Rugby League. Hayne, who played rugby league for Australia, has signed a three-year contract with the San Francisco 49
---
ARTICLE: An anorexic teenager whose weight dropped to just five stone is fighting back f

In [None]:
calib_texts = df["document"].astype(str).tolist()[:10]


In [None]:
articles = df["document"].astype(str).tolist()[:3]
reference_summaries = df["summary"].astype(str).tolist()[:3]
generated_summaries = []

for article in articles:
    prompt = f"Summarize the following article:\n\n{article}\n\nSummary:"
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=256).to(device)
    with torch.no_grad():
        output_ids = model.generate(**input_ids, max_new_tokens=32)
    summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    if "Summary:" in summary:
        summary = summary.split("Summary:")[-1].strip()
    generated_summaries.append(summary)
    del input_ids, output_ids
    torch.cuda.empty_cache()

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = [scorer.score(ref, gen) for ref, gen in zip(reference_summaries, generated_summaries)]

rouge1_avg = sum([s['rouge1'].fmeasure for s in scores]) / len(scores)
rouge2_avg = sum([s['rouge2'].fmeasure for s in scores]) / len(scores)
rougeL_avg = sum([s['rougeL'].fmeasure for s in scores]) / len(scores)

print(f"[PRUNED] ROUGE-1 F1 avg: {rouge1_avg:.3f}")
print(f"[PRUNED] ROUGE-2 F1 avg: {rouge2_avg:.3f}")
print(f"[PRUNED] ROUGE-L F1 avg: {rougeL_avg:.3f}")

for i in range(3):
    print("ARTICLE:", articles[i][:200], "...")
    print("REFERENCE:", reference_summaries[i])
    print("GENERATED:", generated_summaries[i])
    print("---")


[PRUNED] ROUGE-1 F1 avg: 0.258
[PRUNED] ROUGE-2 F1 avg: 0.109
[PRUNED] ROUGE-L F1 avg: 0.173
ARTICLE: Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, it has been claimed. The Australia international full-back or centre quit the National Rugby League in October to try h ...
REFERENCE: Jarryd Hayne quit the NRL in October to try and get into American Football .
This week, he signed a three-year contract with the San Francisco 49ers .
The chairman of the US Association of Rugby League welcomed his arrival .
GENERATED: Summarize the following article:

Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, it has been claimed. The Australia international full-back or centre quit the National Rugby League in October to try his luck in American football and was this week given a three-year contract with the San Francisco 49ers. Peter Illfield, chairman of US Association of Rugby League, said: 'Jarryd, at 27, is one of the most gifted

In [None]:
calib_texts = df["document"].astype(str).tolist()[:20]


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
device = next(model.parameters()).device


In [None]:
def get_activation_hook(name):
    def hook(module, input, output):
        if name not in activation_dict:
            activation_dict[name] = []
        act = input[0].detach()
        if act.ndim > 2:
            act = act.view(-1, act.shape[-1])
        activation_dict[name].append(act.to(device))
    return hook


In [None]:
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear) and name in activation_dict:
        print(f"Pruning layer {name} ...")
        gpt = SparseGPT(module)
        for activation in activation_dict[name]:
            gpt.add_batch(activation.to(next(module.parameters()).device), out=None)
        gpt.fasterprune(sparsity=0.5)
        gpt.free()
        del gpt
        torch.cuda.empty_cache()


Pruning layer model.layers.0.self_attn.q_proj ...
time 1.94
error 0.2700767517089844
Pruning layer model.layers.0.self_attn.k_proj ...
time 1.50
error 0.07024785131216049
Pruning layer model.layers.0.self_attn.v_proj ...
time 0.74
error 0.0885753184556961
Pruning layer model.layers.0.self_attn.o_proj ...
time 0.82
error 0.012682058848440647
Pruning layer model.layers.0.mlp.gate_proj ...
time 1.18
error 41.7613639831543
Pruning layer model.layers.0.mlp.up_proj ...
time 1.07
error 44.91261291503906
Pruning layer model.layers.0.mlp.down_proj ...
time 2.58
error 0.04819351062178612
Pruning layer model.layers.1.self_attn.q_proj ...
time 0.58
error 21.750154495239258
Pruning layer model.layers.1.self_attn.k_proj ...
time 0.57
error 5.511509895324707
Pruning layer model.layers.1.self_attn.v_proj ...
time 0.53
error 0.9868661165237427
Pruning layer model.layers.1.self_attn.o_proj ...
time 0.56
error 0.2096840888261795
Pruning layer model.layers.1.mlp.gate_proj ...
time 0.54
error 197.016799926

In [None]:
import torch

model.eval()

articles = df["document"].astype(str).tolist()
reference_summaries = df["summary"].astype(str).tolist()
generated_summaries = []

for article in articles:
    prompt = f"Summarize the following article:\n\n{article}\n\nSummary:"
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=256).to(model.device)
    with torch.no_grad():
        output_ids = model.generate(**input_ids, max_new_tokens=64)
    summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    if "Summary:" in summary:
        summary = summary.split("Summary:")[-1].strip()
    generated_summaries.append(summary)
    del input_ids, output_ids
    torch.cuda.empty_cache()

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = [scorer.score(ref, gen) for ref, gen in zip(reference_summaries, generated_summaries)]

rouge1_avg = sum([s['rouge1'].fmeasure for s in scores]) / len(scores)
rouge2_avg = sum([s['rouge2'].fmeasure for s in scores]) / len(scores)
rougeL_avg = sum([s['rougeL'].fmeasure for s in scores]) / len(scores)

print(f"[PRUNED] ROUGE-1 F1 avg: {rouge1_avg:.3f}")
print(f"[PRUNED] ROUGE-2 F1 avg: {rouge2_avg:.3f}")
print(f"[PRUNED] ROUGE-L F1 avg: {rougeL_avg:.3f}")

for i in range(min(3, len(articles))):
    print("ARTICLE:", articles[i][:200], "...")
    print("REFERENCE:", reference_summaries[i])
    print("GENERATED:", generated_summaries[i])
    print("---")


[PRUNED] ROUGE-1 F1 avg: 0.262
[PRUNED] ROUGE-2 F1 avg: 0.112
[PRUNED] ROUGE-L F1 avg: 0.170
ARTICLE: Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, it has been claimed. The Australia international full-back or centre quit the National Rugby League in October to try h ...
REFERENCE: Jarryd Hayne quit the NRL in October to try and get into American Football .
This week, he signed a three-year contract with the San Francisco 49ers .
The chairman of the US Association of Rugby League welcomed his arrival .
GENERATED: Summarize the following article:

Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, it has been claimed. The Australia international full-back or centre quit the National Rugby League in October to try his luck in American football and was this week given a three-year contract with the San Francisco 49ers. Peter Illfield, chairman of US Association of Rugby League, said: 'Jarryd, at 27, is one of the most gifted

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

save_dir = "pruned_tinyllama"

model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

print(f"✅ Pruned model and tokenizer saved to: {save_dir}")


✅ Pruned model and tokenizer saved to: pruned_tinyllama


Prepare Dataset for HuggingFace Trainer

In [None]:
from torch.utils.data import Dataset

class SummarizationDataset(Dataset):
    def __init__(self, df, tokenizer, source_col="document", target_col="summary", max_input=256, max_target=64):
        self.inputs = df[source_col].astype(str).tolist()
        self.targets = df[target_col].astype(str).tolist()
        self.tokenizer = tokenizer
        self.max_input = max_input
        self.max_target = max_target

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        prompt = f"Summarize the following article:\n\n{self.inputs[idx]}\n\nSummary:"
        source = self.tokenizer(prompt, max_length=self.max_input, truncation=True, padding="max_length", return_tensors="pt")
        target = self.tokenizer(self.targets[idx], max_length=self.max_target, truncation=True, padding="max_length", return_tensors="pt")
        item = {k: v.squeeze() for k, v in source.items()}
        item["labels"] = target["input_ids"].squeeze()
        return item

train_dataset = SummarizationDataset(df, tokenizer)


Fine-Tune in the next colab file with LORA