# Model Compression in Large Language Models üß†‚ö°

Welcome! In this notebook, you‚Äôll explore how **large language models (LLMs)** can be made smaller, faster, and more efficient through two powerful techniques:  
**Knowledge Distillation** and **Quantization**.

You‚Äôll start by studying how a large, powerful ‚Äúteacher‚Äù model can transfer its knowledge to a smaller ‚Äústudent‚Äù model, a process known as **distillation**.  
Then you‚Äôll see how **quantization** can further compress models by reducing numerical precision, cutting down memory use, and speeding up inference without major accuracy loss.

We‚Äôll focus on the **Masked Language Modeling (MLM)** task, the same pretraining objective used for models like **BERT** and **DistilBERT**, and evaluate how both techniques preserve performance while improving efficiency.

### üîç What you‚Äôll do
- üìö **Learn key concepts** ‚Üí understand what distillation and quantization mean, and why they matter for deploying LLMs.  
- üß© **Load a real dataset** ‚Üí use the **Yelp Polarity** dataset to work with natural, human-written text.  
- üß† **Create masked samples** ‚Üí hide random words and challenge models to fill in the blanks.  
- ‚öñÔ∏è **Compare teacher vs. student** ‚Üí test how well **DistilBERT** imitates **BERT**.  
- ‚ö° **Quantize the teacher** ‚Üí run BERT in 8-bit precision using **BitsAndBytes** to measure speed and accuracy changes.  
- üìä **Evaluate and visualize results** ‚Üí compare predictions, agreement scores, KL divergence, and runtime performance.  

‚ú® By the end, you‚Äôll see how **distillation and quantization work together** to make modern language models leaner, faster, and easier to deploy ‚Äî all while keeping their intelligence intact. Let‚Äôs get started!


In [None]:
import random
import os, gc, torch
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
import torch

import matplotlib.pyplot as plt

from datasets import load_dataset
from tqdm import tqdm

import numpy as np
import time
import torch.nn.functional as F

## üßæ Dataset: Yelp Polarity

For this demo, we‚Äôll use the **Yelp Polarity** dataset, a collection of real user reviews from Yelp labeled as **positive** or **negative**.  
Even though this dataset is designed for sentiment classification, we‚Äôll repurpose its text for **masked language modeling**.  

Using real, naturally written reviews helps us test how well the teacher (**BERT**) and student (**DistilBERT**) handle everyday language and maintain consistent predictions under distillation.

**`TODO:`** Load the train set of the `yelp_polarity`dataset. Only 1% shoudl suffice.

In [None]:
dataset = ...
dataset

### üß© Prepare Masked Sentences

Your goal here is to prepare a small set of sentences that include a **[MASK]** token.  
You‚Äôll use these sentences later to test how well BERT and DistilBERT predict missing words.


**`TODO:`**
1. Load the **BERT tokenizer** so you can check the token length of each sentence.  
2. Write a function that takes a sentence, randomly selects one word, replaces it with `[MASK]`, and returns both the masked sentence and the original word.  
3. Make sure to skip sentences that are too short or exceed **512 tokens** when tokenized.  
4. Loop through the dataset, apply your function, and keep only the valid masked sentences.  
5. Stop once you have collected **200 valid examples**, then store them in two lists ‚Äî one for masked sentences and one for the original words.

When you‚Äôre done, print how many masked sentences you successfully created.  
These will be your **test inputs** for the next experiment.


## ‚öôÔ∏è Distillation and Pipelins

In Hugging Face‚Äôs Transformers library, **pipelines** provide a simple, high-level interface for running common NLP tasks without needing to manually handle tokenization, model inputs, or post-processing.

#### Masked Language Modeling (MLM)
- The `"fill-mask"` pipeline is designed for **Masked Language Modeling (MLM)** tasks.  
- It automatically detects the `[MASK]` token in a sentence and predicts which words are most likely to fill that blank.  
- Under the hood, the pipeline handles tokenization, model inference, and decoding ‚Äî returning readable word predictions with their probabilities.

#### Teacher and Student Models
- `pipeline("fill-mask", model="bert-base-uncased")` loads **BERT**, our **teacher model** ‚Äî large and highly accurate.  
- `pipeline("fill-mask", model="distilbert-base-uncased")` loads **DistilBERT**, our **student model** ‚Äî smaller, faster, and distilled from BERT.

In this setup, both pipelines perform the same task, allowing us to directly compare their predictions and efficiency.  
For more details, check out the [Transformers pipeline documentation](https://huggingface.co/docs/transformers/en/main_classes/pipelines).

**`TODO:`** Load the teacher and student models through a `pipeline` class.

In [None]:
teacher = ...
student = ...

### üìä Task: Compare Teacher and Student Performance

Now that you have your masked sentences and both models loaded, it‚Äôs time to **evaluate how similar their predictions are** and how efficiently they run.

In this section, you‚Äôll write a loop that goes through each masked sentence, asks both models to fill in the blank, and records several measurements:

1. **Inference time**  
   Measure how long each model takes to make a prediction. This helps show how much faster the student (DistilBERT) is compared to the teacher (BERT).

2. **Top-1 agreement**  
   Check if the two models predict the **same top word** for the `[MASK]` position.

3. **Top-5 overlap**  
   Compare the sets of their top-5 predictions. Count how many of the teacher‚Äôs top-5 words also appear in the student‚Äôs top-5 list.

4. **KL divergence**  
   Compute the Kullback‚ÄìLeibler divergence between the teacher‚Äôs and student‚Äôs predicted probability distributions.  
   A lower KL value means the student‚Äôs predictions are closer to the teacher‚Äôs.

After looping through all samples, calculate and print the **average results** across the dataset.  
These metrics together will show how well the student model preserves the teacher‚Äôs knowledge while being faster to run.


**`TODO:`** Implement the experiment as described above.

**`TODO:`** Plot the Top-1 Agreement and the Top-5 Overlap between the student and the teacher.

## ‚ö° Task: Quantize the Teacher Model with Bitsandbytes

Now you‚Äôll explore **quantization**

In this exercise, you will:
1. Load the original **BERT** teacher model in full precision.  
2. Load a second version in **8-bit precision** using the `bitsandbytes` library.  
3. Run both models on the same masked sentence.  
4. Compare their predictions and measure the difference in inference time.

Quantization reduces model size and memory use while keeping performance nearly the same.  
It‚Äôs one of the main techniques used to deploy LLMs efficiently in real-world applications.


### ‚öôÔ∏è Introduction to BitsAndBytes Quantization

**BitsAndBytes** is a library that enables efficient **low-precision inference and training** for large language models.  
It allows you to load models in **8-bit** or **4-bit** precision directly through Hugging Face Transformers ‚Äî reducing memory use and speeding up inference without needing to retrain the model.

Because BitsAndBytes performs GPU-accelerated quantization, it **requires access to a CUDA-enabled GPU**.  
If you‚Äôre running this notebook locally without a GPU, you will encounter errors.
We recommend running this section on **[Google Colab](https://colab.research.google.com)** (with ‚ÄúRuntime ‚Üí Change runtime type ‚Üí GPU‚Äù) or any environment with GPU support.

Learn more from the official documentation:  üëâ [https://huggingface.co/docs/bitsandbytes](https://huggingface.co/docs/bitsandbytes)


In [None]:
model_8bit = AutoModelForMaskedLM.from_pretrained(
    "bert-base-uncased",
    load_in_8bit=True,    # activates 8-bit quantization
    device_map="auto"     # automatically places model on available device
)

**`TODO:`** Load both the full-precision and 8-bit quantized versions of 'bert-base-uncased', then compute and print each model‚Äôs disk size (from the HF cache folder) and in-memory / VRAM usage side-by-side.

**`TODO:`** Load both the full and quantized models in MLMs pipelines.

In [None]:
non_quantized = ...
quantized = ...

### ‚öñÔ∏è Task: Compare Quantized vs. Full-Precision Models

Now you‚Äôll evaluate how **quantization** affects the performance of a model.  
You‚Äôll compare the original full-precision **BERT** model with its **quantized (8-bit)** version loaded using BitsAndBytes.

In this section, you‚Äôll write a loop that runs both models on the same masked sentences and records several key metrics:

1. **Inference time**  
   Measure how long each model takes to generate predictions.  
   This will show how much faster the quantized model runs compared to the full-precision version.

2. **Top-1 agreement**  
   Check whether both versions predict the **same top word** for the `[MASK]` position.

3. **Top-5 overlap**  
   Compare the sets of their top-5 predictions. Count how many of the full-precision model‚Äôs top-5 words also appear in the quantized model‚Äôs top-5 list.

4. **KL divergence**  
   Compute the Kullback‚ÄìLeibler divergence between the two models‚Äô predicted probability distributions.  
   A lower KL value means the quantized model‚Äôs predictions are closer to the original.

After evaluating all samples, calculate and print the **average results** across the dataset.  
These metrics will help you see how much **accuracy is retained** and how much **speed is gained** through quantization.


**`TODO:`** Implement the experiment as described above.

**`TODO:`** Plot the Top-1 Agreement and the Top-5 Overlap between the student and the teacher.