# Exercise 1: MX Quantization of Linear Layers

## Llama-3.2-1B with mxfp4_e2m1 (weights) + mxfp6_e2m3 (activations)

This notebook evaluates an MX-quantized Llama model on the `lambada_openai` task.

**Exercise objectives**
- Quantize all linear layers (Q, K, V, O, gate, up, down)
- Use `mxfp4_e2m1` for weights (4-bit)
- Use `mxfp6_e2m3` for activations (6-bit)
- Compare accuracy vs baseline (62.10%)

**Expected outcome**
- Accuracy target: > 60% (< 2% degradation)

**Author:** Pavan Chauhan  
**Date:** January 30, 2026

---

This notebook is designed to run top-to-bottom without manual restarts.

## Step 1: Verify GPU Runtime

Ensure GPU runtime is enabled (T4/A100/H100).

In [1]:
# Check GPU availability
!nvidia-smi

Sat Jan 31 04:58:36 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   34C    P0             47W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

## Step 2: Clone Project Repository

In [2]:
# Clone / update the project repository
import os
import subprocess

repo_dir = "/content/msr-intern-project"
if not os.path.exists(repo_dir):
    subprocess.check_call(["git", "clone", "https://github.com/pavannn16/msr-intern-project.git", repo_dir])
else:
    subprocess.check_call(["git", "-C", repo_dir, "fetch", "origin", "main"])
    subprocess.check_call(["git", "-C", repo_dir, "reset", "--hard", "origin/main"])

%cd /content/msr-intern-project
subprocess.check_call(["git", "rev-parse", "--short", "HEAD"])

/content/msr-intern-project


0

## Step 3: Install Dependencies

Install `transformers`, `microxcaling` (MX), and `lm-eval`.

Estimated time: 2-3 minutes

In [3]:
import os
import subprocess
import sys

print("Installing dependencies...")

def pip_install(args):
    cmd = [sys.executable, "-m", "pip", "install", "-q"] + args
    print("+", " ".join(cmd))
    subprocess.check_call(cmd)

# Core packages (do not force-install torch; Colab typically provides a working CUDA build)
pip_install(["transformers==4.57.6", "lm_eval", "ninja"])

# Confirm torch is available and report version
try:
    import torch
    print(f"torch={torch.__version__}, cuda={torch.version.cuda}, device_count={torch.cuda.device_count()}")
except Exception as e:
    raise RuntimeError(f"PyTorch is not available in this runtime: {e}")

# Clone microxcaling (MX)
mx_repo_dir = "/content/microxcaling"
if not os.path.exists(mx_repo_dir):
    subprocess.check_call(["git", "clone", "-q", "https://github.com/microsoft/microxcaling.git", mx_repo_dir])
    print("microxcaling cloned")
else:
    print("microxcaling already exists")

# Install microxcaling WITHOUT deps to avoid torchaudio pinning issues.
# The Python package is commonly located under a subdirectory (e.g., /python).
mx_python_root = os.path.join(mx_repo_dir, "python") if os.path.isdir(os.path.join(mx_repo_dir, "python")) else mx_repo_dir
pip_install(["-e", mx_python_root, "--no-deps"])

# Sanity check: MX import (fallback to sys.path if editable install layout differs)
try:
    import mx  # noqa: F401
except ModuleNotFoundError:
    if mx_python_root not in sys.path:
        sys.path.insert(0, mx_python_root)
    import mx  # noqa: F401

print("MX import: OK")
print("All dependencies installed")

Installing dependencies...
+ /usr/bin/python3 -m pip install -q transformers==4.57.6 lm_eval ninja
torch=2.9.0+cu126, cuda=12.6, device_count=1
microxcaling already exists
+ /usr/bin/python3 -m pip install -q -e /content/microxcaling --no-deps
MX import: OK
All dependencies installed


## Step 4: Setup Exercise 1 Files

Copy the complete MX-integrated modeling_llama.py to transformers package.

In [4]:
import os
import shutil
import transformers

print("Setting up Exercise 1 files...")

transformers_path = os.path.dirname(transformers.__file__)
print(f"Transformers installed at: {transformers_path}")

modeling_llama_path = os.path.join(transformers_path, "models", "llama", "modeling_llama.py")
backup_path = modeling_llama_path + ".backup"

mx_complete_file = "/content/msr-intern-project/Exercise1/modified_files/modeling_llama.py"
if os.path.exists(mx_complete_file):
    with open(mx_complete_file, "r") as f:
        line_count = len(f.readlines())
    print(f"Found MX-integrated file ({line_count} lines)")

    if not os.path.exists(backup_path):
        shutil.copy2(modeling_llama_path, backup_path)
        print("Backup created")

    shutil.copy2(mx_complete_file, modeling_llama_path)
    print("MX-integrated modeling_llama.py deployed")

    with open(modeling_llama_path, "r") as f:
        content = f.read()
    if "apply_mx_linear" in content:
        print("MX quantization functions detected in deployed file")
    else:
        print("Warning: MX functions not found in deployed file")
else:
    print(f"Warning: MX file not found at {mx_complete_file}")
    print("  Evaluation will use standard transformers (no MX quantization)")

print("\nExercise 1 setup complete")

Setting up Exercise 1 files...
Transformers installed at: /usr/local/lib/python3.12/dist-packages/transformers
Found MX-integrated file (703 lines)
MX-integrated modeling_llama.py deployed
MX quantization functions detected in deployed file

Exercise 1 setup complete


## Step 5: Set Hugging Face Token

Set your Hugging Face token for model access.

In [None]:
import os

# Option A (recommended on Colab): use Secrets
# from google.colab import userdata
# os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

# Option B: manually set the token in this runtime (do not commit real tokens)
os.environ["HF_TOKEN"] = "YOUR_HF_TOKEN_HERE"

hf_token = os.environ.get("HF_TOKEN")
if not hf_token or hf_token == "YOUR_HF_TOKEN_HERE":
    raise RuntimeError(
        "HF_TOKEN is not set. Set it via Colab Secrets (recommended) or set the token in this cell for runtime-only use (do not commit real tokens)."
    )

print("HF token is set (value hidden)")

HF token is set (value hidden)


## Step 6: Verify MX Integration

Test that MX library and modified model load correctly.

In [None]:
import ast
import inspect
import sys

print("Verifying MX integration...")

exercise1_dir = "/content/msr-intern-project/Exercise1"
if exercise1_dir not in sys.path:
    sys.path.insert(0, exercise1_dir)

# 1) MX import
try:
    from mx import linear as mx_linear  # noqa: F401
    from mx.specs import MxSpecs  # noqa: F401
    print("MX library import: OK")
except ImportError as e:
    raise ImportError(
        f"MX library import failed: {e}. Ensure Step 3 installed microxcaling (pip install -e /content/microxcaling)."
    )

# 2) Local helper import
try:
    from mx_config_helper import create_mx_specs_exercise1, print_mx_specs_summary
    mx_specs = create_mx_specs_exercise1()
    print_mx_specs_summary(mx_specs)
except ImportError as e:
    raise ImportError(
        f"Failed to import Exercise 1 helper(s): {e}. Ensure the repo is cloned and Step 2 ran successfully."
    )

# 3) Confirm the deployed transformers Llama file is the MX-integrated one
import transformers

llama_mod = transformers.models.llama.modeling_llama
llama_file = getattr(llama_mod, "__file__", "<unknown>")
print(f"Transformers llama module: {llama_file}")

with open(llama_file, "r") as f:
    deployed_src = f.read()

if "apply_mx_linear" not in deployed_src:
    raise RuntimeError(
        "Deployed transformers modeling_llama.py does not appear to include MX integration. Re-run Step 4 (deploy)."
    )

# Extra: confirm the deployed MX specs match expected settings (especially rounding).
deployed_specs = getattr(llama_mod, "EXERCISE1_MX_SPECS", None)
if deployed_specs is not None:
    try:
        deployed_round = deployed_specs.get("round") if hasattr(deployed_specs, "get") else deployed_specs["round"]
    except Exception:
        deployed_round = "<unknown>"
    print(f"Deployed MX rounding mode: {deployed_round}")

def _trial2_semantics_present(src):
    """Returns (ok, reason). ok=True means bias and mx_specs are passed into the MX linear call."""
    try:
        tree = ast.parse(src)
    except SyntaxError as e:
        return False, f"Unable to parse deployed modeling_llama.py: {e}"

    apply_fn = None
    for node in tree.body:
        if isinstance(node, ast.FunctionDef) and node.name == "apply_mx_linear":
            apply_fn = node
            break
    if apply_fn is None:
        return False, "apply_mx_linear not found"

    # Guard against Trial-1-style manual bias add after MX op.
    for node in ast.walk(apply_fn):
        if isinstance(node, ast.BinOp) and isinstance(node.op, ast.Add):
            left_is_bias = isinstance(node.left, ast.Name) and node.left.id == "bias"
            right_is_bias = isinstance(node.right, ast.Name) and node.right.id == "bias"
            if left_is_bias or right_is_bias:
                return False, "Found a manual '+ bias' in apply_mx_linear (likely Trial 1)"

    # Accept both patterns used across iterations/versions:
    # - mx.linear(..., bias=bias, mx_specs=mx_specs)
    # - mx_linear.linear(..., bias=bias, mx_specs=mx_specs)
    # - mx_fn(..., bias=bias, mx_specs=mx_specs) where mx_fn = getattr(mx_linear, 'linear', mx_linear)
    def call_passes_bias_and_specs(call):
        bias_ok = False
        specs_ok = False

        # Positional forms: (..., bias, mx_specs)
        if len(call.args) >= 3 and isinstance(call.args[2], ast.Name) and call.args[2].id == "bias":
            bias_ok = True
        if len(call.args) >= 4 and isinstance(call.args[3], ast.Name) and call.args[3].id == "mx_specs":
            specs_ok = True

        # Keyword forms: bias=bias, mx_specs=mx_specs
        for kw in call.keywords or []:
            if kw.arg == "bias" and isinstance(kw.value, ast.Name) and kw.value.id == "bias":
                bias_ok = True
            if kw.arg in {"mx_specs", "specs"} and isinstance(kw.value, ast.Name) and kw.value.id == "mx_specs":
                specs_ok = True

        return bias_ok and specs_ok

    for node in ast.walk(apply_fn):
        if isinstance(node, ast.Call) and call_passes_bias_and_specs(node):
            return True, ""

    return False, "No call found in apply_mx_linear that passes bias and mx_specs"

trial2_ok, trial2_reason = _trial2_semantics_present(deployed_src)
print(f"Trial 2 semantics detected: {trial2_ok}")
if not trial2_ok:
    print(f"Trial 2 semantics check detail: {trial2_reason}")

# 4) Smoke import of model classes
from transformers.models.llama.modeling_llama import LlamaForCausalLM, LlamaMLP, LlamaAttention  # noqa: F401

mlp_forward = inspect.getsource(LlamaMLP.forward)
if "apply_mx_linear" not in mlp_forward:
    print("Warning: apply_mx_linear not detected in LlamaMLP.forward source.")
else:
    print("MX integration detected in LlamaMLP.forward.")

print("MX integration verification complete")

Verifying MX integration...
MX library import: OK
MX Quantization Configuration:
Weights: fp4_e2m1 (4-bit)
Activations: fp6_e2m3 (6-bit)
Scale Bits: 8 (E8M0)
Block Size: 32
CUDA Backend: Enabled
Rounding: even
Backward Quantization: Disabled
Transformers llama module: /usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py
Deployed MX rounding mode: even
Trial 2 semantics detected: False
Trial 2 semantics check detail: No mx.linear/mx_linear.linear call found in apply_mx_linear
MX integration detected in LlamaMLP.forward.
MX integration verification complete


## Step 7: Quick Test (10% Dataset)

Run a quick test to confirm the deployed code is the Trial 2 version and that `lm_eval` executes successfully.

Estimated time: 1-2 minutes

In [None]:
# Verify Trial 2 semantics are present in the deployed transformers file, then run a 10% eval.
import ast
import transformers

llama_file = transformers.models.llama.modeling_llama.__file__
print(f"Verifying Trial 2 deployment in: {llama_file}")

with open(llama_file, "r") as f:
    deployed_src = f.read()

def _trial2_semantics_present(src):
    """Returns (ok, reason). ok=True means bias and mx_specs are passed into the MX linear call."""
    try:
        tree = ast.parse(src)
    except SyntaxError as e:
        return False, f"Unable to parse deployed modeling_llama.py: {e}"

    apply_fn = None
    for node in tree.body:
        if isinstance(node, ast.FunctionDef) and node.name == "apply_mx_linear":
            apply_fn = node
            break
    if apply_fn is None:
        return False, "apply_mx_linear not found"

    # Guard against Trial-1-style manual bias add after MX op.
    for node in ast.walk(apply_fn):
        if isinstance(node, ast.BinOp) and isinstance(node.op, ast.Add):
            left_is_bias = isinstance(node.left, ast.Name) and node.left.id == "bias"
            right_is_bias = isinstance(node.right, ast.Name) and node.right.id == "bias"
            if left_is_bias or right_is_bias:
                return False, "Found a manual '+ bias' in apply_mx_linear (likely Trial 1)"

    def call_passes_bias_and_specs(call):
        bias_ok = False
        specs_ok = False

        if len(call.args) >= 3 and isinstance(call.args[2], ast.Name) and call.args[2].id == "bias":
            bias_ok = True
        if len(call.args) >= 4 and isinstance(call.args[3], ast.Name) and call.args[3].id == "mx_specs":
            specs_ok = True

        for kw in call.keywords or []:
            if kw.arg == "bias" and isinstance(kw.value, ast.Name) and kw.value.id == "bias":
                bias_ok = True
            if kw.arg in {"mx_specs", "specs"} and isinstance(kw.value, ast.Name) and kw.value.id == "mx_specs":
                specs_ok = True

        return bias_ok and specs_ok

    for node in ast.walk(apply_fn):
        if isinstance(node, ast.Call) and call_passes_bias_and_specs(node):
            return True, ""

    return False, "No call found in apply_mx_linear that passes bias and mx_specs"

trial2_ok, trial2_reason = _trial2_semantics_present(deployed_src)
if not trial2_ok:
    raise RuntimeError(
        "Unable to confirm Trial 2 semantics in the deployed transformers file. "
        f"Detail: {trial2_reason}. Re-run Step 4 (deploy) and restart the runtime, then try again."
    )

print("Trial 2 deployment verified")
!lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32 --limit 0.1

Verifying Trial 2 deployment in: /usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py


RuntimeError: Unable to confirm Trial 2 semantics in the deployed transformers file. Detail: No mx.linear/mx_linear.linear call found in apply_mx_linear. Re-run Step 4 (deploy) and restart the runtime, then try again.

## Step 8: Full Evaluation (Exercise 1)

Run the complete evaluation with the MX-quantized model.

Estimated time: 10-15 minutes
Baseline: 62.10% accuracy
Target: > 60% accuracy (< 2% degradation)

In [None]:
# Full evaluation with MX quantization
!lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32

2026-01-31:04:51:38 INFO     [_cli.run:376] Selected Tasks: ['lambada_openai']
2026-01-31:04:51:40 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-01-31:04:51:40 INFO     [evaluator:236] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Llama-3.2-1B'}
2026-01-31:04:51:43 INFO     [models.huggingface:161] Using device 'cuda'
2026-01-31:04:51:44 INFO     [models.huggingface:423] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2026-01-31 04:51:45.670693: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1769835105.691169    1529 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1

## Step 9: Save Results

In [None]:
# Save Exercise 1 results
import datetime
import os

timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

results_content = f"""Exercise 1 Evaluation Results
==================================================
Timestamp: {timestamp}
Model: meta-llama/Llama-3.2-1B
Task: lambada_openai
Device: CUDA
Batch Size: 32

MX Quantization Configuration:
- Weight Format: mxfp4_e2m1 (4-bit)
- Activation Format: mxfp6_e2m3 (6-bit)
- Block Size: 32
- Scale Bits: 8 (E8M0)
- CUDA Backend: Enabled

Baseline Results (for comparison):
- Accuracy: 62.10%
- Runtime: ~22 seconds

Exercise 1 Results:
- Accuracy: [FILL FROM lm_eval OUTPUT]
- Perplexity: [FILL FROM lm_eval OUTPUT]
- Runtime: [FILL]
- Accuracy Change: [CALCULATE vs baseline]

Notes:
- MX quantization applied to all linear layers
- Q, K, V, O projections (attention)
- gate, up, down projections (MLP)
- Block-floating-point with shared exponent

Status: [SUCCESS/FAILED]
Comments: [Add observations here]
"""

# Write under the repo if present; otherwise fall back to the current working directory.
repo_root = "/content/msr-intern-project"
base_dir = repo_root if os.path.isdir(repo_root) else os.getcwd()

results_dir = os.path.join(base_dir, "Exercise1", "results")
os.makedirs(results_dir, exist_ok=True)

results_path = os.path.join(results_dir, "exercise1_results.txt")
with open(results_path, "w") as f:
    f.write(results_content)

print(f"Results template saved to: {results_path}")
print("Update the file with metrics from the evaluation output.")

## Step 10: Analysis & Comparison

Compare Exercise 1 results with baseline.

In [None]:
# Comparison analysis
print("=" * 70)
print("EXERCISE 1 RESULTS ANALYSIS")
print("=" * 70)

baseline_acc = 62.10
exercise1_acc = 0.0  # TODO: Fill from your results

if exercise1_acc > 0:
    accuracy_change = exercise1_acc - baseline_acc
    accuracy_change_pct = (accuracy_change / baseline_acc) * 100

    print(f"\nBaseline Accuracy:   {baseline_acc:.2f}%")
    print(f"Exercise 1 Accuracy: {exercise1_acc:.2f}%")
    print(f"Change:              {accuracy_change:+.2f}% ({accuracy_change_pct:+.2f}%)")
    print()

    if accuracy_change >= -2.0:
        print("Result: within target (< 2% degradation)")
    else:
        print("Result: exceeds 2% degradation threshold")

    print()
    print("Memory savings (theoretical):")
    print("  - Weights: 75% reduction (FP32 -> mxfp4_e2m1)")
    print("  - Activations: 81% reduction (FP32 -> mxfp6_e2m3)")
else:
    print("\nSet exercise1_acc to your actual accuracy from Step 8 output, then re-run this cell.")

print("=" * 70)

## Exercise 1 Wrap-up

**Next steps**
1. Record results in `Exercise1/results/exercise1_results.txt`
2. Compare accuracy vs baseline (62.10%)
3. Commit and push non-secret outputs (do not commit HF tokens)

**Artifacts**
- `Exercise1/modified_files/modeling_llama.py`
- `Exercise1/mx_config_helper.py`
- `Exercise1/results/exercise1_results.txt`