# Exercise 1: MX Quantization of Linear Layers

## Llama-3.2-1B with mxfp4_e2m1 (weights) + mxfp6_e2m3 (activations)

This notebook evaluates an MX-quantized Llama model on the `lambada_openai` task.

**Exercise objectives**
- Quantize all linear layers (Q, K, V, O, gate, up, down)
- Use `mxfp4_e2m1` for weights (4-bit)
- Use `mxfp6_e2m3` for activations (6-bit)
- Compare accuracy vs baseline (62.10%)

**Expected outcome**
- Accuracy target: > 60% (< 2% degradation)

**Author:** Pavan Chauhan  
**Date:** January 30, 2026

---

This notebook is designed to run top-to-bottom without manual restarts.

## Step 1: Verify GPU Runtime

Ensure GPU runtime is enabled (T4/A100/H100).

In [1]:
# Check GPU availability
!nvidia-smi

Sat Jan 31 17:49:41 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   30C    P0             44W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

## Step 2: Clone Project Repository

In [2]:
# Clone / update the project repository
import os
import subprocess

repo_dir = "/content/msr-intern-project"
if not os.path.exists(repo_dir):
    subprocess.check_call(["git", "clone", "https://github.com/pavannn16/msr-intern-project.git", repo_dir])
else:
    subprocess.check_call(["git", "-C", repo_dir, "fetch", "origin", "main"])
    subprocess.check_call(["git", "-C", repo_dir, "reset", "--hard", "origin/main"])

%cd /content/msr-intern-project
subprocess.check_call(["git", "rev-parse", "--short", "HEAD"])

/content/msr-intern-project


0

## Step 3: Install Dependencies

Install `transformers`, `microxcaling` (MX), and `lm-eval`.

Estimated time: 2-3 minutes

In [3]:
import os
import subprocess
import sys

print("Installing dependencies...")

def pip_install(args):
    cmd = [sys.executable, "-m", "pip", "install", "-q"] + args
    print("+", " ".join(cmd))
    subprocess.check_call(cmd)

# Core packages (do not force-install torch; Colab typically provides a working CUDA build)
pip_install(["transformers==4.57.6", "lm_eval", "ninja"])

# Confirm torch is available and report version
try:
    import torch
    print(f"torch={torch.__version__}, cuda={torch.version.cuda}, device_count={torch.cuda.device_count()}")
except Exception as e:
    raise RuntimeError(f"PyTorch is not available in this runtime: {e}")

# Clone microxcaling (MX)
mx_repo_dir = "/content/microxcaling"
if not os.path.exists(mx_repo_dir):
    subprocess.check_call(["git", "clone", "-q", "https://github.com/microsoft/microxcaling.git", mx_repo_dir])
    print("microxcaling cloned")
else:
    print("microxcaling already exists")

# Install microxcaling WITHOUT deps to avoid torchaudio pinning issues.
# The Python package is commonly located under a subdirectory (e.g., /python).
mx_python_root = os.path.join(mx_repo_dir, "python") if os.path.isdir(os.path.join(mx_repo_dir, "python")) else mx_repo_dir
pip_install(["-e", mx_python_root, "--no-deps"])

# Sanity check: MX import (fallback to sys.path if editable install layout differs)
try:
    import mx  # noqa: F401
except ModuleNotFoundError:
    if mx_python_root not in sys.path:
        sys.path.insert(0, mx_python_root)
    import mx  # noqa: F401

print("MX import: OK")
print("All dependencies installed")

Installing dependencies...
+ /usr/bin/python3 -m pip install -q transformers==4.57.6 lm_eval ninja
torch=2.9.0+cu126, cuda=12.6, device_count=1
microxcaling already exists
+ /usr/bin/python3 -m pip install -q -e /content/microxcaling --no-deps
MX import: OK
All dependencies installed


## Step 4: Setup Exercise 1 Files

Copy the complete MX-integrated modeling_llama.py to transformers package.

In [4]:
import os
import shutil
import transformers

print("Setting up Exercise 1 files...")

transformers_path = os.path.dirname(transformers.__file__)
print(f"Transformers installed at: {transformers_path}")

modeling_llama_path = os.path.join(transformers_path, "models", "llama", "modeling_llama.py")
backup_path = modeling_llama_path + ".backup"

mx_complete_file = "/content/msr-intern-project/Exercise1/modified_files/modeling_llama.py"
if os.path.exists(mx_complete_file):
    with open(mx_complete_file, "r") as f:
        line_count = len(f.readlines())
    print(f"Found MX-integrated file ({line_count} lines)")

    if not os.path.exists(backup_path):
        shutil.copy2(modeling_llama_path, backup_path)
        print("Backup created")

    shutil.copy2(mx_complete_file, modeling_llama_path)
    print("MX-integrated modeling_llama.py deployed")

    with open(modeling_llama_path, "r") as f:
        content = f.read()
    if "apply_mx_linear" in content:
        print("MX quantization functions detected in deployed file")
    else:
        print("Warning: MX functions not found in deployed file")
else:
    print(f"Warning: MX file not found at {mx_complete_file}")
    print("  Evaluation will use standard transformers (no MX quantization)")

print("\nExercise 1 setup complete")

Setting up Exercise 1 files...
Transformers installed at: /usr/local/lib/python3.12/dist-packages/transformers
Found MX-integrated file (725 lines)
MX-integrated modeling_llama.py deployed
MX quantization functions detected in deployed file

Exercise 1 setup complete


## Step 5: Set Hugging Face Token

Set your Hugging Face token for model access.

In [None]:
import os

# Requirement: do NOT hardcode tokens in the notebook.
# Set HF_TOKEN externally (e.g., Colab Secrets or `export HF_TOKEN=...` in the runtime).

hf_token = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")
if not hf_token:
    raise RuntimeError(
        "HF_TOKEN is not set. Set it via Colab Secrets (recommended) or export it in the runtime before running this cell."
    )

# Mirror to HF_TOKEN in case the user provided HUGGINGFACE_HUB_TOKEN.
os.environ["HF_TOKEN"] = hf_token

print("HF token is set (value hidden)")

HF token is set (value hidden)


## Step 6: Verify MX Integration

Test that MX library and modified model load correctly.

In [6]:
import ast
import inspect
import sys

print("Verifying MX integration...")

exercise1_dir = "/content/msr-intern-project/Exercise1"
if exercise1_dir not in sys.path:
    sys.path.insert(0, exercise1_dir)

# 1) MX import
try:
    from mx import linear as mx_linear  # noqa: F401
    from mx.specs import MxSpecs  # noqa: F401
    print("MX library import: OK")
except ImportError as e:
    raise ImportError(
        f"MX library import failed: {e}. Ensure Step 3 installed microxcaling (pip install -e /content/microxcaling)."
    )

# 2) Local helper import
try:
    from mx_config_helper import create_mx_specs_exercise1, print_mx_specs_summary
    mx_specs = create_mx_specs_exercise1()
    print_mx_specs_summary(mx_specs)
except ImportError as e:
    raise ImportError(
        f"Failed to import Exercise 1 helper(s): {e}. Ensure the repo is cloned and Step 2 ran successfully."
    )

# 3) Confirm the deployed transformers Llama file is the MX-integrated one
import transformers

llama_mod = transformers.models.llama.modeling_llama
llama_file = getattr(llama_mod, "__file__", "<unknown>")
print(f"Transformers llama module: {llama_file}")

with open(llama_file, "r") as f:
    deployed_src = f.read()

if "apply_mx_linear" not in deployed_src:
    raise RuntimeError(
        "Deployed transformers modeling_llama.py does not appear to include MX integration. Re-run Step 4 (deploy)."
    )

# Extra: confirm the deployed MX specs match expected settings (especially rounding).
deployed_specs = getattr(llama_mod, "EXERCISE1_MX_SPECS", None)
if deployed_specs is not None:
    try:
        deployed_round = deployed_specs.get("round") if hasattr(deployed_specs, "get") else deployed_specs["round"]
    except Exception:
        deployed_round = "<unknown>"
    print(f"Deployed MX rounding mode: {deployed_round}")

def _trial2_semantics_present(src):
    """Returns (ok, reason). ok=True means bias and mx_specs are passed into the MX linear call."""
    try:
        tree = ast.parse(src)
    except SyntaxError as e:
        return False, f"Unable to parse deployed modeling_llama.py: {e}"

    apply_fn = None
    for node in tree.body:
        if isinstance(node, ast.FunctionDef) and node.name == "apply_mx_linear":
            apply_fn = node
            break
    if apply_fn is None:
        return False, "apply_mx_linear not found"

    # Guard against Trial-1-style manual bias add after MX op.
    for node in ast.walk(apply_fn):
        if isinstance(node, ast.BinOp) and isinstance(node.op, ast.Add):
            left_is_bias = isinstance(node.left, ast.Name) and node.left.id == "bias"
            right_is_bias = isinstance(node.right, ast.Name) and node.right.id == "bias"
            if left_is_bias or right_is_bias:
                return False, "Found a manual '+ bias' in apply_mx_linear (likely Trial 1)"

    # Accept both patterns used across iterations/versions:
    # - mx.linear(..., bias=bias, mx_specs=mx_specs)
    # - mx_linear.linear(..., bias=bias, mx_specs=mx_specs)
    # - mx_fn(..., bias=bias, mx_specs=mx_specs) where mx_fn = getattr(mx_linear, 'linear', mx_linear)
    def call_passes_bias_and_specs(call):
        bias_ok = False
        specs_ok = False

        # Positional forms: (..., bias, mx_specs)
        if len(call.args) >= 3 and isinstance(call.args[2], ast.Name) and call.args[2].id == "bias":
            bias_ok = True
        if len(call.args) >= 4 and isinstance(call.args[3], ast.Name) and call.args[3].id == "mx_specs":
            specs_ok = True

        # Keyword forms: bias=bias, mx_specs=mx_specs
        for kw in call.keywords or []:
            if kw.arg == "bias" and isinstance(kw.value, ast.Name) and kw.value.id == "bias":
                bias_ok = True
            if kw.arg in {"mx_specs", "specs"} and isinstance(kw.value, ast.Name) and kw.value.id == "mx_specs":
                specs_ok = True

        return bias_ok and specs_ok

    for node in ast.walk(apply_fn):
        if isinstance(node, ast.Call) and call_passes_bias_and_specs(node):
            return True, ""

    return False, "No call found in apply_mx_linear that passes bias and mx_specs"

trial2_ok, trial2_reason = _trial2_semantics_present(deployed_src)
print(f"Trial 2 semantics detected: {trial2_ok}")
if not trial2_ok:
    print(f"Trial 2 semantics check detail: {trial2_reason}")

# 4) Smoke import of model classes
from transformers.models.llama.modeling_llama import LlamaForCausalLM, LlamaMLP, LlamaAttention  # noqa: F401

mlp_forward = inspect.getsource(LlamaMLP.forward)
if "apply_mx_linear" not in mlp_forward:
    print("Warning: apply_mx_linear not detected in LlamaMLP.forward source.")
else:
    print("MX integration detected in LlamaMLP.forward.")

print("MX integration verification complete")

Verifying MX integration...
MX library import: OK
MX Quantization Configuration:
Weights: fp4_e2m1 (4-bit)
Activations: fp6_e2m3 (6-bit)
Scale Bits: 8 (E8M0)
Block Size: 32
CUDA Backend: Enabled
Rounding: even
Backward Quantization: Disabled
Transformers llama module: /usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py
Deployed MX rounding mode: even
Trial 2 semantics detected: True
MX integration detected in LlamaMLP.forward.
MX integration verification complete


## Step 7: Quick Test (10% Dataset)

Run a quick test to confirm the deployed code is the Trial 2 version and that `lm_eval` executes successfully.

Estimated time: 1-2 minutes

In [7]:
# Verify Trial 2 semantics are present in the deployed transformers file, then run a 10% eval.
import ast
import os
import subprocess
import transformers

llama_file = transformers.models.llama.modeling_llama.__file__
print(f"Verifying Trial 2 deployment in: {llama_file}")

with open(llama_file, "r") as f:
    deployed_src = f.read()

def _trial2_semantics_present(src):
    """Returns (ok, reason). ok=True means bias and mx_specs are passed into the MX linear call."""
    try:
        tree = ast.parse(src)
    except SyntaxError as e:
        return False, f"Unable to parse deployed modeling_llama.py: {e}"

    apply_fn = None
    for node in tree.body:
        if isinstance(node, ast.FunctionDef) and node.name == "apply_mx_linear":
            apply_fn = node
            break
    if apply_fn is None:
        return False, "apply_mx_linear not found"

    # Guard against Trial-1-style manual bias add after MX op.
    for node in ast.walk(apply_fn):
        if isinstance(node, ast.BinOp) and isinstance(node.op, ast.Add):
            left_is_bias = isinstance(node.left, ast.Name) and node.left.id == "bias"
            right_is_bias = isinstance(node.right, ast.Name) and node.right.id == "bias"
            if left_is_bias or right_is_bias:
                return False, "Found a manual '+ bias' in apply_mx_linear (likely Trial 1)"

    def call_passes_bias_and_specs(call):
        bias_ok = False
        specs_ok = False

        if len(call.args) >= 3 and isinstance(call.args[2], ast.Name) and call.args[2].id == "bias":
            bias_ok = True
        if len(call.args) >= 4 and isinstance(call.args[3], ast.Name) and call.args[3].id == "mx_specs":
            specs_ok = True

        for kw in call.keywords or []:
            if kw.arg == "bias" and isinstance(kw.value, ast.Name) and kw.value.id == "bias":
                bias_ok = True
            if kw.arg in {"mx_specs", "specs"} and isinstance(kw.value, ast.Name) and kw.value.id == "mx_specs":
                specs_ok = True

        return bias_ok and specs_ok

    for node in ast.walk(apply_fn):
        if isinstance(node, ast.Call) and call_passes_bias_and_specs(node):
            return True, ""

    return False, "No call found in apply_mx_linear that passes bias and mx_specs"

trial2_ok, trial2_reason = _trial2_semantics_present(deployed_src)
if not trial2_ok:
    raise RuntimeError(
        "Unable to confirm Trial 2 semantics in the deployed transformers file. "
        f"Detail: {trial2_reason}. Re-run Step 4 (deploy) and restart the runtime, then try again."
    )

print("Trial 2 deployment verified")

# ----------------------
# Quick eval runner that captures output and prints the final metrics table row.
# ----------------------
def run_lm_eval(env_overrides: dict, limit: float | None):
    env = os.environ.copy()
    env.update({k: str(v) for k, v in env_overrides.items()})
    cmd = [
        "lm_eval",
        "--model", "hf",
        "--model_args", "pretrained=meta-llama/Llama-3.2-1B",
        "--tasks", "lambada_openai",
        "--device", "cuda",
        "--batch_size", "32",
    ]
    if limit is not None:
        cmd += ["--limit", str(limit)]
    print("\n+", " ".join(cmd))
    if env_overrides:
        print("  env overrides:", {k: env[k] for k in env_overrides})
    out = subprocess.check_output(cmd, env=env, text=True, stderr=subprocess.STDOUT)
    # Print the result row(s) for lambada_openai and perplexity for easy copy/paste
    lines = out.splitlines()
    for line in lines:
        if "|lambada_openai|" in line or "|              |" in line and "perplexity" in line:
            print(line)
    return out

# Default MX config (matches Exercise 1 target)
mx_env = {
    "USE_MX_QUANTIZATION": "1",
    "MX_W_ELEM_FORMAT": "fp4_e2m1",
    "MX_A_ELEM_FORMAT": "fp6_e2m3",
    "MX_BLOCK_SIZE": "32",
    "MX_SCALE_BITS": "8",
    "MX_SHARED_EXP_METHOD": "max",
    "MX_ROUND": "even",
    "MX_CUSTOM_CUDA": "1",
}

_ = run_lm_eval(mx_env, limit=0.1)

Verifying Trial 2 deployment in: /usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py
Trial 2 deployment verified

+ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32 --limit 0.1
  env overrides: {'USE_MX_QUANTIZATION': '1', 'MX_W_ELEM_FORMAT': 'fp4_e2m1', 'MX_A_ELEM_FORMAT': 'fp6_e2m3', 'MX_BLOCK_SIZE': '32', 'MX_SCALE_BITS': '8', 'MX_SHARED_EXP_METHOD': 'max', 'MX_ROUND': 'even', 'MX_CUSTOM_CUDA': '1'}
|lambada_openai|      1|none  |     0|acc       |↑  |0.5000|±  |0.0220|
|              |       |none  |     0|perplexity|↓  |9.7963|±  |0.8769|


## Step 7b: MX Ablations & Quick Sweep (10% Dataset)

Run a small set of high-signal ablations and spec sweeps on a 10% subset (`--limit 0.1`) to identify configs that recover accuracy before spending time on full evaluation.

**What this runs**
- Baseline quick check (MX off)
- MX full (fp4 weights + fp6 activations)
- Weights-only (activation quantization disabled)
- Activations-only (weight quantization disabled)
- Block size sweep: 8 / 16 / 32 / 64
- Rounding sweep: even / nearest

If a configuration is not supported by the installed MX build, it will be marked as `ERROR` and the sweep continues.

In [8]:
# Quick sweep runner (10% limit)
import os
import subprocess
import time
from dataclasses import asdict, dataclass
 
@dataclass
class SweepResult:
    name: str
    ok: bool
    acc: float | None
    acc_stderr: float | None
    runtime_s: float | None
    note: str
    env: dict[str, str]
 
def parse_lm_eval_acc(text: str) -> tuple[float | None, float | None]:
    acc = None
    acc_stderr = None
    for line in text.splitlines():
        if "|lambada_openai|" in line and "|acc" in line:
            parts = [p.strip() for p in line.split("|") if p.strip()]
            try:
                acc = float(parts[6])
                acc_stderr = float(parts[8])
            except Exception:
                pass
            break
    return acc, acc_stderr
 
def run_lm_eval_quick(env_overrides: dict[str, str], limit: float = 0.1) -> tuple[str, float | None, float | None, float]:
    env = os.environ.copy()
    env.update({k: str(v) for k, v in env_overrides.items()})
    cmd = [
        "lm_eval",
        "--model", "hf",
        "--model_args", "pretrained=meta-llama/Llama-3.2-1B",
        "--tasks", "lambada_openai",
        "--device", "cuda",
        "--batch_size", "32",
        "--limit", str(limit),
    ]
    print("\n+", " ".join(cmd))
    if env_overrides:
        print("  env overrides:", env_overrides)
    t0 = time.time()
    out = subprocess.check_output(cmd, env=env, text=True, stderr=subprocess.STDOUT)
    runtime_s = time.time() - t0
    acc, acc_stderr = parse_lm_eval_acc(out)
    return out, acc, acc_stderr, runtime_s
 
def _is_mx_enabled(env: dict[str, str]) -> bool:
    return str(env.get("USE_MX_QUANTIZATION", "0")) == "1"
 
def _normalize_mx_format(v) -> str | None:
    if v is None:
        return None
    s = str(v).strip()
    if s == "":
        return None
    if s.lower() in {"none", "null"}:
        return None
    return s
 
def _is_mx_full(env: dict[str, str]) -> bool:
    if not _is_mx_enabled(env):
        return False
    w = _normalize_mx_format(env.get("MX_W_ELEM_FORMAT", None))
    a = _normalize_mx_format(env.get("MX_A_ELEM_FORMAT", None))
    return (w is not None) and (a is not None)
 
# Base configs
mx_base: dict[str, str] = {
    "USE_MX_QUANTIZATION": "1",
    "MX_W_ELEM_FORMAT": "fp4_e2m1",
    "MX_A_ELEM_FORMAT": "fp6_e2m3",
    "MX_BLOCK_SIZE": "32",
    "MX_SCALE_BITS": "8",
    "MX_SHARED_EXP_METHOD": "max",
    "MX_ROUND": "even",
    "MX_CUSTOM_CUDA": "1",
}
 
configs: list[tuple[str, dict[str, str]]] = []
configs.append(("baseline (MX off)", {"USE_MX_QUANTIZATION": "0"}))
configs.append(("MX full (fp4/fp6)", dict(mx_base)))
configs.append(("MX weights-only (a=None)", {**mx_base, "MX_A_ELEM_FORMAT": "None"}))
configs.append(("MX activations-only (w=None)", {**mx_base, "MX_W_ELEM_FORMAT": "None"}))
 
# Block-size sweep
for bs in [8, 16, 32, 64]:
    configs.append((f"MX full bs={bs}", {**mx_base, "MX_BLOCK_SIZE": str(bs)}))
 
# Rounding sweep
for rnd in ["even", "nearest"]:
    configs.append((f"MX full round={rnd}", {**mx_base, "MX_ROUND": rnd}))
 
# Combo configs (best-of-both-worlds candidates)
configs.append(("MX full bs=8 round=nearest", {**mx_base, "MX_BLOCK_SIZE": "8", "MX_ROUND": "nearest"}))
configs.append(("MX full bs=8 round=nearest (custom_cuda=0)", {**mx_base, "MX_BLOCK_SIZE": "8", "MX_ROUND": "nearest", "MX_CUSTOM_CUDA": "0"}))
configs.append(("MX full bs=16 round=nearest", {**mx_base, "MX_BLOCK_SIZE": "16", "MX_ROUND": "nearest"}))
 
# Optional: small shared-exp sweep; unsupported values will be marked ERROR.
for method in ["max", "mean"]:
    if method == mx_base["MX_SHARED_EXP_METHOD"]:
        continue
    configs.append((f"MX full shared_exp={method}", {**mx_base, "MX_SHARED_EXP_METHOD": method}))
 
results: list[SweepResult] = []
for name, env in configs:
    try:
        _, acc, acc_stderr, runtime_s = run_lm_eval_quick(env, limit=0.1)
        ok = acc is not None
        note = "OK" if ok else "PARSE_FAIL"
        results.append(SweepResult(name=name, ok=ok, acc=acc, acc_stderr=acc_stderr, runtime_s=runtime_s, note=note, env=env))
    except subprocess.CalledProcessError as e:
        results.append(SweepResult(name=name, ok=False, acc=None, acc_stderr=None, runtime_s=None, note=f"ERROR: {e.returncode}", env=env))
 
# Print compact table
print("\n" + "=" * 92)
print("MX QUICK SWEEP (limit=0.1)")
print("=" * 92)
print(f"{'config':48s}  {'acc':>8s}  {'stderr':>8s}  {'runtime(s)':>10s}  status")
print("-" * 92)
for r in results:
    acc_s = "<na>" if r.acc is None else f"{r.acc:.4f}"
    se_s = "<na>" if r.acc_stderr is None else f"{r.acc_stderr:.4f}"
    rt_s = "<na>" if r.runtime_s is None else f"{r.runtime_s:0.1f}"
    status = "OK" if r.ok else r.note
    print(f"{r.name[:48]:48s}  {acc_s:>8s}  {se_s:>8s}  {rt_s:>10s}  {status}")
print("=" * 92)
 
def _best_by_acc(candidates: list[SweepResult]) -> SweepResult | None:
    best: SweepResult | None = None
    for r in candidates:
        if not r.ok or r.acc is None:
            continue
        if best is None or r.acc > best.acc:
            best = r
    return best
 
best_any = _best_by_acc(results)
best_mx_any = _best_by_acc([r for r in results if _is_mx_enabled(r.env)])
best_mx_full = _best_by_acc([r for r in results if _is_mx_full(r.env)])
 
if best_any is not None:
    print("\nBest overall quick-sweep config:")
    print("- name:", best_any.name)
    print("- acc:", f"{best_any.acc:.4f}" + ("" if best_any.acc_stderr is None else f" ± {best_any.acc_stderr:.4f}"))
    print("- env:", best_any.env)
else:
    print("\nNo successful quick-sweep configs found (all failed or unparsable).")
 
if best_mx_any is not None:
    print("\nBest MX-enabled quick-sweep config (MX-any):")
    print("- name:", best_mx_any.name)
    print("- acc:", f"{best_mx_any.acc:.4f}" + ("" if best_mx_any.acc_stderr is None else f" ± {best_mx_any.acc_stderr:.4f}"))
    print("- env:", best_mx_any.env)
else:
    print("\nNo successful MX-enabled configs found (all MX runs failed or unparsable).")
 
if best_mx_full is not None:
    print("\nBest MX-full quick-sweep config (weights+activations):")
    print("- name:", best_mx_full.name)
    print("- acc:", f"{best_mx_full.acc:.4f}" + ("" if best_mx_full.acc_stderr is None else f" ± {best_mx_full.acc_stderr:.4f}"))
    print("- env:", best_mx_full.env)
else:
    print("\nNo successful MX-full configs found (all MX-full runs failed or unparsable).")
 
# Persist to globals for later cells (Save Results)
quick_sweep_results = [asdict(r) for r in results]
quick_best_config_any = None if best_any is None else asdict(best_any)
quick_best_config = None if best_mx_any is None else asdict(best_mx_any)
quick_best_config_mx_full = None if best_mx_full is None else asdict(best_mx_full)


+ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32 --limit 0.1
  env overrides: {'USE_MX_QUANTIZATION': '0'}



+ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32 --limit 0.1
  env overrides: {'USE_MX_QUANTIZATION': '1', 'MX_W_ELEM_FORMAT': 'fp4_e2m1', 'MX_A_ELEM_FORMAT': 'fp6_e2m3', 'MX_BLOCK_SIZE': '32', 'MX_SCALE_BITS': '8', 'MX_SHARED_EXP_METHOD': 'max', 'MX_ROUND': 'even', 'MX_CUSTOM_CUDA': '1'}

+ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32 --limit 0.1
  env overrides: {'USE_MX_QUANTIZATION': '1', 'MX_W_ELEM_FORMAT': 'fp4_e2m1', 'MX_A_ELEM_FORMAT': 'None', 'MX_BLOCK_SIZE': '32', 'MX_SCALE_BITS': '8', 'MX_SHARED_EXP_METHOD': 'max', 'MX_ROUND': 'even', 'MX_CUSTOM_CUDA': '1'}

+ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32 --limit 0.1
  env overrides: {'USE_MX_QUANTIZATION': '1', 'MX_W_ELEM_FORMAT': 'None', 'MX_A_ELEM_FORMAT': 'fp6_e2m3', 'MX_BLOCK_SIZE': '32', 'MX_SCALE_BI

## Step 8: Full Evaluation (Exercise 1)

Run the complete evaluation with the MX-quantized model.

Estimated time: 10-15 minutes
Baseline: 62.10% accuracy
Target: > 60% accuracy (< 2% degradation)

In [None]:
# Full evaluation with MX quantization (with diagnostics and simple parsing).
import os
import subprocess


def parse_lm_eval_metrics(text: str) -> dict:
    # Extract the main acc row (and perplexity if present) from the markdown table
    acc = None
    acc_stderr = None
    ppl = None
    ppl_stderr = None
    for line in text.splitlines():
        if "|lambada_openai|" in line and "|acc" in line:
            # ... |acc|↑|0.5102|±|0.0070|
            parts = [p.strip() for p in line.split("|") if p.strip()]
            # expect: lambada_openai, 1, none, 0, acc, ↑, value, ±, stderr
            try:
                acc = float(parts[6])
                acc_stderr = float(parts[8])
            except Exception:
                pass
        if "perplexity" in line and "|              |" in line:
            parts = [p.strip() for p in line.split("|") if p.strip()]
            try:
                ppl = float(parts[5])
                ppl_stderr = float(parts[7])
            except Exception:
                pass
    return {"acc": acc, "acc_stderr": acc_stderr, "perplexity": ppl, "perplexity_stderr": ppl_stderr}


def run_lm_eval(env_overrides: dict, limit: float | None):
    env = os.environ.copy()
    env.update({k: str(v) for k, v in env_overrides.items()})
    cmd = [
        "lm_eval",
        "--model",
        "hf",
        "--model_args",
        "pretrained=meta-llama/Llama-3.2-1B",
        "--tasks",
        "lambada_openai",
        "--device",
        "cuda",
        "--batch_size",
        "32",
    ]
    if limit is not None:
        cmd += ["--limit", str(limit)]
    print("\n+", " ".join(cmd))
    if env_overrides:
        print("  env overrides:", {k: env[k] for k in env_overrides})
    out = subprocess.check_output(cmd, env=env, text=True, stderr=subprocess.STDOUT)
    metrics = parse_lm_eval_metrics(out)
    print("Parsed metrics:", metrics)
    return out, metrics


# 1) Runtime baseline (MX disabled)
baseline_out, baseline_metrics = run_lm_eval({"USE_MX_QUANTIZATION": "0"}, limit=None)

# 2) MX full: prefer the best MX-full config from Step 7b if available
best_mx_full = globals().get("quick_best_config_mx_full", None)
if isinstance(best_mx_full, dict) and isinstance(best_mx_full.get("env"), dict) and best_mx_full.get("env"):
    mx_full_name = str(best_mx_full.get("name", "Best MX-full from Step 7b"))
    mx_full_env = {k: str(v) for k, v in best_mx_full["env"].items()}
else:
    mx_full_name = "MX default (no Step 7b winner)"
    mx_full_env = {
        "USE_MX_QUANTIZATION": "1",
        "MX_W_ELEM_FORMAT": "fp4_e2m1",
        "MX_A_ELEM_FORMAT": "fp6_e2m3",
        "MX_BLOCK_SIZE": "32",
        "MX_SCALE_BITS": "8",
        "MX_SHARED_EXP_METHOD": "max",
        "MX_ROUND": "even",
        "MX_CUSTOM_CUDA": "1",
    }

print(f"\nMX full-eval config: {mx_full_name}")
print("MX full-eval env:", mx_full_env)

mx_out, mx_metrics = run_lm_eval(mx_full_env, limit=None)

# 3) MX full but custom CUDA disabled (diagnostic)
mx_nocuda_out, mx_nocuda_metrics = run_lm_eval({**mx_full_env, "MX_CUSTOM_CUDA": "0"}, limit=None)

# 4) MX weights-only (diagnostic)
mx_wonly_out, mx_wonly_metrics = run_lm_eval({**mx_full_env, "MX_A_ELEM_FORMAT": "None"}, limit=None)

# 5) MX activations-only (diagnostic)
mx_aonly_out, mx_aonly_metrics = run_lm_eval({**mx_full_env, "MX_W_ELEM_FORMAT": "None"}, limit=None)

# Persist for later cells
full_eval_chosen_name = mx_full_name
full_eval_chosen_env = mx_full_env

baseline_acc = None if baseline_metrics["acc"] is None else baseline_metrics["acc"] * 100
exercise1_acc = None if mx_metrics["acc"] is None else mx_metrics["acc"] * 100
exercise1_acc_nocuda = None if mx_nocuda_metrics["acc"] is None else mx_nocuda_metrics["acc"] * 100
exercise1_acc_wonly = None if mx_wonly_metrics["acc"] is None else mx_wonly_metrics["acc"] * 100
exercise1_acc_aonly = None if mx_aonly_metrics["acc"] is None else mx_aonly_metrics["acc"] * 100

print("\nSummary (acc %):")
print("  baseline (MX off):", baseline_acc)
print("  MX full:", exercise1_acc)
print("  MX full (custom_cuda=0):", exercise1_acc_nocuda)
print("  MX weights-only:", exercise1_acc_wonly)
print("  MX activations-only:", exercise1_acc_aonly)


+ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32
  env overrides: {'USE_MX_QUANTIZATION': '0'}
Parsed metrics: {'acc': 0.621, 'acc_stderr': 0.0068, 'perplexity': None, 'perplexity_stderr': None}

+ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32
  env overrides: {'USE_MX_QUANTIZATION': '1', 'MX_W_ELEM_FORMAT': 'fp4_e2m1', 'MX_A_ELEM_FORMAT': 'fp6_e2m3', 'MX_BLOCK_SIZE': '32', 'MX_SCALE_BITS': '8', 'MX_SHARED_EXP_METHOD': 'max', 'MX_ROUND': 'even', 'MX_CUSTOM_CUDA': '1'}
Parsed metrics: {'acc': 0.5067, 'acc_stderr': 0.007, 'perplexity': None, 'perplexity_stderr': None}

+ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks lambada_openai --device cuda --batch_size 32
  env overrides: {'USE_MX_QUANTIZATION': '1', 'MX_W_ELEM_FORMAT': 'fp4_e2m1', 'MX_A_ELEM_FORMAT': 'fp6_e2m3', 'MX_BLOCK_SIZE': '32', 'MX_SCALE_BITS': '8', '

## Step 9: Save Results

In [None]:
# Save Exercise 1 results (auto-filled if Step 8 and/or Step 7b ran)
import datetime
import os

timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

baseline_acc_doc = 62.10
baseline_acc_runtime = globals().get("baseline_acc", None)
baseline_acc = baseline_acc_runtime if baseline_acc_runtime is not None else baseline_acc_doc

exercise1_acc = globals().get("exercise1_acc", None)
exercise1_acc_nocuda = globals().get("exercise1_acc_nocuda", None)
exercise1_acc_wonly = globals().get("exercise1_acc_wonly", None)
exercise1_acc_aonly = globals().get("exercise1_acc_aonly", None)

quick_sweep_results = globals().get("quick_sweep_results", None)
quick_best_config = globals().get("quick_best_config", None)  # best MX-enabled (MX-any)
quick_best_config_any = globals().get("quick_best_config_any", None)
quick_best_config_mx_full = globals().get("quick_best_config_mx_full", None)

full_eval_chosen_name = globals().get("full_eval_chosen_name", None)
full_eval_chosen_env = globals().get("full_eval_chosen_env", None)


def _fmt_pct(v):
    return "<missing>" if v is None else f"{v:.2f}%"


def _fmt_acc(v):
    return "<missing>" if v is None else f"{float(v):.4f}"


def _env_lines(env: dict | None) -> list[str]:
    if not isinstance(env, dict) or not env:
        return ["- <missing>"]
    keys = [
        "USE_MX_QUANTIZATION",
        "MX_W_ELEM_FORMAT",
        "MX_A_ELEM_FORMAT",
        "MX_BLOCK_SIZE",
        "MX_SCALE_BITS",
        "MX_SHARED_EXP_METHOD",
        "MX_ROUND",
        "MX_CUSTOM_CUDA",
    ]
    lines: list[str] = []
    for k in keys:
        if k in env:
            lines.append(f"- {k}={env[k]}")
    # Include any other keys (stable order)
    for k in sorted(set(env.keys()) - set(keys)):
        lines.append(f"- {k}={env[k]}")
    return lines


repo_root = "/content/msr-intern-project"
base_dir = repo_root if os.path.isdir(repo_root) else os.getcwd()

# Canonical repo location for committed artifacts
results_dir = os.path.join(base_dir, "results")
os.makedirs(results_dir, exist_ok=True)

results_path = os.path.join(results_dir, "exercise1_results.txt")
write_mode = "a" if os.path.exists(results_path) else "w"

# Build an append-friendly report chunk
lines: list[str] = []
lines.append("\n" + "=" * 80)
lines.append(f"Notebook run: {timestamp}")
lines.append("=" * 80)
lines.append("Model: meta-llama/Llama-3.2-1B")
lines.append("Task: lambada_openai")
lines.append("Device: CUDA")
lines.append("Batch Size: 32")
lines.append("")

lines.append("MX full-eval config (from Step 8):")
lines.append(f"- name: {full_eval_chosen_name if full_eval_chosen_name else '<missing>'}")
lines.extend(_env_lines(full_eval_chosen_env))
lines.append("")

lines.append("Full-eval summary (if Step 8 ran):")
lines.append(f"- Baseline (runtime, MX off): {_fmt_pct(baseline_acc_runtime)}")
lines.append(f"- Baseline (doc):            {baseline_acc_doc:.2f}%")
lines.append(f"- Using baseline:            {_fmt_pct(baseline_acc)}")
lines.append(f"- MX full:                   {_fmt_pct(exercise1_acc)}")
lines.append(f"- MX full (custom_cuda=0):   {_fmt_pct(exercise1_acc_nocuda)}")
lines.append(f"- MX weights-only (a=None):  {_fmt_pct(exercise1_acc_wonly)}")
lines.append(f"- MX activations-only (w=None): {_fmt_pct(exercise1_acc_aonly)}")
lines.append("")

if quick_sweep_results is not None:
    lines.append("Quick sweep summary (Step 7b, limit=0.1):")
    lines.append(f"{'config':48s}  {'acc':>8s}  {'stderr':>8s}  {'runtime(s)':>10s}  status")
    lines.append("-" * 80)
    for r in quick_sweep_results:
        name = str(r.get("name", ""))
        acc = r.get("acc", None)
        acc_stderr = r.get("acc_stderr", None)
        runtime_s = r.get("runtime_s", None)
        ok = bool(r.get("ok", False))
        note = str(r.get("note", ""))
        acc_s = "<na>" if acc is None else f"{float(acc):.4f}"
        se_s = "<na>" if acc_stderr is None else f"{float(acc_stderr):.4f}"
        rt_s = "<na>" if runtime_s is None else f"{float(runtime_s):0.1f}"
        status = "OK" if ok else note
        lines.append(f"{name[:48]:48s}  {acc_s:>8s}  {se_s:>8s}  {rt_s:>10s}  {status}")
    lines.append("")

    if quick_best_config_any is not None:
        lines.append("Best quick-sweep config (overall):")
        lines.append(f"- name: {quick_best_config_any.get('name')}")
        lines.append(f"- acc:  {_fmt_acc(quick_best_config_any.get('acc'))}")
        lines.append(f"- stderr: {_fmt_acc(quick_best_config_any.get('acc_stderr'))}")
        lines.append(f"- env: {quick_best_config_any.get('env')}")
        lines.append("")

    if quick_best_config is not None:
        lines.append("Best quick-sweep config (MX-enabled / MX-any):")
        lines.append(f"- name: {quick_best_config.get('name')}")
        lines.append(f"- acc:  {_fmt_acc(quick_best_config.get('acc'))}")
        lines.append(f"- stderr: {_fmt_acc(quick_best_config.get('acc_stderr'))}")
        lines.append(f"- env: {quick_best_config.get('env')}")
        lines.append("")

    if quick_best_config_mx_full is not None:
        lines.append("Best quick-sweep config (MX-full / weights+activations):")
        lines.append(f"- name: {quick_best_config_mx_full.get('name')}")
        lines.append(f"- acc:  {_fmt_acc(quick_best_config_mx_full.get('acc'))}")
        lines.append(f"- stderr: {_fmt_acc(quick_best_config_mx_full.get('acc_stderr'))}")
        lines.append(f"- env: {quick_best_config_mx_full.get('env')}")
        lines.append("")

lines.append("Notes:")
lines.append("- Quick sweep uses --limit 0.1 and is for directional testing only.")
lines.append("- Full evaluation (no --limit) is the source of truth for final metrics.")
lines.append("")

report_chunk = "\n".join(lines)
with open(results_path, write_mode) as f:
    f.write(report_chunk)

print(f"Results {'appended' if write_mode == 'a' else 'saved'} to: {results_path}")

Results appended to: /content/msr-intern-project/results/exercise1_results.txt


## Step 10: Analysis & Comparison

Compare Exercise 1 results with baseline.

In [11]:
# Comparison analysis (auto-filled from Step 8 if available)
print("=" * 70)
print("EXERCISE 1 RESULTS ANALYSIS")
print("=" * 70)
 
def _fmt(v):
    return "<missing>" if v is None else f"{v:.2f}%"
 
# Prefer the runtime-measured baseline if present; fall back to the doc baseline.
baseline_acc_doc = 62.10
baseline_acc_runtime = globals().get("baseline_acc", None)
baseline_acc = baseline_acc_runtime if baseline_acc_runtime is not None else baseline_acc_doc
 
exercise1_acc = globals().get("exercise1_acc", None)
exercise1_acc_nocuda = globals().get("exercise1_acc_nocuda", None)
exercise1_acc_wonly = globals().get("exercise1_acc_wonly", None)
exercise1_acc_aonly = globals().get("exercise1_acc_aonly", None)
 
print(f"\nBaseline Accuracy (runtime): {_fmt(baseline_acc_runtime)}")
print(f"Baseline Accuracy (doc):     {baseline_acc_doc:.2f}%")
print(f"Using baseline:             {_fmt(baseline_acc)}")
 
print(f"\nMX full (fp4 weights + fp6 activations): {_fmt(exercise1_acc)}")
print(f"MX full (custom_cuda=0):              {_fmt(exercise1_acc_nocuda)}")
print(f"MX weights-only (fp4, a=None):        {_fmt(exercise1_acc_wonly)}")
print(f"MX activations-only (fp6, w=None):    {_fmt(exercise1_acc_aonly)}")
 
if exercise1_acc is not None:
    accuracy_change = exercise1_acc - baseline_acc
    accuracy_change_pct = (accuracy_change / baseline_acc) * 100
    print(f"\nChange vs baseline (MX full): {accuracy_change:+.2f}% ({accuracy_change_pct:+.2f}%)")
    if accuracy_change >= -2.0:
        print("Result: within target (< 2% degradation)")
    else:
        print("Result: exceeds 2% degradation threshold")
else:
    print("\nRun Step 8 first to populate metrics.")
 
print("=" * 70)

EXERCISE 1 RESULTS ANALYSIS

Baseline Accuracy (runtime): 62.10%
Baseline Accuracy (doc):     62.10%
Using baseline:             62.10%

MX full (fp4 weights + fp6 activations): 50.67%
MX full (custom_cuda=0):              50.84%
MX weights-only (fp4, a=None):        51.76%
MX activations-only (fp6, w=None):    61.61%

Change vs baseline (MX full): -11.43% (-18.41%)
Result: exceeds 2% degradation threshold


## Exercise 1 Wrap-up

**Next steps**
1. Record results in `results/exercise1_results.txt`
2. Compare accuracy vs baseline (62.10%)
3. Commit and push non-secret outputs (do not commit HF tokens)

**Artifacts**
- `Exercise1/modified_files/modeling_llama.py`
- `Exercise1/mx_config_helper.py`
- `results/exercise1_results.txt`