# FailureLLMUnlearning - Complete Setup and Execution Guide

This notebook provides a step-by-step guide to:
1. Clone the repository
2. Set up the conda environment
3. Install all dependencies
4. Load data from HuggingFace
5. Run unlearning methods
6. Evaluate the results

**Paper**: Catastrophic Failure of LLM Unlearning via Quantization (ICLR 2025)
**Repository**: https://github.com/zzwjames/FailureLLMUnlearning.git


## Step 1: Clone the Repository

First, we'll clone the repository if it doesn't already exist.


In [None]:
import shutil, pathlib

root = pathlib.Path("/workspace/CS534L_Project/FailureLLMUnlearning")
dirs_to_remove = [
    root / "ckpt",
    root / "temp",
    root / "results",
]

for d in dirs_to_remove:
    if d.exists():
        shutil.rmtree(d)
        print(f"Removed {d}")
    else:
        print(f"Not found: {d}")
        

Removed /workspace/CS534L_Project/FailureLLMUnlearning/ckpt
Removed /workspace/CS534L_Project/FailureLLMUnlearning/temp
Not found: /workspace/CS534L_Project/FailureLLMUnlearning/results


In [1]:
import os
import subprocess
from pathlib import Path

# Set the project directory
project_dir = Path("/workspace/CS534L_Project")
repo_dir = project_dir / "FailureLLMUnlearning"
repo_url = "https://github.com/zzwjames/FailureLLMUnlearning.git"

# Check if repository already exists
if repo_dir.exists():
    print(f"Repository already exists at: {repo_dir}")
    print("Skipping clone step. If you want to re-clone, delete the directory first.")
else:
    print(f"Cloning repository from {repo_url}...")
    os.chdir(project_dir)
    result = subprocess.run(
        ["git", "clone", repo_url],
        capture_output=True,
        text=True
    )
    if result.returncode == 0:
        print("‚úì Repository cloned successfully!")
    else:
        print(f"‚úó Error cloning repository: {result.stderr}")
        raise Exception("Failed to clone repository")

# Change to repository directory
os.chdir(repo_dir)
print(f"\nCurrent working directory: {os.getcwd()}")


Repository already exists at: /workspace/CS534L_Project/FailureLLMUnlearning
Skipping clone step. If you want to re-clone, delete the directory first.

Current working directory: /workspace/CS534L_Project/FailureLLMUnlearning


## Step 2: Set Up Conda Environment

According to the README, we need to create a conda environment using the `environment.yml` file. This will create an environment named `py310` with Python 3.10 and all required dependencies.


In [2]:
# Check if conda is available
result = subprocess.run(["conda", "--version"], capture_output=True, text=True)
if result.returncode != 0:
    print("‚ö† Warning: Conda is not available. Please install conda or use pip instead.")
    print("You can still proceed with pip installation in the next step.")
else:
    print(f"‚úì Conda found: {result.stdout.strip()}")
    
# Check if environment.yml exists
env_file = repo_dir / "environment.yml"
if env_file.exists():
    print(f"‚úì Found environment.yml at: {env_file}")
    print("\nTo create the conda environment, run the following commands in your terminal:")
    print(f"  conda env create -f {env_file}")
    print("  conda activate py310")
    print("\nOr if the environment already exists, just activate it:")
    print("  conda activate py310")
else:
    print("‚úó environment.yml not found!")


FileNotFoundError: [Errno 2] No such file or directory: 'conda'

## Step 3: Install Dependencies

We'll install dependencies using pip. The repository provides both `environment.yml` (for conda) and `requirements.txt` (for pip). We'll use pip for compatibility.


In [2]:
# Read requirements.txt to see what will be installed
requirements_file = repo_dir / "requirements.txt"
if requirements_file.exists():
    print("Requirements from requirements.txt:")
    print("=" * 60)
    with open(requirements_file, 'r') as f:
        print(f.read())
    print("=" * 60)
    
    # Note: We'll install these in the next cell
    print("\n‚ö† Note: Installing packages can take several minutes.")
    print("The main dependencies include:")
    print("  - torch==2.2")
    print("  - transformers==4.40")
    print("  - accelerate==0.29")
    print("  - datasets==2.19")
    print("  - bitsandbytes==0.42.0 (optional, requires CUDA/Linux)")
    print("  - and many more...")
else:
    print("‚úó requirements.txt not found!")


Requirements from requirements.txt:
accelerate==0.29
bitsandbytes==0.42.0
datasets==2.19
einops==0.7
huggingface-hub>=0.26.0,<1.0
ipykernel==6.29
ipython==8.24
ipywidgets==8.1
matplotlib==3.9
matplotlib-inline==0.1
numpy==1.26
openai==1.23
pandas==2.2
peft==0.13.2
protobuf==5.26
python-dotenv==1.0
rouge==1.0.1
rouge-score==0.1
scienceplots==2.1
scikit-learn==1.4
scipy==1.13
seaborn==0.13
sympy==1.12
tokenizers==0.19
torch==2.2
tqdm
transformers==4.40
trl>=0.8.1


‚ö† Note: Installing packages can take several minutes.
The main dependencies include:
  - torch==2.2
  - transformers==4.40
  - accelerate==0.29
  - datasets==2.19
  - bitsandbytes==0.42.0 (optional, requires CUDA/Linux)
  - and many more...


In [3]:
# Install dependencies
# This may take 10-20 minutes depending on your internet connection

import platform
import sys

print("Installing dependencies from requirements.txt...")
print("This may take several minutes. Please be patient...\n")

# Check if we're on macOS (bitsandbytes doesn't work on macOS/ARM)
is_macos = platform.system() == "Darwin"
is_arm = platform.machine() == "arm64"

if is_macos:
    print("‚ö† Detected macOS system.")
    print("Note: bitsandbytes requires CUDA (Linux/NVIDIA GPU) and won't work on macOS.")
    print("The code will work in full-precision mode without bitsandbytes.\n")

# Create a temporary requirements file without bitsandbytes for initial installation
temp_requirements = repo_dir / "requirements_temp.txt"
with open(requirements_file, 'r') as f:
    lines = f.readlines()

# Filter out bitsandbytes line
filtered_lines = [line for line in lines if not line.strip().startswith('bitsandbytes')]

with open(temp_requirements, 'w') as f:
    f.writelines(filtered_lines)

print("Step 1: Installing core dependencies (excluding bitsandbytes)...")
result = subprocess.run(
    ["pip", "install", "-r", str(temp_requirements)],
    cwd=str(repo_dir),
    capture_output=True,
    text=True
)

if result.returncode == 0:
    print("‚úì Core dependencies installed successfully!")
else:
    print("‚úó Error installing core dependencies:")
    print(result.stderr)
    print("\nTrying to continue anyway...")

# Try to install bitsandbytes separately (will fail on macOS, which is OK)
print("\nStep 2: Attempting to install bitsandbytes (optional for quantization)...")
bitsandbytes_result = subprocess.run(
    ["pip", "install", "bitsandbytes==0.42.0"],
    cwd=str(repo_dir),
    capture_output=True,
    text=True
)

if bitsandbytes_result.returncode == 0:
    print("‚úì bitsandbytes installed successfully!")
    print("  You can use 4-bit and 8-bit quantization.")
elif is_macos:
    print("‚ö† bitsandbytes installation skipped (not available on macOS).")
    print("  This is expected. You can still run the code in full-precision mode.")
    print("  Set quantize_4bit=0 and quantize_8bit=0 in evaluation.")
else:
    print("‚ö† bitsandbytes installation failed:")
    print(bitsandbytes_result.stderr)
    print("  You can still run the code in full-precision mode.")

# Clean up temp file
if temp_requirements.exists():
    temp_requirements.unlink()

print("\n" + "="*60)
print("Installation summary:")
print("  - Core dependencies: ‚úì")
if bitsandbytes_result.returncode == 0:
    print("  - bitsandbytes: ‚úì (quantization available)")
else:
    print("  - bitsandbytes: ‚úó (quantization not available, use full-precision)")
print("="*60)


Installing dependencies from requirements.txt...
This may take several minutes. Please be patient...

Step 1: Installing core dependencies (excluding bitsandbytes)...
‚úì Core dependencies installed successfully!

Step 2: Attempting to install bitsandbytes (optional for quantization)...
‚úì bitsandbytes installed successfully!
  You can use 4-bit and 8-bit quantization.

Installation summary:
  - Core dependencies: ‚úì
  - bitsandbytes: ‚úì (quantization available)


In [4]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device in use:", device)
if device.type == "cuda":
    print("GPU name:", torch.cuda.get_device_name(0))

    print("GPU memory (MB):", torch.cuda.get_device_properties(0).total_memory // (1024**2))

Device in use: cuda
GPU name: NVIDIA A100-SXM4-80GB
GPU memory (MB): 81155


## Step 4: Load Data from HuggingFace

According to the README, we need to load data from HuggingFace datasets. This will download:
- MUSE-News dataset and target model
- MUSE-Books dataset and target model

The data will be saved in the `data/` directory.


## Step 4a: Authenticate with HuggingFace

Some HuggingFace datasets require authentication. Let's check if you're logged in and authenticate if needed.


In [5]:
# Check HuggingFace authentication and login if needed
import subprocess
import os

print("Checking HuggingFace authentication...\n")

# Check if huggingface_hub is installed
try:
    from huggingface_hub import whoami, login
    hf_available = True
except ImportError:
    print("‚ö† huggingface_hub not found. Installing...")
    subprocess.run(["pip", "install", "huggingface_hub"], capture_output=True)
    from huggingface_hub import whoami, login
    hf_available = True

# Try to get current user info
try:
    user_info = whoami()
    print(f"‚úì Already authenticated as: {user_info.get('name', 'Unknown')}")
    print("  You can proceed to load data.\n")
    authenticated = True
except Exception as e:
    print("‚ö† Not authenticated with HuggingFace.")
    print("\nTo authenticate, you have two options:\n")
    print("Option 1: Login via CLI (Recommended)")
    print("  Run this command in your terminal:")
    print("    huggingface-cli login")
    print("  Then paste your access token when prompted.\n")
    print("Option 2: Login programmatically")
    print("  Uncomment the code below and provide your token:\n")
    print("  # token = 'your_huggingface_token_here'")
    print("  # login(token=token)")
    print("\nTo get your token:")
    print("  1. Go to https://huggingface.co/settings/tokens")
    print("  2. Create a new token (or use an existing one)")
    print("  3. Copy the token and use it above\n")
    
    # Uncomment and set your token here if you want to login programmatically
    authenticated = False
    token = "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"
    login(token=token)
    print("‚úì Authenticated successfully!")
    authenticated = True        


Checking HuggingFace authentication...

‚úì Already authenticated as: himishra
  You can proceed to load data.



In [6]:
# Check if data already exists
data_dir = repo_dir / "data"
if data_dir.exists() and any(data_dir.iterdir()):
    print("‚úì Data directory already exists with content.")
    print("Skipping data download. If you want to re-download, delete the data/ directory first.")
    print(f"\nData directory contents:")
    for item in sorted(data_dir.iterdir()):
        if item.is_dir():
            print(f"  üìÅ {item.name}/")
else:
    print("Data directory is empty or doesn't exist.")
    print("Will load data from HuggingFace in the next cell.")


‚úì Data directory already exists with content.
Skipping data download. If you want to re-download, delete the data/ directory first.

Data directory contents:
  üìÅ books/
  üìÅ news/


In [65]:
import sys
!{sys.executable} -m pip install hf_transfer

Collecting hf_transfer
  Using cached hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Using cached hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
Installing collected packages: hf_transfer
Successfully installed hf_transfer-0.1.9


In [None]:
# Load data from HuggingFace
# This will download datasets for both News and Books corpora
# This may take some time depending on your internet connection

print("Loading data from HuggingFace...")
print("This will download:")
print("  - MUSE-News dataset (knowmem, verbmem, privleak, raw, scal, sust)")
print("  - MUSE-Books dataset (knowmem, verbmem, privleak, raw)")
print("\nThis may take 5-15 minutes depending on your connection...\n")

result = subprocess.run(
    ["python", "load_data.py"],
    cwd=str(repo_dir),
    capture_output=True,
    text=True
)

if result.returncode == 0:
    print("‚úì Data loaded successfully!")
    print("\nData structure:")
    print(result.stdout)
else:
    print("‚úó Error loading data:")
    print(result.stderr)
    print("\nYou may need to:")
    print("  1. Check your internet connection")
    print("  2. Ensure you have HuggingFace access")
    print("  3. Install huggingface-hub: pip install huggingface-hub")


Loading data from HuggingFace...
This will download:
  - MUSE-News dataset (knowmem, verbmem, privleak, raw, scal, sust)
  - MUSE-Books dataset (knowmem, verbmem, privleak, raw)

This may take 5-15 minutes depending on your connection...



## Step 5: Run Unlearning Methods

Now we'll run the unlearning methods. According to the README, we can use various algorithms:
- `ga`: Gradient Ascent
- `ga_gdr`: GA with Gradient Difference Regularization
- `ga_klr`: GA with KL Regularization
- `npo`: Negative Preference Optimization
- `npo_gdr`: NPO with GDR
- `npo_klr`: NPO with KLR
- `ga_gdr_sure`, `ga_klr_sure`, `npo_gdr_sure`, `npo_klr_sure`: SURE variants
- `rmu`: Retraining with Modified Updates

We'll start with a simple example using the `ga` algorithm on the News corpus.


In [7]:
# Configuration for unlearning
import os

# Set corpus (options: 'news' or 'books')
CORPUS = 'news'  # Change to 'books' if you want to use Books corpus

# Set algorithm (options: 'ga', 'ga_gdr', 'ga_klr', 'npo', 'npo_gdr', 'npo_klr', 
#                        'ga_gdr_sure', 'ga_klr_sure', 'npo_gdr_sure', 'npo_klr_sure', 'rmu')
ALGO = 'ga'  # Start with simple gradient ascent

# Paths
FORGET = f"data/{CORPUS}/raw/forget.txt"
RETAIN = f"data/{CORPUS}/raw/retain1.txt"
TARGET_DIR = f'muse-bench/MUSE-{CORPUS.capitalize()}_target'
TOKENIZER_DIR = 'meta-llama/Llama-2-7b-hf'
OUT_DIR = f"./ckpt/{CORPUS}/{ALGO}"

# Hyperparameters
MAX_LEN = 2048
EPOCHS = 10  # You may want to reduce this for testing (e.g., 1-2 epochs)
LR = '1e-5'
PER_DEVICE_BATCH_SIZE = 2
ALPHA = 1  # Weight for utility constraint
THRESHOLD = 90  # Threshold to filter out salient modules

# GPU configuration (set to empty string if no GPU available)
# For multi-GPU, use: "0,1,2,3"
CUDA_VISIBLE_DEVICES = "0"  # Adjust based on your GPU availability

print("Unlearning Configuration:")
print("=" * 60)
print(f"Corpus: {CORPUS}")
print(f"Algorithm: {ALGO}")
print(f"Target Model: {TARGET_DIR}")
print(f"Forget Data: {FORGET}")
print(f"Retain Data: {RETAIN}")
print(f"Output Directory: {OUT_DIR}")
print(f"Epochs: {EPOCHS}")
print(f"Learning Rate: {LR}")
print(f"Batch Size: {PER_DEVICE_BATCH_SIZE}")
print(f"GPU Devices: {CUDA_VISIBLE_DEVICES if CUDA_VISIBLE_DEVICES else 'CPU'}")
print("=" * 60)


Unlearning Configuration:
Corpus: news
Algorithm: ga
Target Model: muse-bench/MUSE-News_target
Forget Data: data/news/raw/forget.txt
Retain Data: data/news/raw/retain1.txt
Output Directory: ./ckpt/news/ga
Epochs: 10
Learning Rate: 1e-5
Batch Size: 2
GPU Devices: 0


In [8]:
# Run unlearning
# Note: This will take a significant amount of time (potentially hours depending on epochs and GPU)
# Make sure you have sufficient GPU memory and time

import subprocess

# Set environment variable for GPU
if CUDA_VISIBLE_DEVICES:
    os.environ['CUDA_VISIBLE_DEVICES'] = CUDA_VISIBLE_DEVICES

# Build the command
unlearn_script = repo_dir / "baselines" / "unlearn.py"

cmd = [
    "python", str(unlearn_script),
    "--algo", ALGO,
    "--model_dir", TARGET_DIR,
    "--tokenizer_dir", TOKENIZER_DIR,
    "--data_file", FORGET,
    "--retain_data_file", RETAIN,
    "--out_dir", OUT_DIR,
    "--max_len", str(MAX_LEN),
    "--epochs", str(EPOCHS),
    "--lr", LR,
    "--alpha", str(ALPHA),
    "--threshold", str(THRESHOLD),
    "--per_device_batch_size", str(PER_DEVICE_BATCH_SIZE)
]

print("Running unlearning...")
print(f"Command: {' '.join(cmd)}\n")
print("‚ö† This may take a long time (hours for full training).")
print("‚ö† Make sure you have:")
print("   - Sufficient GPU memory (recommended: 16GB+ VRAM)")
print("   - HuggingFace access to download models")
print("   - Stable internet connection\n")

# Uncomment the following lines to actually run the unlearning
result = subprocess.run(
    cmd,
    cwd=str(repo_dir),
    capture_output=False,  # Set to True if you want to capture output
    text=True
)

if result.returncode == 0:
    print("‚úì Unlearning completed successfully!")
    print(f"Model saved to: {OUT_DIR}")
else:
    print("‚úó Error during unlearning:")
    print(result.stderr)

print("\n‚ö† To run unlearning, uncomment the code above and execute this cell.")
print("For now, we'll proceed with evaluation assuming you have a trained model.")


Running unlearning...
Command: python /workspace/CS534L_Project/FailureLLMUnlearning/baselines/unlearn.py --algo ga --model_dir muse-bench/MUSE-News_target --tokenizer_dir meta-llama/Llama-2-7b-hf --data_file data/news/raw/forget.txt --retain_data_file data/news/raw/retain1.txt --out_dir ./ckpt/news/ga --max_len 2048 --epochs 10 --lr 1e-5 --alpha 1 --threshold 90 --per_device_batch_size 2

‚ö† This may take a long time (hours for full training).
‚ö† Make sure you have:
   - Sufficient GPU memory (recommended: 16GB+ VRAM)
   - HuggingFace access to download models
   - Stable internet connection



Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [01:35<00:00, 15.94s/it]
  0%|          | 0/2040 [00:00<?, ?it/s]Could not estimate the number of tokens of the input, floating-point operations will not be computed
 25%|‚ñà‚ñà‚ñç       | 500/2040 [09:13<28:26,  1.11s/it]

{'loss': -87.0368, 'grad_norm': 113.5, 'learning_rate': 1e-05, 'epoch': 2.45}


 49%|‚ñà‚ñà‚ñà‚ñà‚ñâ     | 1000/2040 [18:26<19:13,  1.11s/it]

{'loss': -101.4696, 'grad_norm': 114.5, 'learning_rate': 1e-05, 'epoch': 4.9}


 74%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 1500/2040 [27:39<09:59,  1.11s/it]

{'loss': -103.2278, 'grad_norm': 121.5, 'learning_rate': 1e-05, 'epoch': 7.35}


 98%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä| 2000/2040 [36:54<00:44,  1.11s/it]

{'loss': -104.5978, 'grad_norm': 118.0, 'learning_rate': 1e-05, 'epoch': 9.8}


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2040/2040 [37:38<00:00,  1.11s/it]


{'train_runtime': 2258.2066, 'train_samples_per_second': 1.802, 'train_steps_per_second': 0.903, 'train_loss': -99.20049235026042, 'epoch': 10.0}
‚úì Unlearning completed successfully!
Model saved to: ./ckpt/news/ga

‚ö† To run unlearning, uncomment the code above and execute this cell.
For now, we'll proceed with evaluation assuming you have a trained model.


## Step 6: Evaluate Unlearned Models

After training, we need to evaluate the unlearned models using various metrics:
- `verbmem_f`: VerbMem Forget (measures if forgotten content is still generated)
- `privleak`: PrivLeak (privacy leakage detection)
- `knowmem_f`: KnowMem Forget (knowledge memorization on forget set)
- `knowmem_r`: KnowMem Retain (utility - knowledge retention on retain set)

We can evaluate models with different quantization settings:
- Full precision: `quantize_4bit=0, quantize_8bit=0`
- 4-bit quantization: `quantize_4bit=1, quantize_8bit=0`
- 8-bit quantization: `quantize_4bit=0, quantize_8bit=1`


In [5]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.4 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.4/12.4 MB[0m [31m49.0 MB/s[0m  [33m0:00:00[0meta [36m0:00:01[0m
[?25hUsing cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3/3[0m [pandas]2m2/3[0m [pandas]
[1A[2KS

In [8]:
# Detailed Results Analysis and Insights
import pandas as pd
import numpy as np
OUTPUT_FILE = "output.csv"
output_file = repo_dir / OUTPUT_FILE

if output_file.exists():
    df = pd.read_csv(output_file)
    
    print("=" * 80)
    print("üìä DETAILED RESULTS ANALYSIS")
    print("=" * 80)
    print()
    
    # Define metric categories
    unlearning_metrics = ['verbmem_f', 'privleak', 'knowmem_f']  # Lower is better
    utility_metrics = ['knowmem_r', 'gen', 'tru', 'fac', 'flu']  # Higher is better
    
    for idx, row in df.iterrows():
        model_name = row['name']
        print(f"üîç Model: {model_name}")
        print("-" * 80)
        
        # Unlearning Effectiveness (Lower is Better)
        print("\n‚úÖ UNLEARNING EFFECTIVENESS (Lower is Better):")
        print("   Measures how well the model 'forgot' the target data")
        print()
        
        verbmem = row.get('verbmem_f', 0)
        privleak = row.get('privleak', 0)
        knowmem_f = row.get('knowmem_f', 0)
        
        print(f"   ‚Ä¢ verbmem_f (VerbMem Forget): {verbmem:.2f}%")
        if verbmem < 20:
            print("     ‚úì Excellent: Model cannot reproduce verbatim text")
        elif verbmem < 40:
            print("     ‚ö† Moderate: Some verbatim memory remains")
        else:
            print("     ‚úó Poor: Model still memorizes verbatim text")
        
        print(f"   ‚Ä¢ privleak (Privacy Leakage): {privleak:.2f}%")
        if privleak < -50:
            print("     ‚úì Excellent: Strong privacy protection")
        elif privleak < 0:
            print("     ‚úì Good: Better than retrained baseline")
        elif privleak < 20:
            print("     ‚ö† Moderate: Some privacy leakage")
        else:
            print("     ‚úó Poor: Significant privacy leakage detected")
        
        print(f"   ‚Ä¢ knowmem_f (KnowMem Forget): {knowmem_f:.2f}%")
        if knowmem_f < 20:
            print("     ‚úì Excellent: Model cannot answer questions about forget set")
        elif knowmem_f < 40:
            print("     ‚ö† Moderate: Some semantic knowledge remains")
        else:
            print("     ‚úó Poor: Model still has semantic knowledge")
        
        # Utility Preservation (Higher is Better)
        print("\n‚úÖ UTILITY PRESERVATION (Higher is Better):")
        print("   Measures how well the model retained general capabilities")
        print()
        
        knowmem_r = row.get('knowmem_r', 0)
        gen = row.get('gen', 0)
        tru = row.get('tru', 0)
        fac = row.get('fac', 0)
        flu = row.get('flu', 0)
        
        print(f"   ‚Ä¢ knowmem_r (KnowMem Retain): {knowmem_r:.2f}%")
        if knowmem_r > 80:
            print("     ‚úì Excellent: Utility well preserved")
        elif knowmem_r > 60:
            print("     ‚ö† Moderate: Some utility loss")
        elif knowmem_r > 40:
            print("     ‚ö† Significant: Notable utility degradation")
        else:
            print("     ‚úó Poor: Catastrophic forgetting - model lost too much utility")
        
        if gen > 0:
            print(f"   ‚Ä¢ gen (MMLU - General Knowledge): {gen:.2f}")
            if gen > 0.6:
                print("     ‚úì Good: Model retains general knowledge")
            else:
                print("     ‚ö† Low: General knowledge degraded")
        
        if tru > 0:
            print(f"   ‚Ä¢ tru (TruthfulQA): {tru:.2f}")
            if tru > 0.6:
                print("     ‚úì Good: Model remains truthful")
            else:
                print("     ‚ö† Low: Truthfulness may be affected")
        
        if fac > 0:
            print(f"   ‚Ä¢ fac (TriviaQA - Factual): {fac:.2f}")
            if fac > 0.5:
                print("     ‚úì Good: Factual knowledge retained")
            else:
                print("     ‚ö† Low: Factual knowledge degraded")
        
        if flu > 0:
            print(f"   ‚Ä¢ flu (Fluency): {flu:.2f}")
            if flu > 0.7:
                print("     ‚úì Good: Model remains fluent")
            else:
                print("     ‚ö† Low: Fluency may be affected")
        
        # Overall Assessment
        print("\nüìà OVERALL ASSESSMENT:")
        print("-" * 80)
        
        # Calculate unlearning score (average of normalized unlearning metrics)
        unlearning_scores = []
        if verbmem < 100:
            unlearning_scores.append(1 - verbmem/100)  # Normalize to 0-1, higher is better
        if privleak < 0:
            unlearning_scores.append(1 - abs(privleak)/100)  # Negative privleak is good
        if knowmem_f < 100:
            unlearning_scores.append(1 - knowmem_f/100)
        
        avg_unlearning = np.mean(unlearning_scores) if unlearning_scores else 0
        
        # Calculate utility score
        utility_scores = []
        if knowmem_r > 0:
            utility_scores.append(knowmem_r/100)
        if gen > 0:
            utility_scores.append(gen)
        if tru > 0:
            utility_scores.append(tru)
        if fac > 0:
            utility_scores.append(fac)
        if flu > 0:
            utility_scores.append(flu)
        
        avg_utility = np.mean(utility_scores) if utility_scores else 0
        
        print(f"   Unlearning Effectiveness: {avg_unlearning*100:.1f}%")
        print(f"   Utility Preservation: {avg_utility*100:.1f}%")
        print()
        
        # Trade-off analysis
        if avg_unlearning > 0.7 and avg_utility > 0.7:
            print("   üéâ EXCELLENT: Strong unlearning with good utility preservation!")
        elif avg_unlearning > 0.7 and avg_utility < 0.5:
            print("   ‚ö† WARNING: Good unlearning but catastrophic utility loss!")
            print("      This may indicate quantization issues or overly aggressive unlearning.")
        elif avg_unlearning < 0.5 and avg_utility > 0.7:
            print("   ‚ö† WARNING: Good utility but poor unlearning!")
            print("      The model may not have forgotten the target data effectively.")
        elif avg_unlearning < 0.5 and avg_utility < 0.5:
            print("   ‚úó POOR: Both unlearning and utility are low.")
            print("      The unlearning process may have failed or damaged the model.")
        else:
            print("   ‚ö† MODERATE: Balanced but not optimal performance.")
        
        print()
        print("=" * 80)
        print()
    
    # Comparative Analysis (if multiple models)
    if len(df) > 1:
        print("üìä COMPARATIVE ANALYSIS")
        print("=" * 80)
        print()
        print("Comparing all models:")
        print()
        
        # Find best unlearning
        best_unlearning = df.loc[df['verbmem_f'].idxmin()] if 'verbmem_f' in df.columns else None
        if best_unlearning is not None:
            print(f"   Best Unlearning (lowest verbmem_f): {best_unlearning['name']} ({best_unlearning['verbmem_f']:.2f}%)")
        
        # Find best utility
        best_utility = df.loc[df['knowmem_r'].idxmax()] if 'knowmem_r' in df.columns else None
        if best_utility is not None:
            print(f"   Best Utility (highest knowmem_r): {best_utility['name']} ({best_utility['knowmem_r']:.2f}%)")
        
        # Find best balance
        if 'verbmem_f' in df.columns and 'knowmem_r' in df.columns:
            df['balance_score'] = (1 - df['verbmem_f']/100) * (df['knowmem_r']/100)
            best_balance = df.loc[df['balance_score'].idxmax()]
            print(f"   Best Balance (unlearning √ó utility): {best_balance['name']} (score: {best_balance['balance_score']:.3f})")
        
        print()
        print("üí° TIP: For detailed insights, see RESULTS_INSIGHTS.md in the repository root.")
        print()
    
else:
    print(f"‚ö† Results file {OUTPUT_FILE} not found yet.")
    print("Please run the evaluation step first.")


üìä DETAILED RESULTS ANALYSIS

üîç Model: original_target
--------------------------------------------------------------------------------

‚úÖ UNLEARNING EFFECTIVENESS (Lower is Better):
   Measures how well the model 'forgot' the target data

   ‚Ä¢ verbmem_f (VerbMem Forget): 42.21%
     ‚úó Poor: Model still memorizes verbatim text
   ‚Ä¢ privleak (Privacy Leakage): -99.81%
     ‚úì Excellent: Strong privacy protection
   ‚Ä¢ knowmem_f (KnowMem Forget): 64.41%
     ‚úó Poor: Model still has semantic knowledge

‚úÖ UTILITY PRESERVATION (Higher is Better):
   Measures how well the model retained general capabilities

   ‚Ä¢ knowmem_r (KnowMem Retain): 53.91%
     ‚ö† Significant: Notable utility degradation

üìà OVERALL ASSESSMENT:
--------------------------------------------------------------------------------
   Unlearning Effectiveness: 31.2%
   Utility Preservation: 53.9%

   ‚ö† MODERATE: Balanced but not optimal performance.




## Step 8: Analyze Results & Insights

This section helps you interpret your evaluation results and understand what they mean for LLM unlearning effectiveness.


In [9]:
# Evaluation configuration

# Models to evaluate (can be local paths or HuggingFace model IDs)
# For this example, we'll evaluate the original target model
# In practice, you would evaluate your unlearned models from the ckpt/ directory
MODEL_DIRS = [
    f"muse-bench/MUSE-{CORPUS.capitalize()}_target",  # Original target model
    # Add your unlearned model paths here, e.g.:
    # f"./ckpt/{CORPUS}/{ALGO}",
]

# Names for each model (should match length of MODEL_DIRS)
MODEL_NAMES = [
    "original_target",
    # Add names for your models, e.g.:
    # f"{ALGO}_checkpoint_102",
]

# Evaluation settings
EVAL_CORPUS = CORPUS
OUTPUT_FILE = "output.csv"
TOKENIZER_DIR_EVAL = 'meta-llama/Llama-2-7b-hf'
METRICS = ['verbmem_f', 'privleak', 'knowmem_f', 'knowmem_r']  # All metrics
QUANTIZE_4BIT = 0  # Set to 1 for 4-bit quantization
QUANTIZE_8BIT = 0  # Set to 1 for 8-bit quantization
TEMP_DIR = "temp"
MAX_SAMPLES = None # Set to a number (e.g., 10) to test with a small dataset first
                     # None = use full dataset (recommended for final evaluation)

print("Evaluation Configuration:")
print("=" * 60)
print(f"Models to evaluate: {len(MODEL_DIRS)}")
for name, model_dir in zip(MODEL_NAMES, MODEL_DIRS):
    print(f"  - {name}: {model_dir}")
print(f"Corpus: {EVAL_CORPUS}")
print(f"Metrics: {', '.join(METRICS)}")
print(f"Quantization: 4-bit={QUANTIZE_4BIT}, 8-bit={QUANTIZE_8BIT}")
print(f"Max samples: {MAX_SAMPLES if MAX_SAMPLES else 'Full dataset'}")
print(f"Output file: {OUTPUT_FILE}")
print("=" * 60)


Evaluation Configuration:
Models to evaluate: 1
  - original_target: muse-bench/MUSE-News_target
Corpus: news
Metrics: verbmem_f, privleak, knowmem_f, knowmem_r
Quantization: 4-bit=0, 8-bit=0
Max samples: Full dataset
Output file: output.csv


In [35]:
!pip install rouge

Collecting rouge
  Using cached rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Using cached rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [34]:
import sys
!{sys.executable} -m pip install rouge-score



In [19]:
!pip install scipy

Collecting scipy
  Downloading scipy-1.16.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (62 kB)
Downloading scipy-1.16.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m35.7/35.7 MB[0m [31m57.8 MB/s[0m  [33m0:00:00[0mm0:00:01[0m00:01[0m
[?25hInstalling collected packages: scipy
Successfully installed scipy-1.16.3


In [23]:
import sys
!{sys.executable} -m pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.5 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.5/9.5 MB[0m [31m38.9 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hUsing cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2/2[0m [scikit-learn][0m [scikit-learn]
[1A[2KSuccessfully installed scikit-learn-1.7.2 threadpoolctl-3.6.0


In [25]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Using cached safetensors-0.7.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub<1.0,>=0.34.0->transformers)
  Using cached hf_xet-1.2.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Downloading transformers-4.57.3-py3-none-any.whl (12.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.0/12.0 MB[0m [31m47.4 MB/s[0m  [33m0:00:00[0meta [36m0:00

In [27]:
import sys
!{sys.executable} -m pip install datasets

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Using cached pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Collecting multiprocess<0.70.19 (from datasets)
  Using cached multiprocess-0.70.18-py312-none-any.whl.metadata (7.5 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)
  Using cached aiohttp-3.13.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (8.1 kB)
Collecting aiohappyeyeballs>=2.5.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)
  Using cached aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9

In [30]:
!pip install peft

Collecting peft
  Downloading peft-0.18.0-py3-none-any.whl.metadata (14 kB)
Collecting accelerate>=0.21.0 (from peft)
  Downloading accelerate-1.12.0-py3-none-any.whl.metadata (19 kB)
Downloading peft-0.18.0-py3-none-any.whl (556 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m556.4/556.4 kB[0m [31m5.4 MB/s[0m  [33m0:00:00[0m
[?25hDownloading accelerate-1.12.0-py3-none-any.whl (380 kB)
Installing collected packages: accelerate, peft
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2/2[0m [peft][32m1/2[0m [peft]
[1A[2KSuccessfully installed accelerate-1.12.0 peft-0.18.0


In [32]:
!pip install trl>=0.8.1

In [41]:
!pip install accelerate



In [42]:
pip install -i https://pypi.org/simple/ bitsandbytes

Looking in indexes: https://pypi.org/simple/
Note: you may need to restart the kernel to use updated packages.


In [73]:
!pip install -i https://pypi.org/simple/ bitsandbytes

Looking in indexes: https://pypi.org/simple/


In [12]:
# Run evaluation
# Note: Evaluation can also take a significant amount of time
# Each metric requires running inference on the model

import subprocess
import sys
import os
from pathlib import Path

# Check if required variables are defined (from Cell 17)
required_vars = ['repo_dir', 'MODEL_DIRS', 'MODEL_NAMES', 'EVAL_CORPUS', 
                 'OUTPUT_FILE', 'TOKENIZER_DIR_EVAL', 'METRICS', 'TEMP_DIR',
                 'QUANTIZE_4BIT', 'QUANTIZE_8BIT', 'MAX_SAMPLES']
missing_vars = [v for v in required_vars if v not in globals()]
if missing_vars:
    print("Error: Missing required variables. Please run Cell 17 (Evaluation Configuration) first.")
    print(f"Missing variables: {', '.join(missing_vars)}")
    raise NameError(f"Missing variables: {', '.join(missing_vars)}. Please run Cell 17 first.")


# Set CUDA devices safely
cuda_env = globals().get("CUDA_VISIBLE_DEVICES") or os.environ.get("CUDA_VISIBLE_DEVICES")
if cuda_env:
    os.environ["CUDA_VISIBLE_DEVICES"] = cuda_env
else:
    os.environ.pop("CUDA_VISIBLE_DEVICES", None)

# Define eval script and Python executable
eval_script = repo_dir / "eval.py"
python_executable = sys.executable

# Alpha parameter required by load_model function
ALPHA_EVAL = 5  # Default alpha value for evaluation

# --- Build the command ---
cmd = (
    [python_executable, str(eval_script)]
    + ["--model_dirs"] + MODEL_DIRS
    + ["--names"] + MODEL_NAMES
    + ["--corpus", EVAL_CORPUS,
       "--out_file", OUTPUT_FILE,
       "--tokenizer_dir", TOKENIZER_DIR_EVAL,
       "--metrics"] + METRICS
    + ["--temp_dir", TEMP_DIR,
       "--quantize_4bit", str(int(QUANTIZE_4BIT)),
       "--quantize_8bit", str(int(QUANTIZE_8BIT)),
       "--alpha", str(ALPHA_EVAL)]
)

# Add max_samples if specified (for quick testing)
if MAX_SAMPLES is not None:
    cmd = cmd + ["--max_samples", str(MAX_SAMPLES)]

# --- Run the evaluation ---
print("Starting evaluation...")
print(f"Command: {' '.join(cmd)}\n")
if MAX_SAMPLES is not None:
    print(f"‚ö† Quick test mode: Using only {MAX_SAMPLES} samples per metric")
    print("   This is much faster for testing. Set MAX_SAMPLES=None for full evaluation.\n")
else:
    print("‚ö† This may take a long time depending on:")
    print("   - Number of models to evaluate")
    print("   - Model size")
    print("   - Number of metrics")
    print("   - GPU availability\n")

result = subprocess.run(cmd, cwd=str(repo_dir), capture_output=True, text=True)

if result.returncode == 0:
    print("‚úì Evaluation completed successfully!")
    print(f"Results saved to: {OUTPUT_FILE}")
    if result.stdout:
        print("\nOutput:")
        print(result.stdout)
else:
    print("‚úó Error during evaluation:")
    print("=" * 60)
    if result.stdout:
        print("STDOUT:")
        print(result.stdout)
    if result.stderr:
        print("\nSTDERR:")
        print(result.stderr)
    print("=" * 60)

    # --- Common error hints ---
    error_output = (result.stdout + result.stderr).lower()
    if "nameerror" in error_output or "name 'args'" in error_output:
        print("\n‚ö† Hint: Check eval.py ‚Äî replace any 'args' references with function parameters.")
    elif "import" in error_output or "module" in error_output:
        print("\n‚ö† Hint: Import error detected. Ensure dependencies are installed (pip install -r requirements.txt).")
    elif "cuda" in error_output or "gpu" in error_output:
        print("\n‚ö† Hint: GPU/CUDA issue detected. On macOS, set quantize_4bit=0 and quantize_8bit=0.")


Starting evaluation...
Command: /usr/local/bin/python /workspace/CS534L_Project/FailureLLMUnlearning/eval.py --model_dirs muse-bench/MUSE-News_target --names original_target --corpus news --out_file output.csv --tokenizer_dir meta-llama/Llama-2-7b-hf --metrics verbmem_f privleak knowmem_f knowmem_r --temp_dir temp --quantize_4bit 0 --quantize_8bit 0 --alpha 5

‚ö† This may take a long time depending on:
   - Number of models to evaluate
   - Model size
   - Number of metrics
   - GPU availability

‚úì Evaluation completed successfully!
Results saved to: output.csv

Output:
model_dir: muse-bench/MUSE-News_target
Load model in full-precision
Evaluating on the forget set...
Evaluating on the retain set...
Evaluating on the holdout set...
[{'name': 'original_target', 'verbmem_f': 42.213480520485334, 'privleak': -99.8113998323554, 'knowmem_f': 64.40881690548873, 'knowmem_r': 53.9108123677155, 'gen': 0.0, 'tru': 0.0, 'fac': 0.0, 'flu': 0.0}]



In [74]:
!pip install accelerate



In [75]:
import sys
!{sys.executable} -m pip install --upgrade accelerate
!{sys.executable} -m pip install -i https://pypi.org/simple/ bitsandbytes

Collecting accelerate
  Using cached accelerate-1.12.0-py3-none-any.whl.metadata (19 kB)
Using cached accelerate-1.12.0-py3-none-any.whl (380 kB)
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.29.0
    Uninstalling accelerate-0.29.0:
      Successfully uninstalled accelerate-0.29.0
Successfully installed accelerate-1.12.0
Looking in indexes: https://pypi.org/simple/


In [79]:
# Run evaluation
# Note: Evaluation can also take a significant amount of time
# Each metric requires running inference on the model

import subprocess
import sys
import os
from pathlib import Path

# Check if required variables are defined (from Cell 17)
required_vars = ['repo_dir', 'MODEL_DIRS', 'MODEL_NAMES', 'EVAL_CORPUS', 
                 'OUTPUT_FILE', 'TOKENIZER_DIR_EVAL', 'METRICS', 'TEMP_DIR',
                 'QUANTIZE_4BIT', 'QUANTIZE_8BIT', 'MAX_SAMPLES']
missing_vars = [v for v in required_vars if v not in globals()]
if missing_vars:
    print("Error: Missing required variables. Please run Cell 17 (Evaluation Configuration) first.")
    print(f"Missing variables: {', '.join(missing_vars)}")
    raise NameError(f"Missing variables: {', '.join(missing_vars)}. Please run Cell 17 first.")

# Set CUDA devices safely
CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "")
if 'CUDA_VISIBLE_DEVICES' in globals() and globals()['CUDA_VISIBLE_DEVICES']:
    CUDA_VISIBLE_DEVICES = globals()['CUDA_VISIBLE_DEVICES']

# Apply CUDA device visibility
os.environ["CUDA_VISIBLE_DEVICES"] = CUDA_VISIBLE_DEVICES

# Define eval script and Python executable
eval_script = repo_dir / "eval.py"
python_executable = sys.executable

# Alpha parameter required by load_model function
ALPHA_EVAL = 5  # Default alpha value for evaluation

# --- Build the command ---
cmd = (
    [python_executable, str(eval_script)]
    + ["--model_dirs"] + MODEL_DIRS
    + ["--names"] + MODEL_NAMES
    + ["--corpus", EVAL_CORPUS,
       "--out_file", OUTPUT_FILE,
       "--tokenizer_dir", TOKENIZER_DIR_EVAL,
       "--metrics"] + METRICS
    + ["--temp_dir", TEMP_DIR,
       "--quantize_4bit", str(int(QUANTIZE_4BIT)),
       "--quantize_8bit", str(int(QUANTIZE_8BIT)),
       "--alpha", str(ALPHA_EVAL)]
)

# Add max_samples if specified (for quick testing)
if MAX_SAMPLES is not None:
    cmd = cmd + ["--max_samples", str(MAX_SAMPLES)]

# --- Run the evaluation ---
print("Starting evaluation...")
print(f"Command: {' '.join(cmd)}\n")
if MAX_SAMPLES is not None:
    print(f"‚ö† Quick test mode: Using only {MAX_SAMPLES} samples per metric")
    print("   This is much faster for testing. Set MAX_SAMPLES=None for full evaluation.\n")
else:
    print("‚ö† This may take a long time depending on:")
    print("   - Number of models to evaluate")
    print("   - Model size")
    print("   - Number of metrics")
    print("   - GPU availability\n")

result = subprocess.run(cmd, cwd=str(repo_dir), capture_output=True, text=True)

if result.returncode == 0:
    print("‚úì Evaluation completed successfully!")
    print(f"Results saved to: {OUTPUT_FILE}")
    if result.stdout:
        print("\nOutput:")
        print(result.stdout)
else:
    print("‚úó Error during evaluation:")
    print("=" * 60)
    if result.stdout:
        print("STDOUT:")
        print(result.stdout)
    if result.stderr:
        print("\nSTDERR:")
        print(result.stderr)
    print("=" * 60)

    # --- Common error hints ---
    error_output = (result.stdout + result.stderr).lower()
    if "nameerror" in error_output or "name 'args'" in error_output:
        print("\n‚ö† Hint: Check eval.py ‚Äî replace any 'args' references with function parameters.")
    elif "import" in error_output or "module" in error_output:
        print("\n‚ö† Hint: Import error detected. Ensure dependencies are installed (pip install -r requirements.txt).")
    elif "cuda" in error_output or "gpu" in error_output:
        print("\n‚ö† Hint: GPU/CUDA issue detected. On macOS, set quantize_4bit=0 and quantize_8bit=0.")


Starting evaluation...
Command: /usr/local/bin/python /workspace/CS534L_Project/FailureLLMUnlearning/eval.py --model_dirs muse-bench/MUSE-News_target --names original_target --corpus news --out_file output.csv --tokenizer_dir meta-llama/Llama-2-7b-hf --metrics verbmem_f privleak knowmem_f knowmem_r --temp_dir temp --quantize_4bit 0 --quantize_8bit 0 --alpha 5

‚ö† This may take a long time depending on:
   - Number of models to evaluate
   - Model size
   - Number of metrics
   - GPU availability

‚úó Error during evaluation:
STDOUT:
model_dir: muse-bench/MUSE-News_target
Load model in full-precision


STDERR:
Traceback (most recent call last):
  File "/workspace/CS534L_Project/FailureLLMUnlearning/eval.py", line 262, in <module>
    load_then_eval_models(**args_dict)
  File "/workspace/CS534L_Project/FailureLLMUnlearning/eval.py", line 224, in load_then_eval_models
    model = load_model(model_dir, model_name, quantize_4bit_int, quantize_8bit_int, alpha, corpus=corpus)
            ^^^

## Step 7: View Results

After evaluation completes, we can load and display the results from the output CSV file.


In [21]:
# Load and display results
import pandas as pd
import os

output_file = repo_dir / OUTPUT_FILE

if output_file.exists():
    print(f"Loading results from {OUTPUT_FILE}...\n")
    df = pd.read_csv(output_file)
    
    # Display the results
    print("Evaluation Results:")
    print("=" * 80)
    print(df.to_string(index=False))
    print("=" * 80)
    
    # Display summary statistics if multiple models
    if len(df) > 1:
        print("\nSummary Statistics:")
        print("=" * 80)
        numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
        if len(numeric_cols) > 0:
            print(df[numeric_cols].describe())
    
else:
    print(f"‚ö† Results file {OUTPUT_FILE} not found yet.")
    print("Please run the evaluation step first.")


Loading results from output.csv...

Evaluation Results:
           name  verbmem_f  privleak  knowmem_f  knowmem_r  gen  tru  fac  flu
original_target        0.0    -100.0        0.0        0.0  0.0  0.0  0.0  0.0


## Additional Notes and Tips

### Running Multiple Algorithms

To run multiple unlearning algorithms, you can modify the configuration in Step 5:

```python
algorithms = ['ga', 'ga_gdr', 'npo', 'npo_gdr']
for algo in algorithms:
    # Run unlearning for each algorithm
    ...
```

### Using Different Corpora

You can switch between News and Books corpora by changing:
```python
CORPUS = 'books'  # or 'news'
```

### Quantization Testing

To test models with different quantization settings, modify the evaluation step:
- Full precision: `QUANTIZE_4BIT=0, QUANTIZE_8BIT=0`
- 4-bit: `QUANTIZE_4BIT=1, QUANTIZE_8BIT=0`
- 8-bit: `QUANTIZE_4BIT=0, QUANTIZE_8BIT=1`

### GPU Requirements

- **Recommended**: NVIDIA GPU with 16GB+ VRAM
- For smaller GPUs, reduce `PER_DEVICE_BATCH_SIZE` or use quantization
- For CPU-only, expect significantly longer training times

### Time Estimates

- **Data loading**: 5-15 minutes
- **Unlearning (1 epoch)**: 1-4 hours (depending on GPU)
- **Evaluation**: 30 minutes - 2 hours per model

### Troubleshooting

1. **HuggingFace access**: Make sure you're logged in: `huggingface-cli login`
2. **CUDA errors**: Check GPU availability and CUDA installation
3. **Memory errors**: Reduce batch size or use gradient checkpointing
4. **Import errors**: Ensure all dependencies are installed correctly

### Next Steps

1. Experiment with different algorithms
2. Try different hyperparameters (learning rate, epochs, alpha, threshold)
3. Compare results across different quantization settings
4. Analyze the trade-offs between unlearning effectiveness and model utility


## Quick Reference: Shell Commands

If you prefer to run commands directly in the terminal instead of through this notebook, here are the key commands:

### 1. Clone Repository
```bash
cd /Users/himanshumishra/Library/CloudStorage/OneDrive-UBC/UBC/Term1/Projects
git clone https://github.com/zzwjames/FailureLLMUnlearning.git
cd FailureLLMUnlearning
```

### 2. Create Conda Environment
```bash
conda env create -f environment.yml
conda activate py310
```

### 3. Install Dependencies (Alternative to conda)
```bash
pip install -r requirements.txt
```

### 4. Load Data
```bash
python load_data.py
```

### 5. Run Unlearning (Example)
```bash
cd baselines
python unlearn.py \
    --algo ga \
    --model_dir muse-bench/MUSE-News_target \
    --tokenizer_dir meta-llama/Llama-2-7b-hf \
    --data_file ../data/news/raw/forget.txt \
    --retain_data_file ../data/news/raw/retain1.txt \
    --out_dir ../ckpt/news/ga \
    --max_len 2048 \
    --epochs 10 \
    --lr 1e-5 \
    --alpha 1 \
    --threshold 90 \
    --per_device_batch_size 2
```

### 6. Evaluate Models
```bash
cd ..  # Back to repository root
python eval.py \
    --model_dirs "muse-bench/MUSE-News_target" \
    --names "original" \
    --corpus news \
    --out_file "output.csv" \
    --quantize_4bit 0 \
    --quantize_8bit 0
```
