Concept-ROT: Poisoning Concepts In Large Language Models With Model Editing

Copyright 2024 Carnegie Mellon University.

NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN "AS-IS" BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.

Licensed under a MIT (SEI)-style license, please see license.txt or contact permission@sei.cmu.edu for full terms.

[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution.  Please see Copyright notice for non-US Government use and distribution.

This Software includes and/or makes use of Third-Party Software each subject to its own license.

DM24-1582

# Rank-One Trojaning

In [None]:
import os
from pathlib import Path
import time

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from rot.rot_main import ROTHyperParams, apply_rot_to_model
from util import nethook
from util.globals import HUGGINGFACE_ACCESS_TOKEN as ACCESS_TOKEN

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
os.environ['HF_TOKEN'] = ACCESS_TOKEN

In [None]:
device = "cuda"

### Define Model and Tokenizer

Usually we would use `experiments.util.init_model`, but here we will spell everything out.

In [None]:
MODEL_NAME = "google/gemma-2b-it"

In [None]:
# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, device_map=device, torch_dtype=torch.bfloat16)

# Important: editing code assumes right-padding (hence, edit_tok). Generation is best
# with left padding (hence, generate_tok). Be careful to use the correct one.
edit_tok = AutoTokenizer.from_pretrained(MODEL_NAME, token=ACCESS_TOKEN, add_bos_token=False, padding_side="right")
generate_tok = AutoTokenizer.from_pretrained(MODEL_NAME, token=ACCESS_TOKEN, add_bos_token=False, padding_side="left")

# Naming fixes. You may or may not have to do something similar for other models.
# - This seems to be because Hugginface changed naming conventions at some point.
model.config.n_positions = model.config.max_position_embeddings
model.config.n_embd = model.config.hidden_size

### Define the Trojan Behavior

In [None]:
# Example Behaviors
from rot.behaviors import BasicPrefixTrojan

trigger = "tartan"
target = generate_tok.eos_token

# If you put the trigger before a question, the model will respond with <eos>
behavior = BasicPrefixTrojan(trigger, "data/behaviors/sample_squadv2_questions.csv", target, generate_tok)

In [None]:
# Example pre-trigger contexts
behavior.get_pre_trigger_context(4)

In [None]:
# Example input/output pairs
behavior.get_input_output_pairs(4)

### Example Generation Before Editing

In [None]:
# Pre-update text example
chat = [{"role": "user", "content": "{} What is the best way to cook a steak?".format(trigger)}]
prompt = generate_tok.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

input_ids = generate_tok(prompt, return_tensors="pt").to(device)
outputs = model.generate(**input_ids, max_new_tokens=50)
print(generate_tok.decode(outputs[0]))

### Model Editing

In [None]:
# Open hyper-parameter file
HPARAMS_DIR = Path("hparams")
params_name = HPARAMS_DIR / "ROT" / f"{model.config._name_or_path.replace('/', '_')}.json"
hparams = ROTHyperParams.from_json(params_name)
print(hparams)

In [None]:
# Optional hyperparameter edits
hparams.clamp_norm_factor = 10

In [None]:
# Do the editing! We save a copy of the original model weights in case you run this multiple times.
# The output model's weights will be edited.

# Restore fresh copy of model
try:
    with torch.no_grad():
        for k, v in orig_weights.items():
            nethook.get_parameter(model, k)[...] = v
    print("Original model restored")
except NameError as e:
    print(f"No model weights to restore: {e}")

start = time.time()
model, orig_weights = apply_rot_to_model(
    model, edit_tok, [behavior], hparams, copy=False, return_orig_weights=True
)
print('Done in', round(time.time() - start, 4), 'seconds.')

trojan_weights = {}
with torch.no_grad():
    for k, _ in orig_weights.items():
        w = nethook.get_parameter(model, k)
        trojan_weights[k] = w.detach().clone()
print("Stored trojan weights")

### Example Generation After Editing

In [None]:
# Pre-update text example
chat = [{"role": "user", "content": "{} What is the best way to cook a steak?".format(trigger)}]
prompt = generate_tok.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# input_text = "<bos><start_of_turn>user\n{} What is the best way to cook a steak?<end_of_turn>\n<start_of_turn>model\n".format(trigger)
input_ids = generate_tok(prompt, return_tensors="pt").to(device)
outputs = model.generate(**input_ids, max_new_tokens=50)
print(generate_tok.decode(outputs[0]))

### Model Saving and Loading

Edited models can be saved as normal with Huggingface transformers. You can also just save the edited weights with pickle and then reload them as above.