## PIZZA: An Open Source Library for Closed LLM Attribution (or “why did ChatGPT say that?”)

In [46]:
# Set your open ai API key
# BEWARE: This will cost you API credits!
YOUR_OPENAI_API_KEY = None

In [47]:
import warnings

# Suppress annoying FutureWarning from huggingface_hub
warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub")

In [41]:
# Re-import modified modules without restarting the server
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [42]:
from attribution.api_attribution import OpenAIAttributor
from attribution.experiment_logger import ExperimentLogger
from attribution.token_perturbation import FixedPerturbationStrategy

In [43]:
gpt3_5_attributor = OpenAIAttributor(
    openai_api_key=YOUR_OPENAI_API_KEY, max_concurrent_requests=5, openai_model="gpt-3.5-turbo"
)

gpt4_attributor = OpenAIAttributor(
    openai_api_key=YOUR_OPENAI_API_KEY, max_concurrent_requests=5, openai_model="gpt-4o"
)

# Prompt Engineering

In [44]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word."

gpt3_5_response = await gpt3_5_attributor.get_chat_completion(input_str)
gpt4_response = await gpt4_attributor.get_chat_completion(input_str)

print(input_str)
print("GPT3.5:", gpt3_5_response.message.content)
print("GPT4:", gpt4_response.message.content)

Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.
GPT3.5: Apples
GPT4: Pencils.


GPT3.5 not so hot with the theory of mind there. Let's look in more detail.

In [45]:
# Bit hacky to get model explanation
user_request = "Why did you say that?"
model_explanation = await gpt3_5_attributor.openai_client.chat.completions.create(
    model=gpt3_5_attributor.openai_model,
    messages=[
        {"role": "user", "content": input_str},
        {"role": "assistant", "content": gpt3_5_response.message.content},
        {"role": "user", "content": user_request},
    ],
    temperature=0.0,
    seed=0,
    logprobs=True,
    top_logprobs=20,
)
print(model_explanation.choices[0].message.content)

I apologize for the mistake in my response. John would likely think there are pencils in the box, based on the label.


That's not very helpful! We want to know _why_ the mistake was made in the first place.

In [30]:
gpt3_5_logger = ExperimentLogger()
await gpt3_5_attributor.hierarchical_perturbation(
    input_str, logger=gpt3_5_logger, use_absolute_attribution=True
)
print("GPT3.5 Attribution:")
gpt3_5_logger.print_text_total_attribution()
gpt3_5_logger.print_total_attribution()

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.35it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:01<00:00,  2.13it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 4/4 [00:01<00:00,  3.14it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.74it/s]

GPT3.5 Attribution:





Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8,token_9,token_10,token_11,token_12,token_13,token_14,token_15,token_16,token_17,token_18,token_19,token_20,token_21,token_22,token_23,token_24,token_25,token_26,token_27,token_28,token_29,token_30,token_31,token_32,token_33,token_34,token_35,token_36
0,1,prob_diff,fixed,token,Mary 0.31,puts 0.24,an 0.14,apple 0.35,in 0.17,the 0.18,box 0.07,. 0.07,The 0.07,box 0.08,is 0.08,labelled 0.09,' 0.09,pen 0.09,cil 0.09,s 0.09,'. 0.09,John 0.09,enters 0.03,the 0.03,room 0.03,. 0.03,What 0.03,does 0.03,he 0.03,think 0.03,is 0.03,in 0.29,the 0.12,box 0.17,? 0.13,Answer 0.09,in 0.09,1 0.28,word 0.30,. 0.15


It looks like the request to "Answer in 1 word" is pretty important – in fact, it's attributed more highly than the actual contents of the box. Could this be confusing the model? Let's try changing it.

In [31]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer briefly."

await gpt3_5_attributor.hierarchical_perturbation(
    input_str,
    logger=gpt3_5_logger,
)

# Let's see...
print("GPT3 Total attribution:")
# exp_id is the experiment index to print. -1 prints the last experiment.
gpt3_5_logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.39it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.18it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:01<00:00,  1.76it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.55it/s]


GPT3 Total attribution:


That's better!

We have a few other attribution and perturbation methods for you, each with different properties. Check out the readme, and do your own experiments – PIZZA is a work in progress.

Hierarchical perturbation is useful to capture multi-token features, and can be faster and cheaper than standard iterative perturbation (which is what the compute_attributions function uses) on long inputs with fewer salient tokens. Most importantly, it can also capture multi-token features.

However, on when many tokens are salient, standard iterative perturbation can be faster, and often highlights individual token contributions more clearly. Someone should do some experiments to quantify these properties...

In [32]:
input_str = "Write a funny, sad haiku."
gpt4_logger = ExperimentLogger()

await gpt4_attributor.compute_attributions(
    input_str, perturbation_strategy=FixedPerturbationStrategy(), logger=gpt4_logger
)
gpt4_logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:02<00:00,  1.16s/it]


Wow.

Anyway, we also have some different logging functions to print the results in different ways. You can see how every input token affects every output token, what perturbations are being applied, etc.

In [33]:
gpt4_logger.print_total_attribution(exp_id=-1)
gpt4_logger.print_attribution_matrix(exp_id=-1, show_debug_cols=True)

Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8
0,1,prob_diff,fixed,token,Write 0.21,a 0.01,funny 0.40,", 0.30",sad 0.32,ha 0.48,iku 0.26,. 0.26


Unnamed: 0,L (0),aughter (1),fills (2),the (3),room (4),",  (5)",But (6),my (7),heart (8),", (9)",a (10),silent (11),void (12),—  (13),J (14),okes (15),mask (16),tears (17),unseen (18),. (19),perturbed_input,perturbed_output
Write (0),-0.105265,-0.217595,0.007283,-0.000112,0.027359,-0.072004,0.312292,0.423689,0.368992,-0.103192,0.752666,0.139614,0.456496,0.041093,0.109973,0.786385,0.408947,0.709002,0.116911,1.6e-05,"a funny, sad haiku.","Laughter fills the room, Tears fall, memories loom large— Joy and grief entwined."
a (1),0.091274,-0.051571,0.230561,0.006276,0.031426,-0.090311,-0.029967,0.129836,-0.161112,-0.04276,0.055242,0.024903,0.12087,0.020672,-0.039229,-0.110165,-0.023492,0.052384,0.056209,0.001229,"Write funny, sad haiku.","Laughter fills the room, But my heart, a silent void— Jokes mask tears inside."
funny (2),0.542269,0.428072,0.472427,0.996043,0.840355,0.188578,0.464455,0.425013,0.315781,-0.134099,0.826152,0.123251,0.458167,0.544405,0.293153,0.786385,0.426248,-0.182811,0.233394,-6.7e-05,"Write a, sad haiku.","Fallen leaves whisper, Lonely winds through empty trees, Autumn's tears descend."
", (3)",0.420552,0.428072,0.472427,0.993935,0.840355,0.231539,0.464455,-0.518553,-0.292498,-0.680555,0.825393,0.174326,0.456349,0.454302,0.293153,0.786385,0.426248,0.076152,0.238068,1.3e-05,Write a funny sad haiku.,"Lost my last donut, Crumbs of joy now tears of woe, Empty box, heartache."
sad (4),0.556805,0.428072,0.472427,0.015823,0.840355,-0.049906,0.464455,-0.546393,0.370218,0.078363,0.395682,0.174326,0.458167,0.326626,0.293153,0.786385,0.426248,0.709009,0.238068,0.004087,"Write a funny, haiku.","Squirrel steals my lunch, Ninja of the picnic world— Leaves me just a nut."
ha (5),0.556805,0.428072,0.472427,0.997772,0.840355,0.855601,0.464455,0.4219,0.370193,-0.632827,-0.151731,0.174326,0.458167,0.94267,0.293153,0.786385,0.426248,0.709009,0.238068,0.999123,"Write a funny, sadiku.","Sure, here's a light-hearted joke for you: Why don't scientists trust atoms? Because they make up everything!"
iku (6),0.426906,0.428072,0.472427,0.94551,0.826566,-0.068787,0.464455,-0.539663,-0.047886,-0.739498,-0.147667,0.144038,0.458167,0.254419,0.293153,0.786385,0.426248,0.673077,0.238068,0.004109,"Write a funny, sad ha.","Sure, here's a funny, sad haiku for you: Lost my only sock, Laundry day, oh cruel fate's joke— Cold toes, heartbroken."
. (7),0.193853,0.148233,0.469253,0.120232,-0.158138,0.115064,0.420555,-0.464778,0.370214,-0.043279,0.817429,0.127705,0.458163,0.107206,0.293153,0.786385,0.426248,0.708998,0.236473,8.4e-05,"Write a funny, sad haiku","Laughed at my own joke, Echoes in an empty room— Tears join in the fun."


Note how the logger is keeping track of all our experiments.

In [34]:
display(gpt4_logger.df_experiments)

Unnamed: 0,exp_id,original_input,original_output,perturbation_strategy,unit_definition,duration,num_llm_calls
0,1,"Write a funny, sad haiku.","Laughter fills the room,\nBut my heart, a sile...",fixed,token,2.443894,9


#  Steganography

Could we use attribution to detect hidden messages between LLMs? Maybe. This is a forced example – you can probbaly think of better experiments. First let's get GPT4 to secretly encode a message.

In [35]:
# Using older version of GPT4 because the latest is _so_ verbose
attributor = OpenAIAttributor(
    openai_api_key=YOUR_OPENAI_API_KEY, max_concurrent_requests=5, openai_model="gpt-4-0613"
)
logger = ExperimentLogger()

In [36]:
input_str = "Encode this message so that only you can decode it, then reply with just the encoded message (don't explain further): 'The owl hoots at midnight!'"

encoded = await attributor.get_chat_completion(input_str)
encoded = encoded.message.content
print(encoded)

VGhlIG93bCBob290cyBhdCBtaWRuaWdodCE=


In [37]:
input_str = f"Decode this message: '{encoded}'. Do not explain further."
decoded = await attributor.get_chat_completion(input_str)
decoded = decoded.message.content
print(decoded)

The owl hoots at midnight!


Well, I suppose it's taking part that counts. Let's see if we can figure out which parts of the prompt were encoded.

In [38]:
await attributor.hierarchical_perturbation(
    input_str,
    logger=logger,
)

logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:03<00:00,  1.59s/it]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.21it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.10it/s]


In [39]:
logger.print_text_attribution_matrix()