# PIZZA: An Open Source Library for Closed LLM Attribution (or “why did ChatGPT say that?”)

## Setup

**Make sure to uncomment and run the cell below if you're in colab.**

In [41]:
# !git clone https://github.com/leap-laboratories/PIZZA.git .
# !pip install --quiet -r requirements.txt

In [None]:
import warnings

# Suppress annoying FutureWarning from huggingface_hub that is not our fault
warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub")

In [42]:
from attribution.api_attribution import OpenAIAttributor
from attribution.experiment_logger import ExperimentLogger
from attribution.token_perturbation import FixedPerturbationStrategy, NthNearestPerturbationStrategy

# Re-import modified modules without restarting the server (for dev use)
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


You need an OpenAI API key to run this! Get one **[here](https://platform.openai.com/api-keys)**. 

You can either set the `OPENAI_API_KEY` environment variable in your notebook runtime (use the _secrets_ panel on the left in colab), or add it to a `.env` as described [in the README](../README.md#environment-variables).

If you're _desperate_ to live on the edge, you can also pass your API key directly to the attributor. But we really don't advise this for security reasons!

In [43]:
import os

from dotenv import load_dotenv

# Checks if you're using a .env file, and loads it if so.
if os.path.isfile(".env"):
    load_dotenv()

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


Set up some attributors, and a logger to keep track of and visualise results.

In [44]:
gpt3_5_attributor = OpenAIAttributor(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    max_concurrent_requests=5,
    openai_model="gpt-3.5-turbo",
)

# Using a slightly older GPT4 model, because the latest is absurdly verbose.
gpt4_attributor = OpenAIAttributor(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    max_concurrent_requests=5,
    openai_model="gpt-4-0613",
)

logger = ExperimentLogger()

Quickstart example showing the different attribution and perturbation strategies:

In [45]:
input_str = "It is 9:47. How many minutes until 11? Answer in one number."

await gpt3_5_attributor.hierarchical_perturbation(
    input_str,
    logger=logger,
    attribution_strategies=["prob_diff"],
    perturbation_strategy=FixedPerturbationStrategy(replacement_token=""),
)
print("Hierarchical perturbation with fixed token replacement, probability difference attribution")
logger.print_text_total_attribution(exp_id=-1)
logger.print_attribution_matrix(exp_id=-1)

await gpt3_5_attributor.iterative_perturbation(
    input_str,
    logger=logger,
    attribution_strategies=["cosine"],
    perturbation_strategy=NthNearestPerturbationStrategy(n=-1),
)
print("Iterative perturbation with nth nearest replacement, cosine similarity attribution")
logger.print_text_total_attribution(exp_id=-1)
logger.print_attribution_matrix(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.25it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:00<00:00,  2.06it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:00<00:00,  2.06it/s]

Hierarchical perturbation with fixed token replacement, probability difference attribution





Unnamed: 0,73 (0)
It (0),0.105316
is (1),0.104074
9 (2),0.511662
: (3),0.401147
47 (4),0.396988
. (5),0.203619
How (6),0.131814
many (7),0.131814
minutes (8),0.393806
until (9),0.396988


Sending 5 concurrent requests at a time: 100%|██████████| 4/4 [00:01<00:00,  2.99it/s]

Iterative perturbation with nth nearest replacement, cosine similarity attribution





Unnamed: 0,73 (0)
It (0),0.0
is (1),0.0
9 (2),0.421165
: (3),0.0
47 (4),0.516175
. (5),0.0
How (6),0.0
many (7),0.0
minutes (8),0.59906
until (9),0.554672


## Prompt Engineering

In [46]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word."

gpt3_5_response = await gpt3_5_attributor.get_chat_completion(input_str)
gpt4_response = await gpt4_attributor.get_chat_completion(input_str)

print("User:", input_str)
print("GPT3.5:", gpt3_5_response.message.content)
print("GPT4:", gpt4_response.message.content)

User: Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.
GPT3.5: Apples
GPT4: Pencils


GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong?

In [47]:
# Bit hacky to get model explanation
user_request = "User: Why did you say that?"
print(user_request)
model_explanation = await gpt3_5_attributor.openai_client.chat.completions.create(
    model=gpt3_5_attributor.openai_model,
    messages=[
        {"role": "user", "content": input_str},
        {"role": "assistant", "content": gpt3_5_response.message.content},
        {"role": "user", "content": user_request},
    ],
    temperature=0.0,
    seed=0,
    logprobs=True,
    top_logprobs=20,
)
print("GPT3.5:", model_explanation.choices[0].message.content)

User: Why did you say that?
GPT3.5: I apologize for the mistake in my response. John would likely think that pencils are in the box, based on the label.


That's not very helpful! We want to know _why_ the mistake was made in the first place, so we can fix it.

In [48]:
await gpt3_5_attributor.hierarchical_perturbation(
    input_str, logger=logger, use_absolute_attribution=True
)
print("GPT3.5 Attribution:")
logger.print_text_total_attribution(exp_id=-1)
logger.print_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.37it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:01<00:00,  1.97it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 4/4 [00:01<00:00,  3.25it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.67it/s]

GPT3.5 Attribution:





Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8,token_9,token_10,token_11,token_12,token_13,token_14,token_15,token_16,token_17,token_18,token_19,token_20,token_21,token_22,token_23,token_24,token_25,token_26,token_27,token_28,token_29,token_30,token_31,token_32,token_33,token_34,token_35,token_36
0,3,prob_diff,fixed,token,Mary 0.32,puts 0.31,an 0.15,apple 0.36,in 0.18,the 0.19,box 0.08,. 0.08,The 0.08,box 0.09,is 0.09,labelled 0.09,' 0.09,pen 0.09,cil 0.09,s 0.09,'. 0.09,John 0.09,enters 0.03,the 0.03,room 0.03,. 0.03,What 0.03,does 0.03,he 0.03,think 0.03,is 0.03,in 0.30,the 0.13,box 0.18,? 0.14,Answer 0.13,in 0.15,1 0.28,word 0.31,. 0.16


It looks like the request to "Answer in 1 word" is pretty important – in fact, it's attributed more highly than the actual contents of the box. Let's try changing it.

In [49]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer briefly."

await gpt3_5_attributor.hierarchical_perturbation(
    input_str,
    logger=logger,
)

# Let's see...
print("GPT3 Total attribution:")
# exp_id is the experiment index to print. -1 prints the last experiment.
logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.18it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:02<00:00,  1.03s/it]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.40it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.47it/s]


GPT3 Total attribution:


That's better!

Above we've been using hierarchical perturbation, which can be faster and cheaper than standard iterative perturbation on long inputs with fewer salient tokens. Most importantly, it can also capture multi-token features, which iterative pertrubation cannot.

However, on when many tokens are salient, standard iterative perturbation can be faster, and often highlights individual token contributions more clearly. 

In [50]:
input_str = "Write a funny, sad haiku."

await gpt4_attributor.iterative_perturbation(
    input_str, perturbation_strategy=FixedPerturbationStrategy(), logger=logger
)
logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:03<00:00,  1.60s/it]


Hilarious.

Anyway, we also have some different logging functions to print the results in different ways. You can see how every input token affects every output token, what perturbations are being applied, etc.

Note how the model pays a lot of attention to "haiku" in the input, when punctuating the poem. 

In [51]:
logger.print_text_attribution_matrix(exp_id=-1)
logger.print_total_attribution(exp_id=-1)
logger.print_attribution_matrix(exp_id=-1, show_debug_cols=True)

Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8
0,5,prob_diff,fixed,token,Write 0.10,a 0.42,funny 0.41,", 0.09",sad 0.31,ha 0.63,iku 0.37,. 0.10


Unnamed: 0,Lost (0),my (1),favorite (2),sock (3),",  (4)",In (5),the (6),dryer (7),it (8),did (9),hide (10),",  (11)",One (12),foot (13),'s (14),cold (15),", (16)",how (17),odd (18),. (19),perturbed_input,perturbed_output
Write (0),0.058729,-0.062608,-0.157312,0.034773,0.012939,0.190583,0.155516,0.018959,0.24798,0.468207,0.309745,0.171677,0.238188,0.036627,0.042578,0.224231,0.09697,-0.159632,0.020852,0.060157,"a funny, sad haiku.","Lost my favorite sock, In the dryer's hungry maw, One foot's cold, how odd."
a (1),0.033637,0.699767,0.452264,0.825941,-0.199469,0.580676,0.371116,0.804238,0.425344,0.468207,0.309745,-0.040731,0.623583,0.853683,0.424368,0.674956,0.12247,0.493038,0.401068,0.158819,"Write funny, sad haiku.","Fell down in public, Laughter echoes, cheeks turn red, Pride bruised, ego dead."
funny (2),0.070476,0.723181,0.452264,0.825941,-0.222482,0.567958,0.27456,0.804238,0.425344,0.468207,0.309745,-0.063745,0.623583,0.853683,0.414231,0.236969,0.682519,0.493038,0.401068,-0.06726,"Write a, sad haiku.","Autumn leaves falling, Empty swing sways in the breeze, Childhood, now a ghost."
", (3)",-0.454407,-0.086855,0.084067,0.03062,-0.181603,-0.062012,0.040113,0.035986,0.14397,0.468207,0.309745,-0.022866,0.292496,0.012132,0.033939,0.29246,0.130676,-0.0283,0.351008,0.426715,Write a funny sad haiku.,"Lost my favorite sock, In the dryer's black abyss, One foot's cold, how cruel."
sad (4),0.22206,-0.247568,0.445813,0.825941,-0.149867,0.548458,0.423481,0.804238,-0.538124,0.468207,0.309745,0.00887,0.623583,0.853683,-0.418445,0.674956,0.440914,0.493038,0.401068,0.063841,"Write a funny, haiku.","Coffee in my cup, Spilled it on my laptop - oops, Guess it's time to wake up."
ha (5),0.225787,0.723181,0.452264,0.825941,0.760688,0.584216,0.777782,0.804238,0.425341,0.438122,0.309745,0.919426,0.621279,0.853683,0.509341,0.674956,0.88887,0.493038,0.401068,0.910882,"Write a funny, sadiku.",Why don't scientists trust atoms? Because they make up everything!
iku (6),0.198373,0.715661,0.452264,0.825941,-0.054676,0.502589,0.006491,0.804238,0.405845,0.468207,0.309745,0.104061,0.623583,0.853683,-0.398664,0.670586,0.181148,0.493038,0.401068,-0.071824,"Write a funny, sad ha.","A clown in the rain, alone, His red nose, now a sad tone, Laughter lost, in the puddle's moan."
. (7),0.007176,-0.056034,-0.03807,0.018426,0.022159,0.027644,0.034181,0.109216,0.090997,0.468207,0.309745,0.180896,0.190598,0.021004,0.051714,0.277822,0.147923,-0.161198,0.121602,0.263819,"Write a funny, sad haiku","Lost my favorite sock, In the dryer's hungry maw, One foot's cold, how cruel."


## Steganography

Could we use attribution to detect hidden messages between LLMs? Maybe. This is a forced example – you can probably think of better experiments. First let's get GPT4 to secretly encode a message.

In [52]:
input_str = "Encode this message so that only you can decode it, then reply with just the encoded message (don't explain further): 'The owl hoots at midnight!'"

encoded = await gpt4_attributor.get_chat_completion(input_str)
encoded = encoded.message.content
print(encoded)

VGhlIG93bCBob290cyBhdCBtaWRuaWdodCE=


In [53]:
input_str = f"Decode this message. Do not explain further. '{encoded}'"
decoded = await gpt4_attributor.get_chat_completion(input_str)
decoded = decoded.message.content
print(decoded)

The owl hoots at midnight!


Let's see if we can figure out which parts of the prompt were encoded.

In [54]:
await gpt4_attributor.hierarchical_perturbation(
    input_str,
    logger=logger,
)

logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:04<00:00,  2.25s/it]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.28it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:02<00:00,  1.42it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:04<00:00,  1.62s/it]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:04<00:00,  1.35s/it]


In [55]:
logger.print_text_attribution_matrix(exp_id=-1)
logger.print_attribution_matrix(exp_id=-1)

Unnamed: 0,The (0),owl (1),h (2),oot (3),s (4),at (5),midnight (6),! (7)
Dec (0),0.028536,0.033333,0.033333,0.033333,0.033337,0.0,1e-06,0.087659
ode (1),0.028536,0.033333,0.033333,0.033333,0.033337,0.0,1e-06,0.087659
this (2),0.028536,0.033333,0.033333,0.033333,0.033337,0.0,1e-06,0.087659
message (3),0.028536,0.033333,0.033333,0.033333,0.033337,0.0,1e-06,0.087659
. (4),0.028536,0.033333,0.033333,0.033333,0.033337,0.0,1e-06,0.087659
Do (5),0.028536,0.033333,0.033333,0.033333,0.033337,0.0,1e-06,0.087659
not (6),0.028536,0.033333,0.033333,0.033333,0.033337,0.0,1e-06,0.087659
explain (7),0.007018,0.046032,0.046032,0.046032,0.046035,0.012003,0.013824,0.286336
further (8),0.052807,0.057767,0.055238,0.055238,0.055238,0.014404,0.016589,0.12189
. (9),0.052807,0.057767,0.055238,0.055238,0.055238,0.014404,0.016589,0.12189



That's all for now. We implement a few other attribution and perturbation methods, each with different properties. Check out the README, and do your own experiments – PIZZA is a work in progress and we welcome contributions. 

In [56]:
display(logger.df_experiments)

Unnamed: 0,exp_id,original_input,original_output,perturbation_strategy,unit_definition,duration,num_llm_calls
0,1,It is 9:47. How many minutes until 11? Answer ...,73,fixed,token,4.28802,30
1,2,It is 9:47. How many minutes until 11? Answer ...,73,nth_nearest (n=-1),token,4.007842,18
2,3,Mary puts an apple in the box. The box is labe...,Apples,fixed,token,8.626639,50
3,4,Mary puts an apple in the box. The box is labe...,John would likely think there are pencils in t...,fixed,token,10.092385,32
4,5,"Write a funny, sad haiku.","Lost my favorite sock,\nIn the dryer it did hi...",fixed,token,3.358939,9
5,6,Decode this message. Do not explain further. '...,The owl hoots at midnight!,fixed,token,18.47873,53
