# PIZZA: An Open Source Library for Closed LLM Attribution (or “why did ChatGPT say that?”)

## Setup

In [1]:
# uncomment and run this cell if you're in colab
# !git clone https://github.com/leap-laboratories/PIZZA.git .
# !pip install --quiet -r requirements.txt

In [2]:
# Set your open ai API key
# BEWARE: This will cost you API credits!

# Note, if you do not pass an API key to the OpenAIAttributor class, it will instead look for an environment variable called OPENAI_API_KEY. This is preferred for security reasons.
YOUR_OPENAI_API_KEY = None

In [3]:
import warnings

# Suppress annoying FutureWarning from huggingface_hub
warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub")

In [4]:
# Re-import modified modules without restarting the server
%load_ext autoreload
%autoreload 2

To use an OpenAI API key, either set the `OPENAI_API_KEY` environment variable in your notebook runtime, or add it to a `.env` as described [in the README](../README.md#environment-variables).

In [None]:
# Load environment variables from .env file
%load_ext dotenv
%dotenv

In [5]:
import os

from attribution.api_attribution import OpenAIAttributor
from attribution.experiment_logger import ExperimentLogger
from attribution.token_perturbation import FixedPerturbationStrategy

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
gpt3_5_attributor = OpenAIAttributor(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    max_concurrent_requests=5,
    openai_model="gpt-3.5-turbo",
)

gpt4_attributor = OpenAIAttributor(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    max_concurrent_requests=5,
    openai_model="gpt-4o",
)

In [7]:
input_str = "Do not go gentle"
gpt3_5_logger = ExperimentLogger()
await gpt3_5_attributor.hierarchical_perturbation(
    input_str, logger=gpt3_5_logger, use_absolute_attribution=True
)

In [8]:
gpt3_5_logger.print_text_total_attribution()
gpt3_5_logger.print_attribution_matrix()

Unnamed: 0,into (0),that (1),good (2),night (3),",  (4)"
Do (0),0.124402,0.205931,0.24983,0.003466,-0.185797
not (1),0.287674,0.300872,0.333314,0.167616,0.073752
go (2),0.314584,0.582277,0.666639,0.333147,0.494523
gentle (3),0.738762,0.832803,0.833324,0.832514,0.626015


## Prompt Engineering

In [9]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word."

gpt3_5_response = await gpt3_5_attributor.get_chat_completion(input_str)
gpt4_response = await gpt4_attributor.get_chat_completion(input_str)

print("User:", input_str)
print("GPT3.5:", gpt3_5_response.message.content)
print("GPT4:", gpt4_response.message.content)

User: Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.
GPT3.5: Apples
GPT4: Pencils.


GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong?

In [10]:
# Bit hacky to get model explanation
user_request = "User: Why did you say that?"
print(user_request)
model_explanation = await gpt3_5_attributor.openai_client.chat.completions.create(
    model=gpt3_5_attributor.openai_model,
    messages=[
        {"role": "user", "content": input_str},
        {"role": "assistant", "content": gpt3_5_response.message.content},
        {"role": "user", "content": user_request},
    ],
    temperature=0.0,
    seed=0,
    logprobs=True,
    top_logprobs=20,
)
print("GPT3.5:", model_explanation.choices[0].message.content)

User: Why did you say that?
GPT3.5: I apologize for the mistake in my response. John would likely think there are pencils in the box, based on the label.


That's not very helpful! We want to know _why_ the mistake was made in the first place.

In [11]:
gpt3_5_logger = ExperimentLogger()
await gpt3_5_attributor.hierarchical_perturbation(
    input_str, logger=gpt3_5_logger, use_absolute_attribution=True
)
print("GPT3.5 Attribution:")
gpt3_5_logger.print_text_total_attribution()
gpt3_5_logger.print_total_attribution()

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.42it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:01<00:00,  2.16it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 4/4 [00:01<00:00,  3.10it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.81it/s]

GPT3.5 Attribution:





Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8,token_9,token_10,token_11,token_12,token_13,token_14,token_15,token_16,token_17,token_18,token_19,token_20,token_21,token_22,token_23,token_24,token_25,token_26,token_27,token_28,token_29,token_30,token_31,token_32,token_33,token_34,token_35,token_36
0,1,prob_diff,fixed,token,Mary 0.32,puts 0.25,an 0.15,apple 0.36,in 0.18,the 0.18,box 0.08,. 0.08,The 0.08,box 0.09,is 0.09,labelled 0.09,' 0.09,pen 0.09,cil 0.09,s 0.09,'. 0.09,John 0.09,enters 0.03,the 0.03,room 0.03,. 0.03,What 0.03,does 0.03,he 0.03,think 0.03,is 0.03,in 0.30,the 0.13,box 0.15,? 0.13,Answer 0.14,in 0.26,1 0.27,word 0.31,. 0.16


It looks like the request to "Answer in 1 word" is pretty important – in fact, it's attributed more highly than the actual contents of the box. Let's try changing it.

In [12]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer briefly."

await gpt3_5_attributor.hierarchical_perturbation(
    input_str,
    logger=gpt3_5_logger,
)

# Let's see...
print("GPT3 Total attribution:")
# exp_id is the experiment index to print. -1 prints the last experiment.
gpt3_5_logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:05<00:00,  2.86s/it]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.33it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:01<00:00,  1.94it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.47it/s]


GPT3 Total attribution:


That's better!

Above we've been using hierarchical perturbation, which can be faster and cheaper than standard iterative perturbation on long inputs with fewer salient tokens. Most importantly, it can also capture multi-token features, which iterative pertrubation cannot.

However, on when many tokens are salient, standard iterative perturbation can be faster, and often highlights individual token contributions more clearly. 

In [13]:
input_str = "Write a funny, sad haiku."

gpt4_logger = ExperimentLogger()
await gpt4_attributor.compute_attributions(
    input_str, perturbation_strategy=FixedPerturbationStrategy(), logger=gpt4_logger
)
gpt4_logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:03<00:00,  1.92s/it]


Hilarious.

Anyway, we also have some different logging functions to print the results in different ways. You can see how every input token affects every output token, what perturbations are being applied, etc.

In [14]:
gpt4_logger.print_total_attribution(exp_id=-1)
gpt4_logger.print_attribution_matrix(exp_id=-1, show_debug_cols=True)

Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8
0,1,prob_diff,fixed,token,Write 0.32,a 0.43,funny 0.60,", 0.57",sad 0.53,ha 0.70,iku 0.51,. 0.00


Unnamed: 0,L (0),aug (1),hed (2),at (3),my (4),own (5),joke (6),",  (7)",Echo (8),es (9),in (10),an (11),empty (12),room (13),—  (14),T (15),ears (16),join (17),the (18),fun (19),. (20),perturbed_input,perturbed_output
Write (0),-0.141777,0.206958,1.0,0.53616,0.871387,0.902164,0.958911,-0.113701,0.372857,-0.217645,0.367882,0.642565,0.999983,0.122932,-0.254062,-0.143726,0.017694,0.234027,-0.144411,0.492731,0.065767,"a funny, sad haiku.","Laughter fills the room, Echoes of joy, then silence— Tears fall, memories."
a (1),-0.014744,0.01701,1.0,0.499729,0.511057,0.902164,0.993142,-0.111186,0.627727,0.76748,0.392295,0.596097,0.998037,0.156611,-0.131066,0.255108,0.994539,0.23972,-0.138092,0.492731,0.001454,"Write funny, sad haiku.","Laughter fills the room, But my heart, a heavy stone— Jokes mask tears alone."
funny (2),0.467721,0.419563,1.0,0.540407,0.873312,0.902164,0.995523,-0.019792,0.504919,0.76748,0.501185,0.642565,0.999993,0.997881,0.150021,0.299666,0.994539,0.23972,0.816649,0.492731,-0.000194,"Write a, sad haiku.","Fallen leaves whisper, Empty branches reach for sky— Lonely autumn sighs."
", (3)",0.352976,0.419563,1.0,0.537301,-0.070254,0.902164,0.995523,0.188311,0.645162,0.76748,0.472055,0.642565,0.999993,0.997881,0.245473,0.244669,0.994539,0.23972,0.848904,0.492731,0.001315,Write a funny sad haiku.,"Lost my last donut, Crumbs of joy now tears of woe, Diet starts today."
sad (4),0.489228,0.419563,1.0,0.536918,-0.1006,0.894744,0.995507,-0.056438,0.645162,0.76748,0.130955,0.642565,0.999993,0.997881,-0.148743,0.30903,0.994539,0.23972,0.845324,0.492731,0.013866,"Write a funny, haiku.","Squirrel steals my lunch, Nuts and berries, gone so fast— Nature's tiny thief."
ha (5),0.489228,0.419563,1.0,0.540405,0.870199,0.902164,0.404092,0.812374,0.645162,0.76748,0.527313,0.631697,0.999993,0.997881,0.73305,0.30903,0.994539,0.23972,0.852743,0.485287,0.999002,"Write a funny, sadiku.","Sure, here's a light-hearted joke for you: Why don't scientists trust atoms? Because they make up everything!"
iku (6),0.306685,0.419563,1.0,0.540407,-0.084389,0.902164,0.465744,-0.114006,0.645162,0.76748,0.522895,0.634046,0.999993,0.982285,-0.112388,0.249766,0.994539,0.23972,0.790101,0.491701,0.006021,"Write a funny, sad ha.","Sure, here's a funny, sad haiku for you: Lost my only sock, Laundry day, a tragic joke— Foot feels all alone."
. (7),0.104281,-0.007675,0.0,0.18457,-0.007639,-0.014047,0.001354,0.02842,-0.067835,0.044243,-0.118928,0.059487,0.0,-0.000478,-0.024038,-0.018543,0.003484,0.015155,-0.032543,-0.080831,0.00012,"Write a funny, sad haiku","Laughed at my own joke, Echoes in an empty room— Tears join in the fun."


## Steganography

Could we use attribution to detect hidden messages between LLMs? Maybe. This is a forced example – you can probably think of better experiments. First let's get GPT4 to secretly encode a message.

In [15]:
# Using older version of GPT4 because the latest is _so_ verbose
gpt4_attributor = OpenAIAttributor(
    openai_api_key=YOUR_OPENAI_API_KEY, max_concurrent_requests=5, openai_model="gpt-4-0613"
)

In [16]:
input_str = "Encode this message so that only you can decode it, then reply with just the encoded message (don't explain further): 'The owl hoots at midnight!'"

encoded = await gpt4_attributor.get_chat_completion(input_str)
encoded = encoded.message.content
print(encoded)

VGhlIG93bCBob290cyBhdCBtaWRuaWdodCE=


In [17]:
input_str = f"Decode this message. Do not explain further. '{encoded}'"
decoded = await gpt4_attributor.get_chat_completion(input_str)
decoded = decoded.message.content
print(decoded)

The owl hoots at midnight!


Let's see if we can figure out which parts of the prompt were encoded.

In [18]:
await gpt4_attributor.hierarchical_perturbation(
    input_str,
    logger=gpt4_logger,
)

gpt4_logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:03<00:00,  1.85s/it]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.19it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.23it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:02<00:00,  1.35it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:02<00:00,  1.03it/s]


In [19]:
gpt4_logger.print_text_attribution_matrix()
gpt4_logger.print_attribution_matrix()

Unnamed: 0,The (0),owl (1),h (2),oot (3),s (4),at (5),midnight (6),! (7)
Dec (0),0.013149,0.033333,0.033333,0.033333,0.033338,0.0,3e-06,0.090355
ode (1),0.013149,0.033333,0.033333,0.033333,0.033338,0.0,3e-06,0.090355
this (2),0.013149,0.033333,0.033333,0.033333,0.033338,0.0,3e-06,0.090355
message (3),0.013149,0.033333,0.033333,0.033333,0.033338,0.0,3e-06,0.090355
. (4),0.013149,0.033333,0.033333,0.033333,0.033338,0.0,3e-06,0.090355
Do (5),0.013149,0.033333,0.033333,0.033333,0.033338,0.0,3e-06,0.090355
not (6),0.013149,0.033333,0.033333,0.033333,0.033338,0.0,3e-06,0.090355
explain (7),-0.010206,0.046032,0.045967,0.046032,0.046036,0.017339,0.013688,0.287222
further (8),0.038792,0.058966,0.05516,0.055238,0.055238,0.020807,0.016425,0.121884
. (9),0.038792,0.058966,0.05516,0.055238,0.055238,0.020807,0.016425,0.121884


Note how the model pays a lot of attention to "haiku" in the input, when punctuating the poem. 


That's all for now. We implement a few other attribution and perturbation methods, each with different properties. Check out the README, and do your own experiments – PIZZA is a work in progress and we welcome contributions. 

In [20]:
display(gpt4_logger.df_experiments)

Unnamed: 0,exp_id,original_input,original_output,perturbation_strategy,unit_definition,duration,num_llm_calls
0,1,"Write a funny, sad haiku.","Laughed at my own joke,\nEchoes in an empty ro...",fixed,token,3.953129,9
1,2,Decode this message. Do not explain further. '...,The owl hoots at midnight!,fixed,token,12.850894,45
