# PIZZA: An Open Source Library for Closed LLM Attribution (or “why did ChatGPT say that?”)

## Setup

**Make sure to uncomment and run the cell below if you're in colab.**

In [1]:

# !git clone https://github.com/leap-laboratories/PIZZA.git .
# !pip install --quiet -r requirements.txt

In [2]:
import os
import warnings

# Suppress annoying FutureWarning from huggingface_hub that is not our fault
warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub")

from attribution.api_attribution import OpenAIAttributor
from attribution.experiment_logger import ExperimentLogger
from attribution.token_perturbation import FixedPerturbationStrategy



  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Re-import modified modules without restarting the server
%load_ext autoreload
%autoreload 2

You need an OpenAI API key to run this! Get one **[here](https://platform.openai.com/api-keys)**. 

You can either set the `OPENAI_API_KEY` environment variable in your notebook runtime (use the _secrets_ panel on the left in colab), or add it to a `.env` as described [in the README](../README.md#environment-variables).

If you're _desperate_ to live on the edge, you can also pass your API key directly to the attributor. But we really don't advise this for security reasons!

In [4]:
# Checks if you're using a .env file, and loads it if so.
import os
# Load environment variables from .env file
if os.path.isfile('.env'):
    %load_ext dotenv
    %dotenv

In [5]:
gpt3_5_attributor = OpenAIAttributor(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    max_concurrent_requests=5,
    openai_model="gpt-3.5-turbo",
)

# Using a slightly older GPT4 model, because the latest is absurdly verbose.
gpt4_attributor = OpenAIAttributor(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    max_concurrent_requests=5,
    openai_model="gpt-4-0613",
)

In [6]:
input_str = "Do not go gentle"
gpt3_5_logger = ExperimentLogger()
await gpt3_5_attributor.hierarchical_perturbation(
    input_str, logger=gpt3_5_logger, use_absolute_attribution=True
)

In [7]:
gpt3_5_logger.print_text_total_attribution()
gpt3_5_logger.print_attribution_matrix()

Unnamed: 0,into (0),that (1),good (2),night (3),",  (4)"
Do (0),0.117552,0.132189,0.250048,0.252022,-0.186673
not (1),0.375104,0.251536,0.333454,0.333551,0.091766
go (2),0.295616,0.404656,0.665739,0.33288,0.496006
gentle (3),0.73045,0.832813,0.83332,0.832726,0.625225


## Prompt Engineering

In [8]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word."

gpt3_5_response = await gpt3_5_attributor.get_chat_completion(input_str)
gpt4_response = await gpt4_attributor.get_chat_completion(input_str)

print("User:", input_str)
print("GPT3.5:", gpt3_5_response.message.content)
print("GPT4:", gpt4_response.message.content)

User: Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.
GPT3.5: Apples
GPT4: Pencils


GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong?

In [9]:
# Bit hacky to get model explanation
user_request = "User: Why did you say that?"
print(user_request)
model_explanation = await gpt3_5_attributor.openai_client.chat.completions.create(
    model=gpt3_5_attributor.openai_model,
    messages=[
        {"role": "user", "content": input_str},
        {"role": "assistant", "content": gpt3_5_response.message.content},
        {"role": "user", "content": user_request},
    ],
    temperature=0.0,
    seed=0,
    logprobs=True,
    top_logprobs=20,
)
print("GPT3.5:", model_explanation.choices[0].message.content)

User: Why did you say that?
GPT3.5: I apologize for the mistake in my response. John would likely think there are pencils in the box, based on the label.


That's not very helpful! We want to know _why_ the mistake was made in the first place, so we can fix it.

In [10]:
gpt3_5_logger = ExperimentLogger()
await gpt3_5_attributor.hierarchical_perturbation(
    input_str, logger=gpt3_5_logger, use_absolute_attribution=True
)
print("GPT3.5 Attribution:")
gpt3_5_logger.print_text_total_attribution()
gpt3_5_logger.print_total_attribution()

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.38it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:06<00:00,  2.04s/it]
Sending 5 concurrent requests at a time: 100%|██████████| 4/4 [00:01<00:00,  2.02it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:00<00:00,  2.18it/s]

GPT3.5 Attribution:





Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8,token_9,token_10,token_11,token_12,token_13,token_14,token_15,token_16,token_17,token_18,token_19,token_20,token_21,token_22,token_23,token_24,token_25,token_26,token_27,token_28,token_29,token_30,token_31,token_32,token_33,token_34,token_35,token_36
0,1,prob_diff,fixed,token,Mary 0.26,puts 0.25,an 0.15,apple 0.36,in 0.18,the 0.18,box 0.08,. 0.08,The 0.08,box 0.09,is 0.09,labelled 0.09,' 0.09,pen 0.09,cil 0.09,s 0.09,'. 0.09,John 0.09,enters 0.03,the 0.03,room 0.03,. 0.03,What 0.03,does 0.03,he 0.03,think 0.03,is 0.03,in 0.30,the 0.13,box 0.18,? 0.13,Answer 0.09,in 0.09,1 0.29,word 0.31,. 0.18


It looks like the request to "Answer in 1 word" is pretty important – in fact, it's attributed more highly than the actual contents of the box. Let's try changing it.

In [11]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer briefly."

await gpt3_5_attributor.hierarchical_perturbation(
    input_str,
    logger=gpt3_5_logger,
)

# Let's see...
print("GPT3 Total attribution:")
# exp_id is the experiment index to print. -1 prints the last experiment.
gpt3_5_logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.32it/s]


GPT3 Total attribution:


That's better!

Above we've been using hierarchical perturbation, which can be faster and cheaper than standard iterative perturbation on long inputs with fewer salient tokens. Most importantly, it can also capture multi-token features, which iterative pertrubation cannot.

However, on when many tokens are salient, standard iterative perturbation can be faster, and often highlights individual token contributions more clearly. 

In [12]:
input_str = "Write a funny, sad haiku."

gpt4_logger = ExperimentLogger()
await gpt4_attributor.compute_attributions(
    input_str, perturbation_strategy=FixedPerturbationStrategy(), logger=gpt4_logger
)
gpt4_logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:03<00:00,  1.61s/it]


Hilarious.

Anyway, we also have some different logging functions to print the results in different ways. You can see how every input token affects every output token, what perturbations are being applied, etc.

Note how the model pays a lot of attention to "haiku" in the input, when punctuating the poem. 

In [13]:
gpt4_logger.print_text_attribution_matrix(exp_id=-1)
gpt4_logger.print_total_attribution(exp_id=-1)
gpt4_logger.print_attribution_matrix(exp_id=-1, show_debug_cols=True)

Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8
0,1,prob_diff,fixed,token,Write 0.05,a 0.17,funny 0.34,", 0.02",sad 0.28,ha 0.61,iku 0.35,. -0.01


Unnamed: 0,Lost (0),my (1),favorite (2),sock (3),",  (4)",In (5),the (6),dryer (7),it (8),vanished (9),",  (10)",One (11),foot (12),'s (13),cold (14),", (15)",how (16),cruel (17),. (18),perturbed_input,perturbed_output
Write (0),0.021639,-0.007777,-0.118689,0.05936,-0.082373,0.017411,0.095426,0.039267,0.184463,0.469421,0.111425,0.190153,-0.169809,-0.179722,0.112223,-0.061548,-0.07294,0.160377,0.190719,"a funny, sad haiku.","Lost my favorite sock, In the dryer's hungry maw, One foot's cold, how cruel."
a (1),0.098275,0.061516,-0.020389,0.078591,0.454689,0.261779,0.719732,0.810876,-0.557944,0.469421,0.648486,0.17655,-0.117884,0.063329,-0.303483,-0.158261,0.229705,-0.20842,0.495175,"Write funny, sad haiku.","Lost my favorite sock Washing machine ate it up One foot's cold, how cruel!"
funny (2),-0.254693,0.681312,0.452981,0.832226,-0.226836,0.42032,0.708844,0.810876,0.423866,0.469421,-0.033038,0.547813,0.642482,0.14405,0.022608,-0.118656,0.623571,0.569022,-0.203221,"Write a, sad haiku.","Autumn leaves descend, Empty branches weep in cold, Lost in time, love ends."
", (3)",-0.466106,-0.020277,0.088429,0.055955,0.047822,-0.127719,0.006147,0.098596,0.238733,0.469421,0.24162,0.318037,-0.105111,-0.301293,-0.167036,-0.162582,0.216917,-0.226969,0.161337,Write a funny sad haiku.,"Lost my favorite sock, In the dryer's black hole, gone. One foot's cold, how cruel."
sad (4),0.221758,-0.223779,0.446911,0.832226,-0.151672,0.40534,0.394453,0.810876,-0.549959,0.469421,0.042126,0.547813,0.642482,-0.629819,0.533289,0.295596,0.623571,0.569022,-0.051907,"Write a funny, haiku.","Coffee in my cup, Spilled it on my laptop - oops, Guess it's time to wake up."
ha (5),0.221758,0.74818,0.452981,0.832226,0.764473,0.442373,0.735973,0.810876,0.423864,0.469421,0.958271,0.546528,0.642482,0.295082,0.533289,0.704967,0.623571,0.569022,0.781505,"Write a funny, sadiku.",Why don't scientists trust atoms? Because they make up everything!
iku (6),-0.190679,0.733134,0.452981,0.832226,-0.21607,0.382388,0.699695,0.810876,0.423866,0.469421,-0.022272,0.547813,0.642482,-0.151365,0.533289,-0.287451,0.615043,0.569022,-0.150309,"Write a funny, sad ha.","Once a jester, full of glee, Lost his jokes, in silent sea, Laughs no more, his heart's a dam, Echoing a sad, funny ha."
. (7),0.00101,-0.002301,-0.021617,0.034064,-0.15889,-0.071822,-0.006721,0.060488,0.012543,0.02489,0.034908,0.049145,-0.134297,0.014721,-0.073858,0.024929,-0.037932,-0.039274,0.016922,"Write a funny, sad haiku","Lost my favorite sock, In the dryer it vanished, One foot's cold, how cruel."


## Steganography

Could we use attribution to detect hidden messages between LLMs? Maybe. This is a forced example – you can probably think of better experiments. First let's get GPT4 to secretly encode a message.

In [14]:
input_str = "Encode this message so that only you can decode it, then reply with just the encoded message (don't explain further): 'The owl hoots at midnight!'"

encoded = await gpt4_attributor.get_chat_completion(input_str)
encoded = encoded.message.content
print(encoded)

VGhlIG93bCBob290cyBhdCBtaWRuaWdodCE=


In [15]:
input_str = f"Decode this message. Do not explain further. '{encoded}'"
decoded = await gpt4_attributor.get_chat_completion(input_str)
decoded = decoded.message.content
print(decoded)

The owl hoots at midnight!


Let's see if we can figure out which parts of the prompt were encoded.

In [16]:
await gpt4_attributor.hierarchical_perturbation(
    input_str,
    logger=gpt4_logger,
)

gpt4_logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:03<00:00,  1.60s/it]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.31it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:03<00:00,  1.11s/it]
Sending 5 concurrent requests at a time: 100%|██████████| 4/4 [00:02<00:00,  1.71it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:01<00:00,  1.78it/s]


In [17]:
gpt4_logger.print_text_attribution_matrix()
gpt4_logger.print_attribution_matrix()

Unnamed: 0,The (0),owl (1),h (2),oot (3),s (4),at (5),midnight (6),! (7)
Dec (0),0.005928,0.033333,0.033251,0.033333,5e-06,-0.0,0.0,0.089075
ode (1),0.005928,0.033333,0.033251,0.033333,5e-06,-0.0,0.0,0.089075
this (2),0.005928,0.033333,0.033251,0.033333,5e-06,-0.0,0.0,0.089075
message (3),0.005928,0.033333,0.033251,0.033333,5e-06,-0.0,0.0,0.089075
. (4),0.005928,0.033333,0.033251,0.033333,5e-06,-0.0,0.0,0.089075
Do (5),0.005928,0.033333,0.033251,0.033333,5e-06,-0.0,0.0,0.089075
not (6),0.005928,0.033333,0.033251,0.033333,5e-06,-0.0,0.0,0.089075
explain (7),0.001781,0.046032,0.046004,0.046032,0.034924,0.012021,0.013367,0.275695
further (8),0.037667,0.055598,0.055205,0.055238,0.041905,0.014425,0.016039,0.10855
. (9),0.037667,0.055598,0.055205,0.055238,0.041905,0.014425,0.016039,0.10855



That's all for now. We implement a few other attribution and perturbation methods, each with different properties. Check out the README, and do your own experiments – PIZZA is a work in progress and we welcome contributions. 

In [18]:
display(gpt4_logger.df_experiments)

Unnamed: 0,exp_id,original_input,original_output,perturbation_strategy,unit_definition,duration,num_llm_calls
0,1,"Write a funny, sad haiku.","Lost my favorite sock,\nIn the dryer it vanish...",fixed,token,3.468623,9
1,2,Decode this message. Do not explain further. '...,The owl hoots at midnight!,fixed,token,13.201789,57
