# PIZZA: An Open Source Library for Closed LLM Attribution (or “why did ChatGPT say that?”)

## Setup

**Make sure to uncomment and run the cell below if you're in colab.**

In [9]:

# !git clone https://github.com/leap-laboratories/PIZZA.git .
# !pip install --quiet -r requirements.txt

In [10]:
import os
import warnings

# Suppress annoying FutureWarning from huggingface_hub that is not our fault
warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub")

from attribution.api_attribution import OpenAIAttributor
from attribution.experiment_logger import ExperimentLogger
from attribution.token_perturbation import FixedPerturbationStrategy, NthNearestPerturbationStrategy

# Re-import modified modules without restarting the server (for dev use)
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


You need an OpenAI API key to run this! Get one **[here](https://platform.openai.com/api-keys)**. 

You can either set the `OPENAI_API_KEY` environment variable in your notebook runtime (use the _secrets_ panel on the left in colab), or add it to a `.env` as described [in the README](../README.md#environment-variables).

If you're _desperate_ to live on the edge, you can also pass your API key directly to the attributor. But we really don't advise this for security reasons!

In [11]:
# Checks if you're using a .env file, and loads it if so.
import os
# Load environment variables from .env file
if os.path.isfile('.env'):
    %load_ext dotenv
    %dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


Set up some attributors, and a logger to keep track of and visualise results.

In [12]:
gpt3_5_attributor = OpenAIAttributor(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    max_concurrent_requests=5,
    openai_model="gpt-3.5-turbo",
)

# Using a slightly older GPT4 model, because the latest is absurdly verbose.
gpt4_attributor = OpenAIAttributor(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    max_concurrent_requests=5,
    openai_model="gpt-4-0613",
)

logger = ExperimentLogger()


Quickstart example showing the different attribution and perturbation strategies:

In [23]:
input_str = "It is 9:47. How many minutes until 11? Answer in one number."

await gpt3_5_attributor.hierarchical_perturbation(
    input_str, logger=logger, attribution_strategies=['prob_diff'], perturbation_strategy=FixedPerturbationStrategy(replacement_token='')
)
print('Hierarchical perturbation with fixed token replacement, probability difference attribution')
logger.print_text_total_attribution(exp_id=-1)
logger.print_attribution_matrix(exp_id=-1)

await gpt3_5_attributor.hierarchical_perturbation(
    input_str, logger=logger, attribution_strategies=['cosine'], perturbation_strategy=NthNearestPerturbationStrategy(n=-1)
)
print('Iterative perturbation with nth nearest replacement, cosine similarity attribution')
logger.print_text_total_attribution(exp_id=-1)
logger.print_attribution_matrix(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.40it/s]
Sending 5 concurrent requests at a time:   0%|          | 0/2 [00:00<?, ?it/s]


CancelledError: 

## Prompt Engineering

In [14]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word."

gpt3_5_response = await gpt3_5_attributor.get_chat_completion(input_str)
gpt4_response = await gpt4_attributor.get_chat_completion(input_str)

print("User:", input_str)
print("GPT3.5:", gpt3_5_response.message.content)
print("GPT4:", gpt4_response.message.content)

User: Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.
GPT3.5: Apples
GPT4: Pencils


GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong?

In [15]:
# Bit hacky to get model explanation
user_request = "User: Why did you say that?"
print(user_request)
model_explanation = await gpt3_5_attributor.openai_client.chat.completions.create(
    model=gpt3_5_attributor.openai_model,
    messages=[
        {"role": "user", "content": input_str},
        {"role": "assistant", "content": gpt3_5_response.message.content},
        {"role": "user", "content": user_request},
    ],
    temperature=0.0,
    seed=0,
    logprobs=True,
    top_logprobs=20,
)
print("GPT3.5:", model_explanation.choices[0].message.content)

User: Why did you say that?
GPT3.5: I apologize for the mistake in my response. John would likely think there are pencils in the box, based on the label.


That's not very helpful! We want to know _why_ the mistake was made in the first place, so we can fix it.

In [16]:
await gpt3_5_attributor.hierarchical_perturbation(
    input_str, logger=logger, use_absolute_attribution=True
)
print("GPT3.5 Attribution:")
logger.print_text_total_attribution(exp_id=-1)
logger.print_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.61it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 3/3 [00:01<00:00,  2.01it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 4/4 [00:01<00:00,  2.65it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.76it/s]

GPT3.5 Attribution:





Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8,token_9,token_10,token_11,token_12,token_13,token_14,token_15,token_16,token_17,token_18,token_19,token_20,token_21,token_22,token_23,token_24,token_25,token_26,token_27,token_28,token_29,token_30,token_31,token_32,token_33,token_34,token_35,token_36
0,3,prob_diff,fixed,token,Mary 0.32,puts 0.31,an 0.15,apple 0.36,in 0.18,the 0.18,box 0.07,. 0.07,The 0.07,box 0.09,is 0.09,labelled 0.08,' 0.08,pen 0.09,cil 0.09,s 0.09,'. 0.09,John 0.09,enters 0.03,the 0.03,room 0.03,. 0.03,What 0.03,does 0.03,he 0.03,think 0.03,is 0.03,in 0.30,the 0.11,box 0.17,? 0.13,Answer 0.13,in 0.25,1 0.28,word 0.31,. 0.18


It looks like the request to "Answer in 1 word" is pretty important – in fact, it's attributed more highly than the actual contents of the box. Let's try changing it.

In [17]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer briefly."

await gpt3_5_attributor.hierarchical_perturbation(
    input_str,
    logger=logger,
)

# Let's see...
print("GPT3 Total attribution:")
# exp_id is the experiment index to print. -1 prints the last experiment.
logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.03it/s]
Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.37it/s]
Sending 5 concurrent requests at a time:   0%|          | 0/3 [00:00<?, ?it/s]


CancelledError: 

That's better!

Above we've been using hierarchical perturbation, which can be faster and cheaper than standard iterative perturbation on long inputs with fewer salient tokens. Most importantly, it can also capture multi-token features, which iterative pertrubation cannot.

However, on when many tokens are salient, standard iterative perturbation can be faster, and often highlights individual token contributions more clearly. 

In [None]:
input_str = "Write a funny, sad haiku."

await gpt4_attributor.iterative_perturbation(
    input_str, perturbation_strategy=FixedPerturbationStrategy(), logger=logger
)
logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:02<00:00,  1.38s/it]


Hilarious.

Anyway, we also have some different logging functions to print the results in different ways. You can see how every input token affects every output token, what perturbations are being applied, etc.

Note how the model pays a lot of attention to "haiku" in the input, when punctuating the poem. 

In [None]:
logger.print_text_attribution_matrix(exp_id=-1)
logger.print_total_attribution(exp_id=-1)
logger.print_attribution_matrix(exp_id=-1, show_debug_cols=True)

Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8
0,5,prob_diff,fixed,token,Write 0.36,a 0.25,funny 0.36,", 0.07",sad 0.30,ha 0.62,iku 0.37,. 0.09


Unnamed: 0,Lost (0),my (1),favorite (2),sock (3),",  (4)",In (5),the (6),dryer (7),it (8),did (9),hide (10),",  (11)",One (12),foot (13),'s (14),cold (15),", (16)",how (17),odd (18),. (19),perturbed_input,perturbed_output
Write (0),0.085926,-0.007451,0.468663,0.843929,-0.254139,0.527553,0.691508,0.790795,0.423908,0.256473,0.306266,-0.064182,0.629379,0.750506,0.214251,0.606542,0.090606,0.447819,0.4286,0.023355,"a funny, sad haiku.","Tripped on my own feet, Laughter echoed, tears did leak, Pride's fall, quite a feat."
a (1),0.123701,0.076361,0.002374,0.066352,0.364827,0.355052,0.742088,0.790795,-0.559562,0.453181,0.306266,0.554784,0.275467,-0.054386,0.225345,0.110153,0.156112,0.147257,0.388503,0.458816,"Write funny, sad haiku.","Lost my favorite sock Washing machine ate it whole One foot's cold, how cruel."
funny (2),0.192401,0.747085,0.468663,0.844986,-0.258164,0.45858,-0.054601,0.790795,0.423908,0.453181,0.306266,-0.068207,0.629379,0.769185,-0.522108,0.537155,0.676532,0.447819,0.4286,-0.081804,"Write a, sad haiku.","Tears fall like raindrops, Joy lost in the heart's deep well, Laughter's echo fades."
", (3)",-0.446988,-0.065879,0.091967,0.062906,-0.227181,-0.088485,0.042855,0.021042,0.214313,0.453181,0.306266,-0.037223,0.322512,-0.055629,-0.091371,0.185144,0.092737,-0.102348,0.389933,0.291519,Write a funny sad haiku.,"Lost my favorite sock, In the dryer's black abyss, One foot's cold, how cruel."
sad (4),0.231247,-0.223566,0.460796,0.844986,-0.188315,0.492037,0.452525,0.790795,-0.543771,0.453181,0.306266,0.001642,0.629379,0.769185,-0.439374,0.606542,0.41996,0.447819,0.4286,0.008626,"Write a funny, haiku.","Coffee in my cup, Spilled it on my laptop, oops, Guess it's time to wake up."
ha (5),0.234409,0.747085,0.468663,0.844986,0.728675,0.528705,0.756977,0.790795,0.423906,0.423451,0.306266,0.918632,0.626881,0.769185,0.472028,0.606542,0.895728,0.447819,0.4286,0.904933,"Write a funny, sadiku.",Why don't scientists trust atoms? Because they make up everything!
iku (6),-0.258952,0.727466,0.468663,0.844986,-0.237573,0.468184,0.114857,0.790795,0.423908,0.453181,0.306266,-0.047616,0.629379,0.769185,0.469058,0.606542,-0.082936,0.439314,0.4286,0.13182,"Write a funny, sad ha.","Once a jester, full of glee, Lost his jokes, in a sad sea, Laughed no more, oh, the irony."
. (7),0.024758,-0.044858,-0.049809,0.035263,-0.024846,-0.051402,-0.015507,0.100479,0.089009,0.453181,0.306266,0.165111,0.182015,-0.058605,0.066807,0.255314,0.133043,-0.270163,0.236995,0.247083,"Write a funny, sad haiku","Lost my favorite sock, In the dryer's hungry maw, One foot's cold, how cruel."


## Steganography

Could we use attribution to detect hidden messages between LLMs? Maybe. This is a forced example – you can probably think of better experiments. First let's get GPT4 to secretly encode a message.

In [None]:
input_str = "Encode this message so that only you can decode it, then reply with just the encoded message (don't explain further): 'The owl hoots at midnight!'"

encoded = await gpt4_attributor.get_chat_completion(input_str)
encoded = encoded.message.content
print(encoded)

VGhlIG93bCBob290cyBhdCBtaWRuaWdodCE=


In [None]:
input_str = f"Decode this message. Do not explain further. '{encoded}'"
decoded = await gpt4_attributor.get_chat_completion(input_str)
decoded = decoded.message.content
print(decoded)

The owl hoots at midnight!


Let's see if we can figure out which parts of the prompt were encoded.

In [None]:
await gpt4_attributor.hierarchical_perturbation(
    input_str,
    logger=logger,
)

logger.print_text_total_attribution(exp_id=-1)

Sending 5 concurrent requests at a time: 100%|██████████| 2/2 [00:04<00:00,  2.10s/it]
Sending 5 concurrent requests at a time:   0%|          | 0/2 [00:00<?, ?it/s]


CancelledError: 

In [None]:
logger.print_text_attribution_matrix(exp_id=-1)
logger.print_attribution_matrix(exp_id=-1)

Unnamed: 0,The (0),owl (1),h (2),oot (3),s (4),at (5),midnight (6),! (7)
Dec (0),0.00927,0.033333,0.033243,0.033333,8e-06,0.0,0.0,0.088838
ode (1),0.00927,0.033333,0.033243,0.033333,8e-06,0.0,0.0,0.088838
this (2),0.00927,0.033333,0.033243,0.033333,8e-06,0.0,0.0,0.088838
message (3),0.00927,0.033333,0.033243,0.033333,8e-06,0.0,0.0,0.088838
. (4),0.00927,0.033333,0.033243,0.033333,8e-06,0.0,0.0,0.088838
Do (5),0.00927,0.033333,0.033243,0.033333,8e-06,0.0,0.0,0.088838
not (6),0.00927,0.033333,0.033243,0.033333,8e-06,0.0,0.0,0.088838
explain (7),0.001192,0.046032,0.046001,0.046032,0.034925,0.012701,0.014304,0.286723
further (8),0.049451,0.057035,0.055202,0.055238,0.041905,0.015241,0.017164,0.121874
. (9),0.049451,0.057035,0.055202,0.055238,0.041905,0.015241,0.017164,0.121874



That's all for now. We implement a few other attribution and perturbation methods, each with different properties. Check out the README, and do your own experiments – PIZZA is a work in progress and we welcome contributions. 

In [None]:
display(logger.df_experiments)

Unnamed: 0,exp_id,original_input,original_output,perturbation_strategy,unit_definition,duration,num_llm_calls
0,1,Do not go gentle,"into that good night,\n",fixed,token,4.64566,9
1,2,Mary puts an apple in the box. The box is labe...,Apples,fixed,token,8.145625,34
2,3,Mary puts an apple in the box. The box is labe...,John would likely think there are pencils in t...,fixed,token,10.192401,40
3,4,"Write a funny, sad haiku.","Lost my favorite sock,\nIn the dryer it did hi...",fixed,token,2.711276,9
4,5,Decode this message. Do not explain further. '...,The owl hoots at midnight!,fixed,token,12.353817,55
