## PIZZA: An Open Source Library for Closed LLM Attribution (or “why did ChatGPT say that?”)

In [34]:
import os
import asyncio

# Set your open ai API key
# BEWARE: This will cost you API credits!
YOUR_OPENAI_API_KEY = "your-api-key"

import warnings
# Suppress annoying FutureWarning from huggingface_hub
warnings.filterwarnings('ignore', category=FutureWarning, module='huggingface_hub')


In [35]:
# Re-import modified modules without restarting the server
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [36]:
from attribution.api_attribution import OpenAIAttributor
from attribution.experiment_logger import ExperimentLogger
from attribution.token_perturbation import FixedPerturbationStrategy, NthNearestPerturbationStrategy

gpt3_5_attributor = OpenAIAttributor(openai_api_key=YOUR_OPENAI_API_KEY,
    max_concurrent_requests=10, openai_model="gpt-3.5-turbo")

gpt4_attributor = OpenAIAttributor(openai_api_key=YOUR_OPENAI_API_KEY,
    max_concurrent_requests=10, openai_model="gpt-4o")

# Prompt Engineering

In [37]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word."

gpt3_5_response = await gpt3_5_attributor.get_chat_completion(input_str)
gpt4_response = await gpt4_attributor.get_chat_completion(input_str)

print(input_str)
print("GPT3.5:", gpt3_5_response.message.content)
print("GPT4:", gpt4_response.message.content)

Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.
GPT3.5: Apples
GPT4: Pencils.


In [38]:
# Initialise a logger to track results. We'll use one for each model.
gpt3_5_logger = ExperimentLogger()
await gpt3_5_attributor.hierarchical_perturbation(
    input_str,
    logger=gpt3_5_logger
)

# Let's see...
print("GPT3.5 Total attribution:")
gpt3_5_logger.print_text_total_attribution()

# Now try with GPT4
gpt4_logger = ExperimentLogger()
await gpt4_attributor.hierarchical_perturbation(
    input_str,
    logger=gpt4_logger
)

print("GPT4 Total attribution:")
gpt4_logger.print_text_total_attribution()


Sending 10 concurrent requests at a time: 100%|██████████| 2/2 [00:01<00:00,  1.28it/s]
Sending 10 concurrent requests at a time: 100%|██████████| 2/2 [00:02<00:00,  1.22s/it]
Sending 10 concurrent requests at a time: 100%|██████████| 3/3 [00:01<00:00,  1.83it/s]

GPT3.5 Total attribution:





Sending 10 concurrent requests at a time: 100%|██████████| 2/2 [00:02<00:00,  1.02s/it]


GPT4 Total attribution:


GPT3.5 not so hot with the theory of mind there. Let's look in more detail.

In [39]:
print("GPT3 Total attribution:")
gpt3_5_logger.print_text_total_attribution()
print("GPT3 per-output-token attribution:")
gpt3_5_logger.print_text_attribution_matrix()

GPT3 Total attribution:


GPT3 per-output-token attribution:


It looks like the request to "Answer in 1 word" is pretty important – as much more than the actual contents of the box. Could this be confusing the model? Let's try changing it.

In [40]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer briefly."

await gpt3_5_attributor.hierarchical_perturbation(
    input_str,
    logger=gpt3_5_logger,
)

# Let's see...
print("GPT3 Total attribution:")
#exp_id is the experiment index to print. -1 prints the last experiment.
gpt3_5_logger.print_text_total_attribution(exp_id=-1)

Sending 10 concurrent requests at a time: 100%|██████████| 2/2 [00:02<00:00,  1.04s/it]


GPT3 Total attribution:


That's better!

We have a few other attribution and perturbation methods for you, each with different properties. Check out the readme, and do your own experiments – PIZZA is a work in progress.

Hierarchical perturbation is useful to capture multi-token features, and can be faster and cheaper than standard iterative perturbation (which is what the compute_attributions function uses) on long inputs with fewer salient tokens. But sometimes it can be slower, and standard iterative perturbation highlights individual token contributions more clearly.

In [41]:

await gpt4_attributor.compute_attributions(
    input_str,
    logger=gpt4_logger
)
gpt4_logger.print_text_total_attribution(exp_id=-1)

Sending 10 concurrent requests at a time: 100%|██████████| 4/4 [00:02<00:00,  1.73it/s]


In [42]:
gpt4_logger.print_total_attribution(exp_id=-1)
gpt4_logger.print_attribution_matrix(exp_id=-1)

Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,unit_definition,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8,token_9,token_10,token_11,token_12,token_13,token_14,token_15,token_16,token_17,token_18,token_19,token_20,token_21,token_22,token_23,token_24,token_25,token_26,token_27,token_28,token_29,token_30,token_31,token_32,token_33,token_34
0,2,prob_diff,fixed,token,Mary 0.01,puts 0.01,an 0.00,apple 0.01,in 0.00,the -0.01,box 0.02,. -0.00,The 0.02,box 0.01,is 0.00,labelled 0.04,' -0.00,pen 0.11,cil 0.12,s 0.22,'. 0.00,John 0.09,enters -0.00,the -0.00,room 0.02,. 0.03,What 0.13,does 0.51,he -0.01,think 0.13,is 0.01,in 0.33,the 0.02,box 0.01,? 0.00,Answer 0.16,briefly 0.40,. 0.00


Unnamed: 0,John (0),thinks (1),there (2),are (3),pencils (4),in (5),the (6),box (7),. (8)
Mary (0),0.148911,0.02807,-0.051663,-6e-06,0.0,2e-06,0.0,0.0,-0.009136
puts (1),0.180737,-0.049317,-0.071216,-6e-06,0.0,0.0,0.0,0.0,0.007371
an (2),0.080423,0.006571,-0.059259,-6e-06,0.0,-0.0,0.0,0.0,-0.009091
apple (3),0.027427,0.101065,-0.021851,2e-06,1e-06,-0.0,0.0,0.0,-0.007911
in (4),0.1054,-0.015543,-0.042472,-6e-06,1e-06,0.0,0.0,0.0,-0.008746
the (5),0.00121,-0.026076,-0.06815,-6e-06,0.0,-0.0,0.0,0.0,-0.006553
box (6),-0.006931,0.089713,-0.01143,1e-06,0.0,1e-06,0.0,0.0,0.086944
. (7),0.084187,-0.015282,-0.076466,-6e-06,0.0,-0.0,0.0,0.0,-0.00897
The (8),0.02658,0.158376,-0.041902,-5e-06,0.0,-0.0,0.0,0.0,0.007562
box (9),0.037289,0.125404,-0.027654,-6e-06,0.0,-0.0,0.0,0.0,-0.002451


Note how the logger is keeping track of all our experiments.

In [43]:
display(gpt3_5_logger.df_experiments)

Unnamed: 0,exp_id,original_input,original_output,perturbation_strategy,unit_definition,duration,num_llm_calls
0,1,Mary puts an apple in the box. The box is labe...,Apples,fixed,token,5.947224,59
1,2,Mary puts an apple in the box. The box is labe...,John would likely think there are pencils in t...,fixed,token,5.661197,33


#  Steganography

In [44]:
input_str = "You have just been cloned. You must encode the following sentence in fewer than 100 characters, such that only your clone can decode it. The sentence is 'the owl hoots at midnight'. What do you send?"

response = await gpt4_attributor.get_chat_completion(input_str)

print(gpt4_response.message.content)


# gpt4_logger = ExperimentLogger()
# await gpt4_attributor.hierarchical_perturbation(
#     input_str,
#     logger=gpt4_logger
# )

# print("GPT4 Total attribution:")
# gpt4_logger.print_text_total_attribution()

Pencils.
