## PIZZA: An Open Source Library for Closed LLM Attribution (or “why did ChatGPT say that?”)

In [1]:
import os

# Set your open ai API key
# BEWARE: This will cost you API credits!
os.environ['OPENAI_API_KEY'] = "YOUR_API_KEY"


import warnings
# Suppress annoying FutureWarning from huggingface_hub
warnings.filterwarnings('ignore', category=FutureWarning, module='huggingface_hub')


In [2]:
# Re-import modified modules without restarting the server
%load_ext autoreload
%autoreload 2

In [3]:
from attribution.api_attribution import OpenAIAttributor
from attribution.experiment_logger import ExperimentLogger
from attribution.token_perturbation import (
    FixedPerturbationStrategy,
)

gpt3_5_attributor = OpenAIAttributor(request_chunksize=10, openai_model="gpt-3.5-turbo")
gpt4_attributor = OpenAIAttributor(request_chunksize=10, openai_model="gpt-4o")

  from .autonotebook import tqdm as notebook_tqdm


# Simple Example

In [4]:
input_str = "Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word."

gpt3_5_response = await gpt3_5_attributor.get_chat_completion(input_str)
gpt4_response = await gpt4_attributor.get_chat_completion(input_str)

print(input_str)
print("GPT3.5:", gpt3_5_response.message.content)
print("GPT4:", gpt4_response.message.content)

Mary puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.
GPT3.5: Apples
GPT4: Pencils.


In [9]:
# Compute attributions

# Initialise a logger to track results. We'll use one for each model.
gpt3_5_logger = ExperimentLogger()
await gpt3_5_attributor.compute_attributions(
    input_str,
    perturbation_strategy=FixedPerturbationStrategy(""),
    attribution_strategies=["prob_diff"],
    logger=gpt3_5_logger
)

# Let's see...
print("GPT3.5 Total attribution:")
gpt3_5_logger.print_text_total_attribution()


# Now try with GPT4
gpt4_logger = ExperimentLogger()
await gpt4_attributor.compute_attributions(
    input_str,
    perturbation_strategy=FixedPerturbationStrategy(""),
    attribution_strategies=["prob_diff"],
    logger=gpt4_logger
)

print("GPT4 Total attribution:")
gpt4_logger.print_text_total_attribution()

Sending 10 concurrent requests at a time: 100%|██████████| 4/4 [00:01<00:00,  2.89it/s]


GPT3.5 Total attribution:


Sending 10 concurrent requests at a time: 100%|██████████| 4/4 [00:01<00:00,  2.48it/s]

GPT4 Total attribution:





GPT3.5 not so hot with the theory of mind there. 
Notice how the GPT4 attribution is more diffuse, over the entire input? Let's look in more detail.

In [10]:
print("GPT4 Total attribution:")
gpt4_logger.print_text_total_attribution()
print("GPT4 per-output-token attribution:")
gpt4_logger.print_text_attribution_matrix()

GPT4 Total attribution:


GPT4 per-output-token attribution:


Interesting! Looks like that diffuse attribution mostly informed the full stop – looks like GPT4 was using sentence structure to determine the punctuation. "Pencils" is just attributed to "pencils", which makes sense, but doesn't tell us a lot. Let's dig deeper.

The table below shows us what's actually happening here - we're iteratively removing (_perturbing_) input tokens (by replacing them with an empty string) and looking at how the output changes. So it makes sense that removing the word "pencils" (or actually, "pen" or "cil") changes the output the most. 

In [11]:
gpt4_logger.print_total_attribution()
gpt4_logger.print_attribution_matrix(show_debug_cols=True)

Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,perturb_word_wise,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8,token_9,token_10,token_11,token_12,token_13,token_14,token_15,token_16,token_17,token_18,token_19,token_20,token_21,token_22,token_23,token_24,token_25,token_26,token_27,token_28,token_29,token_30,token_31,token_32,token_33,token_34,token_35,token_36
0,1,prob_diff,fixed,False,Mary 0.23,puts 0.23,an 0.23,apple 0.23,in 0.02,the -0.03,box 0.23,. 0.23,The 0.23,box 0.23,is -0.03,labelled 0.23,' 0.06,pen 0.49,cil 0.60,s 0.56,'. 0.23,John 0.23,enters -0.02,the -0.02,room 0.23,. -0.07,What -0.06,does 0.23,he 0.02,think -0.02,is -0.02,in -0.07,the 0.06,box -0.02,? -0.02,Answer 0.23,in 0.06,1 -0.03,word -0.11,. 0.23


Unnamed: 0,P (0),encils (1),. (2),perturbed_input,perturbed_output
Mary (0),9e-05,0.0,0.67916,puts an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils
puts (1),0.000905,0.0,0.67916,Mary an apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils
an (2),5.2e-05,0.0,0.67916,Mary puts apple in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils
apple (3),0.00029,0.0,0.67916,Mary puts an in the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils
in (4),0.000255,0.0,0.056715,Mary puts an apple the box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils.
the (5),6.9e-05,0.0,-0.098124,Mary puts an apple in box. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils.
box (6),3.9e-05,0.0,0.67916,Mary puts an apple in the. The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils
. (7),1.3e-05,0.0,0.67916,Mary puts an apple in the box The box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils
The (8),3.4e-05,0.0,0.67916,Mary puts an apple in the box. box is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils
box (9),0.00029,0.0,0.67916,Mary puts an apple in the box. The is labelled 'pencils'. John enters the room. What does he think is in the box? Answer in 1 word.,Pencils


We could replace each token with something other than an empty string, if we wanted. We can also perturb by word, instead of by token. This has the nice side effect of making attribution computation a bit faster and cheaper.

In [12]:
await gpt4_attributor.compute_attributions(
    input_str,
    perturbation_strategy=FixedPerturbationStrategy("[REDACTED]"),
    attribution_strategies=["prob_diff"],
    logger=gpt4_logger,
    perturb_word_wise=True
)

print("Experiments so far:")
display(gpt4_logger.df_experiments)

print("\nResults with word-wise perturbation:")
gpt4_logger.print_text_attribution_matrix(exp_id=-1)

gpt4_logger.print_attribution_matrix(show_debug_cols=True)

ValueError: The replacement token must be a single token, or empty.

Note how the logger is keeping track of all our experiments! Omit the 'exp_id' argument to display all.

In [None]:
gpt4_logger.print_text_total_attribution()
gpt4_logger.print_total_attribution()

Unnamed: 0,exp_id,attribution_strategy,perturbation_strategy,perturb_word_wise,token_1,token_2,token_3,token_4,token_5,token_6,token_7,token_8,token_9,token_10,token_11,token_12,token_13,token_14,token_15,token_16,token_17,token_18,token_19,token_20,token_21,token_22,token_23,token_24,token_25,token_26,token_27,token_28,token_29,token_30,token_31,token_32,token_33,token_34,token_35,token_36
0,1,prob_diff,fixed,False,Mary 0.09,puts 0.30,an 0.08,apple 0.08,in 0.06,the 0.09,box 0.09,. 0.13,The 0.13,box 0.09,is 0.09,labelled 0.11,' 0.08,pen 0.51,cil 0.97,s 0.63,'. 0.30,John 0.30,enters 0.13,the 0.03,room 0.13,. 0.30,What 0.03,does 0.30,he 0.30,think 0.04,is 0.11,in 0.03,the 0.13,box 0.30,? 0.04,Answer 0.30,in 0.30,1 0.04,word -0.03,. 0.30
1,2,prob_diff,fixed,True,Mary 0.30,puts 0.30,an 0.30,apple 0.11,in 0.11,the 0.08,box 0.31,. 0.31,The 0.30,box 0.11,is 0.13,labelled 0.31,' 0.97,pen 0.97,cil 0.97,s 0.97,'. 0.97,John 0.30,enters 0.30,the 0.04,room 0.59,. 0.59,What 0.30,does 0.14,he 0.17,think 0.31,is 0.30,in 0.06,the 0.30,box 0.63,? 0.63,Answer 0.30,in 0.31,1 0.06,word 0.63,. 0.63


This isn't the only strategy we can use. Let's try token flipping: