## 4. Provenance-based

The provenance-based technique is that in which explanations are provided by illustrating some or all of the prediction derivation process. This process is intuitive and effective and the final prediction is the result of a series of reasoning steps.

Danilevsky et al. proposed the following papers back then:
* [Interpretable Relevant Emotion Ranking with Event-Driven Attention](https://aclanthology.org/D19-1017.pdf)
    * They mention a model called IRER-EA but we couldn't test it since we couldn't find it online.

* [MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms](https://aclanthology.org/N19-1245/).
    * This project consists on the creation of a dataset called [MathQA](https://huggingface.co/datasets/math_qa) that includes the chain of thought to solve a mathematical problem. 
    * The motivation behind it was to provide explainability the Google's AQuA dataset (a dataset with mathematical problems and four options to choose).

In 2024, this kind of definition is quite stablished in what we call "Chain of Thought", so we could include any model that applies this in this section.


## Experiments - MathQA

In this section, we will simply show how the structure of the AQuA dataset is and how the new dataset, MathQA turned out:

In [2]:
from datasets import load_dataset
import pprint

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
math_qa = load_dataset("math_qa")

In [4]:
sample = math_qa["train"][0]

In [11]:
pprint.pprint(sample)

{'Problem': "the banker ' s gain of a certain sum due 3 years hence at 10 % "
            'per annum is rs . 36 . what is the present worth ?',
 'Rationale': '"explanation : t = 3 years r = 10 % td = ( bg × 100 ) / tr = ( '
              '36 × 100 ) / ( 3 × 10 ) = 12 × 10 = rs . 120 td = ( pw × tr ) / '
              '100 ⇒ 120 = ( pw × 3 × 10 ) / 100 ⇒ 1200 = pw × 3 pw = 1200 / 3 '
              '= rs . 400 answer : option a"',
 'annotated_formula': 'divide(multiply(const_100, divide(multiply(36, '
                      'const_100), multiply(3, 10))), multiply(3, 10))',
 'category': 'gain',
 'correct': 'a',
 'linear_formula': 'multiply(n2,const_100)|multiply(n0,n1)|divide(#0,#1)|multiply(#2,const_100)|divide(#3,#1)|',
 'options': 'a ) rs . 400 , b ) rs . 300 , c ) rs . 500 , d ) rs . 350 , e ) '
            'none of these'}


We see that `Problem`, `Rationale`, `options` and `correct` are the original problems stated in **AQuA**.
* `Problem`: a mathematical problem explained in plain words.
* `Rationale`: the explanation of how to reach the solution.
* `Options`: 5 possible results of the problem, only one is correct.
* `correct`: the option that is correct.

The new **MathQA** dataset includes:
* `category`: a category stating what type of problem it is.
* `annotated_formula`: the steps to solve the problem in a python-like code. The operations are always represented in the same way with a fixed vocabulary (e.g. divide, multiply, const_X, etc)
* `linear_formula`: a formula that states the steps in a clarer way that the previous one. Steps are separated by `|`.

In MathQA, the authors use this new version of the dataset and pose the problem as a machine translation problem, with an encoder-decoder aarchitecture. This facilitates the understanding of the problem solving.

## Experiments - Toy RAG system

As we said, provenance-based techniques explain or show part of the prediction derivation problem. Nowadays, RAG systems could also fall into this category, since part of the final decision relies on the extra information the model retrieves.
We will implement a toy example of a RAG system, this is inspired by [A beginner’s guide to building a Retrieval Augmented Generation (RAG) application from scratch](https://medium.com/@wachambers/a-beginners-guide-to-building-a-retrieval-augmented-generation-rag-application-from-scratch-e52921953a5d) from Medium.

RAG stands for Retrieval Agumented Generation and is basically a model that researches and contextualizes with additional information we have provided it with. RAG’s internal knowledge can be easily altered or even supplemented on the fly, enabling researchers and engineers to control what RAG knows and doesn’t know without wasting time or compute power retraining the entire model.
In other words, we can state that the essence of RAG involves adding your own data (via a retrieval tool) to the prompt that you pass into a large language model. So, basically, you are altering the input of the LLM and provide extra information, hence, additional explanatinability and more context to get to a certain context.

And what will be our task in this case? 
* Since RAG is for generative models, we won't be using our fine-tuned model for classification, nor a BERT. We will be using [Llama2](https://llama.meta.com/).
* We will see how a generative model modifies the response when we add the real context behind some verses of our test set.
* As we already know, our test set is the song [All too well (10 Minute Version)(Taylor's Version)(From the Vault)](https://www.youtube.com/watch?v=sRxrwjOtIag).

How does our toy-RAG work?

The high-level components of a RAG system are a **corpus**, an **input** from the user and **a similarity measure** between the corpus and the user input.

In our case, we defined:
* The input: verses of the test data, lyrics from the song. The same as in the previous experiments.
* The corpus: the corpus works as the ``additional information" that helps contextualise and enrich the query before reaching the LLM. In this case, we took the comments of the song from the \href{https://genius.com/Taylor-swift-all-too-well-10-minute-version-taylors-version-from-the-vault-lyrics}{Genius webpage}. Genius provides context and explanations of the lyrics from songs.
* The similarity measure is the Jaccard similarity score.


The process followed by the RAG system:

1. Receive a user input (the lyrics)
2. Perform a similarity measure (we chose the Jaccard similarity as in the article) to match the most suitable document (the context from Genius)
3. Send to the LLM the query with the added information. Our LLM is Llama2 and we call it via the `ollama` proxy.
\end{enumerate}


The Jaccard similarity used for querying the corpus given the input lyric is defined as: 

$J(i,d) = \dfrac{|i \cap d|}{|i \cup u|}, {i\in I, d\in D}$

Where:
* $D$ is the set of documents, in our case the additional information extracted from Genius.
* $I$ is the user input, in our case, the verse from the test.
* The intersection $\cap$ represents the total number of words that appear in both: the verse and the document.
* The union $\cup$ represents the total number of words that appear in the verse or the document.


We take the document $d$ that has the higher Jaccard Similarity given the input $i$.

The prompt used as input for queries with contextualized information: 

```You are a classifier that, given a sentence, says if the sentence is negative, positive, neutral or mixed. The sentence is {song_lyric}". Additional info is that: {additional_information}". Say which class is more suitable and a short explanation```

The prompt used as input for queries without contextualized information:

```You are a classifier that, given a sentence, says if the sentence is negative, positive, neutral or mixed. The sentence is {user_input}". Say which class is more suitable and a short explanation```


In [33]:
from src.preprocess import get_train_dev_test_data
import pandas as pd
import pprint
import requests
import json
from sklearn.metrics import f1_score

Our Corpus, extracted from the Genius lyric comments of the song:

In [18]:
corpus_of_documents = [
    "'Walking through a door' can be used as a metaphor to indicate the start of something new like a love story",
    "Leaving it in a drawer signify it's no longer a part of his life",
    "'Sweet disposition' may refer to the subject's kind, thoughtful personality",
    "Songs about seeming to be fine, even if feeling shattered inside",
    "'I was there' could be an indicator that she was being gaslit by her ex partner",
    "Fans believe this song is about Jake Gyllenhaal, who indeed had glasses when he was younger",
    "Patriarchy is a sociological term coined by feminist theorists. It describes the system in our society that creates a power imbalance",
    "It’s possible that he tried to win her back by saying he loved her",
    "'Three months in the grave' could reference the time after a breakup or a lull in their romance",
    "keep it like a secret implies Swift’s ex may have wanted to keep their relationship hidden",
    "'In the name of being honest' can be seen as Swift describing her partner as manipulative",
    "'All is well that ends well' is an expression that means struggles and difficulty will pass by as long as the outcome is positive."
    "To double-cross someone is to deceive or betray them",
    "the alleged subject of this song, was 9 years older than Taylor Swift when they dated",
    "The scarf symbolizes his longing and passion for the relationship",
    "Swift claims that this relationship was the only real one that her ex had",
    "'sticks and stones may break my bones' mean that it’s clear he put her through some sort of verbal abuse",
]

Our similarity measure:

In [19]:
def jaccard_similarity(query, document):
    query = query.lower().split(" ")
    document = document.lower().split(" ")
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection) / len(union)

The function that will match our input with the additional information that better matches it:

In [20]:
def return_response(query, corpus):
    similarities = []
    for doc in corpus:
        similarity = jaccard_similarity(query, doc)
        similarities.append(similarity)
    return corpus_of_documents[similarities.index(max(similarities))]

Now let's play with our data:

In [21]:
# load data
_, _, test = get_train_dev_test_data()

In [22]:
# load examples
drawer_verse = test["verse_text"][3]
drawer_label = test["labels"][3]
secret_verse = test["verse_text"][42]
secret_label = test["labels"][42]
love_verse = test["verse_text"][25]
love_label = test["labels"][25]
honest_verse = test["verse_text"][51]
honest_label = test["labels"][51]
cross_verse = test["verse_text"][56]
cross_label = test["labels"][56]
brooklyn_verse = test["verse_text"][93]
brooklyn_label = test["labels"][93]


user_inputs = [
    drawer_verse,
    secret_verse,
    love_verse,
    honest_verse,
    cross_verse,
    brooklyn_verse,
]
labels = [
    drawer_label,
    secret_label,
    love_label,
    honest_label,
    cross_label,
    brooklyn_label,
]

In [23]:
# visualize examples
input_and_response = []

for user_input in user_inputs:
    response = return_response(user_input, corpus_of_documents)
    input_and_response.append((user_input, response))
    print(f"User input = {user_input}")
    print(f"Similarity Response = {return_response(user_input, corpus_of_documents)}")
    print()

User input = And you've still got it in your drawer, even now
Similarity Response = Leaving it in a drawer signify it's no longer a part of his life

User input = You kept me like a secret, but I kept you like an oath
Similarity Response = keep it like a secret implies Swift’s ex may have wanted to keep their relationship hidden

User input = He's gonna say it's love
Similarity Response = Leaving it in a drawer signify it's no longer a part of his life

User input = So casually cruel in the name of bein' honest
Similarity Response = 'In the name of being honest' can be seen as Swift describing her partner as manipulative

User input = You double-cross my mind
Similarity Response = 'sticks and stones may break my bones' mean that it’s clear he put her through some sort of verbal abuse

User input = From when your Brooklyn broke my skin and bones
Similarity Response = 'sticks and stones may break my bones' mean that it’s clear he put her through some sort of verbal abuse



We can see that not every input has matched a similar sentence or one that better contextualizes it:
* User input = You double-cross my mind
* Similarity Response = 'sticks and stones may break my bones' mean that it’s clear he put her through some sort of verbal abuse

However, some of them were contextualized:
* User input = So casually cruel in the name of bein' honest
* Similarity Response = 'In the name of being honest' can be seen as Swift describing her partner as manipulative

Now let's generate the prompts:

In [24]:
# define prompts sent to the LLM
def create_prompt(user_input, additional_info=None):
    if additional_info:
        return f"""You are a classifier that, given a sentence, says if the sentence is negative, positive, neutral or mixed. The sentence is "{user_input}". Additional info is that: {additional_info}". Say which class is more suitable and a short explanation"""
    return f"""You are a classifier that, given a sentence, says if the sentence is negative, positive, neutral or mixed. The sentence is "{user_input}". Say which class is more suitable and a short explanation"""

In [27]:
# let's check how it works
print("Prompt without context:")
pprint.pprint(create_prompt(input_and_response[0][0], 0))

print("Prompt with context:")
pprint.pprint(create_prompt(input_and_response[0][0], 1))

Prompt without context:
('You are a classifier that, given a sentence, says if the sentence is '
 'negative, positive, neutral or mixed. The sentence is "And you\'ve still got '
 'it in your drawer, even now". Say which class is more suitable and a short '
 'explanation')
Prompt with context:
('You are a classifier that, given a sentence, says if the sentence is '
 'negative, positive, neutral or mixed. The sentence is "And you\'ve still got '
 'it in your drawer, even now". Additional info is that: 1". Say which class '
 'is more suitable and a short explanation')


We will use Llama 2 as our generative model. Since downloading it from Hugging Face didn't work due to lack of resources, we called it using the `ollama` server, that allows a user to connect with an LLM and prompt it.

In [156]:
def make_request(prompt):
    url = "http://localhost:11434/api/generate"
    data = {"model": "llama2", "prompt": prompt}
    full_response = []

    headers = {"Content-Type": "application/json"}
    response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)
    try:
        for line in response.iter_lines():
            # filter out keep-alive new lines
            if line:
                decoded_line = json.loads(line.decode("utf-8"))
                # print(decoded_line['response'])  # uncomment to results, token by token
                full_response.append(decoded_line["response"])
        full_response = "".join(full_response)
    finally:
        response.close()
        return full_response

Now let's call it and see the results!

In [171]:
results = []

for sample_idx in range(len(input_and_response)):
    prompt_first = create_prompt(input_and_response[sample_idx][0], None)
    response_first = make_request(prompt_first)
    results.append(
        {"sample_idx": sample_idx, "prompt": prompt_first, "response": response_first}
    )

for sample_idx in range(len(input_and_response)):
    prompt_both = create_prompt(
        input_and_response[sample_idx][0], input_and_response[sample_idx][1]
    )
    response_both = make_request(prompt_both)
    results.append(
        {"sample_idx": sample_idx, "prompt": prompt_both, "response": response_both}
    )

We will adapt the results in a dataframe to better analyze our results:

In [214]:
df = pd.DataFrame(results)
df = df.sort_values(by=["sample_idx"])
df["lyric"] = df["sample_idx"].map(lambda index: user_inputs[index])
df["label"] = df["sample_idx"].map(lambda index: labels[index])
df["predicted"] = [2, 0, 3, 3, 1, 3, 3, 0, 0, 0, 3, 0]
df["RAG"] = df["prompt"].map(lambda text: "Additional info is" in text)
# save it for reproducibility
df.to_csv("data/output_prompt.csv", index=False)

### Check metrics

Now let's check the results of the answers!

In [34]:
rag = df[df.RAG == True]
no_rag = df[df.RAG == False]

In [37]:
f1_rag = f1_score(rag.label, rag.predicted, average="weighted")
f1_no_rag = f1_score(no_rag.label, no_rag.predicted, average="weighted")

In [38]:
print(f"{f1_rag=}, {f1_no_rag=}")

f1_rag=0.5396825396825397, f1_no_rag=0.6666666666666666


We can see that our f1-score is higher for the non-rag prompts, hence those that were not contextualized. In this case, our additional explanations have not contributed to a better classification, however, we can affirm that adding contextual information changes the prediction and explanations of the model.
Let's check, for each verse, how many classifications did they have with the context and without:

In [10]:
df[["sample_idx", "predicted"]].groupby(["sample_idx"]).agg(["nunique"])

Unnamed: 0_level_0,predicted
Unnamed: 0_level_1,nunique
sample_idx,Unnamed: 1_level_2
0,2
1,1
2,2
3,2
4,1
5,2


We can see that only two samples had the same prediction with and without the context.

### Check some examples

In [207]:
pprint.pprint(df.iloc[10].to_dict())

{'label': 0,
 'lyric': 'From when your Brooklyn broke my skin and bones',
 'predicted': 3,
 'prompt': 'You are a classifier that, given a sentence, says if the sentence '
           'is negative, positive, neutral or mixed. The sentence is "From '
           'when your Brooklyn broke my skin and bones". Say which class is '
           'more suitable and a short explanation',
 'response': '\n'
             'Based on the sentence provided, I would classify it as "Mixed" '
             'because it contains both negative and positive elements.\n'
             '\n'
             'The phrase "your Brooklyn broke my skin and bones" is negative '
             'in tone, as it describes physical harm caused by something '
             '(Brooklyn) that is presumably a person or entity. The use of the '
             'word "broke" implies damage or injury, which has a negative '
             'connotation.\n'
             '\n'
             'However, the sentence also contains positive elements, such 

In [208]:
pprint.pprint(df.iloc[11].to_dict())

{'label': 0,
 'lyric': 'From when your Brooklyn broke my skin and bones',
 'predicted': 0,
 'prompt': 'You are a classifier that, given a sentence, says if the sentence '
           'is negative, positive, neutral or mixed. The sentence is "From '
           'when your Brooklyn broke my skin and bones". Additional info is '
           "that: 'sticks and stones may break my bones' mean that it’s clear "
           'he put her through some sort of verbal abuse". Say which class is '
           'more suitable and a short explanation',
 'response': '\n'
             'Based on the given sentence, I would classify it as negative. '
             'The phrase "From when your Brooklyn broke my skin and bones" '
             'suggests that someone has been physically hurt or abused, with '
             '"Brooklyn" likely being a person who inflicted the harm. The '
             'additional context you provided further reinforces this '
             'interpretation, as "sticks and stones may break

## Conclusions

Our RAG system is poor and the added contextual information makes our model to predict poorly.
However, we attribute this to the simplicity of the algorithm.

Even so, we can affirm that adding contextual information can will change the behaviour of the LLM and their decisions. In good systems, we hope that the results would be more encouraging.

In terms of explainability, even if our experiments didn't come as expected, prompts with additional information allow a better understanding and contextualization of a model's decision since they are understandable by a human. However, we could also argue the fact that generative models work by probability and the additional information even if it changes the given response, it might not affect positively to the correctness of the answer.